[OpenAFS] Re: OpenAFS freeze problems

Derrick Brashear shadow@gmail.com
Tue, 28 Feb 2012 08:34:20 -0500


On Mon, Feb 27, 2012 at 9:00 PM, John Tang Boyland
<boyland@pabst.cs.uwm.edu> wrote:
> ] About every few hours or so, AFS "freezes" on a write:
> ] the attempt to write blocks for about 30 seconds or so.
>
> ...
>
> As suspected, there is no problem with the number of threads; the rxdebug
> command shows 0 threads used out of 11 while a freeze is happening.
>
> Some people suggested I blacklist clients that (apparently)
> don't respond to callback breaking. =A0But that won't work because
> (1) it could be that the campus wireless is blocking access
> =A0 =A0(not sure here)
> (2) when you close a laptop it won't respond to anything.
> =A0 =A0(Most of the students using AFS on our cell have OpenAFS on
> =A0 =A0 their laptops.)
> (3) If you move your laptop to a new location on campus, you get a new
> =A0 =A0IP address, and no one will respond at the old IP address.
> None of these are the fault of the client.
>
> So the only solution would be to decouple callback breaking from
> giving permission to write. =A0Right now, the attempt to write
> stalls while the server attempts to tell clients the callbacks are
> broken. =A0I don't understand why the client doing the write
> has to wait for the other clients to ack the callback breaks.
> Why not permit the write to go ahead while the server continues
> to try to notify the other clients of the write?
>
> In other words, is there any information that these clients
> (whose callbcaks are being broken) could say that would cause the
> server to deny the write attempt? =A0If not, then why delay it?

The AFS coherency model means that when you get a successful reply to your
write, that it means other clients have been notified. You're delayed
until we can
tell you we notified clients.

It means that if your software uses a return code from the write to
know it can tell other nodes to
take some action on the data, that they will have access to the data
(since otherwise their cache manager
will not re fetchstatus the data; the valid callback will mean they'll
use what they have)

--=20
Derrick