[OpenAFS] Re: "afs: Lost contact with file server" on the same machine?

Adam Megacz megacz@hcoop.net
Sat, 13 Jun 2009 17:29:35 -0700


Esther Filderman <mizmoose@gmail.com> writes:
>  - Does the "lost contact with server" occur on all clients at the
> same time?  Or is it scattered which one loses contact?

It is definitely scattered; we've seen situations where one client
"lost contact" while another seemed to be having no troubles.

>  - For how long does the "lost contact" occur?  Is it seconds or
> minutes or longer?

Around 10-15 minutes, or until the next "fs checks", whichever comes
first.  Some users know to run "fs checks" to make this go away, but
most don't.  Others are seeing unsupervised cron/at jobs fail as a
result of this.

>  - Simple, stupid question: Have you confirmed your hardware is OK and
> not causing hiccups in the system?

Yes.

>  - Have you tried using rxdebug to see if the fileserver is getting
> caught up on something?  Try running it when one of the clients claims
> it's lost contact with the server.

Unfortunately we can't reproduce the bug "on demand".  It tends to
happen when nobody's looking, and goes away quickly enough that by the
time somebody gets an admin's attention it has gone away.

Thanks for your help,

  - a