[OpenAFS] Timeouts and odd behavior with 1.6.0 file servers

Thu, 26 Jan 2012 09:35:10 -0500

Russ, can you link me to some more information on the data corruption 
issue with 1.6.0?

Jeff White - Linux/Unix Systems Engineer
University of Pittsburgh - CSSD

On 01/25/2012 05:22 PM, Russ Allbery wrote:
> Jack Neely<jjneely@pams.ncsu.edu>  writes:
>
>> We are working our way through a migration from old Sun AFS hardware
>> running Openafs 1.4.11 to HP Blades running RHEL 6 with OpenAFS 1.6.0.
>> At this point we've completed most of our file servers.
> Don't use the 1.6.0 file server.  It has a data corruption problem when
> you have an inode clone (such as a backup volume or a migration clone) and
> directories are moved with mv.  These are fixed in 1.6.1pre1 and in the
> Debian 1.6.0-3 packages.  This may be what you're running into.
>
>> RHEL 6 / 1.6.0 clients wired into network occasionally have long pauses
>> when doing AFS operations, such as running ls.  It may take 30 seconds
>> to a minute for the AFS server (the datacenter is downstairs) to
>> respond.  We are not seeing high load or any signs on the server that
>> something is wrong.
>> The above applies as well to our web servers that are RHEL 6 / 1.6.0.
>> Several times a week load on the web servers will suddenly spike and
>> rxdebug tells us that RX calls to one of the AFS servers are all/mostly
>> in the reader_wait state.  Just as suddenly as it starts, its over with.
>>      call 0: # 5231, state active, mode: receiving, flags: reader_wait
>> Our cron job that mirrors CPAN to AFS space now often fails with time
>> out errors.
>>      readlink_stat("/afs/...") failed: Connection timed out (110)
> Yes, this is consistent with the problems that we're seeing on our web
> servers with OpenAFS 1.4 as well, which are probably due at least in part
> to the pathological idledead interactions with the way that server threads
> can back up waiting for vnode locks.  1.6.1pre2 (coming shortly) has both
> client and server fixes for the idledead part of this.
>