[OpenAFS] Timeouts and odd behavior with 1.6.0 file servers

Derrick Brashear shadow@gmail.com
Thu, 26 Jan 2012 11:47:35 -0500


On Thu, Jan 26, 2012 at 9:35 AM, Jeff White <jaw171@pitt.edu> wrote:
> Russ, can you link me to some more information on the data corruption iss=
ue
> with 1.6.0?

Effectively, there's an issue where files can end up corrupted when a
copy-on-write copy is modified.
(a backup or readonly copy being present on the read-write site)


> Jeff White - Linux/Unix Systems Engineer
> University of Pittsburgh - CSSD
>
>
> On 01/25/2012 05:22 PM, Russ Allbery wrote:
>>
>> Jack Neely<jjneely@pams.ncsu.edu> =A0writes:
>>
>>> We are working our way through a migration from old Sun AFS hardware
>>> running Openafs 1.4.11 to HP Blades running RHEL 6 with OpenAFS 1.6.0.
>>> At this point we've completed most of our file servers.
>>
>> Don't use the 1.6.0 file server. =A0It has a data corruption problem whe=
n
>> you have an inode clone (such as a backup volume or a migration clone) a=
nd
>> directories are moved with mv. =A0These are fixed in 1.6.1pre1 and in th=
e
>> Debian 1.6.0-3 packages. =A0This may be what you're running into.
>>
>>> RHEL 6 / 1.6.0 clients wired into network occasionally have long pauses
>>> when doing AFS operations, such as running ls. =A0It may take 30 second=
s
>>> to a minute for the AFS server (the datacenter is downstairs) to
>>> respond. =A0We are not seeing high load or any signs on the server that
>>> something is wrong.
>>> The above applies as well to our web servers that are RHEL 6 / 1.6.0.
>>> Several times a week load on the web servers will suddenly spike and
>>> rxdebug tells us that RX calls to one of the AFS servers are all/mostly
>>> in the reader_wait state. =A0Just as suddenly as it starts, its over wi=
th.
>>> =A0 =A0 call 0: # 5231, state active, mode: receiving, flags: reader_wa=
it
>>> Our cron job that mirrors CPAN to AFS space now often fails with time
>>> out errors.
>>> =A0 =A0 readlink_stat("/afs/...") failed: Connection timed out (110)
>>
>> Yes, this is consistent with the problems that we're seeing on our web
>> servers with OpenAFS 1.4 as well, which are probably due at least in par=
t
>> to the pathological idledead interactions with the way that server threa=
ds
>> can back up waiting for vnode locks. =A01.6.1pre2 (coming shortly) has b=
oth
>> client and server fixes for the idledead part of this.
>>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info



--=20
Derrick