[OpenAFS] 1.4.x, select() and recent RHEL kernels beware

Dan Van Der Ster daniel.vanderster@cern.ch
Thu, 8 Nov 2012 15:41:57 +0000


Dear OpenAFS 1.4.x Users,

At CERN we just suffered from a confusing problem where the fileserver proc=
ess would regularly segfault (on only one new server just put into producti=
on). Since a gdb of the fileserver core file was showing random bit flips h=
ere and there, we initially suspected a bad memory chip. However, the memor=
y tested OK.

Finally we realised this was due to fssync.c in 1.4's use of select()/FD_SE=
T and the corrupting behaviour of those functions when using >1024 file des=
criptors per process. Until quite recently this hadn't been a problem, sinc=
e RHEL kernels used ulimit -Hn 1024 by default. However, as of kernel 2.6.3=
2-279 the limit was raised to 4096 (to purge certain distro's of dangerous =
applications ;) ). This means that all 1.4.x servers running with 2.6.32-27=
9 and later will get corrupted stacks in fssync.c and probably crash.

Note that 1.6 and beyond is safe from this RHEL kernel change since Simon a=
lready patched fssync to use poll() 5 years ago ;)=20

All of the nasty details of this incident here:
    https://afs.web.cern.ch/afs/reports/html/afs200SegFaults.html

We're now running with a workaround,
  ulimit -Hn 1024; ulimit -Sn 1024
in our init scripts until we manage to upgrade to 1.6.

Hope this saves someone the effort of troubleshooting this again.

Cheers,=20
Dan van der Ster
CERN IT-DSS=