[OpenAFS] 1.4.x, select() and recent RHEL kernels beware

Derrick Brashear shadow@gmail.com
Thu, 8 Nov 2012 11:11:05 -0500


On Thu, Nov 8, 2012 at 10:41 AM, Dan Van Der Ster
<daniel.vanderster@cern.ch> wrote:
> Dear OpenAFS 1.4.x Users,
>
> At CERN we just suffered from a confusing problem where the fileserver pr=
ocess would regularly segfault (on only one new server just put into produc=
tion). Since a gdb of the fileserver core file was showing random bit flips=
 here and there, we initially suspected a bad memory chip. However, the mem=
ory tested OK.
>
> Finally we realised this was due to fssync.c in 1.4's use of select()/FD_=
SET and the corrupting behaviour of those functions when using >1024 file d=
escriptors per process. Until quite recently this hadn't been a problem, si=
nce RHEL kernels used ulimit -Hn 1024 by default. However, as of kernel 2.6=
.32-279 the limit was raised to 4096 (to purge certain distro's of dangerou=
s applications ;) ). This means that all 1.4.x servers running with 2.6.32-=
279 and later will get corrupted stacks in fssync.c and probably crash.
>
> Note that 1.6 and beyond is safe from this RHEL kernel change since Simon=
 already patched fssync to use poll() 5 years ago ;)
>
> All of the nasty details of this incident here:
>     https://afs.web.cern.ch/afs/reports/html/afs200SegFaults.html
>
> We're now running with a workaround,
>   ulimit -Hn 1024; ulimit -Sn 1024
> in our init scripts until we manage to upgrade to 1.6.
>
> Hope this saves someone the effort of troubleshooting this again.

Unless you manually set HAVE_POLL, you may not have it enabled in 1.6:
we didn't actually do the configure test for it. It will be fixed in 1.6.2.

Incidentally, of note, currently salvsync unlike fssync doesn't ever try po=
ll().