[OpenAFS] 1.4.x, select() and recent RHEL kernels beware
Thu, 8 Nov 2012 11:11:05 -0500
On Thu, Nov 8, 2012 at 10:41 AM, Dan Van Der Ster
> Dear OpenAFS 1.4.x Users,
> At CERN we just suffered from a confusing problem where the fileserver pr=
ocess would regularly segfault (on only one new server just put into produc=
tion). Since a gdb of the fileserver core file was showing random bit flips=
here and there, we initially suspected a bad memory chip. However, the mem=
ory tested OK.
> Finally we realised this was due to fssync.c in 1.4's use of select()/FD_=
SET and the corrupting behaviour of those functions when using >1024 file d=
escriptors per process. Until quite recently this hadn't been a problem, si=
nce RHEL kernels used ulimit -Hn 1024 by default. However, as of kernel 2.6=
.32-279 the limit was raised to 4096 (to purge certain distro's of dangerou=
s applications ;) ). This means that all 1.4.x servers running with 2.6.32-=
279 and later will get corrupted stacks in fssync.c and probably crash.
> Note that 1.6 and beyond is safe from this RHEL kernel change since Simon=
already patched fssync to use poll() 5 years ago ;)
> All of the nasty details of this incident here:
> We're now running with a workaround,
> ulimit -Hn 1024; ulimit -Sn 1024
> in our init scripts until we manage to upgrade to 1.6.
> Hope this saves someone the effort of troubleshooting this again.
Unless you manually set HAVE_POLL, you may not have it enabled in 1.6:
we didn't actually do the configure test for it. It will be fixed in 1.6.2.
Incidentally, of note, currently salvsync unlike fssync doesn't ever try po=