[OpenAFS] DAFS Salvager failure

Pavel Semerad semerad@lab.ms.mff.cuni.cz
Sat, 20 Oct 2012 01:03:22 +0200


> Folks,
> 
> One of our AFS file servers crashed this afternoon.  OpenAFS 1.6.1 on
> RHEL 6 with kernel 2.6.32-279.9.1.el6.x86_64.  It looks like the
> salvager hung and eventually the dafileserver stopped responding to
> clients.
> 
  I had similar problem at monday and tuesday this week. dafileserver
crashed, was restarted by bosserver but after some time salvager stopped
salvaging (defined number of salvage processes was there, but only
sleeping and not repairing data). And some FSSYNC error messages were at log.
Then I manually restarted fileserver process and it worked for some time,
salvaging volumes. But only till next dafileserver crash. This was seen
several times, also with older binaries from openafs-1.6.1 (current were
openafs-1.6.1a).

  After recompiling openafs with debug info and next crash I found that it
segfaulted in FD_ISSET in function CallHandler in file src/vol/fssync-server.c .
  I saw that it is possible to use poll() interface instead of select()
in the code, so I forced it to use this poll() code (#define HAVE_POLL)
and it is working without crash from tuesday till now.
  I don't know if this have no issues, I didn't found test for poll() in
configure script so this poll() code doesn't seem to be normally used.

Pavel Semerad