[OpenAFS] Re: client kernel panic on EL6

Mon, 21 Oct 2013 10:13:43 -0500

On Mon, 21 Oct 2013 15:18:06 +0100
Stephen Quinney <stephen@jadevine.org.uk> wrote:

> Has anyone else seen a kernel panic like this on EL6 with 1.6.5 and
> kernel 2.6.32-358.14.1.el6? Or does anyone have any suggestions as to
> what might have caused the problem?
> 
> afs: disk cache read error in CacheItems slot 353815 off 28305220/36284420
> code -4/80
> openafs: assertion failed: tdc, file:

In short: this means we got an error when reading from the cache fs. I
assume that -4 is -EINTR, so that means we probably need to block
signals on Linux when reading from the cache fs. (Or support getting
interrupted by signals, but we don't do that now.)

Some more info:

Historically, the unix client hasn't really handled cache i/o errors at
all. In various places a failed read or write from/to the cache would
panic the machine, or error to userspace, or corrupt cache accounting
information, etc etc. This is improving over time, and it's much better
now than it has been in the past, but not all instances have been fixed
yet.

That particular error you saw occurs when reading a dcache slot from
disk. In the past, an error like this would corrupt cache information,
since the old code assumed that errors like this never happened. We've
suspected that this is what's causing some other reports of crashes and
small cache corruption involving dslot hash chain corruption; we didn't
really _know_, since in those cases, the "problem" happened at some
point in the past by the time the crash occurred.

This dslot read error has been a candidate for the cause of those issues
(and I believe, pretty much the only candidate that hasn't been
otherwise ruled out). So we added a log message when that error occurs,
and made the relevant function return an error. Various code paths were
adjusted to try to handle the error as gracefully as possible, but some
were more complex/difficult than others. A few, such as the specific
backtrace you mentioned, have not hit 1.6 yet, since I was concerned
about introducing new/different errors in the error handling since we
didn't really know what was going on in these scenarios.

Anyway, you're the first person I've seen actually hit this since the
more informative log messages were introduced, so hooray! Now we finally
know what's going on (in one scenario, at least).

Developer references: Noticing the disk error was added in gerrit 7940,
though many other commits have been changing it and fixing issues. The
"easy" error-handling cases mentioned above are in 7941 and 9287. Some
of the more "hard" cases are 8376, 8377, 8405, 8406.

-- 
Andrew Deason
adeason@sinenomine.net