[OpenAFS] OpenAFS 1.8.4 Linux kernel BUG
Benjamin Kaduk
kaduk@mit.edu
Fri, 3 Apr 2020 20:35:31 -0700
On Wed, Apr 01, 2020 at 01:53:04PM +0100, Chris Cooke wrote:
> Hi,
>
> A machine of ours recently became unresponsive - these are the messages reported by journalctl for the time it happened:
>
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 320036 off 25602900/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 319975 off 25598020/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 231566 off 18525300/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 320007 off 25600580/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 239740 off 19179220/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 229899 off 18391940/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 319838 off 25587060/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 653166 off 52253300/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 653166 off 52253300/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: failed to store file (5/0)
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 653166 off 52253300/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: openafs: afs_InvalidateAllSegments tdc count
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: ------------[ cut here ]------------
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: kernel BUG at /builddir/build/BUILD/openafs-1.8.4/src/libafs/MODLOAD-3.10.0-1062.7.1.el7.x86_64-SP/afs_segments.c:556!
>
> Nothing further was logged until a reboot nearly an hour later.
> The machine runs Scientific Linux 7.6, and here's the output of "rpm -q kernel openafs" :
>
> kernel-3.10.0-1062.7.1.el7.x86_64
> openafs-1.8.4-1.el7.x86_64
That line number in 1.8.4 is a panic call in afs_InvalidateAllSegments():
543 if (afs_indexUnique[index] == avc->f.fid.Fid.Unique) {
544 tdc = afs_GetValidDSlot(index);
545 if (!tdc) {
546 /* In the case of fatal errors during stores, we MUST
547 * invalidate all of the relevant chunks. Otherwise, the chunks
548 * will be left with the 'new' data that was never successfully
549 * written to the server, but the DV in the dcache is still the
550 * old DV. So, we may indefinitely serve data to applications
551 * that is not actually in the file on the fileserver. If we
552 * cannot afs_GetValidDSlot the appropriate entries, currently
553 * there is no way to ensure the dcache is invalidated. So for
554 * now, to avoid risking serving bad data from the cache, panic
555 * instead. */
556 osi_Panic("afs_InvalidateAllSegments tdc count");
(The previous log message comes from afs_UFSGetDSlot() before it returns
failure, which would trickle back up to the afs_GetValidDSlot() call.)
Essentially, we think that we should have something in a cache file, e.g.,
because we already have an in-memory handle to it, but it wasn't there when
we went looking for it. Had this machine been running for a long time
without restart or needing to flush the (AFS) cache? How full is/was the
partition that the disk cache lives on?
There's not a whole lot to go on if we only have the one instance of the
crash, I fear.
-Ben