[OpenAFS] OpenAFS 1.8.4 Linux kernel BUG

Benjamin Kaduk kaduk@mit.edu
Fri, 3 Apr 2020 20:35:31 -0700


On Wed, Apr 01, 2020 at 01:53:04PM +0100, Chris Cooke wrote:
> Hi,
> 
> A machine of ours recently became unresponsive - these are the messages reported by journalctl for the time it happened:
> 
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 320036 off 25602900/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 319975 off 25598020/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 231566 off 18525300/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 320007 off 25600580/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 239740 off 19179220/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 229899 off 18391940/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 319838 off 25587060/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 653166 off 52253300/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 653166 off 52253300/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: failed to store file (5/0)
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: afs: disk cache read error in CacheItems slot 653166 off 52253300/113096500 code -5/80
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: openafs: afs_InvalidateAllSegments tdc count
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: ------------[ cut here ]------------
> Mar 31 22:51:07 lute.inf.ed.ac.uk kernel: kernel BUG at /builddir/build/BUILD/openafs-1.8.4/src/libafs/MODLOAD-3.10.0-1062.7.1.el7.x86_64-SP/afs_segments.c:556!
> 
> Nothing further was logged until a reboot nearly an hour later.
> The machine runs Scientific Linux 7.6, and here's the output of "rpm -q kernel openafs" :
> 
> kernel-3.10.0-1062.7.1.el7.x86_64
> openafs-1.8.4-1.el7.x86_64

That line number in 1.8.4 is a panic call in afs_InvalidateAllSegments():

    543         if (afs_indexUnique[index] == avc->f.fid.Fid.Unique) {
    544             tdc = afs_GetValidDSlot(index);
    545             if (!tdc) {
    546                 /* In the case of fatal errors during stores, we MUST
    547                  * invalidate all of the relevant chunks. Otherwise, the chunks
    548                  * will be left with the 'new' data that was never successfully
    549                  * written to the server, but the DV in the dcache is still the
    550                  * old DV. So, we may indefinitely serve data to applications
    551                  * that is not actually in the file on the fileserver. If we
    552                  * cannot afs_GetValidDSlot the appropriate entries, currently
    553                  * there is no way to ensure the dcache is invalidated. So for
    554                  * now, to avoid risking serving bad data from the cache, panic
    555                  * instead. */
    556                 osi_Panic("afs_InvalidateAllSegments tdc count");

(The previous log message comes from afs_UFSGetDSlot() before it returns
failure, which would trickle back up to the afs_GetValidDSlot() call.)

Essentially, we think that we should have something in a cache file, e.g.,
because we already have an in-memory handle to it, but it wasn't there when
we went looking for it.  Had this machine been running for a long time
without restart or needing to flush the (AFS) cache?  How full is/was the
partition that the disk cache lives on?

There's not a whole lot to go on if we only have the one instance of the
crash, I fear.

-Ben