From cg2v@andrew.cmu.edu Fri Mar 26 14:14:10 2021 From: cg2v@andrew.cmu.edu (Chaskiel M Grundman) Date: Fri, 26 Mar 2021 13:14:10 +0000 Subject: [OpenAFS-devel] short CacheItems reads - AND - vcache locking for afs_InvalidateAllSegments Message-ID: <58a0f3617d62409087a702fe821330ed@andrew.cmu.edu> --_000_58a0f3617d62409087a702fe821330edandrewcmuedu_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable While investigating a performance issue affecting timeshares at our institu= tion (which I am provisionally blaming on other clients driving up IO load = on the fileservers), I encountered a rerun of an issue that's been reported= on openafs-info twice before: [42342.692729] afs: disk cache read error in CacheItems slot 100849 off 806= 7940/8750020 code -5/80 (repeated) But this one ends differently than https://lists.openafs.org/pipermail/open= afs-info/2018-October/042576.html or https://lists.openafs.org/pipermail/op= enafs-info/2020-April/042930.html [42342.697743] afs: Failed to invalidate cache chunks for fid NNN.NNN.NNN.N= NN; our local disk cache may be throwing errors. We must invalidate these c= hunks to avoid possibly serving incorrect data, so we'll retry until we suc= ceed. If AFS access seems to hang, this may be why. [42342.697771] openafs: assertion failed: WriteLocked(&tvc->lock), file: /v= ar/lib/dkms/openafs/1.8.6-2.el7_9/build/src/libafs/MODLOAD-3.10.0-1160.6.1.= el7.x86_64-SP/afs_daemons.c, line: 606 The first thing I'm going to assert is that this isn't a hardware error. It= affects multiple virtual systems, and no IO errors are logged by the kerne= l. My assertion is that EIO is coming from osi_rdwr, which will turn a short r= ead or write into EIO. The supposition of myself and others who have looked= at this is that the source of the problem is using ext4 as a cache (and pe= rhaps also the dedicated cache filesystem being >80% full), and we're remed= iating that on these systems. This does leave us with two problems in openafs: * The use of EIO, leading to claims that people have hardware errors wh= en they may not. * The lock breakage. For the former, I'd recommend that either the short IOs be logged, or a dif= ferent code (perhaps ENODATA if available?) used to differentiate it from h= ardware errors. For the latter, I believe that there's inconsistency about the locking requ= irements of afs_InvalidateAllSegments. This comment claims the lock is held: /* * Ask a background daemon to do this request for us. Note that _we= _ hold * the write lock on 'avc', while the background daemon does the wo= rk. This * is a little weird, but it helps avoid any issues with lock order= ing * or if our caller does not expect avc->lock to be dropped while * running. */ When called from afs_StoreAllSegments's error path, avc->lock is clearly he= ld, because StoreAllSegments itself downgrades and upgrades the lock. When called from afs_dentry_iput via afs_InactiveVCache, it seems like it i= sn't. None of the callers on any platform seems to lock the cache before calling = inactive. (unless on some platforms there's aliasing between a VFS level lo= ck and vc->lock). afs_remunlink expects to be called with avc unlocked. --_000_58a0f3617d62409087a702fe821330edandrewcmuedu_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

While investigating a performance issue affecting ti= meshares at our institution (which I am provisionally blaming on other clie= nts driving up IO load on the fileservers), I encountered a rerun of an iss= ue that’s been reported on openafs-info twice before:

 

[42342.692729] afs: disk cache read error in CacheIt= ems slot 100849 off 8067940/8750020 code -5/80

(repeated)

 

But this one ends differently than https://lists.openafs.org/pipermail/openafs-info/2018-October/042576.html or https://lists.openafs.org/pipermail/openafs-info/2020-April/042930.html=

 

[42342.697743] afs: Failed to invalidate cache chunk= s for fid NNN.NNN.NNN.NNN; our local disk cache may be throwing errors. We = must invalidate these chunks to avoid possibly serving incorrect data, so w= e'll retry until we succeed. If AFS access seems to hang, this may be why.

[42342.697771] openafs: assertion failed: WriteLocke= d(&tvc->lock), file: /var/lib/dkms/openafs/1.8.6-2.el7_9/build/src/l= ibafs/MODLOAD-3.10.0-1160.6.1.el7.x86_64-SP/afs_daemons.c, line: 606

 

The first thing I’m going to assert is that th= is isn’t a hardware error. It affects multiple virtual systems, and n= o IO errors are logged by the kernel.

My assertion is that EIO is coming from osi_rdwr, wh= ich will turn a short read or write into EIO. The supposition of myself and= others who have looked at this is that the source of the problem is using = ext4 as a cache (and perhaps also the dedicated cache filesystem being >80% full), and we’re remedi= ating that on these systems.

 

 

This does leave us with two problems in openafs:

--_000_58a0f3617d62409087a702fe821330edandrewcmuedu_-- From hozer@hozed.org Sat Mar 27 19:47:09 2021 From: hozer@hozed.org (Troy Benjegerdes) Date: Sat, 27 Mar 2021 11:47:09 -0700 Subject: [OpenAFS-devel] rxgk and ipv6 status Message-ID: <20210327184709.GA27911@bc.grid.coop> The best I can find on google on this subject is: https://www.sinenomine.net/news/openafs-1.9.0 Is there any new movement on this project? What sort of problems should I expect if I run the 1.9.0 release in 'production', and what companies or consultants do any of you recommend (including yourselves) to get quotes/estimates on cost to get to the point where I can either: a) run openafs servers with IPv4 and port forwarding and/or Dynamic DNS b) full rxgk/ipv6 support Thanks Troy, 7 Elements LLC