[OpenAFS] Cache corruption on Ubuntu 20.04 LTS (GNU/Linux 5.4.0-42-generic x86_64)

Rich Sudlow rich@nd.edu
Sat, 8 Aug 2020 09:37:02 -0400


--00000000000030437705ac5dd300
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Greetings Craig

We run the latest version of RH7 and also ran into many of the same
problems which you've
mentioned with earlier versions of OpenAFS 1.8.4 & 1.8.5 - working with our
OpenAFS support vendor
SineNomine those issues were fixed and incorporated into the recent release
of 1.8.6 clients.
We've been running 1.8.6 in production for the past 3 weeks for our busy
frontend machines
(> 100 simultaneous users) without problems now.We build from SRPM. We've
been using
1.8.6 on Fedora 32, RH7 and RH8 latest patch levels but testing mostly on
RH7.
I'm fairly certain 1.8.6 will fix your issue.

Rich


On Sat, Aug 8, 2020 at 6:25 AM STRACHAN Craig <Craig.Strachan@ed.ac.uk>
wrote:

> Hi, sent this before but from the wrong email address. Apologies if it
> pops up twice.
>
> I was wondering if anyone has seen something like this, or has suggestion=
s
> about how I could debug the issue should it happen again.
>
> We are moving our desktop environment from SL7 to Ubuntu 20.04 LTS. After
> a couple of weeks of trouble free performance, on Monday two different
> users on different machines (KVM guests if that makes any difference)
> suffered problems with cache corruption in their home directories within =
a
> couple of hours of each other. The messages in syslog looked like:
>
> Aug  3 15:38:52 gazebo kernel: afs: Corrupt directory
> (5.536870965.13859.4201870 [inf.ed.ac.uk] @ffffb303425613c8, pos 0)
> Aug  3 15:38:52 gazebo kernel: afs: Corrupt directory
> (5.536870965.13995.4201950 [inf.ed.ac.uk] @ffffb303423b7ec8, pos 0)
> Aug  3 15:38:52 gazebo kernel: afs: Corrupt directory
> (5.536870965.13997.4201995 [inf.ed.ac.uk] @ffffb303423b75c8, pos 0)
> Aug  3 15:38:52 gazebo kernel: afs: Corrupt directory
> (5.536870965.13737.4201771 [inf.ed.ac.uk] @ffffb303423b69c8, pos 0)
>
> One user also saw input/output errors when trying to access some files.
>
> There were a number of byte-range locking warnings in both syslogs but
> none which referred to anything in the corrupted directories. The effect =
of
> the corruption was the appearance of one or more entries of the form
>
> -????????? ? ?       ?         ?            ? registrymodifications.xcu
>
> when doing an ls of the affected directory. Fs flush cleared up all but
> one of the issues. This required halting afsd and manually deleting the
> cache files to get things working again.
>
> Both users were very near the upper limits of their quotas when this
> happened but there was plenty of space in the file server partition and i=
n
> both cache partitions. Both home volumes are on the same server and
> partition but there=E2=80=99s no evidence of anything going wrong in the =
server
> logs and none of our SL7 users have reported similar issues. The Ubuntu
> machines are running openafs 1.8.4~pre1-1ubuntu2-debian, the server is
> running SL7.6, kernel 3.10.0-1062.4.3.el7.x86_64 and
> openafs-server-1.8.4-1.el7.x86_64. Fs getcacheparms returns
>
> AFS using    51% of cache blocks (1068658 of 2097152 1k blocks)
>             95% of the cache files (62256 of 65536 files)
> afs_cacheFiles:      65536
> IFFree:               3280
> IFEverUsed:           9654
> IFDataMod:               1
> IFDirtyPages:            0
> IFAnyPages:              0
> IFDiscarded:             1
> DCentries:        9998
>  0k-   4K:       9087
>  4k-  16k:        460
> 16k-  64k:         70
> 64k- 256k:         21
> 256k-   1M:          6
>      >=3D1M:        354
> [cache file usage over 90%, consider increasing '-files' argument to afsd=
]
>
> on one machine and
>
> AFS using    29% of cache blocks (1783025 of 6098259 1k blocks)
>              3% of the cache files (5900 of 190570 files)
> afs_cacheFiles:     190570
> IFFree:             184670
> IFEverUsed:           2270
> IFDataMod:              50
> IFDirtyPages:            0
> IFAnyPages:              0
> IFDiscarded:             0
> DCentries:        9998
>  0k-   4K:       5639
>  4k-  16k:       1638
> 16k-  64k:        606
> 64k- 256k:        308
> 256k-   1M:        262
>      >=3D1M:       1545
>
> on the other.
>
> Does anyone have any idea what might be going on or any further steps I
> can take to investigate the problem if it happens again? All suggestions
> welcome!
>
> Thanks in advance,
> Craig.
> ---
> Craig Strachan, Computing Officer,
> School of Informatics, University of Edinburgh
>
>
>
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>


--=20
Rich Sudlow
University of Notre Dame
Center for Research Computing - Union Station
506 W. South St
South Bend, In 46601

(574) 631-7258 (office)
(574) 807-1046 (cell)

--00000000000030437705ac5dd300
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Greetings Craig</div><div><br></div><div>We run the l=
atest version of RH7 and also ran into many of the same problems which you&=
#39;ve</div><div>mentioned with earlier versions of OpenAFS 1.8.4 &amp; 1.8=
.5 - working with our OpenAFS support vendor <br></div><div>SineNomine thos=
e issues were fixed and incorporated into the recent release of 1.8.6 clien=
ts.</div><div>We&#39;ve been running 1.8.6 in production for the past 3 wee=
ks for our busy frontend machines</div><div>(&gt; 100 simultaneous users) w=
ithout problems now.We build from SRPM. We&#39;ve been using <br></div><div=
>1.8.6 on Fedora 32, RH7 and RH8 latest patch levels but testing mostly on =
RH7. <br></div><div>I&#39;m fairly certain 1.8.6 will fix your issue.<br></=
div><div><br></div><div>Rich</div><div><br></div></div><br><div class=3D"gm=
ail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Sat, Aug 8, 2020 at 6:2=
5 AM STRACHAN Craig &lt;<a href=3D"mailto:Craig.Strachan@ed.ac.uk">Craig.St=
rachan@ed.ac.uk</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pad=
ding-left:1ex">



<div style=3D"overflow-wrap: break-word;">
Hi, sent this before but from the wrong email address. Apologies if it pops=
 up twice.<br>
<div>
<div>
<div><br>
I was wondering if anyone has seen something like this, or has suggestions =
about how I could debug the issue should it happen again.<br>
<br>
We are moving our desktop environment from SL7 to Ubuntu 20.04 LTS. After a=
 couple of weeks of trouble free performance, on Monday two different users=
 on different machines (KVM guests if that makes any difference) suffered p=
roblems with cache corruption in
 their home directories within a couple of hours of each other. The message=
s in syslog looked like:<br>
<br>
Aug =C2=A03 15:38:52 gazebo kernel: afs: Corrupt directory (5.536870965.138=
59.4201870 [<a href=3D"http://inf.ed.ac.uk" target=3D"_blank">inf.ed.ac.uk<=
/a>] @ffffb303425613c8, pos 0)<br>
Aug =C2=A03 15:38:52 gazebo kernel: afs: Corrupt directory (5.536870965.139=
95.4201950 [<a href=3D"http://inf.ed.ac.uk" target=3D"_blank">inf.ed.ac.uk<=
/a>] @ffffb303423b7ec8, pos 0)<br>
Aug =C2=A03 15:38:52 gazebo kernel: afs: Corrupt directory (5.536870965.139=
97.4201995 [<a href=3D"http://inf.ed.ac.uk" target=3D"_blank">inf.ed.ac.uk<=
/a>] @ffffb303423b75c8, pos 0)<br>
Aug =C2=A03 15:38:52 gazebo kernel: afs: Corrupt directory (5.536870965.137=
37.4201771 [<a href=3D"http://inf.ed.ac.uk" target=3D"_blank">inf.ed.ac.uk<=
/a>] @ffffb303423b69c8, pos 0)<br>
<br>
One user also saw input/output errors when trying to access some files.<br>
<br>
There were a number of byte-range locking warnings in both syslogs but none=
 which referred to anything in the corrupted directories. The effect of the=
 corruption was the appearance of one or more entries of the form<br>
<br>
-????????? ? ? =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0? =C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0? =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0? registrymodifications.xcu<br>
<br>
when doing an ls of the affected directory. Fs flush cleared up all but one=
 of the issues. This required halting afsd and manually deleting the cache =
files to get things working again.<br>
<br>
Both users were very near the upper limits of their quotas when this happen=
ed but there was plenty of space in the file server partition and in both c=
ache partitions. Both home volumes are on the same server and partition but=
 there=E2=80=99s no evidence of anything
 going wrong in the server logs and none of our SL7 users have reported sim=
ilar issues. The Ubuntu machines are running openafs 1.8.4~pre1-1ubuntu2-de=
bian, the server is running SL7.6, kernel 3.10.0-1062.4.3.el7.x86_64 and op=
enafs-server-1.8.4-1.el7.x86_64.
 Fs getcacheparms returns<br>
<br>
AFS using =C2=A0=C2=A0=C2=A051% of cache blocks (1068658 of 2097152 1k bloc=
ks)<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A095%=
 of the cache files (62256 of 65536 files)<br>
<span style=3D"white-space:pre-wrap"></span>afs_cacheFiles: =C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A065536<br>
<span style=3D"white-space:pre-wrap"></span>IFFree: =C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A03280<br>
<span style=3D"white-space:pre-wrap"></span>IFEverUsed: =C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A09654<br>
<span style=3D"white-space:pre-wrap"></span>IFDataMod: =C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A01<br>
<span style=3D"white-space:pre-wrap"></span>IFDirtyPages: =C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00<br>
<span style=3D"white-space:pre-wrap"></span>IFAnyPages: =C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00<br>
<span style=3D"white-space:pre-wrap"></span>IFDiscarded: =C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A01<br>
<span style=3D"white-space:pre-wrap"></span>DCentries: =C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A09998<br>
<span style=3D"white-space:pre-wrap"></span>=C2=A00k- =C2=A0=C2=A04K: =C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A09087<br>
<span style=3D"white-space:pre-wrap"></span>=C2=A04k- =C2=A016k: =C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0460<br>
<span style=3D"white-space:pre-wrap"></span>16k- =C2=A064k: =C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A070<br>
<span style=3D"white-space:pre-wrap"></span>64k- 256k: =C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A021<br>
<span style=3D"white-space:pre-wrap"></span>256k- =C2=A0=C2=A01M: =C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A06<br>
<span style=3D"white-space:pre-wrap"></span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0&=
gt;=3D1M: =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0354<br>
[cache file usage over 90%, consider increasing &#39;-files&#39; argument t=
o afsd]<br>
<br>
on one machine and<br>
<br>
AFS using =C2=A0=C2=A0=C2=A029% of cache blocks (1783025 of 6098259 1k bloc=
ks)<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A03% of the cache files (5900 of 190570 files)<br>
<span style=3D"white-space:pre-wrap"></span>afs_cacheFiles: =C2=A0=C2=A0=C2=
=A0=C2=A0190570<br>
<span style=3D"white-space:pre-wrap"></span>IFFree: =C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0184670<br>
<span style=3D"white-space:pre-wrap"></span>IFEverUsed: =C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A02270<br>
<span style=3D"white-space:pre-wrap"></span>IFDataMod: =C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A050<br>
<span style=3D"white-space:pre-wrap"></span>IFDirtyPages: =C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00<br>
<span style=3D"white-space:pre-wrap"></span>IFAnyPages: =C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00<br>
<span style=3D"white-space:pre-wrap"></span>IFDiscarded: =C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00<br>
<span style=3D"white-space:pre-wrap"></span>DCentries: =C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A09998<br>
<span style=3D"white-space:pre-wrap"></span>=C2=A00k- =C2=A0=C2=A04K: =C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A05639<br>
<span style=3D"white-space:pre-wrap"></span>=C2=A04k- =C2=A016k: =C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A01638<br>
<span style=3D"white-space:pre-wrap"></span>16k- =C2=A064k: =C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0606<br>
<span style=3D"white-space:pre-wrap"></span>64k- 256k: =C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0308<br>
<span style=3D"white-space:pre-wrap"></span>256k- =C2=A0=C2=A01M: =C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0262<br>
<span style=3D"white-space:pre-wrap"></span>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0&=
gt;=3D1M: =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A01545<br>
<br>
on the other.<br>
<br>
Does anyone have any idea what might be going on or any further steps I can=
 take to investigate the problem if it happens again? All suggestions welco=
me!<br>
<br>
Thanks in advance,<br>
Craig.</div>
</div>
</div>
<div>
<div style=3D"color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-=
indent:0px;text-transform:none;white-space:normal;word-spacing:0px">
<div>---</div>
<div>Craig Strachan, Computing Officer,</div>
<div>School of Informatics, University of Edinburgh</div>
<div><br>
</div>
</div>
<br>
<br>
</div>
<br>
The University of Edinburgh is a charitable body, registered in Scotland, w=
ith registration number SC005336.
</div>

</blockquote></div><br clear=3D"all"><br>-- <br><div dir=3D"ltr" class=3D"g=
mail_signature"><div dir=3D"ltr"><font size=3D"1">Rich Sudlow<br>University=
 of Notre Dame<br>Center for Research Computing - Union Station<br>506 W. S=
outh St<br>South Bend, In 46601<br><br>(574) 631-7258 (office)<br>(574) 807=
-1046 (cell)<br></font><br></div></div>

--00000000000030437705ac5dd300--