[OpenAFS] disk cache read error in CacheItems

Martin Flemming martin.flemming@desy.de
Fri, 26 Oct 2018 14:00:16 +0200 (CEST)


Hi and thanks for response !

In the last days we've got the idential situtation with these error-messages ...
sometimes on all machines they started to log on the same time  ... 
network-traffic is not extremly high ...


filesystem of the afscache  is ext4 and the size 8GB

Option are : /usr/vice/etc/afsd -afsdb -dynroot -fakestat

The cacheinfo-file :   /usr/vice/etc/cacheinfo : /afs:/var/cache/afs:5552000

[root@bird070 ~]# fs getcacheparms -excessive
AFS using    88% of cache blocks (4908415 of 5552000 1k blocks)
              29% of the cache files (49470 of 173500 files)
 	afs_cacheFiles:     173500
 	IFFree:             124030
 	IFEverUsed:           9551
 	IFDataMod:               3
 	IFDirtyPages:            0
 	IFAnyPages:              0
 	IFDiscarded:             0
 	DCentries:        9997
 	  0k-   4K:        267
 	  4k-  16k:        229
 	 16k-  64k:       9061
 	 64k- 256k:        212
 	256k-   1M:         10
 	      >=1M:        218
[root@bird070 ~]# df -i|grep cache |grep afs
/dev/sda3                                          512064    173599     338465   34% /var/cache/afs
[root@bird070 ~]# df -h|grep cache |grep afs
/dev/sda3                                       7.6G  4.7G  2.5G  66% /var/cache/afs

[root@bird058 ~]# fs getcacheparms -excessive
AFS using    86% of cache blocks (4768364 of 5552000 1k blocks)
              25% of the cache files (43806 of 173500 files)
 	afs_cacheFiles:     173500
 	IFFree:             129694
 	IFEverUsed:           9929
 	IFDataMod:               2
 	IFDirtyPages:            0
 	IFAnyPages:              0
 	IFDiscarded:             0
 	DCentries:        9998
 	  0k-   4K:       5074
 	  4k-  16k:       1639
 	 16k-  64k:       1728
 	 64k- 256k:        440
 	256k-   1M:        115
 	      >=1M:       1002

[root@bird652 ~]# fs getcacheparms -excessive
AFS using    89% of cache blocks (4917473 of 5552000 1k blocks)
              34% of the cache files (58678 of 173500 files)
 	afs_cacheFiles:     173500
 	IFFree:             114822
 	IFEverUsed:           9913
 	IFDataMod:               0
 	IFDirtyPages:            0
 	IFAnyPages:              0
 	IFDiscarded:             0
 	DCentries:        9999
 	  0k-   4K:       2372
 	  4k-  16k:       4863
 	 16k-  64k:       2047
 	 64k- 256k:        154
 	256k-   1M:         78
 	      >=1M:        485

thanks & cheers,

             martin
On Tue, 23 Oct 2018, Benjamin Kaduk wrote:

> On Tue, Oct 23, 2018 at 02:14:38PM +0200, Stephan Wiesand wrote:
>>
>>> On 23. Oct 2018, at 12:16, Andreas Ladanyi <andreas.ladanyi@kit.edu> wrote:
>>>
>>>> In the last few days we've observed an increasing number of Nodes,
>>>> which are no longer be reached and have to be rebooted
>>>>
>>>> In the /var/log/messages we see a lot of lines with e.g.
>>>>
>>>> Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
>>>> CacheItems slot 25254 off 2020340/13880020 code -5/80
>>>> Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
>>>> CacheItems slot 25253 off 2020260/13880020 code -5/80
>>>> Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
>>>> CacheItems slot 25252 off 2020180/13880020 code -5/80
>>>> Oct 22 18:48:26 bird858 kernel: afs: disk cache read error in
>>>> CacheItems slot 25251 off 2020100/13880020 code -5/80
>>>>
>>>> till nothing happens anymore ...
>>>>
>>>> The clients are  Centos 7.5 , 3.10.0-862.14.4.el7.x86_64, OpenAFS
>>>> 1.6.23 built 2018-09-12 (289.sl7.862.11.6@fnal.gov)
>>>>
>>>> Any hints for the possible reason ?
>>>
>>> I have the same constellation with AFS 1.6.23 client from jsbilling repo.
>>>
>>> I cant see this messages in /var/log/messages yet.
>>
>> We're running the same kernel version and the same client build (it's the SL one) on a fair number of SL 7.4 systems, and don't see these issues either.
>>
>> -5 is EIO, meaning an actual I/O error is reported.
>>
>> What's the size and type of the cache filesystems? What does "fs getcache report"? What are the afsd parameters? Could these nodes be out of space or inodes for the cache?
>
> It's also possible that the actual disk is having trouble, and/or got
> remounted RO.  dmesg and/or syslog might have some clues.
>
> (Interestingly enough, we had some changes go by recently on master to make
> the error handling for certain cases in this same class more graceful (i.e.,
> fail requests but not panic), though those changes are not in 1.6.23.)
>
> -Ben
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>

Gruss

        Martin Flemming


______________________________________________________
Martin Flemming
DESY / IT          office : Building 2b / 008a
Notkestr. 85       phone  : 040 - 8998 - 4667
22603 Hamburg      mail   : martin.flemming@desy.de
______________________________________________________