[OpenAFS] Re: accessing R/O volume becomes slow

Andrew Deason adeason@sinenomine.net
Wed, 26 Nov 2014 14:15:10 -0600


On Wed, 26 Nov 2014 10:51:00 +0100
Hans-Werner Paulsen <hans@MPA-Garching.MPG.DE> wrote:

> this is on Linux 3.14.8 x86_64, and OpenAFS 1.6.9. The machine is 
> running normally for several months, and then accessing a specific R/O 
> volume (e.g. ls -lR <large_volume>) becomes slow.

Do you mean it's slow when you hit the net, or even when you expect
everything to be cached? (That is, if you run ls -lR twice in a row,
does it still remain slow?)

I also echo Ben's suggestion to try other volumes on the same server.
Try to isolate if it's stuff on the server that's causing the problem,
or the specific partition, or just that one volume. Or maybe it could be
a specific dir somewhere in the volume.

> Checking the machine I see more than 5 million of afs_inode_cache slab 
> entries. Is this normal? Any hint how to proceed?

That's not unusual if you are accessing a lot of files (say, about 5
million recently accessed). But having a lot of vcaches in memory can
cause certain operations to be slow; there was a fix just added in
1.6.10 to improve speed for a background cleanup process with lots of
files (well, and PAGs): 94f1d4.

Other information that could be gathered: fstrace data (but if data is
going by too quickly, it can be hard to get useful data out of this), or
'strace' syscall timing information (to see what syscalls are slow), or
a network dump, if you are hitting the net in the cases you're talking
about; that could help show if it's the client or server that's being
slow (when comparing a 'success' run vs a 'slow' run).

Traces like that are hard to look at when you have a ton of data to sort
through, but it's still feasible to compare timings from a 'slow' run to
a 'fast' run to try to see if the speed difference is coming from a
particular place.

-- 
Andrew Deason
adeason@sinenomine.net