[OpenAFS] Performance issue with "many" volumes in a single /vicep?

Tom Keiser tkeiser@sinenomine.net
Wed, 24 Mar 2010 23:43:32 -0400


On Wed, Mar 24, 2010 at 4:32 PM, Steve Simmons <scs@umich.edu> wrote:
>
> On Mar 18, 2010, at 2:37 AM, Tom Keiser wrote:
>
>> On Wed, Mar 17, 2010 at 7:41 PM, Derrick Brashear <shadow@gmail.com> wro=
te:
>>> On Wed, Mar 17, 2010 at 12:50 PM, Steve Simmons <scs@umich.edu> wrote:
>>>> We've been seeing issues for a while that seem to relate to the number=
 of volumes in a single vice partition. The numbers and data are inexact be=
cause there are so many damned possible parameters that affect performance,=
 but it appears that somewhere between 10,000 and 14,000 volumes performanc=
e falls off significantly. That 40% difference in volume count results in 2=
x to 3x falloffs for performance in issues that affect the /vicep as a whol=
e - backupsys, nightly dumps, vos listvol, etc.
>>>>
>>
>> First off, could you describe how you're measuring the performance drop-=
off?
>
> Wall clock, mostly. Operations which touch all the volumes on a server ta=
ke disproportionately longer on servers w/10,000 volumes vs servers with 14=
,000. The best operations to show this are vos backupsys and our nightly du=
mps, which call vos dump with various parameters on every volume on the ser=
ver.
>

Ok.  Well, this likely rules out the volume hash chain suggestion--we
don't directly use the hash table in the volserver (although we do
perform at least two lookups as a consequence of  performing fssync
ops as part of the volume transaction).  The reason I say it's
unlikely is fssync overhead is an insignificant component of the
execution time for the vos ops you're talking about.


>> The fact that this relationship b/t volumes and performance is
>> superlinear makes me think you're exceeding a magic boundary (e.g
>> you're now causing eviction pressure on some cache where you weren't
>> previously...).
>
> Our estimate too. But before drilling down, it seemed worth checking if a=
nyone else has a similar server - ext3 with 14,000 or more volumes in a sin=
gle vice partition - and has seen a difference. Note, tho, that it's not #i=
nodes or total disk usage in the partition. The servers that exhibited the =
problem had a large number of mostly empty volumes.
>

Sure.  Makes sense.   The one thing that does come to mind is that
regardless of the number of inodes, ISTR some people were having
trouble with ext performance when htree indices were turned on because
spatial locality of reference against the inode tables goes way down
when you process files in the order returned by readdir(), since
readdir() in htree mode returns files in hash chain order rather than
more-or-less inode order.  This could definitely have a huge impact on
the salvager [especially GetVolumeSummary(), and to a lesser extent
ListViceInodes() and friends].  I'm less certain how it would affect
things in the volserver, but it would certainly have an effect on
operations which delete clones, since the nuke code also calls
ListViceInodes().

In addition, with regard to ext htree indices I'll pose the
(completely untested) hypothesis that htree indices aren't necessarily
a net win for the namei workload.  Given that namei goes great lengths
to avoid large directories (with the notable exception of the /vicepXX
root dir itself), it is conceivable that htree overhead is actually a
net loss.  I don't know for sure, but I'd say it's worth doing further
study.  In a volume with files>>dirs you're going to see on the order
of ~256 files per namei directory.  Certainly a linear search of on
average 128 entries is expensive, but it may be worth verifying this
empirically because we don't know how much overhead htree and its
side-effects produce.  Regrettably, there don't seem to be any
published results on the threshold above which htree becomes a net
win...

Finally, you did tune2fs -O dir_index <dev> before populating the file
system, right?


>> Another possibility accounting for the superlinearity, which would
>> very much depend upon your workload, is that by virtue of increased
>> volume count you're now experiencing higher volume operation
>> concurrency, thus causing higher rates of partition lock contention.
>> However, this would be very specific to the volume server and
>> salvager--it should not have any substantial effect on the file
>> server, aside from some increased VOL_LOCK contention...
>
> Salvager is not involved, or at least, hasn't yet been involved. It's vos=
 backupsys and vos dump where we see it mostly.


What I was trying to say is if the observed performance regression
involves either the volserver, or the salvager, then it could involve
partition lock contention.  However, this will only come into play if
you're running a lot of vos jobs in parallel against the same vice
partition...

Cheers,

-Tom