[OpenAFS] Re: odd problem with RW site after a botched replica

Timothy Balcer timothy@telmate.com
Wed, 31 Oct 2012 10:43:02 -0700


--047d7b67002d31703d04cd5e6f5b
Content-Type: text/plain; charset=ISO-8859-1

CPU and IO, it seemed. I was at an uptime of 3+ with a VM that had 2 cores,
so more CPU would have been better.

The vice partitions are slices on an underlying LVM system from its dom0.
So there are definitely other bottlenecks.

I have been considering running VMs to spread out fs operations on a
machine with many cores. On a 12 core machine, for example, I would make
something like 4 fileservers each with 3 cores, and the underlying OS would
be doing nothing except servicing those VMs. Do you think this would allow
for better performance?

On Wed, Oct 31, 2012 at 8:29 AM, Andrew Deason <adeason@sinenomine.net>wrote:

> On Tue, 30 Oct 2012 20:07:57 -0700
> Timothy Balcer <timothy@telmate.com> wrote:
>
> > In other news, the latest salvage has been running for 12 hours... I
> > straced the busiest pid and it is happily verifying all the links and
> > contents (open(), close(), pread() ad infinitum), so its not wedged.
> > This volume has literally slightly less than 32k directory entries in
> > various places (yes, I made SURE the limits were observed ;-) ) and so
> > I imagine it will take a very long time to traverse the entire
> > thing... interesting that this is the fourth salvage and it actually
> > seems to be working at it this time. Last three times it stopped after
> > a bit over an hour.
>
> I am just curious; does the machine seem to be cpu-bound during this
> process? There has been some work done to parallelize this, so in the
> future this could be faster (if, among other things, it seems cpu-bound
> and you have multiple cores).
>
> > I'll keep you all posted. There wasn't an error in the AFS logs that
> > indicated that salvager proceses had been killed due to OOM. It was
> > only in the kernel logs.
>
> If you started this via 'bos salvage', there should be something in
> BosLog to say that it was killed by signal 9.
>

Yeah... as I mentioned, those logs are gone :-( definitely making syslog
style logging a priority!

-- 
Timothy Balcer / IT Services
Telmate / San Francisco, CA
Direct / (415) 300-4313
Customer Service / (800) 205-5510

--047d7b67002d31703d04cd5e6f5b
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

CPU and IO, it seemed. I was at an uptime of 3+ with a VM that had 2 cores,=
 so more CPU would have been better. <br><br>The vice partitions are slices=
 on an underlying LVM system from its dom0. So there are definitely other b=
ottlenecks.<br>
<br>I have been considering running VMs to spread out fs operations on a ma=
chine with many cores. On a 12 core machine, for example, I would make some=
thing like 4 fileservers each with 3 cores, and the underlying OS would be =
doing nothing except servicing those VMs. Do you think this would allow for=
 better performance?<br>
<br><div class=3D"gmail_quote">On Wed, Oct 31, 2012 at 8:29 AM, Andrew Deas=
on <span dir=3D"ltr">&lt;<a href=3D"mailto:adeason@sinenomine.net" target=
=3D"_blank">adeason@sinenomine.net</a>&gt;</span> wrote:<br><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pa=
dding-left:1ex">
<div class=3D"im">On Tue, 30 Oct 2012 20:07:57 -0700<br>
Timothy Balcer &lt;<a href=3D"mailto:timothy@telmate.com">timothy@telmate.c=
om</a>&gt; wrote:<br>
<br>
&gt; In other news, the latest salvage has been running for 12 hours... I<b=
r>
&gt; straced the busiest pid and it is happily verifying all the links and<=
br>
&gt; contents (open(), close(), pread() ad infinitum), so its not wedged.<b=
r>
&gt; This volume has literally slightly less than 32k directory entries in<=
br>
&gt; various places (yes, I made SURE the limits were observed ;-) ) and so=
<br>
&gt; I imagine it will take a very long time to traverse the entire<br>
&gt; thing... interesting that this is the fourth salvage and it actually<b=
r>
&gt; seems to be working at it this time. Last three times it stopped after=
<br>
&gt; a bit over an hour.<br>
<br>
</div>I am just curious; does the machine seem to be cpu-bound during this<=
br>
process? There has been some work done to parallelize this, so in the<br>
future this could be faster (if, among other things, it seems cpu-bound<br>
and you have multiple cores).<br>
<div class=3D"im"><br>
&gt; I&#39;ll keep you all posted. There wasn&#39;t an error in the AFS log=
s that<br>
&gt; indicated that salvager proceses had been killed due to OOM. It was<br=
>
&gt; only in the kernel logs.<br>
<br>
</div>If you started this via &#39;bos salvage&#39;, there should be someth=
ing in<br>
BosLog to say that it was killed by signal 9.<br></blockquote><div><br>Yeah=
... as I mentioned, those logs are gone :-( definitely making syslog style =
logging a priority! <br></div><br></div>-- <br><span style=3D"border-collap=
se:collapse;color:rgb(102,102,102);font-family:verdana,sans-serif;font-size=
:x-small">Timothy Balcer / IT Services<br>
Telmate / San Francisco, CA<br>Direct / </span><span style=3D"border-collap=
se:collapse;font-family:verdana,sans-serif;font-size:x-small"><font color=
=3D"#1155cc">(415) 300-4313</font><br><font color=3D"#666666">Customer Serv=
ice /=A0</font><a value=3D"+18002055510" style=3D"color:rgb(17,85,204)">(80=
0) 205-5510</a></span><br>


--047d7b67002d31703d04cd5e6f5b--