[OpenAFS] Slow loading of virtually hosted web content

Kendrick Hernandez kendrick.hernandez@umbc.edu
Mon, 29 Nov 2021 13:11:44 -0500


--000000000000364f4105d1f16010
Content-Type: text/plain; charset="UTF-8"

Hi all,

Thanks for all the replies and info!

We were able to narrow the problem down to DNS timeouts from an internal
DNS server that had reached its limit for NF connection tracking. Once that
limit was increased, the issue went away.
Along with some forwarded insights from the folks at CMU and some isolated
testing, we were able to confirm that disabling dynamic root and DNS-based
server discovery on the cache manager also worked around issue.

Thanks again!
k-

On Fri, Nov 19, 2021 at 9:34 PM Jeffrey E Altman <jaltman@auristor.com>
wrote:

> On 11/10/2021 3:27 PM, Kendrick Hernandez (kendrick.hernandez@umbc.edu)
> wrote:
>
> Hi all,
>
> We host around 240 departmental and campus web sites (individual afs
> volumes) across 6 virtual web servers on AFS storage. The web servers are 4
> core, 16G VMs, and the 4 file servers are 4 core 32G VMs. All CentOS 7
> systems.
>
> In the past week or so, we've encountered high-load on the web servers
> (primary consumers being apache and afsd) during periods of increased
> traffic, and we're trying to identify ways to tune performance.
>
> "In the past week or so" appears to imply that the high-load was not
> observed previously.  If that is the case, one question to ask is "what
> changed?"  Analysis of the Apache access and error logs compared to the
> prior period might provide some important clues.
>
> After seeing the following in the logs:
>
> 2021 11 08 08:52:03 -05:00 virthost4 [kern.warning] kernel: afs: Warning:
>> We are having trouble keeping the AFS stat cache trimmed down under the
>> configured limit (current -stat setting: 3000, current vcache usage: 18116).
>> 2021 11 08 08:52:03 -05:00 virthost4 [kern.warning] kernel: afs: If AFS
>> access seems slow, consider raising the -stat setting for afsd.
>
> There is a one-to-one mapping between AFS vnodes and Linux inodes.  Unlike
> some other platforms with OpenAFS kernel modules, the Linux kernel module
> does not strictly enforce the vnode cache (aka vcache) limit.  When the
> limit is reached instead of finding a vnode to recycle, new vnodes are
> created and a background task attempts to prune excess vnodes.   Its that
> background task which is logging the text quoted above.
>
> I increased the disk cache to 10g and the -stat parameter to 100000, which
> has improved things somewhat, but we're not quite there yet.
>
> As Ben Kaduk mentioned in his reply, callback promises must be tracked by
> both the fileserver and the client.  Increasing the vcache (-stat) limit
> increases the number of vnodes for which callbacks must be tracked.  The
> umbc.edu cell is behind a firewall so its not possible for me to probe
> the fileserver statistics to determine if increasing to 100,000 on the
> clients also requires an increase on the fileservers.  If the fileserver
> callback table is full, then it might have to prematurely break callback
> promises to satisfy the new allocation.  A callback break requires issuing
> an RPC to the client whose promise is being broken.
>
> This is the current client cache configuration from one of the web servers:
>
> Chunk files:   281250
>> Stat caches:   100000
>> Data caches:   10000
>>
> The data cache might need to be increased if the web servers are serving
> content from more than 18,000 files
>
> Volume caches: 200
>>
> If the web servers are serving data from 240 volumes, then 200 volumes is
> too small.
>
> Chunk size:    1048576
>> Cache size:    9000000 kB
>> Set time:      no
>> Cache type:    disk
>
>
> Has anyone else experienced this? I think the bottleneck is with the cache
> manager and not the file servers themselves, because they don't seem to be
> impacted much during those periods of high load, and I can access files in
> those web volumes from my local client without any noticable lag.
>
> Apart from the cache settings how the web server is configured and how it
> accesses content from /afs matters.
>
> * Are the web servers configured with mod_waklog to obtain tokens for
> authenticated users?
>
> * Are PAGs in use?
>
> * How many active RX connections are there from the cache manager to the
> fileservers?
>
> * Are the volumes being served primarily RW volumes or RO volumes?
>
> * Are the contents of the volumes frequently changing?
>
> Finally, compared to the AuriStorFS and kafs clients, the OpenAFS cache
> manager suffers from a number of bottlenecks on multiprocessor systems due
> to reliance on a global lock to protect internal data structures.  The
> cache manager's callback service is another potential bottleneck because
> only one incoming RPC can be processed at a time and each incoming RPC must
> acquire the aforementioned global lock for the life of the call.
>
> Good luck,
>
> Jeffrey Altman
>
>
>

-- 
Kendrick Hernandez
*UNIX Systems Administrator*
Division of Information Technology
University of Maryland, Baltimore County

--000000000000364f4105d1f16010
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hi all,</div><div><br></div><div>Thanks for all the r=
eplies and info!</div><div><br></div><div>We were able to narrow the proble=
m down to DNS timeouts from an internal DNS server that had reached its lim=
it for NF connection tracking. Once that limit was increased, the issue wen=
t away. <br></div><div>Along with some forwarded insights from the folks at=
 CMU and some isolated testing, we were able to confirm that disabling dyna=
mic root and DNS-based server discovery on the cache manager also worked ar=
ound issue. <br></div><div><br></div><div>Thanks again!</div><div>k-<br></d=
iv></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_att=
r">On Fri, Nov 19, 2021 at 9:34 PM Jeffrey E Altman &lt;<a href=3D"mailto:j=
altman@auristor.com" target=3D"_blank">jaltman@auristor.com</a>&gt; wrote:<=
br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8e=
x;border-left:1px solid rgb(204,204,204);padding-left:1ex">
 =20
   =20
 =20
  <div>
    <div>On 11/10/2021 3:27 PM, Kendrick
      Hernandez (<a href=3D"mailto:kendrick.hernandez@umbc.edu" target=3D"_=
blank">kendrick.hernandez@umbc.edu</a>) wrote:<br>
    </div>
    <blockquote type=3D"cite">
     =20
      <div dir=3D"ltr">Hi all,
        <div><br>
        </div>
        <div>We host around 240 departmental and campus web sites
          (individual afs volumes) across 6 virtual web servers on AFS
          storage. The web servers are 4 core, 16G VMs, and the 4 file
          servers are 4 core 32G VMs. All CentOS 7 systems. <br>
        </div>
        <div><br>
        </div>
        <div>In the past week or so, we&#39;ve encountered high-load on the
          web servers (primary consumers being apache and afsd) during
          periods of increased traffic, and we&#39;re trying to identify
          ways to tune performance. </div>
      </div>
    </blockquote>
    &quot;In the past week or so&quot; appears to imply that the high-load =
was not
    observed previously.=C2=A0 If that is the case, one question to ask is
    &quot;what changed?&quot;=C2=A0 Analysis of the Apache access and error=
 logs
    compared to the prior period might provide some important clues.<br>
    <blockquote type=3D"cite">
      <div dir=3D"ltr">
        <div>After seeing the following in the logs:</div>
        <div><br>
        </div>
        <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex=
;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style=3D"fo=
nt-family:monospace">2021 11 08 08:52:03 -05:00
            virthost4 [kern.warning] kernel: afs: Warning: We are having
            trouble keeping the AFS stat cache trimmed down under the
            configured limit (current -stat setting: 3000, current
            vcache usage: 18116).</span><br>
          <span style=3D"font-family:monospace">2021 11 08 08:52:03 -05:00
            virthost4 [kern.warning] kernel: afs: If AFS access seems
            slow, consider raising the -stat setting for afsd.</span></bloc=
kquote>
      </div>
    </blockquote>
    <p>There is a one-to-one mapping between AFS vnodes and Linux
      inodes.=C2=A0 Unlike some other platforms with OpenAFS kernel modules=
,
      the Linux kernel module does not strictly enforce the vnode cache
      (aka vcache) limit.=C2=A0 When the limit is reached instead of findin=
g
      a vnode to recycle, new vnodes are created and a background task
      attempts to prune excess vnodes.=C2=A0=C2=A0 Its that background task=
 which
      is logging the text quoted above.=C2=A0 <br>
    </p>
    <blockquote type=3D"cite">
      <div dir=3D"ltr">
        <div>I increased the disk cache to 10g and the -stat parameter
          to 100000, which has improved things somewhat, but we&#39;re not
          quite there yet. </div>
      </div>
    </blockquote>
    <p>As Ben Kaduk mentioned in his reply, callback promises must be
      tracked by both the fileserver and the client.=C2=A0 Increasing the
      vcache (-stat) limit increases the number of vnodes for which
      callbacks must be tracked.=C2=A0 The <a href=3D"http://umbc.edu" targ=
et=3D"_blank">umbc.edu</a> cell is behind a firewall
      so its not possible for me to probe the fileserver statistics to
      determine if increasing to 100,000 on the clients also requires an
      increase on the fileservers.=C2=A0 If the fileserver callback table i=
s
      full, then it might have to prematurely break callback promises to
      satisfy the new allocation.=C2=A0 A callback break requires issuing a=
n
      RPC to the client whose promise is being broken.<br>
    </p>
    <blockquote type=3D"cite">
      <div dir=3D"ltr">
        <div>This is the current client cache configuration from one of
          the web servers:<br>
        </div>
        <div><br>
        </div>
        <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex=
;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div><span style=3D"font-family:monospace">Chunk files: =C2=A0 28=
1250<br>
              Stat caches: =C2=A0 100000</span></div>
          <div><span style=3D"font-family:monospace"></span></div>
          <span style=3D"font-family:monospace">Data caches: =C2=A0 10000</=
span><br>
        </blockquote>
      </div>
    </blockquote>
    The data cache might need to be increased if the web servers are
    serving content from more than 18,000 files<br>
    <blockquote type=3D"cite">
      <div dir=3D"ltr">
        <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex=
;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style=3D"fo=
nt-family:monospace">Volume caches: 200</span><br>
        </blockquote>
      </div>
    </blockquote>
    If the web servers are serving data from 240 volumes, then 200
    volumes is too small.<br>
    <blockquote type=3D"cite">
      <div dir=3D"ltr">
        <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex=
;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style=3D"fo=
nt-family:monospace">Chunk size: =C2=A0 =C2=A01048576</span><br>
          <span style=3D"font-family:monospace">Cache size: =C2=A0 =C2=A090=
00000 kB</span><br>
          <span style=3D"font-family:monospace">Set time: =C2=A0 =C2=A0 =C2=
=A0no</span><br>
          <span style=3D"font-family:monospace">Cache type: =C2=A0 =C2=A0di=
sk</span></blockquote>
        <div><br>
        </div>
        <div>Has anyone else experienced this? I think the bottleneck is
          with the cache manager and not the file servers themselves,
          because they don&#39;t seem to be impacted much during those
          periods of high load, and I can access files in those web
          volumes from my local client without any noticable lag. <br>
        </div>
      </div>
    </blockquote>
    <p>Apart from the cache settings how the web server is configured
      and how it accesses content from /afs matters.=C2=A0 <br>
    </p>
    <p>* Are the web servers configured with mod_waklog to obtain tokens
      for authenticated users?</p>
    <p>* Are PAGs in use?</p>
    <p>* How many active RX connections are there from the cache manager
      to the fileservers?</p>
    <p>* Are the volumes being served primarily RW volumes or RO
      volumes?</p>
    <p>* Are the contents of the volumes frequently changing?<br>
    </p>
    <p>Finally, compared to the AuriStorFS and kafs clients, the OpenAFS
      cache manager suffers from a number of bottlenecks on
      multiprocessor systems due to reliance on a global lock to protect
      internal data structures.=C2=A0 The cache manager&#39;s callback serv=
ice is
      another potential bottleneck because only one incoming RPC can be
      processed at a time and each incoming RPC must acquire the
      aforementioned global lock for the life of the call.=C2=A0=C2=A0 <br>
    </p>
    <p>Good luck,</p>
    <p>Jeffrey Altman</p>
    <p><br>
    </p>
  </div>
</blockquote></div><br clear=3D"all"><br>-- <br><div dir=3D"ltr"><div dir=
=3D"ltr">Kendrick Hernandez<br><i>UNIX Systems Administrator</i><br>Divisio=
n of Information Technology<br><span style=3D"font-family:georgia,serif">Un=
iversity of Maryland, Baltimore County</span><br></div></div>

--000000000000364f4105d1f16010--