[OpenAFS] accessing /afs processes go into device wait

Stephan Wiesand stephan.wiesand@desy.de
Thu, 8 Nov 2018 19:59:18 +0100


Have you tried w/o -afsdb?

> On 08 Nov 2018, at 19:48, John Sopko <sopko@cs.unc.edu> wrote:
>=20
> nsswitch and DNS the same, the AFSDB records resolve fine, the
> /afs/cs.unc.edu cell works fine, just not /afs.
>=20
>=20
> On Thu, Nov 8, 2018 at 12:52 PM Stephan Wiesand =
<stephan.wiesand@desy.de> wrote:
>>=20
>>=20
>>> On 8. Nov 2018, at 18:22, John Sopko <sopko@cs.unc.edu> wrote:
>>>=20
>>> I have been running two legacy Redhat 6.x web servers for several
>>> years. The apache httpd processes started to go into device wait =
state
>>> the last few days on one of the servers, the other server is fine,
>>> both are configured pretty much the same. I tracked this down to the
>>> web server trying to stat /afs/.htaccess. If I try to do an ls in =
/afs
>>> or cat /afs/.htaccess which does not exist, the commands take a long
>>> time to complete and first go into device wait state, it can take
>>> several minutes or they may hang indefinitely. The afs file system
>>> seems to be working fine, just accessing under /afs is the problem. =
On
>>> other Redhat 6.x systems accessing /afs is fast and have no =
problems.
>>=20
>> Are the nsswitch and DNS resolver configurations the same on all =
systems?
>> Any differences in network restrictions?
>> Does it help to run afsd without -afsdb?
>>=20
>> Just a wild guess,
>>        Stephan
>>=20
>>>=20
>>> I am running afsd with:
>>>=20
>>> /usr/vice/etc/afsd -dynroot -fakestat-all -afsdb
>>>=20
>>> Note I tried fakestat-all to see if that would help, I have been
>>> running just -fakesat, our db servers have afsdb records.
>>>=20
>>> I removed all cells accept for our cell in CellServDB so only have =
this:
>>>=20
>>> % pwd
>>> /afs
>>>=20
>>> % ls -l
>>> total 4
>>> lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu/
>>> drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu/
>>> lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu/
>>>=20
>>> I re-formatted the /usr/vice/cache partition and that did not help.
>>>=20
>>> I cannot find any hardware problems, no clues in the syslog or on =
the
>>> console, the system disk including the cache is on a raid1/mirror
>>> disk. This is a Dell server and I run Dell OpenMange which is really
>>> good at reporting system and especially disk errors.
>>>=20
>>> I am running the same afsd verison on our remaining rhel 6.x =
servers:
>>>=20
>>> % fs version
>>> openafs 1.6.22.2
>>>=20
>>> Distributor ID: RedHatEnterpriseWorkstation
>>> Release:        6.10
>>>=20
>>> The problem is intermittent but goes into device wait most of the
>>> time, for example the first time ran fine, the second time it took
>>> 14.96 seconds.
>>>=20
>>> % time ls -l
>>> total 4
>>> lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu
>>> drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu
>>> lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu
>>> 0.000u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
>>>=20
>>> % time ls -l
>>> total 4
>>> lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu
>>> drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu
>>> lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu
>>> 0.000u 0.000s 0:14.96 0.0%      0+0k 0+0io 0pf+0w
>>>=20
>>> Thanks for any help or ideas to try.