[OpenAFS] accessing /afs processes go into device wait

Stephan Wiesand stephan.wiesand@desy.de
Thu, 8 Nov 2018 20:53:54 +0100


My guess is that attempting to retrieve SRV and then AFSDB DNS
records for an "htaccess" top level domain is very slow to fail
on the problematic system for some reason.

I think it's kind of a known issue which has crept up in the past
for things like ".trash" as well.

You could probably find out where things get stuck by comparing
tcpdump outputs.

- Stephan

> On 08 Nov 2018, at 20:41, John Sopko <sopko@cs.unc.edu> wrote:
>=20
> Wow! Removing -afsdb and adding our db servers in the CellServDB seems
> to have fixed the problem. Does not make any sense, this machine and
> others running many years with -afsdb. And fs listcells works when
> -afsdb is used:
>=20
> % fs listcells
> Cell dynroot on hosts.
> Cell cs.unc.edu on hosts toucan.cs.unc.edu quail.cs.unc.edu =
kiwi.cs.unc.edu.
>=20
> % host -t AFSDB cs.unc.edu
> cs.unc.edu has AFSDB record 1 kiwi.cs.unc.edu.
> cs.unc.edu has AFSDB record 1 quail.cs.unc.edu.
> cs.unc.edu has AFSDB record 1 toucan.cs.unc.edu.
>=20
> Thanks for the help. Is this a known issue?
>=20
>=20
> On Thu, Nov 8, 2018 at 1:59 PM Stephan Wiesand =
<stephan.wiesand@desy.de> wrote:
>>=20
>> Have you tried w/o -afsdb?
>>=20
>>> On 08 Nov 2018, at 19:48, John Sopko <sopko@cs.unc.edu> wrote:
>>>=20
>>> nsswitch and DNS the same, the AFSDB records resolve fine, the
>>> /afs/cs.unc.edu cell works fine, just not /afs.
>>>=20
>>>=20
>>> On Thu, Nov 8, 2018 at 12:52 PM Stephan Wiesand =
<stephan.wiesand@desy.de> wrote:
>>>>=20
>>>>=20
>>>>> On 8. Nov 2018, at 18:22, John Sopko <sopko@cs.unc.edu> wrote:
>>>>>=20
>>>>> I have been running two legacy Redhat 6.x web servers for several
>>>>> years. The apache httpd processes started to go into device wait =
state
>>>>> the last few days on one of the servers, the other server is fine,
>>>>> both are configured pretty much the same. I tracked this down to =
the
>>>>> web server trying to stat /afs/.htaccess. If I try to do an ls in =
/afs
>>>>> or cat /afs/.htaccess which does not exist, the commands take a =
long
>>>>> time to complete and first go into device wait state, it can take
>>>>> several minutes or they may hang indefinitely. The afs file system
>>>>> seems to be working fine, just accessing under /afs is the =
problem. On
>>>>> other Redhat 6.x systems accessing /afs is fast and have no =
problems.
>>>>=20
>>>> Are the nsswitch and DNS resolver configurations the same on all =
systems?
>>>> Any differences in network restrictions?
>>>> Does it help to run afsd without -afsdb?
>>>>=20
>>>> Just a wild guess,
>>>>       Stephan
>>>>=20
>>>>>=20
>>>>> I am running afsd with:
>>>>>=20
>>>>> /usr/vice/etc/afsd -dynroot -fakestat-all -afsdb
>>>>>=20
>>>>> Note I tried fakestat-all to see if that would help, I have been
>>>>> running just -fakesat, our db servers have afsdb records.
>>>>>=20
>>>>> I removed all cells accept for our cell in CellServDB so only have =
this:
>>>>>=20
>>>>> % pwd
>>>>> /afs
>>>>>=20
>>>>> % ls -l
>>>>> total 4
>>>>> lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu/
>>>>> drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu/
>>>>> lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu/
>>>>>=20
>>>>> I re-formatted the /usr/vice/cache partition and that did not =
help.
>>>>>=20
>>>>> I cannot find any hardware problems, no clues in the syslog or on =
the
>>>>> console, the system disk including the cache is on a raid1/mirror
>>>>> disk. This is a Dell server and I run Dell OpenMange which is =
really
>>>>> good at reporting system and especially disk errors.
>>>>>=20
>>>>> I am running the same afsd verison on our remaining rhel 6.x =
servers:
>>>>>=20
>>>>> % fs version
>>>>> openafs 1.6.22.2
>>>>>=20
>>>>> Distributor ID: RedHatEnterpriseWorkstation
>>>>> Release:        6.10
>>>>>=20
>>>>> The problem is intermittent but goes into device wait most of the
>>>>> time, for example the first time ran fine, the second time it took
>>>>> 14.96 seconds.
>>>>>=20
>>>>> % time ls -l
>>>>> total 4
>>>>> lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu
>>>>> drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu
>>>>> lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu
>>>>> 0.000u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
>>>>>=20
>>>>> % time ls -l
>>>>> total 4
>>>>> lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu
>>>>> drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu
>>>>> lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu
>>>>> 0.000u 0.000s 0:14.96 0.0%      0+0k 0+0io 0pf+0w
>>>>>=20
>>>>> Thanks for any help or ideas to try.