[OpenAFS] accessing /afs processes go into device wait

Stephan Wiesand stephan.wiesand@desy.de
Thu, 8 Nov 2018 18:52:54 +0100


> On 8. Nov 2018, at 18:22, John Sopko <sopko@cs.unc.edu> wrote:
> 
> I have been running two legacy Redhat 6.x web servers for several
> years. The apache httpd processes started to go into device wait state
> the last few days on one of the servers, the other server is fine,
> both are configured pretty much the same. I tracked this down to the
> web server trying to stat /afs/.htaccess. If I try to do an ls in /afs
> or cat /afs/.htaccess which does not exist, the commands take a long
> time to complete and first go into device wait state, it can take
> several minutes or they may hang indefinitely. The afs file system
> seems to be working fine, just accessing under /afs is the problem. On
> other Redhat 6.x systems accessing /afs is fast and have no problems.

Are the nsswitch and DNS resolver configurations the same on all systems?
Any differences in network restrictions?
Does it help to run afsd without -afsdb?

Just a wild guess,
	Stephan

> 
> I am running afsd with:
> 
> /usr/vice/etc/afsd -dynroot -fakestat-all -afsdb
> 
> Note I tried fakestat-all to see if that would help, I have been
> running just -fakesat, our db servers have afsdb records.
> 
> I removed all cells accept for our cell in CellServDB so only have this:
> 
> % pwd
> /afs
> 
> % ls -l
> total 4
> lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu/
> drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu/
> lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu/
> 
> I re-formatted the /usr/vice/cache partition and that did not help.
> 
> I cannot find any hardware problems, no clues in the syslog or on the
> console, the system disk including the cache is on a raid1/mirror
> disk. This is a Dell server and I run Dell OpenMange which is really
> good at reporting system and especially disk errors.
> 
> I am running the same afsd verison on our remaining rhel 6.x servers:
> 
> % fs version
> openafs 1.6.22.2
> 
> Distributor ID: RedHatEnterpriseWorkstation
> Release:        6.10
> 
> The problem is intermittent but goes into device wait most of the
> time, for example the first time ran fine, the second time it took
> 14.96 seconds.
> 
> % time ls -l
> total 4
> lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu
> drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu
> lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu
> 0.000u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
> 
> % time ls -l
> total 4
> lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu
> drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu
> lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu
> 0.000u 0.000s 0:14.96 0.0%      0+0k 0+0io 0pf+0w
> 
> Thanks for any help or ideas to try.

-- 
Stephan Wiesand
DESY -DV-
Platanenallee 6
15738 Zeuthen, Germany