[OpenAFS] accessing /afs processes go into device wait

John Sopko sopko@cs.unc.edu
Thu, 8 Nov 2018 12:22:49 -0500


I have been running two legacy Redhat 6.x web servers for several
years. The apache httpd processes started to go into device wait state
the last few days on one of the servers, the other server is fine,
both are configured pretty much the same. I tracked this down to the
web server trying to stat /afs/.htaccess. If I try to do an ls in /afs
or cat /afs/.htaccess which does not exist, the commands take a long
time to complete and first go into device wait state, it can take
several minutes or they may hang indefinitely. The afs file system
seems to be working fine, just accessing under /afs is the problem. On
other Redhat 6.x systems accessing /afs is fast and have no problems.

I am running afsd with:

/usr/vice/etc/afsd -dynroot -fakestat-all -afsdb

Note I tried fakestat-all to see if that would help, I have been
running just -fakesat, our db servers have afsdb records.

I removed all cells accept for our cell in CellServDB so only have this:

% pwd
/afs

 % ls -l
total 4
lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu/
drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu/
lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu/

I re-formatted the /usr/vice/cache partition and that did not help.

I cannot find any hardware problems, no clues in the syslog or on the
console, the system disk including the cache is on a raid1/mirror
disk. This is a Dell server and I run Dell OpenMange which is really
good at reporting system and especially disk errors.

I am running the same afsd verison on our remaining rhel 6.x servers:

% fs version
openafs 1.6.22.2

Distributor ID: RedHatEnterpriseWorkstation
Release:        6.10

The problem is intermittent but goes into device wait most of the
time, for example the first time ran fine, the second time it took
14.96 seconds.

% time ls -l
total 4
lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu
lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu
0.000u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w

 % time ls -l
total 4
lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu
drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu
lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu
0.000u 0.000s 0:14.96 0.0%      0+0k 0+0io 0pf+0w

Thanks for any help or ideas to try.

-- 
John W. Sopko Jr.
University of North Carolina
Computer Science Dept CB 3175
Chapel Hill, NC 27599-3175

Fred Brooks Building; Room 140
Computer Services Systems Specialist
email: sopko AT cs.unc.edu
phone: 919-590-6144