[OpenAFS] accessing /afs processes go into device wait

John Sopko sopko@cs.unc.edu
Thu, 8 Nov 2018 13:48:08 -0500


nsswitch and DNS the same, the AFSDB records resolve fine, the
/afs/cs.unc.edu cell works fine, just not /afs.


On Thu, Nov 8, 2018 at 12:52 PM Stephan Wiesand <stephan.wiesand@desy.de> wrote:
>
>
> > On 8. Nov 2018, at 18:22, John Sopko <sopko@cs.unc.edu> wrote:
> >
> > I have been running two legacy Redhat 6.x web servers for several
> > years. The apache httpd processes started to go into device wait state
> > the last few days on one of the servers, the other server is fine,
> > both are configured pretty much the same. I tracked this down to the
> > web server trying to stat /afs/.htaccess. If I try to do an ls in /afs
> > or cat /afs/.htaccess which does not exist, the commands take a long
> > time to complete and first go into device wait state, it can take
> > several minutes or they may hang indefinitely. The afs file system
> > seems to be working fine, just accessing under /afs is the problem. On
> > other Redhat 6.x systems accessing /afs is fast and have no problems.
>
> Are the nsswitch and DNS resolver configurations the same on all systems?
> Any differences in network restrictions?
> Does it help to run afsd without -afsdb?
>
> Just a wild guess,
>         Stephan
>
> >
> > I am running afsd with:
> >
> > /usr/vice/etc/afsd -dynroot -fakestat-all -afsdb
> >
> > Note I tried fakestat-all to see if that would help, I have been
> > running just -fakesat, our db servers have afsdb records.
> >
> > I removed all cells accept for our cell in CellServDB so only have this:
> >
> > % pwd
> > /afs
> >
> > % ls -l
> > total 4
> > lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu/
> > drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu/
> > lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu/
> >
> > I re-formatted the /usr/vice/cache partition and that did not help.
> >
> > I cannot find any hardware problems, no clues in the syslog or on the
> > console, the system disk including the cache is on a raid1/mirror
> > disk. This is a Dell server and I run Dell OpenMange which is really
> > good at reporting system and especially disk errors.
> >
> > I am running the same afsd verison on our remaining rhel 6.x servers:
> >
> > % fs version
> > openafs 1.6.22.2
> >
> > Distributor ID: RedHatEnterpriseWorkstation
> > Release:        6.10
> >
> > The problem is intermittent but goes into device wait most of the
> > time, for example the first time ran fine, the second time it took
> > 14.96 seconds.
> >
> > % time ls -l
> > total 4
> > lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu
> > drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu
> > lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu
> > 0.000u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
> >
> > % time ls -l
> > total 4
> > lrwxr-xr-x 1 root root   10 Dec 31  1969 cs -> cs.unc.edu
> > drwxr-xr-x 8 root root 2048 Mar  6  2015 cs.unc.edu
> > lrwxr-xr-x 1 root root   10 Dec 31  1969 unc -> cs.unc.edu
> > 0.000u 0.000s 0:14.96 0.0%      0+0k 0+0io 0pf+0w
> >
> > Thanks for any help or ideas to try.
>
> --
> Stephan Wiesand
> DESY -DV-
> Platanenallee 6
> 15738 Zeuthen, Germany
>
>
>


-- 
John W. Sopko Jr.
University of North Carolina
Computer Science Dept CB 3175
Chapel Hill, NC 27599-3175

Fred Brooks Building; Room 140
Computer Services Systems Specialist
email: sopko AT cs.unc.edu
phone: 919-590-6144