[OpenAFS] Problem with openafs-1.3.81 on kernel

Dr A V Le Blanc Dr A V Le Blanc <LeBlanc@mcc.ac.uk>
Thu, 14 Apr 2005 09:04:52 +0100

I reported on March 8 that I had a serious problem with openafs-1.3.79,
but I was unable to give details because the machine got into such a
state that it had to be rebooted.

A similar problem has now surfaced on a machine running openafs-1.3.81
with kernel, and this time the machine is still accessible,
though it has serious problems.  The kernel module was compiled from
the openafs-modules-source package, version 1.3.81-3, and the kernel
was a standard kernel with the hidden arp patch applied, compiled
with gcc 3.3.5 (Debian gcc 1:3.3.5-8).  The system is Debian sarge,
patched up to date.  The openafs-client package is also version 1.3.81-3.

Our web service is provided using apache with over 300 virtual servers;
the service is provided by 5 or 6 'back end' machines to which incoming
requests are directed via Linux virtual server software using ipvs.
I compiled the kernel and openafs, installed them, and rebooted the
machine yesterday at 8:03.  After initial testing, I started apache
at 11:58.  Before I stopped it, there were 92402 requests.

At 13:14 problems began to appear in the logs with the message

     (5)Input/output error: access to [some file] failed

There were 3043 such errors in the log files before I stopped
apache at 15:10.  About 13:45 errors like this one started to appear
in the main error file:

     [notice] child pid 4601 exit signal Bus error (7)

There are 352 of these, and I suspect they are related to the
afs problem.  In the system logs I start finding messages at 13:20
like this one:

     kernel: afs: failed to store file (partition full)

There are 26 of these.  In fact the /var/cache/openafs partition
is full; it is a 500mb partition, and the cache size is set to
300mb in the cacheinfo file, so this is certainly contributing to
the problem.  The dmesg-es contain the 'failed to store file'
messages as well as eight like this:

     AFS_VMA_CLOSE(8072): Skipping Already locked vcp=de257f38 vmap=de257f48

No other unusual messages are in the syslog.

After I shut apache off, I left the system up, and it has now been
doing nothing for 18 hours.  The cache partition is still 167% full,
but 'fs getca' reports

     AFS using 277938 of the cache's available 300000 1K byte blocks.

/afs is still accessible, or at least partially:

     # cd ftp
     -bash: cd: ftp: Input/output error
     # cd common/
     # ls
     WWW    admwork  etc       info   lynx     passwd  terminfo  texmf
     admin  emacs    examples  local  ncurses  rsync   tex       zope
     # cd etc
     -bash: cd: etc: Input/output error

In other words, directories already in the cache are legible, but
not others.  The cache is on its own partition, which is ext2.
The problems I see are these:

For some reason the cache is filling far beyond the configured
limit.  'fs getca' ought to be seeing this and isn't.  For some
reason the cache is not being cleared, even after hours of
inactivity.  I'm a little puzzled by the numbers as well;
on a 2.4.30 machine with openafs 1.2.13-1, I get this:

     # df /var/cache/openafs
     Filesystem           1K-blocks      Used Available Use% Mounted on
     /dev/hda6               489992    279729    184963  61% /var/cache/openafs
     # du -s /var/cache/openafs
     279729  /var/cache/openafs

but on the system I see this:

     # df /var/cache/openafs
     Filesystem           1K-blocks      Used Available Use% Mounted on
     /dev/sda6               521748    521748         0 100% /var/cache/openafs
     # du -s /var/cache/openafs
     832908  /var/cache/openafs

which suggest file system corruption.   I did not remkfs the partition
before rebooting; perhaps I should try this?

As you will understand, this system was under very heavy load for
about 3 hours.  Is anyone else seeing problems with openafs 1.3.81
under similar loads?

     -- Owen