[OpenAFS] solaris 10 versions supporting inode fileservers

Hartmut Reuter reuter@rzg.mpg.de
Wed, 13 May 2009 18:48:45 +0200

David R Boldt wrote:
> We use Solaris 10 SPARC exclusively for our AFS servers.
> After upgrading to 1.4.10 from 1.4.8 we had a very few
> volumes that started spontaneously going off-line, recovering,
> and then going off-line again until they needed to be salvaged.
> Hearing that this might be related to inode, we moved these
> volumes to a set of little use fileservers that were running
> namei at 1.4.10. It made no discernible difference.
> Two volumes in particular accounted for >90% of our off-line
> volume issues.
> FileLog:
> Mon Apr 27 10:56:09 2009 Volume 2023867468 now offline, must be salvaged.
> Mon Apr 27 10:56:15 2009 Volume 2023867468 now offline, must be salvaged.
> Mon Apr 27 10:56:15 2009 Volume 2023867468 now offline, must be salvaged.
> Mon Apr 27 10:56:22 2009 fssync: volume 2023867469 restored; breaking
> all call backs
> (restored vol above being R/O for R/W in need of salvage)

That's interesting: I saw similar behavior on some of our volumes,
however, with AFS/OSD fileservers. I then made the ViceLog messages more
 eloquent and found out that this always happened when IH_OPEN failed.
This can fail if the handle in the vnode is missing. To prevent that I
added some lines in VGetVnode_r when an already existing vnode structure
is found to check whether the handle is in place and if not do a new
IH_INIT (and write a message into the log). I found about 100 cases per
day in our cell, but not all of them would have ended in taking the
volume off-line because in many cases the handle never would have been
used (All the GetStatus RPCs). Since then I never again saw volumes
going off-line.

> Both of the volumes most frequently impacted have content
> completely rewritten roughly every 20 minutes while being on
> an automated replication schedule of 15 minutes. One of them
> 25MB, the other 95MB, both at about 80% quota.
> We downgraded just the fileserver binary to 1.4.8 on all of
> our servers and have not seen a single off-line message in
> 36 hours.
>                                         -- David Boldt
>                                         <dboldt@usgs.gov>

Hartmut Reuter                  e-mail 		reuter@rzg.mpg.de
			   	phone 		 +49-89-3299-1328
			   	fax   		 +49-89-3299-1301
RZG (Rechenzentrum Garching)   	web    http://www.rzg.mpg.de/~hwr
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)