[OpenAFS] openafs-1.0.1 on sun4x_57, ate my volume

Nathan Rawling nrawling@firedrake.net
Tue, 01 May 2001 12:39:03 -0400


My apologies in advance for the length of this message. I find that when
I'm trying to sort out what happened, skipping details just prolongs the
wait for diagnosis.

I had a new and disturbing problem with my AFS cell today. I've been running
openafs-1.0.1 on a number of Solaris 7 machines (Generic_106541-12) and
overall, I've had a very good experience.

This morning, I had my first bad break. I have my home directory stored in
AFS, and I attempted to download a file in netscape (written to my homedir), 
and compose an email message (creating a temporary file in my home directory).

Suddenly, all access attempts to my home directory started to block on a
number of different client machines. This worried me, naturally, but due to
my past experiences with Transarc AFS, I've seen similar behavior periodically.

I waited a few minutes, and when the problem did not immediately go away, I 
started investigating the status of my fileservers. I currently have three
DB/File servers in my cell, and bos status on two of the three reported core
files from the volserver. That didn't particularly interest me, Transarc AFS
used to drop a core file every once and a while anyhow. 

So I decided to try restarting the fileserver. This is a development cell, so
it wasn't a big deal. I am basically the only user using it. So I logged into
the fileserver machine, got a token, and bos restarted the fs instance.

The command hung for a long while. I examined the logs, and it said that the
shutdown process had started, and it was waiting to close out a few volume
accesses before it stopped completely and bos started it again. 

At this point, I shrugged my shoulders and went to lunch.

I returned from lunch and discovered that things seemed to have stabilized. My
home directory was availible. A brief examination of the FileLog suggested that
one volume couldn't be attached.

So I then checked the SalvageLog, which had a bunch of stuff I'd never seen
before, which I believe was the logs from it trying to repair a couple of 
corrupted volumes.

I then noticed that about half of my home directory was missing. I tried to
copy the files from the backup volume, only to discover that the backup volume
was the one volume that didn't attach.

At this point, I tried to salvage that volume:

root@afsdev03:afs/logs# /usr/afs/bin/salvager -partition /vicepa -volumeid 536870932
root@afsdev03:afs/logs# more SalvageLog
@(#)CML not accessible: No version Information
05/01/2001 16:13:41 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager -partition 
/vicepa -volumeid 536870932)
05/01/2001 16:13:51 Scanning inodes on device /dev/md/rdsk/d7...
05/01/2001 16:13:56 No applicable vice inodes on d7; not salvaged
Temporary file /vicepa/salvage.inodes.d7.2287 is missing...


So at this point, my home directory is currently trashed beyond my ability to
rescue it. Due to reasons that I won't go into here, we don't have tape backups
of the cell. If it proves impossible to recover the backup volume, I will
just have to write off my loses and start over.

I would appreciate any help or advice (aside from getting some tape backups =)
that anyone can give. As irritating as it is, I can afford to lose my homedir
without much trouble. However, I really worry about what volume could be 
next.

I have tar'd and gzip'd the /usr/afs/logs directory, and made the contents
availible at:

http://www.firedrake.net/~nrawling/afs/afscrash20010501.tar.gz

Thanks in advance.

Nathan