[OpenAFS] Problems with power outages

Karl M. Davis karl@ridgetop-group.com
Wed, 15 Aug 2007 15:59:18 -0700


Since this is a VM, I just took a snapshot and tried to repro, by =
"powering off" the VM myself.  Here's the terminal log after booting =
back up from that:
<<
login as: karl
karl@coronado.ridgetop-group.local's password:
Linux coronado 2.6.20-16-server #2 SMP Thu Jun 7 20:26:23 UTC 2007 i686

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
Last login: Wed Aug 15 15:25:47 2007
karl@coronado:~$ ls /afs
karl@coronado:~$ sudo bos status -server localhost
bos: no such entry (getting tickets)
bos: running unauthenticated
Instance ptserver, currently running normally.
Instance vlserver, currently running normally.
Instance fs, currently running normally.
    Auxiliary status is: salvaging file system.
karl@coronado:~$ sudo bos status -server localhost
bos: no such entry (getting tickets)
bos: running unauthenticated
Instance ptserver, currently running normally.
Instance vlserver, currently running normally.
Instance fs, currently running normally.
    Auxiliary status is: file server running.
karl@coronado:~$ ls /afs
karl@coronado:~$
>>

After noting the status changed, I rebooted (the right way) and this =
time, it came back up just fine.  So I can't reproduce this by just =
killing the VM.  I'm not really interested in powering off the VM's =
host, either (for one, that would have to be done after-hours).

Anyways, if it happens again, I'll catch the logs and file a report on =
the segfault.  In the meantime, I'll go learn how to set dynroot up.

Thanks,
Karl


-----Original Message-----
From: Jeffrey Altman [mailto:jaltman@secure-endpoints.com]=20
Sent: Wednesday, August 15, 2007 1:28 PM
To: Karl M. Davis
Cc: openafs-info@openafs.org
Subject: Re: [OpenAFS] Problems with power outages

Karl M. Davis wrote:
> Hey there all,
>=20
> =20
>=20
> I just recently set up the Debian openafs 1.4.4 packages on an Ubuntu
> server box, running in a virtual machine.  It=E2=80=99s monsoon season =
here in
> Tucson and we=E2=80=99ve had a couple of long power outages and =
problems with
> the UPS.  Both times the server has gone done unexpectedly, AFS =
didn=E2=80=99t
> come back up correctly.  The symptoms I note are that =E2=80=9Cls =
/afs=E2=80=9D returns
> empty on the server and the Windows client can=E2=80=99t connect.
>
> For whatever reason, the thing that has fixed it both times is running
> =E2=80=9Cfs checkvolumes=E2=80=9D.  Of course, =E2=80=9Cfs =
checkvolumes=E2=80=9D segfaults when I run
> it, but if I reboot after that, everything comes back up fine, clients
> can connect, and further =E2=80=9Cfs checkvolumes=E2=80=9D =
don=E2=80=99t segfault.  Rebooting
> before running that specific command (with the segfault) does
> nothing=E2=80=94=E2=80=9Cls /afs=E2=80=9D still returns empty.
>
>=20
> So=E2=80=A6 a couple of questions:
>=20
> How do I ensure AFS can survive a power outage/unexpected poweroff
> without getting borked?
>=20
> If it does get borked, why would a segfaulting =E2=80=9Cfs =
checkvolumes=E2=80=9D fix things?

fs checkvolumes doesn't really check anything.  It instructs the AFS
cache manager to invalidate its knowledge of all of the volume location
information thereby forcing the data to be reloaded from the volume
database servers.

If you are not using dynroot on UNIX or freelance on Windows, if the
file servers are all down or if all of the copies of the 'root.afs'
volume are offline when the client starts, the client will be unable to
mount the volume.  In the case of the Windows clients they will stop
with a panic condition that is logged to %WinDir%\temp\afsd_init.log

If you file a bug report to openafs-bugs@openafs.org with a stack trace
for the segfault on Linux someone can attempt to fix that.   My guess is
that it is failing because the volume list is empty or some boundary
condition like that.

I have no idea how/why "fs checkvolumes" segfaulting would be a
requirement for subsequent access.

Jeffrey Altman