[OpenAFS] 'vos' command dos not finish, file service works ok (sort of)

Andreas Hirczy ahi@itp.tugraz.at
Thu, 24 Jul 2008 03:00:41 +0200


Jeffrey Altman <jaltman@secure-endpoints.com> writes:

> The Volserver is trying to establish a connection with the Fileserver
> and it can't.  As a result after five retries it exits with an assertion
> failure.
>
> What is the state of the File Server?  (FileLog)

Thu Jul 24 01:00:26 2008 File server starting
Thu Jul 24 01:00:26 2008 afs_krb_get_lrealm failed, using itp.tugraz.at.
Thu Jul 24 02:25:27 2008 Set Debug On level = 1
Thu Jul 24 02:25:53 2008 [0] Set Debug On level = 5

After a few restarts the verbose mode is of course no longer active and the
Log File has been moved away. I have now rectivated more verbose logging, but
besides the message of increased logging as above I cannot remember anything
unusual.

Dump again; differnt to my latest attempts there is now some reaction on the command line:

root@faeppc18:~# "vos dump -id user.zuzi -file /tmp/backup -localauth -verbose"
Full Dump ...
Starting transaction on volume 536872626...

And thats all after 10 Minutes - in my latest full archival backup this dump
file is about 300 Mbytes. The log files have not changed much, esp. FileLog
stays the same:
<http://itp.tugraz.at/~ahi/openafs/>

> What version of OpenAFS are you using?

1.4.7.dfsg1-1 - a backport of Sam Hartmans Debian packages (unstable) to
Debian stable, Kernel is a custom linux 2.6.25.11.

There are 3 DB servers and 3 file servers all running this version. This is
the only machine acting as a file and db server. DB server binds to virtual
ethernet socket provided by fake (arp poisioning). This worked with very
little problems for about a year now, but I had to reboot because we had a
sheduled power outage last friday and yesterday (wednesday)

Another file server (on big UPS, so not rebooted, but also running 1.4.7) is much more verbose after
startup: 

Sun Jul 20 04:00:11 2008 File server starting
Sun Jul 20 04:00:11 2008 afs_krb_get_lrealm failed, using itp.tugraz.at.
Sun Jul 20 04:00:49 2008 VL_RegisterAddrs rpc failed; will retry periodically (code=5377, err=0)
Sun Jul 20 04:00:49 2008 Set thread id 11 for FSYNC_sync
Sun Jul 20 04:00:49 2008 FSYNC_sync: bind failed with (98), removed bogus /var/lib/openafs/local/fssync.sock
Sun Jul 20 04:00:49 2008 Partition /vicepa: attaching volumes
Sun Jul 20 04:01:15 2008 Partition /vicepa: attached 362 volumes; 0 volumes not attached
Sun Jul 20 04:01:15 2008 Getting FileServer name...
Sun Jul 20 04:01:15 2008 FileServer host name is 'faepsv07'
Sun Jul 20 04:01:15 2008 Getting FileServer address...
Sun Jul 20 04:01:15 2008 FileServer faepsv07 has address 129.27.161.111 (0x6fa11b81 or 0x811ba16f in host byte order)
Sun Jul 20 04:01:15 2008 File Server started Sun Jul 20 04:01:15 2008
Sun Jul 20 04:01:15 2008 Set thread id 15 for 'FiveMinuteCheckLWP'
Sun Jul 20 04:01:15 2008 Set thread id 16 for 'HostCheckLWP'
Sun Jul 20 04:01:15 2008 Set thread id 17 for 'FsyncCheckLWP'
Sun Jul 20 20:04:29 2008 CB: ProbeUuid for 78.104.3.214:51209 failed -01
Sun Jul 20 20:08:56 2008 CB: ProbeUuid for 78.104.3.214:51227 failed -01
.....

I now really suspect those problems stem from the file server and db server
listening on different IP addresses on the same machine.

Thanks for caring!
Andreas
-- 
Andreas Hirczy <ahi@itp.tugraz.at>                   http://itp.tugraz.at/~ahi/
Graz University of Technology                        phone: +43/316/873-   8190
Institute of Theoretical and Computational Physics     fax: +43/316/873-10 8190
Petersgasse 16, A-8010 Graz                         mobile: +43/664/859 23 57