[OpenAFS] Crazy DAFS problem (with log)

Ryan C. Underwood nemesis@icequake.net
Sun, 20 Mar 2011 16:50:16 -0500


Shortly after the weekly scheduled fileserver restart, things blew up in
a big way.  My RW root.cell was inaccessible in the end.  No kernel
messages indicating filesystem or disk problems underneath.  I force-fscked
the vice partition (ext4) and no problems were found.  Any speculation
on what in the world happened here?  This is running -pre3 with the
patch from Andrew mentioned earlier.


Sun Mar 20 04:00:29 2011 File server starting (/usr/lib/openafs/dafileserver -p 123 -pctspare 20 -L -busyat 50 -rxpck 2000 -rxbind -cb 4000000 -vattachpar 128 -vlruthresh 1440 -vlrumax 8 -vhashsize 11)
Sun Mar 20 04:00:29 2011 afs_krb_get_lrealm failed, using icequake.net.
Sun Mar 20 04:00:30 2011 VLRU: starting scanner with the following configuration parameters:
Sun Mar 20 04:00:30 2011 VLRU:  offlining volumes after minimum of 86400 seconds of inactivity
Sun Mar 20 04:00:30 2011 VLRU:  running VLRU soft detach pass every 120 seconds
Sun Mar 20 04:00:30 2011 VLRU:  taking up to 8 volumes offline per pass
Sun Mar 20 04:00:30 2011 VLRU:  scanning generation 0 for inactive volumes every 10800 seconds
Sun Mar 20 04:00:30 2011 VLRU:  scanning for promotion/demotion between generations 0 and 1 every 172800 seconds
Sun Mar 20 04:00:30 2011 VLRU:  scanning for promotion/demotion between generations 1 and 2 every 345600 seconds
Sun Mar 20 04:00:30 2011 Set thread id 3 for FSYNC_sync
Sun Mar 20 04:00:30 2011 VInitVolumePackage: beginning parallel fileserver startup
Sun Mar 20 04:00:30 2011 VInitVolumePackage: using 1 threads to pre-attach volumes on 1 partitions
Sun Mar 20 04:00:30 2011 Scanning partitions on thread 1 of 1
Sun Mar 20 04:00:30 2011 Partition /vicepa: pre-attaching volumes
Sun Mar 20 04:00:30 2011 fs_stateRestore: commencing fileserver state restore
Sun Mar 20 04:00:30 2011 Partition scan thread 1 of 1 ended
Sun Mar 20 04:00:30 2011 fs_stateRestore: host table restored
Sun Mar 20 04:00:30 2011 fs_stateRestore: FileEntry and CallBack tables restored
Sun Mar 20 04:00:30 2011 fs_stateRestore: host table indices remapped
Sun Mar 20 04:00:30 2011 fs_stateRestore: FileEntry and CallBack indices remapped
Sun Mar 20 04:00:30 2011 fs_stateRestore: restore phase complete
Sun Mar 20 04:00:30 2011 fs_stateRestore: beginning state verification phase
Sun Mar 20 04:00:30 2011 fs_stateRestore: fileserver state verification complete
Sun Mar 20 04:00:30 2011 fs_stateRestore: restore was successful
Sun Mar 20 04:00:30 2011 Getting FileServer name...
Sun Mar 20 04:00:30 2011 Set thread id 0000007E for 'FiveMinuteCheckLWP'
Sun Mar 20 04:00:30 2011 FileServer host name is 'valhalla'
Sun Mar 20 04:00:30 2011 Set thread id 00000082 for 'FsyncCheckLWP'
Sun Mar 20 04:00:30 2011 Set thread id 0000007F for 'HostCheckLWP'
Sun Mar 20 04:00:30 2011 Getting FileServer address...
Sun Mar 20 04:00:30 2011 FileServer valhalla has address 10.0.1.232 (0xe801000a or 0xa0001e8 in host byte order)
Sun Mar 20 04:00:30 2011 File Server started Sun Mar 20 04:00:30 2011
Sun Mar 20 04:02:02 2011 Scheduling salvage for volume 536870915 on part /vicepa over SALVSYNC
Sun Mar 20 04:02:02 2011 VVGC_scan_partition: finished scanning /vicepa: 147 volumes in 74 groups
Sun Mar 20 04:02:03 2011 fssync: breaking all call backs for volume 536870915
Sun Mar 20 04:02:05 2011 Scheduling salvage for volume 536871273 on part /vicepa over SALVSYNC
Sun Mar 20 04:02:05 2011 fssync: breaking all call backs for volume 536871273
Sun Mar 20 04:04:01 2011 VAttachVolume: Error reading diskDataHandle header for vol 536870916; error=101
Sun Mar 20 04:04:01 2011 Scheduling salvage for volume 536870916 on part /vicepa over SALVSYNC
Sun Mar 20 04:04:01 2011 warning: volume 536870916 recursively checked out by programType id 4
Sun Mar 20 04:04:13 2011 Scheduling salvage for volume 536871274 on part /vicepa over SALVSYNC
Sun Mar 20 04:04:13 2011 warning: volume 536871274 recursively checked out by programType id 4
Sun Mar 20 04:06:02 2011 fssync: breaking all call backs for volume 536870916
Sun Mar 20 04:10:02 2011 fssync: breaking all call backs for volume 536871274
Sun Mar 20 04:12:02 2011 fssync: breaking all call backs for volume 536871274
Sun Mar 20 04:14:03 2011 fssync: breaking all call backs for volume 536871274
Sun Mar 20 04:14:14 2011 CopyOnWrite failed: volume 536871273 in partition /vicepa  (tried reading 8192, read 0, wrote 0, errno 0) volume needs salvage
Sun Mar 20 04:14:14 2011 CopyOnWrite failed: requesting salvage
Sun Mar 20 04:14:14 2011 Scheduling salvage for volume 536871273 on part /vicepa over SALVSYNC
Sun Mar 20 04:14:14 2011 fssync: breaking all call backs for volume 536871273
Sun Mar 20 04:16:02 2011 Scheduling salvage for volume 536871274 on part /vicepa over SALVSYNC
Sun Mar 20 04:18:02 2011 fssync: breaking all call backs for volume 536871274
Sun Mar 20 04:20:02 2011 fssync: breaking all call backs for volume 536871274
Sun Mar 20 04:20:04 2011 CopyOnWrite failed: volume 536871273 in partition /vicepa  (tried reading 8192, read 0, wrote 0, errno 0) volume needs salvage
Sun Mar 20 04:20:04 2011 CopyOnWrite failed: requesting salvage
Sun Mar 20 04:20:04 2011 Scheduling salvage for volume 536871273 on part /vicepa over SALVSYNC
Sun Mar 20 04:20:04 2011 fssync: breaking all call backs for volume 536871273
Sun Mar 20 04:20:07 2011 Scheduling salvage for volume 536871274 on part /vicepa over SALVSYNC
Sun Mar 20 04:24:02 2011 fssync: breaking all call backs for volume 536871274
Sun Mar 20 04:26:02 2011 fssync: breaking all call backs for volume 536871274
Sun Mar 20 04:26:03 2011 CopyOnWrite failed: volume 536871273 in partition /vicepa  (tried reading 8192, read 0, wrote 0, errno 0) volume needs salvage
Sun Mar 20 04:26:03 2011 CopyOnWrite failed: requesting salvage
Sun Mar 20 04:26:03 2011 Scheduling salvage for volume 536871273 on part /vicepa over SALVSYNC
Sun Mar 20 04:26:03 2011 fssync: breaking all call backs for volume 536871273
Sun Mar 20 04:26:08 2011 Scheduling salvage for volume 536871274 on part /vicepa over SALVSYNC
Sun Mar 20 04:30:02 2011 fssync: breaking all call backs for volume 536871274
Sun Mar 20 04:32:02 2011 fssync: breaking all call backs for volume 536871274
Sun Mar 20 04:32:54 2011 CopyOnWrite failed: volume 536871273 in partition /vicepa  (tried reading 8192, read 0, wrote 0, errno 0) volume needs salvage
Sun Mar 20 04:32:54 2011 CopyOnWrite failed: requesting salvage
Sun Mar 20 04:32:54 2011 Scheduling salvage for volume 536871273 on part /vicepa over SALVSYNC
Sun Mar 20 04:32:54 2011 fssync: breaking all call backs for volume 536871273
Sun Mar 20 04:34:02 2011 fssync: breaking all call backs for volume 536870916
Sun Mar 20 04:34:10 2011 Scheduling salvage for volume 536871274 on part /vicepa over SALVSYNC
Sun Mar 20 04:35:51 2011 CopyOnWrite failed: volume 536870915 in partition /vicepa  (tried reading 8192, read 0, wrote 0, errno 0) volume needs salvage
Sun Mar 20 04:35:51 2011 CopyOnWrite failed: requesting salvage
Sun Mar 20 04:35:51 2011 Scheduling salvage for volume 536870915 on part /vicepa over SALVSYNC
Sun Mar 20 04:35:51 2011 fssync: breaking all call backs for volume 536870915
Sun Mar 20 04:36:01 2011 VAttachVolume: Error reading diskDataHandle header for vol 536870916; error=101
Sun Mar 20 04:36:01 2011 Scheduling salvage for volume 536870916 on part /vicepa over SALVSYNC
Sun Mar 20 04:36:09 2011 fssync: breaking all call backs for volume 536871274
Sun Mar 20 04:38:02 2011 fssync: breaking all call backs for volume 536870916
Sun Mar 20 04:38:12 2011 fssync: breaking all call backs for volume 536871274
Sun Mar 20 04:38:13 2011 CopyOnWrite failed: volume 536871273 in partition /vicepa  (tried reading 8192, read 0, wrote 0, errno 0) volume needs salvage
Sun Mar 20 04:38:13 2011 CopyOnWrite failed: requesting salvage
Sun Mar 20 04:38:13 2011 Scheduling salvage for volume 536871273 on part /vicepa over SALVSYNC
Sun Mar 20 04:38:13 2011 fssync: breaking all call backs for volume 536871273
Sun Mar 20 04:38:18 2011 Scheduling salvage for volume 536871274 on part /vicepa over SALVSYNC
Sun Mar 20 04:42:02 2011 fssync: breaking all call backs for volume 536871274
Sun Mar 20 04:44:02 2011 fssync: breaking all call backs for volume 536870916
[..]
Sun Mar 20 05:46:16 2011 VRequestSalvage: volume 536871273 online salvaged too many times; forced offline.
Sun Mar 20 05:48:02 2011 VRequestSalvage: volume 536871273 online salvaged too many times; forced offline.
Sun Mar 20 05:48:02 2011 FSYNC_backgroundSalvage: unable to request salvage for volume 536871273
Sun Mar 20 05:48:05 2011 VRequestSalvage: volume 536871273 online salvaged too many times; forced offline.
Sun Mar 20 05:48:05 2011 FSYNC_backgroundSalvage: unable to request salvage for volume 536871273
Sun Mar 20 05:48:08 2011 VRequestSalvage: volume 536871273 online salvaged too many times; forced offline.
Sun Mar 20 05:48:08 2011 FSYNC_backgroundSalvage: unable to request salvage for volume 536871273
[..]
Sun Mar 20 16:32:17 2011 VRequestSalvage: volume 536871273 online salvaged too many times; forced offline.
Sun Mar 20 16:32:17 2011 FSYNC_backgroundSalvage: unable to request salvage for volume 536871273
Sun Mar 20 16:32:17 2011 Shutting down file server at Sun Mar 20 16:32:17 2011
Sun Mar 20 16:32:17 2011 Vice was last started at Sun Mar 20 04:00:30 2011

Sun Mar 20 16:32:17 2011 Volume hash summary: 2048 buckets
Sun Mar 20 16:32:17 2011  chain length : min=0, max=1, avg=0, total=147
Sun Mar 20 16:32:17 2011  looks : min=0, max=19189, avg=95, total=195480
Sun Mar 20 16:32:17 2011  gets : min=0, max=19190, avg=95, total=195627
Sun Mar 20 16:32:17 2011  reorders : min=0, max=0, avg=0, total=0
Sun Mar 20 16:32:17 2011 Partition /vicepa has 147 online volumes
Sun Mar 20 16:32:17 2011 VLRU State Dump:

Sun Mar 20 16:32:17 2011 	VLRU_QUEUE_NEW:
Sun Mar 20 16:32:17 2011 		536871132, 536870916, 536871131, 536870913, 536871279,
Sun Mar 20 16:32:17 2011 		536871218, 536871107, 536870952, 536871119, 536871098,
Sun Mar 20 16:32:17 2011 		536871101, 536871143, 536871110, 536871134, 536871152,
Sun Mar 20 16:32:17 2011 		536871164, 536871062, 536871155, 536871113, 536871065,
Sun Mar 20 16:32:17 2011 		536871122, 536871149, 536871140, 536871162, 536870970,
Sun Mar 20 16:32:17 2011 		536871183, 536870967, 536870966, 536870957, 536871212,
Sun Mar 20 16:32:17 2011 		536870951, 536871182, 536870960, 536870969, 536870954,
Sun Mar 20 16:32:17 2011 		536871161, 536870981, 0, 0, 0
Sun Mar 20 16:32:17 2011 
Sun Mar 20 16:32:17 2011 	VLRU_QUEUE_MID:
Sun Mar 20 16:32:17 2011 
Sun Mar 20 16:32:17 2011 	VLRU_QUEUE_OLD:
Sun Mar 20 16:32:17 2011 
Sun Mar 20 16:32:17 2011 	VLRU_QUEUE_CANDIDATE:
Sun Mar 20 16:32:17 2011 
Sun Mar 20 16:32:17 2011 	VLRU_QUEUE_HELD:
Sun Mar 20 16:32:17 2011 
Sun Mar 20 16:32:17 2011 Large vnode cache, 600 entries, 38 allocs, 38392 gets (2434 reads), 171 writes
Sun Mar 20 16:32:17 2011 Small vnode cache,600 entries, 9 allocs, 47958 gets (22008 reads), 1339 writes
Sun Mar 20 16:32:17 2011 Volume header cache, 600 entries, 78165 gets, 442 replacements
Sun Mar 20 16:32:17 2011 Partition /vicepa: 519115596 available 1K blocks (minfree=27780708), 216780036 free blocks
Sun Mar 20 16:32:17 2011 With 120 directory buffers; 1296 reads resulted in 13 read I/Os
Sun Mar 20 16:32:17 2011 Total Client entries = 1083, blocks = 15; Host entries = 9, blocks = 1
Sun Mar 20 16:32:17 2011 There are 1083 connections, process size 161040
Sun Mar 20 16:32:17 2011 There are 8 workstations, 4 are active (req in < 15 mins), 0 marked "down"
Sun Mar 20 16:32:17 2011 CopyOnWrite: calls 194 off0 80 size0 80 maxsize 0xf00000
Sun Mar 20 16:32:17 2011 VShutdown:  shutting down on-line volumes on 1 partition...
Sun Mar 20 16:32:17 2011 VShutdown: partition /vicepa has 37 volumes with attached headers
Sun Mar 20 16:32:17 2011 VShutdown:  beginning parallel fileserver shutdown
Sun Mar 20 16:32:17 2011 VShutdown:  using 128 threads to offline volumes on 1 partition
Sun Mar 20 16:32:17 2011 SYNC_getCom:  error receiving command
Sun Mar 20 16:32:17 2011 FSYNC_com:  read failed; dropping connection (cnt=156067)
Sun Mar 20 16:32:17 2011 VShutdown:  pass 0 completed using the 1 thread per partition algorithm
Sun Mar 20 16:32:17 2011 VShutdown:  starting passes 1 through 3 using finely-granular mp-fast algorithm
Sun Mar 20 16:32:17 2011 ShutdownController:  schedule version=1, vol_remaining=37, pass=1
Sun Mar 20 16:32:17 2011 ShutdownController:  n_threads_complete=0, n_parts_done_pass=0
Sun Mar 20 16:32:17 2011 ShutdownController:  part[0] : (len=37, thread_target=128, done_pass=0, pass_head=097C2170)
Sun Mar 20 16:32:17 2011 SYNC_getCom:  error receiving command
Sun Mar 20 16:32:17 2011 FSYNC_com:  read failed; dropping connection (cnt=156068)
Sun Mar 20 16:32:17 2011 VShutdown:  pass 1 completed using 128 threads on 1 partitions
Sun Mar 20 16:32:17 2011 VShutdown:  pass 2 completed using 128 threads on 1 partitions
Sun Mar 20 16:32:17 2011 VShutdown:  done shutting down volumes on partition /vicepa.
Sun Mar 20 16:32:17 2011 VShutdown:  pass 3 completed using 128 threads on 1 partitions
Sun Mar 20 16:32:18 2011 VShutdown:  /vicepa stats : (pass[0]=110, pass[1]=1, pass[2]=36, pass[3]=0)
Sun Mar 20 16:32:18 2011 VShutdown:  shutdown finished using 128 threads
Sun Mar 20 16:32:18 2011 VShutdown:  complete.
Sun Mar 20 16:32:18 2011 fs_stateSave: commencing fileserver state dump
Sun Mar 20 16:32:18 2011 fs_stateSave: performing internal consistency checks before proceeding with state dump
Sun Mar 20 16:32:18 2011 h_stateVerifyUuidHash: warning: uuid hash entry points to different host struct (7, 5)
Sun Mar 20 16:32:18 2011 h_stateVerifyUuidHash: warning: uuid hash entry points to different host struct (6, 5)
Sun Mar 20 16:32:18 2011 fs_stateSave: proceeding with dump
Sun Mar 20 16:32:18 2011 h_stateSave:  hostCount=9
Sun Mar 20 16:32:18 2011 fs_stateSave: fileserver state dump completed successfully
Sun Mar 20 16:32:18 2011 File server has terminated normally at Sun Mar 20 16:32:18 2011