[OpenAFS] AFS namei file servers, SAN, any issues elsewhere? We've had some. Can AFS _cause_ SAN issues?

Thu, 13 Mar 2008 13:45:15 -0600

We're using Hitachi USP and Hitachi 9585 SAN devices, and have had a 
series of incidents that, after two years of success, significantly 
affected AFS reliability for a period of six months.

I'm wondering if anyone else has had any issues using SANs for vice 
partitions.

Also, to make a long story short, I've been asked by my management to 
determine if AFS itself can cause SANs to misbehave. 

I can't see how, but committed to getting additional opinions.

Please opine!

Any experience, good or bad, with AFS impact from using SANs for namei 
vicep's is very helpful.
Any theories about how AFS could confuse a SAN also very helpful.

Thanks.

Kim

(Below I've added some detail about the SAN/AFS interaction I've seen, 
for those who are interested.)

========================================
For the record, here's what I've been experiencing.  The worst of the 
experience, as detailed below, was the impact on creation of move and 
release clones but not backup clones

AFS IMPACT

We were running 1.4.1 with some patches.  (Upgrading to 1.4.6 has been 
part of a thus far definitive fix for the 9585 issues.)

The worst of the six month stretch occured when the primary and 
secondary controller roles (9585 only thus far)  were reversed as a 
consequence of SAN fabric rebuilds.  For whatever reason, the time 
required to create volume clones for AFS 'vos release' and 'vos move' 
(using 'vos status' to audit clone time) increased from a typical 
several seconds to minutes, ten minutes, and in one case four hours.  
The RW volume is of course unwritable during the clone operation.

'vos remove' times on afflicted partitions were also affected, with 
increased time required to remove a volume.

I don't know why the creation of .backup clones was not similarly 
affected.  For a given volume the create time/refresh time for a move 
clone or release clone might have been fifteen minutes, while the 
.backup clone created quickly and took only slightly longer than usual.

With 'vos move' out of the picture I moved volumes with dump/restore, 
for volumes not frequently or recently updated, and dump/restore 
followed by use of a synchronization tool, Unison, to create a new RW 
volume, followed by changing the mount point to point to the name of the 
new volume, followed by waiting until the previous RW volume no longer 
showed any updates for a few days.

(If anyone is interested in Unison let me know.  I'm thinking of talking 
about it at Best Practices this year.)

The USP continues to spew SCSI command timeouts.

I tried dump|restore -overwrite -- which turned up interesting 
behavior.  The restore didn't update the VLDB entry until after the 
remove of the 'overwritten' volume.  Since deleteVolume was taking a 
long long time on affected vice partitions I stopped using dump|restore 
-overwrite on frequently changed volumes and used 
dump|restore-to-newname|change mount points instead.

(This behavior of 'vos restore' may not be true of 1.4.6, as I suspect 
the behavior may have been related to the single threading of the 
volserver which was fixed in 1.4.6)

I had always thought that the code to create clones of volumes was 
shared, and don't have a good reason for the .backup creation differing 
from the move and release clone creation.  I haven't gone to look to see 
if .backup code is separate.  Could it might have simply been that the 
creation of a .backup volume is likely to be an incremental update of an 
existing clone, while a move clone and release clone are more than 
likely full clone operations?

SAN symptoms, for those interested

I'm seeing SCSI command timeouts and UFS log timeouts (on vice 
partitions using the SAN for storage) on LUNS used for vicep's on the 
Hitachi USP, and was seeing them also on the 9585 until a recent 
configuration change.

At first I thought this was load related, so wrote scripts to generate a 
goodly load.  It turns out that even with a one second sleep between 
file create/write/close operations and between rm operations the SCSI 
command timeouts still occur, and that it's not load but simply activity 
that turns up the timeouts.

AFS is an excellent diagnostic for storage and network burps, and we've 
unsurprisingly seen more of the SCSI command timeouts and UFS log 
timeouts (Solaris) on the AFS file servers than anywhere else, but have 
seen some occurrences elsewhere.

The impact of the Solaris UFS log timeout is confined to the vicep which 
is, in response to the log timeout, unmounted by Solaris.  It must be 
fsck'd and remounted.  Not great with several hundred GB out of service 
for the duration of the fsck.  One UFS log timeout resulted in the loss 
of ~ 200GB of data.  (More accurately, fsck ran for more than five days, 
I'd already restored the data from tape, and chose to 'lose' the data 
after fsck completed since I couldn't figure out what the heck fsck had 
been doing for five days and didn't trust the results.  Not to mention 
five days of updates to the restored volumes, and no requirement to 
merge the recovered with the restored.)

The HBAs on the 9585 were apparently configured as active/passive and 
not active/active (or obverse) and I've not seen SCSI command timeouts 
on any of the 9585 LUNs since the configuration was changed. 

IN CLOSING

I realize this isn't a SAN forum and my inquiry isn't about SANs, and 
provide the information above just to share my experiences with SAN over 
the past six months.  We ran successfully for two years prior to the 
onset of these issues, and if anyone wants to discuss SAN issues off 
line my email address is below.  I can tell you what we saw and what 
we've done to correct issues, but am not a SAN expert by any means.

TIA

Kim Kimball
dhk at ccre period com