[OpenAFS] Re: Large files with 1.6.0pre2

Andrew Deason adeason@sinenomine.net
Sat, 5 Mar 2011 16:49:17 -0600

On Sat, 5 Mar 2011 21:54:20 +0000
Simon Wilkinson <sxw@inf.ed.ac.uk> wrote:

> On 5 Mar 2011, at 21:31, Ryan C. Underwood wrote:
> > 
> > I also looked for any read, seek, or stat call that returned
> > negative.  No luck.  It seems like all the threads are being
> > captured...

The fileserver storage backend code caches file descriptors, so a
previous access could have opened it. Either that, or we're somehow
failing before we get to accessing the file data. But that seems
unlikely if reads before the 2G mark are fine; you can access the
beginning of the file, right?

You could restart the dafileserver process and start strace'ing right
away, or try to correlate the open file decsriptors in /proc/foo/fd; of
course, you can't do that if you wait until after the salvage happened,
since the fileserver won't have it open anymore.

And there probably won't be any relevant seeks (pread, open, and fstat),
and a short read can trigger this, which won't show up as a negative

Also, can you check if /vicepa/AFSIDat/3=/3=++U/8/L3/Que++kB44 is
actually 2147483648 bytes long? Can you read the contents of the file
directly from vicepa successfully? (just don't change anything in the
data or metadata of the file)

> Another thing you could try is (if this is a test system) attach to
> the fileserver process with gdb, and set a breakpoint at VTakeOffline.
> Then try and reproduce the problem. Hopefully, when the fileserver
> decides to take the volume offline, you'll hit that breakpoint, and
> 'bt' will let us know exactly where this is being triggered.

Or this. Or we could change the error messages to actually provide
useful information, since they're currently amazingly unhelpful.

Andrew Deason