[OpenAFS] status of samba serving AFS file space? other non-native windows access?

Jeffrey Altman jaltman@secure-endpoints.com
Tue, 17 Oct 2006 10:14:17 -0400


Dan Pritts wrote:
> On Mon, Oct 16, 2006 at 05:05:20PM -0400, Jeffrey Altman wrote:
>> Danno:
>>
>> I suspect that if people filed more bug reports when problems were
>> experienced that things would get fixed faster.  I understand that folks
> 
> for the record (and I know *you* know this), our windows support folks
> have in fact filed some bugs, or at least contacted you directly,
> specifically regarding the delay issues.  Also for the record you
> were apparently responsive but somehow along the way the problem 
> didn't get solved.  

Searching through the bugs queue the only tickets submitted from
internet2.edu are from you.  That is not to say that folks have not
submitted tickets from other organizations but I would be unable to
tie them back to Internet2.   If I had recognized the relationship I
would have contacted you directly when the problem had been reported.

It is often the case that the information provided within a report
is not sufficient to reproduce a problem or even narrow down a problem.
Problems such as a deadlock or a panic are easy to fix.  Problems
involving protocol behaviors are harder to identify but still not so
hard to fix.  Things which involved complex interactions between
clients, networks, file servers, and the stored metadata are most
challenging.

I have dozens of tickets in the openafs.org queue that have either been
marked stalled or resolved simply because the submitter stopped provided
responses to requests for more information.

If I am unable to reproduce a problem and cannot identify the cause from
the client side log files and if the file server logs are not
accessible, the easiest way for me to debug a problem would be to be
given remote access on a system within the cell that is experiencing the
problem.  Therefore, if I had seen an unresolvable ticket from
internet2.edu I would certainly have contacted you personally about it
because you would have been able to facilitate access to the necessary
information.

If you can provide the ticket number for the bug report I would be more
than happy to take another look at it.

> It may be that all of our problems have been either the long delays
> accessing AFS (which I think was due to the server bug that was fxed in
> 1.4.1), or the "missing data" problems, which I never bothered to report
> because i was sure it's just locking.  (and, because try as we could,
> we could never reproduce it).  Thanks for the info on the bug fixed
> in the upcoming release - it sounds like that will be an improvement,
> but I don't think it's related here, the users are editing the same file
> over and over.  

Every edit of an office document produces a temporary file in the same
directory as the original.  The edits are made to the temporary file and
the original is only updated after a "save" is executed.  I suspect that
if the users have the 5 minute auto-save feature enabled that they would
have triggered the bugs I described.

If you have multiple people editing in the same directory, the creation
and deletion of temporary files will result in callbacks to the clients
that can also trigger the callback break race conditions.

> Part of the problem is that my team is overloaded and didn't do proper
> followup with this beta customer in our organization - so they got very
> frustrated.  I'm loath to frustrate them any further, eg, with -dev branch
> code that may have furhter problems, since they process my paycheck.

The fact that a release is labeled "stable" vs "devel" doesn't mean that
the code is more likely to work.  "Stable" simply means that fewer
changes will occur in its successors because no new features or
behavioral changes will be added to that branch.  This doesn't prevent
bugs from being discovered or introduced in that code.  What it does
mean is that if there is a bug that requires a major redesign of the
code in order to fix it, the bug fix may not go onto the "stable" branch
at all and only be applied to the development branch.

There are many organizations which deploy code off of the development
branch for Windows but the stable branch for UNIX.  If you are using
64-bit Windows you don't have a choice.  If you are using Office
applications and require locking you don't have a choice.  If you are
using files larger than 2GB you don't have a choice.  And now if you are
using large numbers of temporary files in a common directory from more
than one machine you don't really have a choice.

Another thing to think about is that bugs on the devel branch get fixed
faster than bugs on the stable branch.  A devel branch release does not
have to work and have binaries built for all platforms unlike a stable
branch release.  Therefore, it is much easier for a fix to be tested in
the Windows client and for a release to be issued.

> Thanks also for the info on all the other afs access methods.  Sounds
> like samba isn't the right solution for us.   We may give the SFTPdrive
> thing a shot - I believe it will work, but it's not clear it enforces
> locking either.  

Be sure to check what their implementation does.  If it is a copy, local
edit, and replace model then for office documents which consist of a
single file you should be ok.  If the office documents in question use
multiple files with links between them, then I suspect that your users
will require simultaneous access to the same file images with support
for byte range locking.  As AFS does not support byte range locks in the
file server, if they are not simulated with full file locks by the
SFTPdrive (or other access client) you will continue to have problems.

Jeffrey Altman