[OpenAFS] More file corruption with disconnected mode

Fri, 03 Sep 2010 18:52:15 -0400

Thanks for the reply, Simon.

> I'm aware of a number of people who are successfully using  
> disconnected mode. Whilst it is not without some rough edges, it is  
> certainly stable enough for day to day use. I have certainly seen no  
> reports of the corruption problems you are seeing.

I assumed as much -- it has been around for a while now, so I was amazed
I was able to break it so easily. Like you, I'm guessing it might be recent
changes in the 1.5 branch.

> I'm away at the moment, and so can't investigate your problem further.  

If you have time to look at it later and I can do anything to provide more
info, let me know. I would *love* to use disconnected AFS in my day-to-day
life; disconnected operation is key for me, since it means I can use AFS
across my workstation and laptop, which are my two main computing platforms.

I think getting disconnected operation going in AFS would open up the system
to a *lot* of users. It doesn't need to have sophisticated conflict
resolution. After all, the standard use case is a user's home directory. When
the user is working on a disconnected client, he's *disconnected* -- so he's
almost certainly not causing side-effects via some other host.

It *does* need to support pinning, though: a key use case is that
I'm sitting at my workstation preparing, say, a presentation.
When it's time to go, I want just to stand up, close the lid on
my laptop and walk out of my office, knowing that when I saved my
file on my workstation, AFS shipped the bits out to my laptop for
me. I would be willing to run a single command on the laptop to
ensure it is synced, but what would not be so good is having to
issue some kind of "find . -type f -exec cat {} \; > /dev/null"
tree-walk on my laptop to ensure everything is pulled in: the
work is proportional to the size of the volume, so it doesn't
scale. If disconnected AFS would just support some simple pinning
mode -- or, even dumber and simpler, if clients simply had
an "eager" global mode that caused them to always reload a file
when receiving a broken-callback notice, instead of dropping the
file, so that files *only* get dropped when the cache overflows
-- then I'd be golden. A global "eager" mode wouldn't be too hard,
would it?

> From the information you provide it looks like the replay to the  
> server is silently failing, and so you're left with the locally  
> written data in the cache, but the servers metadata. 

That was my uneducated guess, as well.

> Do you have valid tokens when you make the reconnection call?

Yes. But it shouldn't matter, should it? If I do have a valid token,
the reconnection should win. If not, it should lose. But the metadata-wins /
regular-data-loses scenarios shouldn't happen in any event.

> Of course, 1.5.75 is a development release, and there has been a lot  
> of churn there. It is entirely possible that someone has broken the  
> disconnected replay code since it was last tested.

I was nervous about using 1.5.75 for exactly this reason. And, in
fact, I was unable to get the 1.5.75 server to come up and work
on my ubuntu system -- I tried but had to fall back to 1.4. (To
eliminate a conjecture: after falling back to 1.4, I deleted all
my volumes and recreated them, just in case the 1.5 server
lossage had somehow messed them up.) As for the 1.5 client:
disconnected operation's not available in the 1.4 branch, so I
had to use 1.5.

While I'm writing, I have a question about the scalability of disconnected
operation. What is the essential algorithm that happens when I reconnect a
disconnected client? If the client has n files in its cache, does the client
do an O(n) scan of all of the cache to verify that the files aren't stale?
Or something more clever that would do little work if little has changed
server-side since we disconnected?

Again, thanks for the fast reply. Let me know if there's anything else 
I can do.
    -Olin