[OpenAFS] Problems on AFS Unix clients after AFS fileserver moves
Rich Sudlow
rich@nd.edu
Wed, 10 Aug 2005 10:36:26 -0500
Todd DeSantis wrote:
> Hi Rich -
>
> I am glad that
>
> fs checkvolumes
>
> was able to help you get rid of this problem.
>
> Hopefully this was not a coincidence and the "vos release"
> of the bogus root.cell.readonly also did not happen around
> this time.
>
> To help understand why your clients were in this state
> I would like to ask some questions:
>
> - a kdump snapshot would have been able to give us some
> information on the state of the client and could have
> helped us determine if any volume and/or vcache entry
> was still pointing at this old fileserver
yes - that would be nice - I wish I used these tools more and
was more proficient with them. But I'm no longer supposed
to do this ;-) But as I mentioned these have been happening
for a number of years. I've also seen very inconsistant
releases of root.cell at out site. e.g. some going offline
and a LOT of communication errors when doing the release
which happens daily at 7 A.M.
>
> Did you just not build kdump for the client, or does
> OpenAFS not build kdump by default ?
I don't remember - I believe that there are "problems" getting
this to build on openafs.
>
> - when was this fileserver taken out of commission, was it
> within 2 hours ?
No MUCH longer > 1 day.
>
> Normal callback timeouts on volumes would be 2 hours.
> There is a daemon on the client that will run every 2
> hours and it will clear the "volume status" flag on
> the volumes in the volume cache, if the expiration time
> has elapsed. I think readonly volumes had a maximum
> 2 hour timeout.
What happens when the 1st readonly volume is "screwed up
as we saw yesterday due to the lack of a vos release on root.cell?
Although as mentioned - this always used to work (transparent
fileserver moves and reconfigurations) until the last couple years.
>
> This process also causes the vcache structures to have
> their CStatd bit cleared. This tells the client to run
> a FetchStatus call to determine if my cached version is
> still the correct version of the file/dir.
>
> This is the way that the IBM Transarc clients work. It is
> possible that the OpenAFS code has changed the callback timing
> a bit, I am not sure of this.
>
> But the above procedures will cause the following to happen
> the next time you tried to access a file or directory that
> had its volume status flag cleared
>
> - contact the vlserver and get location information for
> the volume. If the client still thought that this file
> lived on the bad fileserver, and the VLDB information is
> correct, then it would get the new server location info.
>
> - it would then contact the fileserver with a FetchStatus
> call to determine if its cache is current, or if it
> needs to do a FetchData call to the fileserver for your
> directories and files.
>
> - and at this time, it has located the directory/file you
> are looking for
>
> Other ways that the volume location information can get cleared is
> with
>
> - fs checkvolumes, as Kim and I suggested to Rich
> - vos move
> - vos release
> - bringing more volumes into the cache than the -volumes option
> in afsd. This causes some volumes to cycle out of the cache
> and this can clear the status flag for the volume
> - and possibly other vos transactions on the volume
>
> Also, as Derrick mentioned in the first email, once the client knows
> about a fileserver, it will remember it until the client is rebooted.
> And every once in a while the CheckServersDaemon will run and it will
> see that it does not get an answer from this fileserver. And then
> every 5 minutes or so, the client will send a GetTime request to the
> fileserver IP to determine if the fileserver is back up. This could
> have been the tcpdump traffic you saw going to this old fileserver IP,
> the GetTime call.
>
> Sorry for chiming in on this one, but I wanted to add some information
> to this issue since the "checkv" has seemd to get us out of this
> problem.
NO THANK YOU VERY MUCH!!
>
> A kdump snapshot would have really helped.
OK
>
> And one more thing to check is if OpenAFS changed any of the
> callback timing for volumes.
OK - Thanks. I did see some very similiar messages which were reported
for the Windows client - and mention that there were some recent server
changes to go with this - not 100% sure that these are related.
https://lists.openafs.org/pipermail/openafs-info/2005-June/018298.html
Thanks for your help Todd ;-)
Rich
>
> Thanks
>
> Todd DeSantis
> AFS Support
> IBM Pittsburgh Lab
>
>
>
>
> Rich Sudlow
> <rich@nd.edu>
> Sent by: To
> openafs-info-admi dhk@ccre.com
> n@openafs.org cc
> "'openafs'"
> <openafs-info@openafs.org>
> 08/09/2005 05:21 Subject
> PM Re: [OpenAFS] Problems on AFS Unix
> clients after AFS fileserver moves
>
>
>
>
>
>
>
>
>
>
> Dexter 'Kim' Kimball wrote:
>
>>fs checkv will cause the client to discard what it remembers about
>
> volumes.
>
>>Did you try that?
>
>
> No - That worked!
>
> Thanks
>
> Rich
>
>
>>Kim
>>
>>
>> -----Original Message-----
>> From: openafs-info-admin@openafs.org
>> [mailto:openafs-info-admin@openafs.org] On Behalf Of Rich Sudlow
>> Sent: Tuesday, August 09, 2005 9:58 AM
>> To: openafs
>> Subject: [OpenAFS] Problems on AFS Unix clients after AFS
>> fileserver moves
>>
>>
>> We've been having problems with our cell for the last couple
>> years with AFS clients after fileservers are taken out of service.
>> Before that things seemed to work ok when doing fileserver
>> moves and
>> rebuilding. All data was moved off the fileserver but the clients
>> still seem to have some need to talk to it. In the past the AFS
>> admins have left the fileservers up and empty for a number of
>> days to try to resolve this issue - but it doesn't resolve the
>> issue.
>>
>> For example a recent example:
>>
>> The fileserver reno.helios.nd.edu was shutdown after all data
>> moved off of it. However the client still can't get to
>> a number of AFS files.
>>
>> [root@xeon109 root]# fs checkservers
>> These servers unavailable due to network or server problems:
>> reno.helios.nd.edu.
>> [root@xeon109 root]# cmdebug reno.helios.nd.edu -long
>> cmdebug: error checking locks: server or network not responding
>> cmdebug: failed to get cache entry 0 (server or network
>> not responding)
>> [root@xeon109 root]# cmdebug reno.helios.nd.edu
>> cmdebug: error checking locks: server or network not responding
>> cmdebug: failed to get cache entry 0 (server or network
>> not responding)
>> [root@xeon109 root]#
>>
>> [root@xeon109 root]# vos listvldb -server reno.helios.nd.edu
>> VLDB entries for server reno.helios.nd.edu
>>
>> Total entries: 0
>> [root@xeon109 root]#
>>
>> on the client:
>> rxdebug localhost 7001 -version
>> Trying 127.0.0.1 (port 7001):
>> AFS version: OpenAFS 1.2.11 built 2004-01-11
>>
>>
>> This is a linux 2.4 client and I don't have kdump - have
>> also had these
>> problems on sun4x_58 clients too.
>>
>> I should mention that we've seen some correlation
>> to this happening on machines with "busy" AFS caches -
>> which makes it
>> even more frustrating as it seems to affect machines which
>> depend on
>> AFS the most. We've tried lots of fs flush* * -
>> So far we've ended up rebooting which does fix the
>> problem.
>>
>> Does anyone have any clues what the problem is or what a workaround
>> might be?
>>
>> Thanks
>>
>> Rich
>>
>> --
>> Rich Sudlow
>> University of Notre Dame
>> Office of Information Technologies
>> 321 Information Technologies Center
>> PO Box 539
>> Notre Dame, IN 46556-0539
>>
>> (574) 631-7258 office phone
>> (574) 631-9283 office fax
>>
>
>
>> _______________________________________________
>> OpenAFS-info mailing list
>> OpenAFS-info@openafs.org
>> https://lists.openafs.org/mailman/listinfo/openafs-info
>>
>>
>>
>>_______________________________________________
>>OpenAFS-info mailing list
>>OpenAFS-info@openafs.org
>>https://lists.openafs.org/mailman/listinfo/openafs-info
>
>
>
> --
> Rich Sudlow
> University of Notre Dame
> Office of Information Technologies
> 321 Information Technologies Center
> PO Box 539
> Notre Dame, IN 46556-0539
>
> (574) 631-7258 office phone
> (574) 631-9283 office fax
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>
--
Rich Sudlow
University of Notre Dame
Office of Information Technologies
321 Information Technologies Center
PO Box 539
Notre Dame, IN 46556-0539
(574) 631-7258 office phone
(574) 631-9283 office fax