[OpenAFS] Problems on AFS Unix clients after AFS fileserver moves

Rich Sudlow rich@nd.edu
Wed, 10 Aug 2005 10:36:26 -0500


Todd DeSantis wrote:
> Hi Rich -
> 
> I am glad that
> 
>       fs checkvolumes
> 
> was able to help you get rid of this problem.
> 
> Hopefully this was not a coincidence and the "vos release"
> of the bogus root.cell.readonly also did not happen around
> this time.
> 
> To help understand why your clients were in this state
> I would like to ask some questions:
> 
>  - a kdump snapshot would have been able to give us some
>    information on the state of the client and could have
>    helped us determine if any volume and/or vcache entry
>    was still pointing at this old fileserver

yes - that would be nice - I wish I used these tools more and
was more proficient with them.  But I'm no longer supposed
to do this ;-)  But as I mentioned these have been happening
for a number of years.  I've also seen very inconsistant
releases of root.cell at out site. e.g. some going offline
and a LOT of communication errors when doing the release
which happens daily at 7 A.M.

> 
>    Did you just not build kdump for the client, or does
>    OpenAFS not build kdump by default ?

I don't remember - I believe that there are  "problems" getting
this to build on openafs.

> 
>  - when was this fileserver taken out of commission, was it
>    within 2 hours ?

No MUCH longer > 1 day.

> 
>    Normal callback timeouts on volumes would be 2 hours.
>    There is a daemon on the client that will run every 2
>    hours and it will clear the "volume status" flag on
>    the volumes in the volume cache, if the expiration time
>    has elapsed.  I think readonly volumes had a maximum
>    2 hour timeout.

What happens when the 1st readonly volume is "screwed up
as we saw yesterday due to the lack of a vos release on root.cell?
Although as mentioned - this always used to work (transparent
fileserver moves and reconfigurations) until the last couple years.
> 
>    This process also causes the vcache structures to have
>    their CStatd bit cleared.  This tells the client to run
>    a FetchStatus call to determine if my cached version is
>    still the correct version of the file/dir.
> 
>    This is the way that the IBM Transarc clients work.  It is
>    possible that the OpenAFS code has changed the callback timing
>    a bit, I am not sure of this.
> 
>    But the above procedures will cause the following to happen
>    the next time you tried to access a file or directory that
>    had its volume status flag cleared
> 
>       - contact the vlserver and get location information for
>         the volume.  If the client still thought that this file
>         lived on the bad fileserver, and the VLDB information is
>         correct, then it would get the new server location info.
> 
>       - it would then contact the fileserver with a FetchStatus
>         call to determine if its cache is current, or if it
>         needs to do a FetchData call to the fileserver for your
>         directories and files.
> 
>       - and at this time, it has located the directory/file you
>         are looking for
> 
> Other ways that the volume location information can get cleared is
> with
> 
>       - fs checkvolumes, as Kim and I suggested to Rich
>       - vos move
>       - vos release
>       - bringing more volumes into the cache than the -volumes option
>         in afsd.  This causes some volumes to cycle out of the cache
>         and this can clear the status flag for the volume
>       - and possibly other vos transactions on the volume
> 
> Also, as Derrick mentioned in the first email, once the client knows
> about a fileserver, it will remember it until the client is rebooted.
> And every once in a while the CheckServersDaemon will run and it will
> see that it does not get an answer from this fileserver.  And then
> every 5 minutes or so, the client will send a GetTime request to the
> fileserver IP to determine if the fileserver is back up.  This could
> have been the tcpdump traffic you saw going to this old fileserver IP,
> the GetTime call.
> 
> Sorry for chiming in on this one, but I wanted to add some information
> to this issue since the "checkv" has seemd to get us out of this
> problem.

NO THANK YOU VERY MUCH!!

> 
> A kdump snapshot would have really helped.

OK

> 
> And one more thing to check is if OpenAFS changed any of the
> callback timing for volumes.

OK - Thanks. I did see some very similiar messages which were reported 
for the Windows client - and mention that there were some recent server
changes to go with this - not 100% sure that these are related.

https://lists.openafs.org/pipermail/openafs-info/2005-June/018298.html

Thanks for your help Todd ;-)

Rich

> 
> Thanks
> 
> Todd DeSantis
> AFS Support
> IBM Pittsburgh Lab
> 
> 
> 
>                                                                            
>              Rich Sudlow                                                   
>              <rich@nd.edu>                                                 
>              Sent by:                                                   To 
>              openafs-info-admi         dhk@ccre.com                        
>              n@openafs.org                                              cc 
>                                        "'openafs'"                         
>                                        <openafs-info@openafs.org>          
>              08/09/2005 05:21                                      Subject 
>              PM                        Re: [OpenAFS] Problems on AFS Unix  
>                                        clients after AFS fileserver moves  
>                                                                            
>                                                                            
>                                                                            
>                                                                            
>                                                                            
>                                                                            
> 
> 
> 
> 
> Dexter 'Kim' Kimball wrote:
> 
>>fs checkv will cause the client to discard what it remembers about
> 
> volumes.
> 
>>Did you try that?
> 
> 
> No - That worked!
> 
> Thanks
> 
> Rich
> 
> 
>>Kim
>>
>>
>>     -----Original Message-----
>>     From: openafs-info-admin@openafs.org
>>     [mailto:openafs-info-admin@openafs.org] On Behalf Of Rich Sudlow
>>     Sent: Tuesday, August 09, 2005 9:58 AM
>>     To: openafs
>>     Subject: [OpenAFS] Problems on AFS Unix clients after AFS
>>     fileserver moves
>>
>>
>>     We've been having problems with our cell for the last couple
>>     years with AFS clients after fileservers are taken out of service.
>>     Before that things seemed to work ok when doing fileserver
>>     moves and
>>     rebuilding. All data was moved off the fileserver but the clients
>>     still seem to have some need to talk to it.  In the past the AFS
>>     admins have left the fileservers up and empty for a number of
>>     days to try to resolve this issue -  but it doesn't resolve the
>>     issue.
>>
>>     For example a recent example:
>>
>>     The fileserver reno.helios.nd.edu was shutdown after all data
>>     moved off of it.  However the client still can't get to
>>     a number of AFS files.
>>
>>     [root@xeon109 root]# fs checkservers
>>     These servers unavailable due to network or server problems:
>>     reno.helios.nd.edu.
>>     [root@xeon109 root]# cmdebug reno.helios.nd.edu -long
>>     cmdebug: error checking locks: server or network not responding
>>     cmdebug: failed to get cache entry 0 (server or network
>>     not responding)
>>     [root@xeon109 root]# cmdebug reno.helios.nd.edu
>>     cmdebug: error checking locks: server or network not responding
>>     cmdebug: failed to get cache entry 0 (server or network
>>     not responding)
>>     [root@xeon109 root]#
>>
>>     [root@xeon109 root]#  vos listvldb -server reno.helios.nd.edu
>>     VLDB entries for server reno.helios.nd.edu
>>
>>     Total entries: 0
>>     [root@xeon109 root]#
>>
>>     on the client:
>>     rxdebug localhost 7001 -version
>>     Trying 127.0.0.1 (port 7001):
>>     AFS version:  OpenAFS 1.2.11 built  2004-01-11
>>
>>
>>     This is a linux 2.4 client and I don't have kdump - have
>>     also had these
>>     problems on sun4x_58 clients too.
>>
>>     I should mention that we've seen some correlation
>>     to this happening on machines with "busy" AFS caches  -
>>     which makes it
>>     even more frustrating as it seems to affect machines which
>>     depend on
>>     AFS the most. We've tried lots of fs flush* * -
>>     So far we've ended up rebooting which does fix the
>>     problem.
>>
>>     Does anyone have any clues what the problem is or what a workaround
>>     might be?
>>
>>     Thanks
>>
>>     Rich
>>
>>     --
>>     Rich Sudlow
>>     University of Notre Dame
>>     Office of Information Technologies
>>     321 Information Technologies Center
>>     PO Box 539
>>     Notre Dame, IN 46556-0539
>>
>>     (574) 631-7258 office phone
>>     (574) 631-9283 office fax
>>
> 
> 
>>     _______________________________________________
>>     OpenAFS-info mailing list
>>     OpenAFS-info@openafs.org
>>     https://lists.openafs.org/mailman/listinfo/openafs-info
>>
>>
>>
>>_______________________________________________
>>OpenAFS-info mailing list
>>OpenAFS-info@openafs.org
>>https://lists.openafs.org/mailman/listinfo/openafs-info
> 
> 
> 
> --
> Rich Sudlow
> University of Notre Dame
> Office of Information Technologies
> 321 Information Technologies Center
> PO Box 539
> Notre Dame, IN 46556-0539
> 
> (574) 631-7258 office phone
> (574) 631-9283 office fax
> 
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
> 


-- 
Rich Sudlow
University of Notre Dame
Office of Information Technologies
321 Information Technologies Center
PO Box 539
Notre Dame, IN 46556-0539

(574) 631-7258 office phone
(574) 631-9283 office fax