[OpenAFS] Strange caching failures

Stephan Wonczak a0033@rrz.uni-koeln.de
Thu, 28 Feb 2013 15:37:29 +0100 (CET)


   Hi all!
   for the past few weeks, we are struck with a very weird behavior 
regarding cache updates of AFS clients. It looks like sometimes the 
callback does not work and one client is stuck with an older version of 
the file in question. Example:

   Write to file 'foo' on client A every five minutes.
   Clients B,C and D dutifully update their caches and see the updates
   After some time, suddenly Client B dows not see the updates any more, 
while clients C and D continue working fine.
   A 'fs flush foo' on client B corrects the problem.
   Other files are *NOT* affected and are updated fine on Client B.

   This behavior is not really repeatable, though, sometimes it is client D 
that stops working, or any other client.
   When taking into account that the clients I am talking about here are 
web servers, you can imagine that this behavior is less than desirable.

   Now for the versions:
   Clients are a mix of 1.4.x and 1.6.x (mainly 1.6.1-1.el5 and 1.4.12-el5, 
with other versions thrown into the mix).
   Servers are version 1.4.11-el5.
   OS on both clients and server is RHEL5

   It might be a coincidence, but we became aware of this problem shortly 
after updating a bunch of clients to 1.6.1-1.el5.

   Any ideas on how to go about debugging this?

 	Dipl. Chem. Dr. Stephan Wonczak

         Regionales Rechenzentrum der Universitaet zu Koeln (RRZK)
         Universitaet zu Koeln, Weyertal 121, 50931 Koeln
         Tel: +49/(0)221/470-89583, Fax: +49/(0)221/470-89625