[OpenAFS] out-of-sync files in client cache (1.4.0)

Peter Somogyi psomogyi@gamax.hu
Thu, 20 Apr 2006 14:43:54 +0200


Hi,

We have a problem at customer side (in production env.): we have a SLES9, 
kernel 2.6.5-7.193-smp, 586.
Every openafs clients & servers are 1.4.0.

Symptom: when a client creates/deletes/modifies any file in a given directory, 
it sees the changes, but a small portion of the other clients don't see any 
change.
And those problematic clients stay 2-3 hours in this state of "out-of-sync".
The "small portion" means that 1-4 clients of the 20-40 don't see the changes.
/But these clients are alive, mainly because when I write a file on these 
clients, the changes can be seen on both the actual and the other "good" 
clients./

I've looked at the tcpdumps, and the problem is that somehow the server 
doesn't send "Operation callback(204)" to these (1-4) problematic client(s) 
when a file or a directory changes. (But the other clients are notified 
correctly.)

Have anybody met the same problem? We appreciate any suggestions/ideas/help.

Client config:
-stat 50000 -dcache 4200 -daemons 6 -volumes 256 -nosettime -chunksize 
17 -rxpck 1500

Fileserver config: (BosConfig)
restarttime 16 0 0 0 0
checkbintime 3 0 5 0 0
bnode fs fs 0
parm /usr/lib/openafs/fileserver -pctspare 10 -L -udpsize 1310720 -nojumbo
-abortthreshold 0 -busyat 1800
parm /usr/lib/openafs/volserver -p 16 -syslog -udpsize 1310720 -nojumbo
parm /usr/lib/openafs/salvager -parallel 4 -syslog -DontSalvage
end

Notes:
- the server is heavily loaded (both by CPU and memory; some other heavy apps 
are running there, too)
- I couldn't find any exceptional messages in logs
- reproducability: it occures sporadicly in prod. env; in every 1-3 days for a 
few hours

If someone is interested in the details (tcpdumps/log files/configs) I can 
send some, but first I may have to ask permission from our customer to 
send/forward or request them, and this may take time.

-- 
Peter Somogyi
Gamax Kft
Bartok Bela ut 15/D
H-1114, Budapest, Hungary