[OpenAFS] client problems

Martin Schulz schulz@iwrmm.math.uni-karlsruhe.de
11 Apr 2001 12:27:51 +0200



Derrick J Brashear <shadow@dementia.org> writes:
 
> Linux 2.2 SMP AFS seems to have some problems. I've been working on one
> with little luck thus far:
> create a volume
> mount the volume
> set the acl
> set a quota
> 
> fs sq never returns, and the afs client subsequently cannot be shut
> down. my assumption is there are locking issues somewhere, but thus far
> nothing reveals that to be the case.

I also occasionally observed that the afs client could not be shut
down, tough I have no problems with the fs, vos, bos etc commands. It
is the writes that are incredibly slow, reads are fast, even after a
restart of the client (The cache should then be empty, isn't it?)

I observed that the /usr/vice/etc/afsd ${OPTIONS} command the start
script apparently never returns. Should it do so? I have put a & behind it 
so that it would not block the rest. 


In the meanwhile, I poked around with the fstrace dump commands (still
smp client config). This is a dump from writing a 1MB file to afs, with
interspersed comments as I interpret it. Please correct me if I am
wrong.

-------------------------------------------------------------------------
AFS Trace Dump -

   Date: Tue Apr 10 16:44:18 2001

Found 1 logs.

Contents of log cmfx:
time 388.796991, pid 0: Tue Apr 10 15:07:16 2001


time 388.796991, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800 
time 388.796991, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0 
time 388.796991, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405) 
time 388.796991, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800 
time 388.796991, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0 
time 388.796991, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3) 
time 388.796991, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800 
time 388.796991, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0 
time 388.796991, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3) 
time 388.796991, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800 
time 388.796991, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0 
time 388.796991, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405) 
time 388.796991, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800 
time 388.796991, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0 
time 388.796991, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3) 
time 388.796991, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800 
time 388.796991, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0 
time 388.796991, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3) 
time 388.796991, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800 
time 388.796991, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0 
time 388.796991, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405) 
time 388.796991, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800 
time 388.796991, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0 
time 388.796991, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3) 
time 388.796991, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800 
time 388.796991, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0 
time 388.796991, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3) 
time 388.796991, pid 26353: Access vp 0xf8969a2c mode 0x40 len 0x800 
time 388.796991, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800 
time 388.796991, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0 
time 388.796991, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405) 
time 388.796991, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800 
time 388.796991, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0 
time 388.796991, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3) 
time 388.796991, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800 
time 388.796991, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0 
time 388.796991, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3) 
time 388.796991, pid 26353: Access vp 0xf8969a2c mode 0x40 len 0x800 
time 388.796991, pid 26353: Access vp 0xf8969ffc mode 0x80 len 0x100000 

As I interpret it, I started the copy command here..

time 390.997086, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800 
time 390.997086, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0 
time 390.997086, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405) 
time 390.997086, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800 
time 390.997086, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0 
time 390.997086, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3) 
time 390.997086, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800 
time 390.997086, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0 
time 390.997086, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3) 
time 390.997086, pid 26353: Access vp 0xf8969a2c mode 0x40 len 0x800 
time 390.997086, pid 26353: Access vp 0xf8969a2c mode 0xc0 len 0x800 
time 390.997086, pid 26353: remove dir 0xf8969a2c name bla 
time 391.007087, pid 26353: Analyze RPC op 6 conn 0xd7497ae0 code 0x0 user 0x41d2fb44 
time 391.007087, pid 26353: TryToSmush vp 0xf8969ffc len 0x100000 
time 391.007087, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800 
time 391.007087, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0 
time 391.007087, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405) 
time 391.007087, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800 
time 391.007087, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0 
time 391.007087, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3) 
time 391.007087, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800 
time 391.007087, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0 
time 391.007087, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3) 
time 391.007087, pid 26353: Access vp 0xf8969a2c mode 0x40 len 0x800 
time 391.007087, pid 26353: Lookup adp 0xf8969a2c name bla fid (1:536870921.2.3), code=2 
time 391.007087, pid 26353: Returning code 2 from 19 
time 391.007087, pid 26353: Access vp 0xf8969a2c mode 0xc0 len 0x800 
time 391.007087, pid 26353: create file in vp 0xf8969a2c name bla mode 0x8180 
time 391.017087, pid 26353: Analyze RPC op 7 conn 0xd7497ae0 code 0x0 user 0x41d2fb44 
time 391.017087, pid 26353: ProcessFS vp 0xf896a170 old len 0x0 new len 0x0 
time 391.017087, pid 26353: Access vp 0xf896a170 mode 0x0 len 0x0 
time 391.017087, pid 26353: Open 0xf896a170 flags 0x8241 
time 391.017087, pid 26353: Open 0xf896a170 flags 0xf423f 
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x0 count 0x1000 code 1869f 
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc06aae00 count 0x2 code 99999 
time 391.017087, pid 26353: Write vp 0xf896a170 off 0x0 resid 0x1000 file length 0x0 
time 391.017087, pid 26353: Getdcache vp 0xf896a170 failed to find chunk 0x0 
time 391.017087, pid 26353: GetdCache vp 0xf896a170 dcache 0xdc6e8c3c dcache low-version 0xffffffff, vcache low-version 0x0 
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc06aae00 count 0x2 code 4096 
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x1000 count 0x1000 code 1000 
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x1000 count 0x1000 code 1869f 
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc068bbb8 count 0x2 code 99999 
time 391.017087, pid 26353: Write vp 0xf896a170 off 0x1000 resid 0x1000 file length 0x1000 
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc068bbb8 count 0x2 code 4096 
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x2000 count 0x1000 code 1000 
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x2000 count 0x1000 code 1869f 
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc062dfe0 count 0x2 code 99999 
time 391.017087, pid 26353: Write vp 0xf896a170 off 0x2000 resid 0x1000 file length 0x2000 
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc062dfe0 count 0x2 code 4096 
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x3000 count 0x1000 code 1000 
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x3000 count 0x1000 code 1869f 
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc06842c8 count 0x2 code 99999 
time 391.017087, pid 26353: Write vp 0xf896a170 off 0x3000 resid 0x1000 file length 0x3000 
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc06842c8 count 0x2 code 4096 
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x4000 count 0x1000 code 1000 
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x4000 count 0x1000 code 1869f 
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc06928a0 count 0x2 code 99999 

[snip] lots of Iupdatepage, Iwrite and Write get repeated here with
increasing pos number, but only marginally increasing time stamp..

time 391.037088, pid 26353: Iwrite ip 0xf896a170 pos 0xfe000 count 0x1000 code 1869f 
time 391.037088, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc07f4c98 count 0x2 code 99999 
time 391.037088, pid 26353: Write vp 0xf896a170 off 0xfe000 resid 0x1000 file length 0xfe000 
time 391.037088, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc07f4c98 count 0x2 code 4096 
time 391.037088, pid 26353: Iwrite ip 0xf896a170 pos 0xff000 count 0x1000 code 1000 
time 391.037088, pid 26353: Iwrite ip 0xf896a170 pos 0xff000 count 0x1000 code 1869f 
time 391.037088, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc07f4c70 count 0x2 code 99999 
time 391.037088, pid 26353: Write vp 0xf896a170 off 0xff000 resid 0x1000 file length 0xff000 
time 391.037088, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc07f4c70 count 0x2 code 4096 
time 391.037088, pid 26353: Iwrite ip 0xf896a170 pos 0x100000 count 0x1000 code 1000 
time 391.037088, pid 26353: Close 0xf896a170 flags 0x8001 

This "Close" gives me the impression that up to here, everything is
ok. No much time consumed yet. Now comes the time-consuming part:

time 391.037088, pid 26247: Bkg daemon executing 0x2 request 
time 391.037088, pid 26247: StoreAll vp 0xf896a170 len 0x100000 
time 391.037088, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 391.037088, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 391.057089, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 391.067089, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 391.087090, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 391.097091, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 421.108390, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 441.159258, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 451.179691, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 451.189692, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 451.209693, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 491.251426, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 501.281860, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 521.312727, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 521.332728, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 
time 531.333160, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000 

What is happening there? It takes 140 seconds (= 4min 20sec) of idle
client and idle server. The rest seems to work fast again:

time 531.353161, pid 26247: Analyze RPC op 3 conn 0xd7497ae0 code 0x0 user 0x41d2fb44 
time 531.353161, pid 26247: StoreAllDCache Done for vp 0xf896a170 (returns 0x0) 
time 531.353161, pid 26247: StoreAll Done vp 0xf896a170 length 0x100000 (returns 0x0) 
time 531.353161, pid 26247: Vnode Lock vp 0xf896a170 wait 0x0 excl 0x2 
time 531.353161, pid 26247: Vnode Lock vp 0xf896a170 wait 0x0 readers 0 waiters 0 
time 531.353161, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800 
time 531.353161, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0 
time 531.353161, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405) 
time 531.353161, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800 
time 531.353161, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0 
time 531.353161, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3) 
time 531.353161, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800 
time 531.353161, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0 
time 531.353161, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3) 
time 531.353161, pid 26353: Access vp 0xf8969a2c mode 0x40 len 0x800 
time 531.353161, pid 26353: Lookup adp 0xf8969a2c name bla fid (1:536870921.2.11), code=0 
time 531.353161, pid 26353: Setattr vp 0xf896a170 len 0x100000 
time 531.353161, pid 26353: RXAFS_StoreStatus vp 0xf896a170 len 0x100000 
time 531.353161, pid 26353: Analyze RPC op 5 conn 0xd7497ae0 code 0x0 user 0x41d2fb44 
time 531.353161, pid 26353: SimpleVStat vp 0xf896a170 old len 0x100000 new len 0x100000 

AFS Trace Dump - Completed
----------------------------------------------------------------------------------------------

Any comments? What is that 0x2 request? What does "StoreProc" do and
why may it take so long?

To exclude any issues with the smp config, I rebooted into
single-processor kernel but the problems remain. 

> -Get a recent tcpdump and dump e.g.
> tcpdump -vv -s 1500 port 7001 and host yourhost

Dont have one handy, will need to get one first.

> -from another machine, run cmdebug yourhost

cmdebug exits without output from the server as well as from the client 

> -if kdump builds successfully (and due to header file fun it may not) you
> can look at state with kdump also. 

kdump was built here from the src.rpm and prints out *lots* of
information, but I cannot interpret a single bit. 

(I had the header fun to get the whole thing compile. Redhat appends
the patchlevel to the kernelversion e.g. 2.2.26-22smp, has a
multi-purpose version.h which in turn include a rhconfig.h, and
MAKEFLAGS=-j2 seems to be asking for trouble...)

> -rxdebug yourhost 7001

Gives me the following (uniprocessor client setup): 

--------------------------------------------------------------
$ rxdebug   iwr07 7001
Trying 129.13.229.77 (port 7001):
Free packets: 130, packet reclaims: 0, calls: 187, used FDs: 64
not waiting for packets.
0 calls waiting for a thread
1 threads are idle
Connection from host 172.22.5.201, port 7000, Cuid a33ebe91/456014dc
  serial 890,  natMTU 1444, flags pktCksum, security index 2, client conn
  rxkad: level clear, flags pktCksum
  Received 1669 bytes in 20 packets
  Sent 1049024 bytes in 762 packets
    call 0: # 20, state dally, mode: receiving, flags: receive_done
    call 1: # 0, state not initialized
    call 2: # 0, state not initialized
    call 3: # 0, state not initialized
Connection from host 172.22.5.201, port 7003, Cuid bad41eca/eb445ee8
  serial 11,  natMTU 1444, flags DESTROYED, security index 0, client conn
    call 0: # 7, state dally, mode: receiving, flags: receive_done
    call 1: # 0, state not initialized
    call 2: # 0, state not initialized
    call 3: # 0, state not initialized
Connection from host 172.22.5.201, port 7000, Cuid bad41eca/eb445eec
  serial 6,  natMTU 1444, flags DESTROYED, security index 0, client conn
    call 0: # 4, state dally, mode: receiving, flags: receive_done
    call 1: # 0, state not initialized
    call 2: # 0, state not initialized
    call 3: # 0, state not initialized
Connection from host 172.22.5.201, port 7000, Cuid a33ebe91/eb445ef4
  serial 14,  natMTU 1444, security index 0, client conn
    call 0: # 7, state dally, mode: receiving, flags: receive_done
    call 1: # 0, state not initialized
    call 2: # 0, state not initialized
    call 3: # 0, state not initialized
Done.
$ 
-----------------------------------------------------------------

What does this tells me? 


Yours,
-- 
Martin Schulz                             schulz@iwrmm.math.uni-karlsruhe.de
Uni Karlsruhe, Institut f. wissenschaftliches Rechnen u. math. Modellbildung
Engesser Str. 6, 76128 Karlsruhe