[OpenAFS] client problems
Martin Schulz
schulz@iwrmm.math.uni-karlsruhe.de
11 Apr 2001 12:27:51 +0200
Derrick J Brashear <shadow@dementia.org> writes:
> Linux 2.2 SMP AFS seems to have some problems. I've been working on one
> with little luck thus far:
> create a volume
> mount the volume
> set the acl
> set a quota
>
> fs sq never returns, and the afs client subsequently cannot be shut
> down. my assumption is there are locking issues somewhere, but thus far
> nothing reveals that to be the case.
I also occasionally observed that the afs client could not be shut
down, tough I have no problems with the fs, vos, bos etc commands. It
is the writes that are incredibly slow, reads are fast, even after a
restart of the client (The cache should then be empty, isn't it?)
I observed that the /usr/vice/etc/afsd ${OPTIONS} command the start
script apparently never returns. Should it do so? I have put a & behind it
so that it would not block the rest.
In the meanwhile, I poked around with the fstrace dump commands (still
smp client config). This is a dump from writing a 1MB file to afs, with
interspersed comments as I interpret it. Please correct me if I am
wrong.
-------------------------------------------------------------------------
AFS Trace Dump -
Date: Tue Apr 10 16:44:18 2001
Found 1 logs.
Contents of log cmfx:
time 388.796991, pid 0: Tue Apr 10 15:07:16 2001
time 388.796991, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800
time 388.796991, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0
time 388.796991, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405)
time 388.796991, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800
time 388.796991, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0
time 388.796991, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3)
time 388.796991, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800
time 388.796991, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0
time 388.796991, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3)
time 388.796991, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800
time 388.796991, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0
time 388.796991, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405)
time 388.796991, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800
time 388.796991, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0
time 388.796991, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3)
time 388.796991, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800
time 388.796991, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0
time 388.796991, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3)
time 388.796991, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800
time 388.796991, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0
time 388.796991, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405)
time 388.796991, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800
time 388.796991, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0
time 388.796991, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3)
time 388.796991, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800
time 388.796991, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0
time 388.796991, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3)
time 388.796991, pid 26353: Access vp 0xf8969a2c mode 0x40 len 0x800
time 388.796991, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800
time 388.796991, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0
time 388.796991, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405)
time 388.796991, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800
time 388.796991, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0
time 388.796991, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3)
time 388.796991, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800
time 388.796991, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0
time 388.796991, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3)
time 388.796991, pid 26353: Access vp 0xf8969a2c mode 0x40 len 0x800
time 388.796991, pid 26353: Access vp 0xf8969ffc mode 0x80 len 0x100000
As I interpret it, I started the copy command here..
time 390.997086, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800
time 390.997086, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0
time 390.997086, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405)
time 390.997086, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800
time 390.997086, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0
time 390.997086, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3)
time 390.997086, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800
time 390.997086, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0
time 390.997086, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3)
time 390.997086, pid 26353: Access vp 0xf8969a2c mode 0x40 len 0x800
time 390.997086, pid 26353: Access vp 0xf8969a2c mode 0xc0 len 0x800
time 390.997086, pid 26353: remove dir 0xf8969a2c name bla
time 391.007087, pid 26353: Analyze RPC op 6 conn 0xd7497ae0 code 0x0 user 0x41d2fb44
time 391.007087, pid 26353: TryToSmush vp 0xf8969ffc len 0x100000
time 391.007087, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800
time 391.007087, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0
time 391.007087, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405)
time 391.007087, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800
time 391.007087, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0
time 391.007087, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3)
time 391.007087, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800
time 391.007087, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0
time 391.007087, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3)
time 391.007087, pid 26353: Access vp 0xf8969a2c mode 0x40 len 0x800
time 391.007087, pid 26353: Lookup adp 0xf8969a2c name bla fid (1:536870921.2.3), code=2
time 391.007087, pid 26353: Returning code 2 from 19
time 391.007087, pid 26353: Access vp 0xf8969a2c mode 0xc0 len 0x800
time 391.007087, pid 26353: create file in vp 0xf8969a2c name bla mode 0x8180
time 391.017087, pid 26353: Analyze RPC op 7 conn 0xd7497ae0 code 0x0 user 0x41d2fb44
time 391.017087, pid 26353: ProcessFS vp 0xf896a170 old len 0x0 new len 0x0
time 391.017087, pid 26353: Access vp 0xf896a170 mode 0x0 len 0x0
time 391.017087, pid 26353: Open 0xf896a170 flags 0x8241
time 391.017087, pid 26353: Open 0xf896a170 flags 0xf423f
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x0 count 0x1000 code 1869f
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc06aae00 count 0x2 code 99999
time 391.017087, pid 26353: Write vp 0xf896a170 off 0x0 resid 0x1000 file length 0x0
time 391.017087, pid 26353: Getdcache vp 0xf896a170 failed to find chunk 0x0
time 391.017087, pid 26353: GetdCache vp 0xf896a170 dcache 0xdc6e8c3c dcache low-version 0xffffffff, vcache low-version 0x0
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc06aae00 count 0x2 code 4096
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x1000 count 0x1000 code 1000
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x1000 count 0x1000 code 1869f
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc068bbb8 count 0x2 code 99999
time 391.017087, pid 26353: Write vp 0xf896a170 off 0x1000 resid 0x1000 file length 0x1000
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc068bbb8 count 0x2 code 4096
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x2000 count 0x1000 code 1000
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x2000 count 0x1000 code 1869f
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc062dfe0 count 0x2 code 99999
time 391.017087, pid 26353: Write vp 0xf896a170 off 0x2000 resid 0x1000 file length 0x2000
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc062dfe0 count 0x2 code 4096
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x3000 count 0x1000 code 1000
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x3000 count 0x1000 code 1869f
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc06842c8 count 0x2 code 99999
time 391.017087, pid 26353: Write vp 0xf896a170 off 0x3000 resid 0x1000 file length 0x3000
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc06842c8 count 0x2 code 4096
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x4000 count 0x1000 code 1000
time 391.017087, pid 26353: Iwrite ip 0xf896a170 pos 0x4000 count 0x1000 code 1869f
time 391.017087, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc06928a0 count 0x2 code 99999
[snip] lots of Iupdatepage, Iwrite and Write get repeated here with
increasing pos number, but only marginally increasing time stamp..
time 391.037088, pid 26353: Iwrite ip 0xf896a170 pos 0xfe000 count 0x1000 code 1869f
time 391.037088, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc07f4c98 count 0x2 code 99999
time 391.037088, pid 26353: Write vp 0xf896a170 off 0xfe000 resid 0x1000 file length 0xfe000
time 391.037088, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc07f4c98 count 0x2 code 4096
time 391.037088, pid 26353: Iwrite ip 0xf896a170 pos 0xff000 count 0x1000 code 1000
time 391.037088, pid 26353: Iwrite ip 0xf896a170 pos 0xff000 count 0x1000 code 1869f
time 391.037088, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc07f4c70 count 0x2 code 99999
time 391.037088, pid 26353: Write vp 0xf896a170 off 0xff000 resid 0x1000 file length 0xff000
time 391.037088, pid 26353: Iupdatepage ip 0xf896a170 pp 0xc07f4c70 count 0x2 code 4096
time 391.037088, pid 26353: Iwrite ip 0xf896a170 pos 0x100000 count 0x1000 code 1000
time 391.037088, pid 26353: Close 0xf896a170 flags 0x8001
This "Close" gives me the impression that up to here, everything is
ok. No much time consumed yet. Now comes the time-consuming part:
time 391.037088, pid 26247: Bkg daemon executing 0x2 request
time 391.037088, pid 26247: StoreAll vp 0xf896a170 len 0x100000
time 391.037088, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 391.037088, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 391.057089, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 391.067089, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 391.087090, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 391.097091, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 421.108390, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 441.159258, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 451.179691, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 451.189692, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 451.209693, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 491.251426, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 501.281860, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 521.312727, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 521.332728, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
time 531.333160, pid 26247: StoreProc vp 0xf896a170 tlen 0x100000 alen 0x10000
What is happening there? It takes 140 seconds (= 4min 20sec) of idle
client and idle server. The rest seems to work fast again:
time 531.353161, pid 26247: Analyze RPC op 3 conn 0xd7497ae0 code 0x0 user 0x41d2fb44
time 531.353161, pid 26247: StoreAllDCache Done for vp 0xf896a170 (returns 0x0)
time 531.353161, pid 26247: StoreAll Done vp 0xf896a170 length 0x100000 (returns 0x0)
time 531.353161, pid 26247: Vnode Lock vp 0xf896a170 wait 0x0 excl 0x2
time 531.353161, pid 26247: Vnode Lock vp 0xf896a170 wait 0x0 readers 0 waiters 0
time 531.353161, pid 26353: Access vp 0xf8969000 mode 0x40 len 0x800
time 531.353161, pid 26353: Lookup adp 0xf8969000 name iwrmm.uni-karlsruhe.de fid (1:536870913.2.405), code=0
time 531.353161, pid 26353: Mount point is to vp 0xf8969174 fid (1:536870913.2.405)
time 531.353161, pid 26353: Access vp 0xf896945c mode 0x40 len 0x800
time 531.353161, pid 26353: Lookup adp 0xf896945c name user fid (1:536870916.2.3), code=0
time 531.353161, pid 26353: Mount point is to vp 0xf89695d0 fid (1:536870916.2.3)
time 531.353161, pid 26353: Access vp 0xf8969744 mode 0x40 len 0x800
time 531.353161, pid 26353: Lookup adp 0xf8969744 name schulz fid (1:536870918.2.3), code=0
time 531.353161, pid 26353: Mount point is to vp 0xf89698b8 fid (1:536870918.2.3)
time 531.353161, pid 26353: Access vp 0xf8969a2c mode 0x40 len 0x800
time 531.353161, pid 26353: Lookup adp 0xf8969a2c name bla fid (1:536870921.2.11), code=0
time 531.353161, pid 26353: Setattr vp 0xf896a170 len 0x100000
time 531.353161, pid 26353: RXAFS_StoreStatus vp 0xf896a170 len 0x100000
time 531.353161, pid 26353: Analyze RPC op 5 conn 0xd7497ae0 code 0x0 user 0x41d2fb44
time 531.353161, pid 26353: SimpleVStat vp 0xf896a170 old len 0x100000 new len 0x100000
AFS Trace Dump - Completed
----------------------------------------------------------------------------------------------
Any comments? What is that 0x2 request? What does "StoreProc" do and
why may it take so long?
To exclude any issues with the smp config, I rebooted into
single-processor kernel but the problems remain.
> -Get a recent tcpdump and dump e.g.
> tcpdump -vv -s 1500 port 7001 and host yourhost
Dont have one handy, will need to get one first.
> -from another machine, run cmdebug yourhost
cmdebug exits without output from the server as well as from the client
> -if kdump builds successfully (and due to header file fun it may not) you
> can look at state with kdump also.
kdump was built here from the src.rpm and prints out *lots* of
information, but I cannot interpret a single bit.
(I had the header fun to get the whole thing compile. Redhat appends
the patchlevel to the kernelversion e.g. 2.2.26-22smp, has a
multi-purpose version.h which in turn include a rhconfig.h, and
MAKEFLAGS=-j2 seems to be asking for trouble...)
> -rxdebug yourhost 7001
Gives me the following (uniprocessor client setup):
--------------------------------------------------------------
$ rxdebug iwr07 7001
Trying 129.13.229.77 (port 7001):
Free packets: 130, packet reclaims: 0, calls: 187, used FDs: 64
not waiting for packets.
0 calls waiting for a thread
1 threads are idle
Connection from host 172.22.5.201, port 7000, Cuid a33ebe91/456014dc
serial 890, natMTU 1444, flags pktCksum, security index 2, client conn
rxkad: level clear, flags pktCksum
Received 1669 bytes in 20 packets
Sent 1049024 bytes in 762 packets
call 0: # 20, state dally, mode: receiving, flags: receive_done
call 1: # 0, state not initialized
call 2: # 0, state not initialized
call 3: # 0, state not initialized
Connection from host 172.22.5.201, port 7003, Cuid bad41eca/eb445ee8
serial 11, natMTU 1444, flags DESTROYED, security index 0, client conn
call 0: # 7, state dally, mode: receiving, flags: receive_done
call 1: # 0, state not initialized
call 2: # 0, state not initialized
call 3: # 0, state not initialized
Connection from host 172.22.5.201, port 7000, Cuid bad41eca/eb445eec
serial 6, natMTU 1444, flags DESTROYED, security index 0, client conn
call 0: # 4, state dally, mode: receiving, flags: receive_done
call 1: # 0, state not initialized
call 2: # 0, state not initialized
call 3: # 0, state not initialized
Connection from host 172.22.5.201, port 7000, Cuid a33ebe91/eb445ef4
serial 14, natMTU 1444, security index 0, client conn
call 0: # 7, state dally, mode: receiving, flags: receive_done
call 1: # 0, state not initialized
call 2: # 0, state not initialized
call 3: # 0, state not initialized
Done.
$
-----------------------------------------------------------------
What does this tells me?
Yours,
--
Martin Schulz schulz@iwrmm.math.uni-karlsruhe.de
Uni Karlsruhe, Institut f. wissenschaftliches Rechnen u. math. Modellbildung
Engesser Str. 6, 76128 Karlsruhe