[OpenAFS] Re: busy .backup volumes

Kristen J. Webb kwebb@teradactyl.com
Thu, 02 May 2013 11:19:41 -0600


Since a bos restart new vos dumps were initiated
and killed with the resulting .backup volumes remaining
in the busy state...forever....

(names and IP address have been changed)

# vos ex 1953047510
**** Volume 1953047510 is busy ****
# vos ex 1953054615
**** Volume 1953054615 is busy ****

I've omitted the volumes from backup so I can control the
order of things better next time.  Is there anything else
useful I can get before restarting the server?

vos status server:

Total transactions: 2
--------------------------------------
transaction: 4018  created: Sun Apr 28 12:08:57 2013
lastActiveTime: Thu May  2 12:25:08 2013
attachFlags:  busy
volume: 1953047510  partition: /vicepc  procedure: Dump
packetRead: 1  lastReceiveTime: Sun Apr 28 12:17:07 2013
packetSend: 1  lastSendTime: Sun Apr 28 12:24:07 2013
--------------------------------------

--------------------------------------
transaction: 25  created: Sun Apr 28 04:29:46 2013
lastActiveTime: Thu May  2 12:25:08 2013
attachFlags:  busy
volume: 1953054615  partition: /vicepc  procedure: Dump
packetRead: 1  lastReceiveTime: Sun Apr 28 12:18:13 2013
packetSend: 1  lastSendTime: Sun Apr 28 12:24:36 2013
--------------------------------------

NOTE: lastActiveTime: changes on both transactions
each time I run vos status.

# rxdebug malvirto 7005 -rxstats -peers
Trying 10.10.10.10 (port 7005):
Free packets: 302, packet reclaims: 0, calls: 55729, used FDs: 14
not waiting for packets.
0 calls waiting for a thread
9 threads are idle
0 calls have waited for a thread
rx stats: free packets 302, allocs 7826045, alloc-failures(rcv 0/0,send 0/0,ack 0)
    greedy 0, bogusReads 0 (last from host 0), noPackets 0, noBuffers 0, selects 
0, sendSelects 0
    packets read: data 55734 ack 3558763 busy 0 abort 0 ackall 0 challenge 0 
response 12551 debug 23 params 0 unused 0 unused 0 unused 0 version 0
    other read counters: data 55731, ack 3558763, dup 0 spurious 0 dally 0
    packets sent: data 7044345 ack 3507 busy 3 abort 57 ackall 0 challenge 12552 
response 0 debug 0 params 0 unused 0 unused 0 unused 0 version 0
    other send counters: ack 3507, data 14088690 (not resends), resends 78, 
pushed 0, acked&ignored 11259158
    	(these should be small) sendFailed 0, fatalErrors 0
    Average rtt is 0.001, with 35827501 samples
    Minimum rtt is 0.000, maximum is 5.777
    2 server connections, 0 client connections, 2 peer structs, 1535 call 
structs, 1533 free call structs
Connection from host 10.10.10.11, port 59580, Cuid ba45de02/36cab25c
   serial 188825,  natMTU 1444, flags pktCksum, security index 2, server conn
   rxkad: level clear, flags authenticated pktCksum, expires in 813181.2 hours
   Received 28 bytes in 2 packets
   Sent 267029284 bytes in 188581 packets
     call 0: # 2, state active, mode: error
     call 1: # 0, state not initialized
     call 2: # 0, state not initialized
     call 3: # 0, state not initialized
Connection from host 10.10.10.11, port 33564, Cuid 9fac4ade/afb3f28
   serial 78284,  natMTU 1444, flags pktCksum, security index 2, server conn
   rxkad: level clear, flags authenticated pktCksum, expires in 813181.2 hours
   Received 28 bytes in 2 packets
   Sent 110840236 bytes in 78278 packets
     call 0: # 2, state active, mode: error
     call 1: # 0, state not initialized
     call 2: # 0, state not initialized
     call 3: # 0, state not initialized
Done.
Peer at host 10.10.10.11, port 33564
	ifMTU 1444	natMTU 1444	maxMTU 1444
	packets sent 156556	packet resends 0
	bytes sent high 0 low 110840437
	bytes received high 0 low 1488749
	rtt 26 msec, rtt_dev 38 msec
	timeout 0.000 sec
Peer at host 10.10.10.11, port 59580
	ifMTU 1444	natMTU 1444	maxMTU 1444
	packets sent 377162	packet resends 0
	bytes sent high 0 low 267038291
	bytes received high 0 low 3596654
	rtt 25 msec, rtt_dev 38 msec
	timeout 0.000 sec
Done.


On 4/22/13 10:11 AM, Kristen J. Webb wrote:
> Hi Andrew,
> The filer was restarted early this a.m.  We've seen
> this come up semi-frequently in the past 6 months,
> so I expect to see it again.
>
> As for the timestamp from vos status fileserver,
> after I ran entrans, it increased continuously.
> I cannot say if it was unchanging before I ran
> entrans.  I'll try and gather this on the next
> volume.
>
> I'll also see what I can do about generating
> any traces to determine if this is a some
> sort of Rx timeout, RPC hang, an overcount
> error, or something else.
>
> Thanks for all the help everyone!
> Kris
>
> On 4/20/13 5:01 PM, Andrew Deason wrote:
>> On Fri, 19 Apr 2013 18:51:12 -0600
>> "Kristen J. Webb" <kwebb@teradactyl.com> wrote:
>>
>>> I can say for sure that server that issued the vos dump has been
>>> rebooted since the transaction started.  The other thing I am
>>> observing is that repeated vos status on the fileserver shows the
>>> lastActiveTime as current (increasing).
>>
>> It would be set to the current time when you ran the 'vos endtrans'
>> command. I assume you just saw it increase once, and not increasing
>> constantly.
>>
>> So again, everything you've said suggests there is an RPC holding a
>> reference to that transaction, which is why it's not going away. So
>> either:
>>
>>   - The Rx call is still alive. Even if the client is gone, for some
>>     reason the Rx call is not dying (i.e. some bug in Rx; a timer not
>>     going off or something).
>>
>>   - The Rx call died, but the RPC is still running. Maybe the volser RPC
>>     is hanging on some lock or some other thing.
>>
>>   - There is no RPC still running, but the transaction still says someone
>>     is using it. We have a bug with a reference overcount on that
>>     transaction.
>>
>> The only way to know which it is is to look at a stack trace or core of
>> the volserver process, or maybe 'rxdebug' would show a stuck call if the
>> problem is a stuck Rx call.
>>
>> There have been some bugs in the past with the volserver not accessing
>> transactions from multiple threads correctly (not locking things right).
>> It could be something like that (though I don't think I've seen that
>> _specific_ manifestation), or it could be something else entirely.
>>
>

-- 
This message is NOT encrypted
--------------------------------
Mr. Kristen J. Webb
Chief Technology Officer
Teradactyl LLC.
2450 Baylor Dr. S.E.
Albuquerque, New Mexico 87106
Phone: 1-505-338-6000
Email: kwebb@teradactyl.com
Web: http://www.teradactyl.com

Providers of Scalable Backup Solutions
    for Unique Data Environments

--------------------------------
NOTICE TO RECIPIENTS: Any information contained in or attached to this message 
is intended solely for the use of the intended recipient(s). If you are not the 
intended recipient of this transmittal, you are hereby notified that you 
received this transmittal in error, and we request that you please delete and 
destroy all copies and attachments in your possession, notify the sender that 
you have received this communication in error, and note that any review or 
dissemination of, or the taking of any action in reliance on, this communication 
is expressly prohibited.


Regular internet e-mail transmission cannot be guaranteed to be secure or 
error-free. Therefore, we do not represent that this information is complete or 
accurate, and it should not be relied upon as such. If you prefer to communicate 
with Teradactyl LLC. using secure (i.e., encrypted and/or digitally signed) 
e-mail transmission, please notify the sender. Otherwise, you will be deemed to 
have consented to communicate with Teradactyl via regular internet e-mail 
transmission. Please note that Teradactyl reserves the right to intercept, 
monitor, and retain all e-mail messages (including secure e-mail messages) sent 
to or from its systems as permitted by applicable law.