[OpenAFS] Tokens discarded during large file transfer

Wed, 21 Feb 2007 11:53:42 -0500 (EST)

After some further investigation, it appears that our problem is with a
server running on a 64-bit machine (x86_64):

I can create two volumes: one on a 64-bit server and another on a 32-bit
server (i386). It seems that the problem might be with the cache
write-through to the 64-bit server. The following succeed when writing to
the volume on the 32-bit server, but fail when writing to the volume
hosted on the 64-bit server:

(1)
cp largefile directory-in-afs/
(this fails after about a minute and a half with a 'Permission denied'
error ('sealed data inconsistent' shows up in the system log) when
directory-in-afs/ is hosted on the volume resident on the 64-bit server.
When directory-in-afs/ is hosted on the volume on the 32-bit server, this
command succeeds)

(2)
find ext2fs-directory/ -type f -exec sha1sum {} \; > \
directory-in-afs/SHA1SUM
(this also fails, discarding tokens, with the 'Permission denied'/'sealed
data inconsistent' error, but the time of the failure is not consistent.
Again the failure is consistently only with the volume on the 64-bit
server, but succeeds on the 32-bit server)

I did a third experiment: after the copy to the volume on the 32-bit
server succeeded, I tried moving the volume from the 32-bit server to the
64-bit server. Again, this failed with a 'sealed data inconsistent' error.
The VolserLog on the 32-bit server had the following error:

  Volser: DumpVolume: Rx call failed during dump, error 19270410

I then tried moving the volume from the 32-bit server to another 32-bit
server, and it completed successfully.

Note: in my previous message I noted the successful test on my home
network. There, I am running only 32-bit architectures, which is
consistent with the results I have outlined above.

Has anyone else experienced these types of problems on the x86_64
architecture? My 64-bit machine is a dual AMD Opteron 256 at 3GHz. I'm
running openafs 1.4.1 and RHEL4 with kernel 2.6.9-42.0.8.ELsmp on all
servers and clients used in these experiments.

Thanks,
Mark

Some answers to previous questions are included below:

> W. Mark Smith <online+lists.afs-info@coffeefreak.net> wrote:
>> Most people who get this error have the problem immediately after
>> they log in. In my case, it consistently happens during large file
>> transfers. The problem occurs when I am copying a large (>1GB) file
>> to an AFS directory. At about the 500MB point in the transfer, I
>> get a "permission denied". I have to unlog then aklog to get my AFS
>> tokens again. Here is the error code that shows up in my message
>> log:
>
> Can you provide the exact syntax of the command you are using to do the
> copy?

A number of commands have the same result:
cp largefile some-directory-in-afs/
dd if=/dev/zero of=some-directory-in-afs/a.tmp bs=1024 count=1000000
rsync -avP some-ext2-directory-with-large-files/ some-directory-in-afs/

>
> Can you run "id" as well and include it here?

uid=501(wmsmith) gid=10(wheel) groups=34413,36562,10(wheel)

>
> If "id" shows two high-numbered groups, the command is probably running
> inside a PAG.  Maybe try not using a PAG and run the copy?  (It might be
> hard to get a session without a PAG, but an su command should work,
> provided you've disabled any AFS related PAM.)  I suspect the issue
> isn't PAG related though...
>
>> Feb 10 11:40:43 [hostname] kernel: afs: Tokens for user of AFS id
>> [id] for cell [cell] are discarded (rxkad error=19270410)
>>
>> And the corresponding error is:
>>
>> # translate_et 19270410
>> 19270410 (rxk).10 = sealed data inconsistent
>>
>> The server is compiled with large file support. When I do the same
>> thing on my (slower) home network, I do not have this problem, and
>> I can write files larger than 2GB.
>>
>> My configuration is RHEL4, kernel 2.6.9-42.0.8.ELsmp, and I have
>> tried both openafs 1.4.1 and 1.4.2.
>
> On the server?  Or client(s)?
>
> What does rxdebug <server> 7000 -version return?
>

AFS version:  OpenAFS 1.4.1 built  2006-05-19

>> Does anyone have any suggestions?
>
> Is this "home network" test using the exact same client and server?  Or
> just ones with a similar configuration?
>
> Is it possible that there is an actual error in your networking hardware
> that is breaking some of the packets under higher loads?
>
> Can you try forcing the link speed on your client's ethernet adapter to
> 10BASE, 100BASE, 1000BASE (if applicable) and see if the commands
> complete at the slower speed?  Or otherwise verify your duplex speed
> between server and switch and client and switch and any other network
> links in between.  (I'd say to plug your client into the server directly
> using a cross-over, or at least into the same physical switch, just to
> test, if possible.)
>
> Do you happen to have another OS version that you can test with?
> Windows, Solaris, or even another Linux kernel version?  Have you tried
> not using an SMP kernel (just to test)?
>
> <<CDC
>
>
>