[OpenAFS-devel] linux45: smoke test failed

Stephan Wiesand stephan.wiesand@desy.de
Fri, 17 Jun 2016 17:30:28 +0200


On Jun 17, 2016, at 04:45 , Benjamin Kaduk wrote:

> On Thu, 16 Jun 2016, Stephan Wiesand wrote:
>=20
>> I smoke tested what was planned to be OpenAFS 1.6.18.1, as discussed =
in yesterday's release team meeting, on a Fedora 23 x86_64 VM with =
kernel 4.5.6-200 today. The result was disappointing:
>>=20
>> git clone git://gerrit.openafs.org/openafs.git
>=20
> Is the pwd the root of a volume?

No, everything happens at least one level below.

>> cd openafs
>> git log
>> # scrolled through a few dozen changes, took a couple of seconds
>> git checkout openafs-stable-1_6_18
>>=20
>> At this point I got the following error:
>>=20
>> fatal: Unable to read current working directory: No such file or =
directory
>>=20
>> A "cd; cd -" cures this for a while, and there's no apparent data =
corruption. I'm still worried. The problem isn't 100% reproducible, but =
it doesn't take too may tries checking out random tags or branches.
>>=20
>> This was plain 1.6.18 + gerrit 12300 12301 12302 12274.
>>=20
>> Cache is on ext4, no separate partition, default size as set by our =
RPM (I think 100MB, but I don't have access to the VM right now to =
check).
>>=20
>> The small cache size may contribute to the problem. But I found no =
errors logged anywhere, and this shouldn't happen no matter how small =
the cache is.
>=20
> Please check if the cmdebug output is empty (I expect it is, but it is
> good to check).

It is empty.

>> NB we have a user report of exactly this problem happening frequently =
while just editing files in a local git repo in AFS space. The data is a =
bit sketchy, but it's probably Ubuntu 14.04 with its current default =
kernel and the openafs packages from Anders' ppa. I'll try to get us =
more data.
>>=20
>>=20
>> Any thoughts? For the time being I'm considering this a showstopper =
for
>> 1.6.18.1, and it looks like we're not quite there yet regarding Linux
>> 4.5, let alone 4.6 or the 4.7 due in a few weeks :-(
>=20
> Can you run the same test on a 4.4 kernel for comparison?

I tried under the last F22 kernel, 4.4.6-200.fc22. And ok, it's not 4.5 =
specific, though it seems to happen more frequently with 4.5.2 than with =
4.4.6.

By chance I found a pretty reliable reproducer:

	cd /vol/ume/root
	mkdir g; cd g
	git clone git://gerrit.openafs.org/openafs.git; sleep 180; git =
log

Note indeed no "cd openafs". Of course this should complain about the =
cwd not being a git repo. But most of the time it will complain about =
the cwd issue instead.

I'm planning to verify that plain 1.6.18 behaves the same on 4.4.6, and =
if it does I'll proceed with the 1.6.18.1 release.

I couldn't reproduce this with any EL clients, but those have larger =
caches (it's indeed 100 MB on that Fedora VM), so there's more to test. =
Help welcome...
=09