[OpenAFS-devel] Problem with mounts in AFS on CentOS 7.4 with openafs 1.6.2[01].1

Ragnar Sundblad ragge@csc.kth.se
Fri, 3 Nov 2017 17:29:58 +0100


Hi Mark,

> On 3 Nov 2017, at 15:51, Mark Vitale <mvitale@sinenomine.net> wrote:
>=20
> Ragge,
>=20
>> On Nov 3, 2017, at 9:46 AM, Ragnar Sundblad <ragge@csc.kth.se> wrote:
>>=20
>> We have compute clusters where the nodes have almost everything of =
their roots in afs; most things in /, as /etc and /usr, are soft links =
into a complete os installation in afs. To be able to have some writable =
files and directories, such as /etc/adjtime or /var/tmp, we bind mount =
files and directories in the tree which is actually in afs (mainly using =
the rwtab functionality), and a lustre client that also gets mounted in =
the afs tree.
>>=20
>> When we upgraded from CentOS 7.3 to 7.4, kernel =
3.10.0-693.5.2.el7.x86_64, and using OpenAFS client 1.6.21.1 or =
1.6.20.1, when users having home directories in afs log in and start =
accessing their data, mounts in the afs tree starts to get randomly =
unmounted. In the lustre case, the lustre client nicely reports that it =
unmounts, so the unmounts seem to be handled in an orderly manner.
>>=20
>> We have a suspicion this may be related to the problem reported in =
the thread =C3=A2=C2=80=C2=9Cgetcwd() error for RHEL 7.4 kernel=C3=A2=C2=80=
=C2=9D, and that the kernel for some reason decides that path to the =
mount point is no good and unmounts.
>> In addition, when this has started to happen, we are not able to =
mount anything more into afs, mount returns ENOENT.
>>=20
>> This is pretty easy to repeat.
> Thank you for your detailed report.
> I have an idea about what this may be, but I will try to duplicate it =
on my test system first.

Thanks for investigating! :-)

>> Our workaround for now is to use the tpmfs based root all the way =
down to the mount points, and have soft links into afs further down for =
the rest, which seems to work.
> It=C3=A2=C2=80=C2=99s good that you have a workaround; thank you for =
sharing that as well.
>=20
>> Please let us know if we can provide any help debugging this.
> For now I would like to see your afsd options, and also the output =
from =C3=A2=C2=80=C2=98cmdebug <client> -cache=C3=A2=C2=80=C2=99 for an =
affected client. =20

We start it like so:
/bin/chroot /sysimage /usr/vice/etc/afsd -memcache -verbose -nosettime =
-dynroot -mountdir /afs
(Before systemd is started, we set up the runtime root in /sysimage, =
then chroot there, and start systemd to let it bring up the system.)

Here is a cmdebug:
# cmdebug tegner-login-2 -cache
Chunk files:   1562
Stat caches:   2343
Data caches:   1562
Volume caches: 200
Chunk size:    65536
Cache size:    100000 kB
Set time:      no
Cache type:    memory

I now see that I forgot to mention that we use memory cache (since the =
nodes are diskless).

> Although you haven=C3=A2=C2=80=C2=99t reported the getcwd() problem, =
could you please confirm if you=C3=A2=C2=80=C2=99ve seen it or not?

We have not seen it, but we haven=E2=80=99t really looked for it either. =
Is there some test we could try?

> And finally, just to confirm, you have seen bind mounts in /afs =
unmounted at CentOS 7.4 with both OpenAFS 1.6.21.1 and 1.6.20.1, but =
_not_ with CentOS 7.3 and those same OpenAFS client releases - correct?

With 7.3 (kernel 3.10.0-514.26.2.el7.x86_64) we actually used openafs =
client 1.6.20.2, but with that combination this mount-within-afs thing =
worked just fine.

Thanks!

/ragge