[OpenAFS] Help: OpenAFS suddenly completely stopped working

Kendrick Hernandez kendrick.hernandez@umbc.edu
Thu, 14 Jan 2021 08:32:40 -0500


--000000000000f50e9a05b8dc4a59
Content-Type: text/plain; charset="UTF-8"

We're seeing a similar issue. We just recently migrated all of our
dafileservers to 1.8.6 (the three dbs are still on 1.6.24). We're running
CentOS 7.9 (kernel 3.10.0-1160.2.2) and these are all vms on vmware.

The db servers appear to be okay (vos listvldb works, udebug shows recovery
state 1f), and the fileservers still *seem* to be serving content (could be
cached), but a 'vos partinfo localhost -localauth' returns:

Could not fetch the list of partitions from the server
Possible communication failure
Error in vos listpart command.
Possible communication failure


even though the underlying storage is attached, and 'find /vicepa -ls' can
traverse the vice mount and hasn't returned any errors.

I restarted the afs processes on one server, and post restart I'm seeing
the following in FileLog:

> Thu Jan 14 07:17:34 2021 File server has terminated normally at Thu Jan 14
> 07:17:34 2021
> Thu Jan 14 07:17:34 2021 File server starting (/usr/afs/bin/dafileserver
> -L -p 256 -vattachpar 8 -vc 32768 -s 10000 -l 20000 -hr 1 -cb 5000000
> -nobusy -udpsize 524288 -rxpck 800 -b 16000)
> Thu Jan 14 07:19:54 2021 VL_RegisterAddrs rpc failed; will retry
> periodically (code=5377, err=0)
> Thu Jan 14 07:24:35 2021 File server starting (/usr/afs/bin/dafileserver
> -L -p 256 -vattachpar 8 -vc 32768 -s 10000 -l 20000 -hr 1 -cb 5000000
> -nobusy -udpsize 524288 -rxpck 800 -b 16000)
> Thu Jan 14 07:26:55 2021 VL_RegisterAddrs rpc failed; will retry
> periodically (code=-1, err=0)
> Thu Jan 14 07:30:25 2021 Couldn't get CPS for AnyUser, will try again in
> 30 seconds; code=-1.
> Thu Jan 14 07:32:40 2021 Couldn't get CPS for AnyUser, will try again in
> 30 seconds; code=-1.
> Thu Jan 14 07:34:55 2021 Couldn't get CPS for AnyUser, will try again in
> 30 seconds; code=-1.
> Thu Jan 14 07:37:10 2021 Couldn't get CPS for AnyUser, will try again in
> 30 seconds; code=-1.
>

The dasalvager process keeps exiting (exit code 1), and SalsrvLog shows:

> Thu Jan 14 08:22:57 2021 @(#)OpenAFS 1.8.6 2020-07-15
> root@c7-nukeable1.core.umbc.edu
> Thu Jan 14 08:22:57 2021 Starting OpenAFS Online Salvage Server 2.4
> (/usr/afs/bin/salvageserver)
> Thu Jan 14 08:23:43 2021 SYNC_connect: temporary failure on circuit
> 'FSSYNC' (will retry)
> Thu Jan 14 08:23:59 2021 SYNC_connect: temporary failure on circuit
> 'FSSYNC' (will retry)
> Thu Jan 14 08:24:23 2021 SYNC_connect: temporary failure on circuit
> 'FSSYNC' (will retry)
> Thu Jan 14 08:24:55 2021 SYNC_connect: temporary failure on circuit
> 'FSSYNC' (will retry)
> Thu Jan 14 08:25:35 2021 SYNC_connect: temporary failure on circuit
> 'FSSYNC' (will retry)
> SYNC_connect failed (giving up!): Connection refused
> Thu Jan 14 08:26:23 2021 Unable to connect to file server; aborted
>

Really at a loss at what else to look for.

Best regards,
k-


On Thu, Jan 14, 2021 at 7:45 AM Valtteri Vuorikoski <vuori@notcom.org>
wrote:

>
> I have a small OpenAFS 1.8.6 setup using the Debian and Ubuntu packages.
> Last night everything was working fine, this morning machines were
> timing out trying to talk to volume servers. Database replication was
> also stuck.
>
> While there is a single backup database and file server, databases and
> volumes are primarily on a single server. I logged in to that server
> ("afs1"), made it the only machine in the cell by editing client and
> server CellServDB and set out trying to restore things.
>
> afs1 is running Debian bullseye. Kernel 5.8 (running at the time when
> things broke) and 5.10 result in an equally non-functional system. There
> are no iptables rules on the system.
>
> OpenAFS is almost 100% dead for no apparent reason:
>
> - "pts listentries" and "vos listvldb localhost" work. udebug shows both
>   servers in recovery state 1f, site is sync site and there are no
>   replicas (as expected at this point).
>
> - After restarting services, vos status -localauth -server localhost
>   prints the following:
>
> Could not access status information about the server
> Possible communication failure
> Error in vos status command.
> Possible communication failure
>
> - After a while, vos status no longer prints anything, just hangs. All
>   AFS client access times out.
>
> - There is mostly nothing in the logs. Starting
>   vlserver/ptserver/dafileserver with -d 125 doesn't lead to any extra
>   output. Nothing out of the ordinary (except AFS client errors) appears
>   in dmesg or journalctl -b. After starting dafileserver -L, the following
> log appears:
>
> Thu Jan 14 11:59:54 2021 File server starting
> (/usr/lib/openafs/dafileserver -L)
> Thu Jan 14 11:59:54 2021 VL_RegisterAddrs rpc failed; will retry
> periodically (code=5376, err=0)
> Thu Jan 14 12:01:04 2021 Couldn't get CPS for AnyUser, will try again in
> 30 seconds; code=-1.
> Thu Jan 14 12:02:09 2021 Couldn't get CPS for AnyUser, will try again in
> 30 seconds; code=-1.
>  [the last message keeps repeating]
>
> - dasalvager appears to run successfully. I'm currently running a
>   voldump to recover data and it's running fine so far. There is plenty
>   of disk space.
>
> - Kerberos appears to be working. kinit works, aklog works, pts/vos
> commands without
>   -localauth work when a superuser token is present. KDC (Samba) doesn't
>   show any problems related to the afs principal. Clocks are accurate.
>
> - Rebooting the whole system (a qemu VM) makes no difference.
>
> After four hours of debugging, I'm at the end of my wits. Even
> temporarily removing all databases, restarting ptserver and vlserver and
> touching NoAuth won't make fileserver/volserver happy. It seems like RX
> communication is failing somehow, but I have no idea why.
>
> Any ideas what's going on here?
>
>  -Valtteri
>
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
>


-- 
Kendrick Hernandez
*UNIX Systems Administrator*
Division of Information Technology
University of Maryland, Baltimore County

--000000000000f50e9a05b8dc4a59
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div>We&#39;re seeing a similar issue. We=
 just recently migrated all of our dafileservers to 1.8.6 (the three dbs ar=
e still on 1.6.24). We&#39;re running CentOS 7.9 (kernel 3.10.0-1160.2.2) a=
nd these are all vms on vmware. <br></div><div><br></div><div>The db server=
s appear to be okay (vos listvldb works, udebug shows recovery state 1f), a=
nd the fileservers still *seem* to be serving content (could be cached), bu=
t a &#39;vos partinfo localhost -localauth&#39; returns:<br><blockquote> Co=
uld not fetch the list of partitions from the server<br>Possible communicat=
ion failure<br>Error in vos listpart command.<br>Possible communication fai=
lure</blockquote></div><div><br></div><div>even though the underlying stora=
ge is attached, and &#39;find /vicepa -ls&#39; can traverse the vice mount =
and hasn&#39;t returned any errors. <br></div><div><br></div><div>I restart=
ed the afs processes on one server, and post restart I&#39;m seeing the fol=
lowing in FileLog:</div><blockquote class=3D"gmail_quote" style=3D"margin:0=
px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><=
div>Thu Jan 14 07:17:34 2021 File server has terminated normally at Thu Jan=
 14 07:17:34 2021<br>Thu Jan 14 07:17:34 2021 File server starting (/usr/af=
s/bin/dafileserver -L -p 256 -vattachpar 8 -vc 32768 -s 10000 -l 20000 -hr =
1 -cb 5000000 -nobusy -udpsize 524288 -rxpck 800 -b 16000)<br>Thu Jan 14 07=
:19:54 2021 VL_RegisterAddrs rpc failed; will retry periodically (code=3D53=
77, err=3D0)<br>Thu Jan 14 07:24:35 2021 File server starting (/usr/afs/bin=
/dafileserver -L -p 256 -vattachpar 8 -vc 32768 -s 10000 -l 20000 -hr 1 -cb=
 5000000 -nobusy -udpsize 524288 -rxpck 800 -b 16000)<br>Thu Jan 14 07:26:5=
5 2021 VL_RegisterAddrs rpc failed; will retry periodically (code=3D-1, err=
=3D0)<br>Thu Jan 14 07:30:25 2021 Couldn&#39;t get CPS for AnyUser, will tr=
y again in 30 seconds; code=3D-1.<br>Thu Jan 14 07:32:40 2021 Couldn&#39;t =
get CPS for AnyUser, will try again in 30 seconds; code=3D-1.<br>Thu Jan 14=
 07:34:55 2021 Couldn&#39;t get CPS for AnyUser, will try again in 30 secon=
ds; code=3D-1.<br>Thu Jan 14 07:37:10 2021 Couldn&#39;t get CPS for AnyUser=
, will try again in 30 seconds; code=3D-1.<br></div></blockquote><div><br><=
/div><div>The dasalvager process keeps exiting (exit code 1), and SalsrvLog=
 shows:</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px =
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Thu Jan=
 14 08:22:57 2021 @(#)OpenAFS 1.8.6 2020-07-15 <a href=3D"mailto:root@c7-nu=
keable1.core.umbc.edu">root@c7-nukeable1.core.umbc.edu</a><br>Thu Jan 14 08=
:22:57 2021 Starting OpenAFS Online Salvage Server 2.4 (/usr/afs/bin/salvag=
eserver)<br>Thu Jan 14 08:23:43 2021 SYNC_connect: temporary failure on cir=
cuit &#39;FSSYNC&#39; (will retry)<br>Thu Jan 14 08:23:59 2021 SYNC_connect=
: temporary failure on circuit &#39;FSSYNC&#39; (will retry)<br>Thu Jan 14 =
08:24:23 2021 SYNC_connect: temporary failure on circuit &#39;FSSYNC&#39; (=
will retry)<br>Thu Jan 14 08:24:55 2021 SYNC_connect: temporary failure on =
circuit &#39;FSSYNC&#39; (will retry)<br>Thu Jan 14 08:25:35 2021 SYNC_conn=
ect: temporary failure on circuit &#39;FSSYNC&#39; (will retry)<br>SYNC_con=
nect failed (giving up!): Connection refused<br>Thu Jan 14 08:26:23 2021 Un=
able to connect to file server; aborted</div></blockquote><div><br></div><d=
iv>Really at a loss at what else to look for. <br></div><div><br></div><div=
>Best regards,</div><div>k-<br></div><div><br></div></div><br><div class=3D=
"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Thu, Jan 14, 2021 at=
 7:45 AM Valtteri Vuorikoski &lt;<a href=3D"mailto:vuori@notcom.org">vuori@=
notcom.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex"><br>
I have a small OpenAFS 1.8.6 setup using the Debian and Ubuntu packages.<br=
>
Last night everything was working fine, this morning machines were<br>
timing out trying to talk to volume servers. Database replication was<br>
also stuck.<br>
<br>
While there is a single backup database and file server, databases and<br>
volumes are primarily on a single server. I logged in to that server<br>
(&quot;afs1&quot;), made it the only machine in the cell by editing client =
and<br>
server CellServDB and set out trying to restore things.<br>
<br>
afs1 is running Debian bullseye. Kernel 5.8 (running at the time when<br>
things broke) and 5.10 result in an equally non-functional system. There<br=
>
are no iptables rules on the system.<br>
<br>
OpenAFS is almost 100% dead for no apparent reason:<br>
<br>
- &quot;pts listentries&quot; and &quot;vos listvldb localhost&quot; work. =
udebug shows both<br>
=C2=A0 servers in recovery state 1f, site is sync site and there are no<br>
=C2=A0 replicas (as expected at this point).<br>
<br>
- After restarting services, vos status -localauth -server localhost<br>
=C2=A0 prints the following:<br>
<br>
Could not access status information about the server<br>
Possible communication failure<br>
Error in vos status command.<br>
Possible communication failure<br>
<br>
- After a while, vos status no longer prints anything, just hangs. All<br>
=C2=A0 AFS client access times out. <br>
<br>
- There is mostly nothing in the logs. Starting<br>
=C2=A0 vlserver/ptserver/dafileserver with -d 125 doesn&#39;t lead to any e=
xtra<br>
=C2=A0 output. Nothing out of the ordinary (except AFS client errors) appea=
rs<br>
=C2=A0 in dmesg or journalctl -b. After starting dafileserver -L, the follo=
wing log appears:<br>
<br>
Thu Jan 14 11:59:54 2021 File server starting (/usr/lib/openafs/dafileserve=
r -L)<br>
Thu Jan 14 11:59:54 2021 VL_RegisterAddrs rpc failed; will retry periodical=
ly (code=3D5376, err=3D0)<br>
Thu Jan 14 12:01:04 2021 Couldn&#39;t get CPS for AnyUser, will try again i=
n 30 seconds; code=3D-1.<br>
Thu Jan 14 12:02:09 2021 Couldn&#39;t get CPS for AnyUser, will try again i=
n 30 seconds; code=3D-1.<br>
=C2=A0[the last message keeps repeating]<br>
<br>
- dasalvager appears to run successfully. I&#39;m currently running a<br>
=C2=A0 voldump to recover data and it&#39;s running fine so far. There is p=
lenty<br>
=C2=A0 of disk space.<br>
<br>
- Kerberos appears to be working. kinit works, aklog works, pts/vos command=
s without<br>
=C2=A0 -localauth work when a superuser token is present. KDC (Samba) doesn=
&#39;t<br>
=C2=A0 show any problems related to the afs principal. Clocks are accurate.=
<br>
<br>
- Rebooting the whole system (a qemu VM) makes no difference.<br>
<br>
After four hours of debugging, I&#39;m at the end of my wits. Even<br>
temporarily removing all databases, restarting ptserver and vlserver and<br=
>
touching NoAuth won&#39;t make fileserver/volserver happy. It seems like RX=
<br>
communication is failing somehow, but I have no idea why.<br>
<br>
Any ideas what&#39;s going on here?<br>
<br>
=C2=A0-Valtteri<br>
<br>
_______________________________________________<br>
OpenAFS-info mailing list<br>
<a href=3D"mailto:OpenAFS-info@openafs.org" target=3D"_blank">OpenAFS-info@=
openafs.org</a><br>
<a href=3D"https://lists.openafs.org/mailman/listinfo/openafs-info" rel=3D"=
noreferrer" target=3D"_blank">https://lists.openafs.org/mailman/listinfo/op=
enafs-info</a><br>
</blockquote></div><br clear=3D"all"><br>-- <br><div dir=3D"ltr" class=3D"g=
mail_signature"><div dir=3D"ltr">Kendrick Hernandez<br><i>UNIX Systems Admi=
nistrator</i><br>Division of Information Technology<br><span style=3D"font-=
family:georgia,serif">University of Maryland, Baltimore County</span><br></=
div></div></div>

--000000000000f50e9a05b8dc4a59--