EXTERNAL: [OpenAFS] Preliminary findings on today's brokenness

Ben Carter bhc@pitt.edu
Thu, 14 Jan 2021 10:26:03 -0500


So we are running 1.6 code and we definitely have a problem.  However 
for us, a sync site is being elected, but doing a vos examine from a 
client seems to hang.  Actual access to files in AFS seems to be working 
fine but we've not restarted any file server processes.

Ben

On 1/14/21 10:21 AM, Chaskiel Grundman wrote:
> None of these things is confirmed yet, but based on some analysis and 
> testing carnegie mellon has done today:
> 
> - The problem is in RX (the transport layer), not any of the applications
> - It likely affects 1.8.0 and newer, but not 1.6
> -It seems to be triggered by the RX epoch being after the unix time 
> 0x60000000  aka 1610612736, aka Thu Jan 14 08:25:36 UTC 2021
> 
> 
> So any cache manager and server that has been running since before that 
> time will continue to work until they are restarted. Sites may wish to 
> try and avoid having critical systems reboot or restart until a fix or 
> workaround for this issue is identified.
> 
> If anyone has a system running something 1.8.0 or newer where the command
> vos status afs-01.andrew.cmu.edu 
> <https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fafs-01.andrew.cmu.edu%2F&data=04%7C01%7Cbhc%40pitt.edu%7C41b163d418f34672980208d8b8a01ee8%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637462345143664355%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=yrFiXzq9V9tiqqASL4EDgRrSChdNPbgkOsWeY3SFjvY%3D&reserved=0> 
> -noauth
> 
> succeeds, I'd appreciate knowing about it, as it will change this analysis.


-- 
Ben Carter
System Engineer/Operations
University of Pittsburgh Information Technology
Office: 412-624-6470
bhc@pitt.edu