[OpenAFS-devel] Possible 1.3.85 fileserver/volserver problem

Hans-Werner Paulsen hans@MPA-Garching.MPG.DE
Mon, 25 Jul 2005 10:46:12 +0200


On Sat, Jul 23, 2005 at 11:00:37AM -0400, Robert Banz wrote:
> I upgraded two (pretty busy) fileservers from 1.3.84 to 1.3.85 last 
> Sunday.  Everthing seemed to be working right, however, last night both 
> of them got into the meltdown syndrome where they 'busy' all requests 
> causing much badness to clients that were using them.
> 
> The platform is Solaris 10 amd64, up to current patch.
> 
> Unfortunatly, I can't provide much debugging information on this -- it 
> happened at 2am, so I wasn't quite in the mental state for "collecting 
> information".  No out-of-the ordinary messages were in the fileserver or 
> volserver logs; the only 'out of the ordinary' event that was occuring 
> at the time is that it was well in the middle of our backup window. 
> From what i could tell, .backup snapshot creation had finished about 20 
> minutes before things started to go bad, and it looks like 
> dumping-to-tape had begun.  Could there be any open fileserver/volserver 
> IPC issues?
> 

We had the same problem with i386 Linux 2.6.12 and OpenAFS 1.3.84.
An AFS backup ran on the *.backup volumes, the fileserver was very
busy during this time, and then for about 30 to 45 minutes no access
to any files on this fileserver from any client was possible. But there
were no crashes of the fileserver/volserver and there were no entries in
the logfiles.
We had this problem two times, and during this time the AFS backup
was writing the *.backup volumes to tape. (Normally this is done
during the night, but due to a hardware problem with the tape library,
we started the backup by hand within the working hours.)

Hans-Werner

-- 
Hans-Werner Paulsen		hans@MPA-Garching.MPG.DE
MPI für Astrophysik		Tel 089-30000-2602
Karl-Schwarzschild-Str. 1	Fax 089-30000-2235	
D-85741 Garching