[OpenAFS] Re: fs process doesn't exit until I send a signal 9

Hartmut Reuter reuter@rzg.mpg.de
Sat, 26 Feb 2005 18:01:17 +0100


Mike Polek wrote:
> On Thu, 24 Feb 2005, Derrick J Brashear wrote:
> 
>   On Thu, 24 Feb 2005, Gabe Castillo wrote:
> 
>   >>   I had to shut down one of my AFS servers to replace a disk. When 
> I issue a
>   >> "bos shutdown" command, all the processes seem to shutdown, except 
> for   the fs
>   >> process. When I "vos status" the server, it says that the 
> fileserver has been
>   >> disabled, and the sub-status is "in the process of shutting down". 
> Is there
> 
> 
>  >Wait while it breaks callbacks. you can watch the status in
>  >/usr/afs/logs/FileLog
> 
> ---
> For what it's worth, I have servers that have thousands of volumes
> on each partition. (Ok... maybe a poor design choice, but I didn't
> know the single threaded volume server would be an issue when I did
> the design...) After 30 minutes, the bosserver assumes that the
> fileserver isn't going to stop, and does a kill -9 to stop it.
> I'm pretty sure it's just because of the sheer number of volumes
> to unmount.
> 
> 1) Is there an easy way to change the timeout value? I'm not sure
>    yet if it's faster to do the kill -9 one minute into the shutdown
>    and just let the salvager do it's thing, or if it's better to
>    let the shutdown take an hour. I can say that it would be helpful
>    to have an emergency procedure that won't corrupt volumes for when
>    the shutdown is triggered by a power failure. :-)

I think it's unsane if the shutdown takes that long. There must be a 
problem with your clients, perhaps switched off PCs, that the callback 
has to wait for timeouts. The writing of the volume information to disk 
never should take that long even if you have 10000 volumes on a server.
If you have compiled with --enable-fast-restart you can kill your 
fileserver after a while (after all active RPCs have finished) and the 
only disadvantage at restart may be that the uniquifier is too low.

> 
> 2) I noticed that in the 1.3 branch, the volume server is multi-
>    threaded. (THANK YOU!!!) Does anybody know how this affects
>    shutdown/startup time? Should I still be looking for a way to
>    reduce the number of volumes on a server?

The volserver has nothing to do with the time needed by the fileserver 
to shutdown. The volserver only does volume operations such as move, 
backup or release.


> 
> 3) I've seen references to a "NoSalvage" option. Is that also new
>    in 1.3? or is it some sort of patch? Anybody have a really good
>    way of dealing with lots of volumes on a server? We currently
>    have almost 60T of storage, and it's growing. I like the idea
>    of having things well organized into finite volumes... it works
>    for our setup.

Is your NoSalvage option the same as --enable-fast-restart? if so, this 
I introduced to avoid hours of salvaging after a crash. My experieance 
was that the log contained nearly never a real error message. I think 
it's better to let the fileserver automatically take a volume off-line 
when he detects an inconsistency than to have to wait hours for a restart.

Hartmut
> 
> 
> Any assistance is appreciated.
> 
> Thanks,
> Mike
> 
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info


-- 
-----------------------------------------------------------------
Hartmut Reuter                           e-mail reuter@rzg.mpg.de
					   phone +49-89-3299-1328
RZG (Rechenzentrum Garching)               fax   +49-89-3299-1301
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)
-----------------------------------------------------------------