[OpenAFS-devel] Re: [OpenAFS] Volume root corruptions - anybody
seen those?
Hartmut Reuter
reuter@rzg.mpg.de
Thu, 05 Jun 2008 14:27:10 +0200
Sorry, I was too fast: the line
> + if (volumes[i].volumeID == volume)
must be
> + if (volumes[i].volumeID == command.volume)
Hartmut Reuter wrote:
>
> Small patch to make sure salvager and volserver cannot get the same
> volume from the fileserver in parallel:
>
>
> --- /afs/ipp/.cs/openafs/openafs-1.4.7-osd/src/vol/fssync.c 2008-05-06
> 14:42:29.000000000 +0200
> +++ ./fssync.c 2008-06-05 14:13:52.000000000 +0200
> @@ -591,8 +591,19 @@
> case FSYNC_OFF:
> case FSYNC_NEEDVOLUME:{
> leaveonline = 0;
> - /* not already offline, we need to find a slot for newly
> offline volume */
> if (!v) {
> + /* not already offline by this handler */
> + /* Check that no other handler has it offline */
> + int found = 0;
> + for (i = 0; i < MAXOFFLINEVOLUMES; i++) {
> + if (volumes[i].volumeID == volume)
> + found = 1;
> + }
> + if (found) {
> + rc = FSYNC_DENIED;
> + break;
> + }
> + /* Find a slot for newly offline volume */
> for (i = 0; i < MAXOFFLINEVOLUMES; i++) {
> if (volumes[i].volumeID == 0) {
> v = &volumes[i];
>
>
> Rainer Toebbicke wrote:
>
>> Sorry, I started this in openafs-info a few days ago, but here's
>> probably the better place:
>>
>> the problem: "salvager -part xxx -vol NNN" corrupts volumes
>> (BTW, even with -nowrite!)
>>
>> the scenario:
>>
>> 1. the volserver takes a volume offline, the salvager takes the same
>> volume offline without knowing about the volserver. Possible, since
>> the partition lock only protects the VAttachVolume, and the "offline"
>> works (of course) even if the volume is already offline.
>>
>> 2. the salvager and the volserver work on same data, first opportunity
>> for a thorough mess.
>>
>> 3. assuming they didn't, e.g. salvager -nowrite, the salvager finishes
>> and puts the volume back online while the volserver still running.
>> There is no "reference count" on offline/online. The fileserver sees
>> this, the user jumps onto it, and we have a second opportunity to blow
>> everything apart.
>>
>>
>> I see a number of alternatives,
>>
>> 1. never use "salvager -vol", always shut down the volserver before
>> (a challenge with bosserver restarting it), try gymnastics with "vos
>> lock", "vos offline", etc... all looks pretty cumbersome.
>>
>> 2. salvage a single volume within the volserver, triggered by a "vos
>> salvage" command; probably most logical, analogous to "vos zap" et al.
>> I haven't checked yet if the code can be shared without problems,
>> looks like some non-trivial Makefile and #ifdef gymnastics to get
>> right. And of course, it changes the documented commands, with changes
>> in bos and vos.
>>
>> 3. have the salvager create a transaction in the volserver, which
>> takes care of offline/online. Would have to care about deadlocks, as
>> the salvager doesn't just salvage the volume itself but the RW parent
>> in case of a RO or BK. And also with the (usual) case that the volume
>> is so damaged that it cannot be attached. But it's minimal intrusion
>> otherwise, no external changes.
>>
>>
>> Before I go and implement 2. or 3., please give me your thoughts about
>> the scenario if you care (I may have missed something and would have
>> to look for another culprit - Derrick hinted at "-orphans attach"
>> where I did not understand everything yet), and whether 2. is
>> preferable over 3.
>>
>>
>> Cheers, Rainer
>>
>>
>
>
--
-----------------------------------------------------------------
Hartmut Reuter e-mail reuter@rzg.mpg.de
phone +49-89-3299-1328
fax +49-89-3299-1301
RZG (Rechenzentrum Garching) web http://www.rzg.mpg.de/~hwr
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)
-----------------------------------------------------------------