[OpenAFS] Mount point weirdness: fs lsm X, fs lq X return different volumes for same mount point.

Jeffrey Altman jaltman@secure-endpoints.com
Fri, 03 Oct 2008 17:08:04 -0400


What I am parsing out of this is the following:

1. The mount point target strings were correct as viewed from all of the
cache managers.

2. The FID reported by "fs examine <mountpoint>" was the wrong volume ID
on all cache managers.

3. vos examine <volume-name> reported the correct volume ID

4. vos examine <volume-id> reported the correct volume name

5. fs checkvolume which resets all of the volume location in the cache
manager did not fix the problem.

This seems to imply that ubik_VL_GetEntryByNameN() was succeeding (vos
examine) but VL_GetEntryByNameU() was failing.  This could indicate that
different VL servers were being contacted by vos vs the cache manager
and one of the VLDB instances was corrupt.

The key items to confirm are that the FID for the target volume is
in fact wrong which I suspect it must be since you have so many
different client versions affected simultaneously.  Second, would be
to use a network monitor to confirm which VLDB instance has bad data.

Jeffrey Altman
Secure Endpoints Inc.

Kim Kimball wrote:
> Jeffrey Altman wrote:
>> Questions that pop into mind:
>> 1) what versions of the clients were involved?
> Multiple, Solaris, Linux, Windows, Macintosh.
>> 2) what was the output of vos examine on the volume names?
> Correct output -- using volume name, returned correct numeric ID and on
> line status.  Using number, returned correct name and on line status.
>> 3) same problem after a cache manager shutdown and restart?
> No attempted.  Issue resolved within five minutes of fix -- first fix
> was to dump/restore affected volumes to force new volIDs.  This worked.
> All clients fine after fix, simultaneously, so don't think restart would
> have helped.
> fs checkv was used after each fix effort, along with lots of fs checkv
> just for good measure.
>> 4) same problem from Unix and Windows clients?
> Yes.
> Thanks!
> Kim
>> Jeffrey Altman
>> Kim Kimball wrote:
>>> Had a weird one on Thursday, and am looking for any plausible
>>> explanation so I can close out the incident report.
>>> My best answer right now is NAFC (not an effing clue.)
>>> I'm using "X-mounted" to describe "volume named in mountpoint" not equal
>>> to "volume accessed at mountpoint"
>>> Probably relevant:  We were moving volumes to clear a file server, and
>>> noticed an unusual number of orphaned volumes.
>>> When I went to start 'vos zapping' the orphans, many of them  turned out
>>> to be those that incorrectly showed up at a given mount point.
>>> Could it be that the 'vos move' failures that created the orphans are
>>> the proximate cause of the X-mounts?  If so, how could the two be
>>> related?
>>> Any FC greatly appreciated.
>>> Kim
>>> ====================================
>>> Synopsis:
>>> From any AFS client, the volume named in a mount point was not the
>>> volume actually accessed
>>> Initial symptom:
>>>       web servers start puking when invoking perl modules
>>>       cd to path where perl modules are expected, and instead of perl
>>> modules see bunch of unrelated png libraries
>>>       check mount point to volume containing perl modules, and mount
>>> point correctly names perl volume
>>>       fs lq on mount point returns name of volume containing png
>>> libraries -- not the name of the volume specified in fs lsm
>>> The diagnostic:
>>>    fs lsm <path/mountpoint>   --> volumeA
>>>    fs lq   <path/mountpoint>   --> volumeZ
>>> Confirmation:
>>>    cd <path/mountpoint>
>>>    ls
>>>            ----- returns list of files/directories stored in volumeZ
>>> The mount point is correct; that is, fs lsm returns the expected volume
>>> name.
>>> The volume accessed at the mount point is incorrect.
>>> The files/directories in the incorrectly accessed volume are correct.
>>> -------------------------------
>>> We turned up forty plus instances of  X-mounted (for lack of a better
>>> word) volumes.
>>> The fix:
>>>    remove the mount point
>>>    release the volume (containing mount point)
>>>    create same mount point
>>>    release volume again
>>>    vos addsite newserver newpart _mounted_ volume (as named in mount
>>> point)
>>>    vos release _mounted_ volume
>>>      fs checkv
>>> Then get expected responses.
>>>        fs lsm <path/mountpoint>   --> volumeA
>>>        fs lq   <path/mountpoint>   --> volumeA
>>> ========================
>>> Other efforts:
>>> I did restart the fs instances on all file servers, suspecting some sort
>>> of off-by-one'ish glitch in some unknown index/table/?
>>> The restarts had no impact.
>>> 'vos move" of the volume containing the mount point did not help.
>>> -----------------
>>> _______________________________________________
>>> OpenAFS-info mailing list
>>> OpenAFS-info@openafs.org
>>> https://lists.openafs.org/mailman/listinfo/openafs-info
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info