[OpenAFS] IP-based ACL problem with 1.2.13 fileserver? (host cache, GetHostCPS())

Christopher Allen Wing wingc@engin.umich.edu
Tue, 24 Jan 2006 10:41:06 -0500 (EST)


Hi. We are having a permissions problem with pts groups containing IP 
addresses. It looks like this may be a bug in the 1.2.13 fileserver.

Our cell configuration is as follows:

 	- fileservers running openafs 1.2.13 on Solaris 9 (sparc)
 	  We are running the binaries from openafs.org.

 	- database servers running Transarc AFS 3.6 (to be upgraded)

 	- fileserver processes are run with '-hr 1' so that cached IP
 	  address group membership should expire after 1 hour.

The clients are a bunch of PCs (some 32-bit, some 64-bit AMD/Intel)
which are dual booted between Red Hat Enterprise Linux 4 and Windows XP:

 	- on RHEL4, openafs-1.4.1rc2	(compiled myself)
 	- on WinXP, openafs 1.3.87	(binary from openafs.org)

We use both 32-bit and 64-bit RHEL4 on the clients; all machines have 
32-bit WinXP only.

In general, each machine gets booted into both Windows and Linux at 
least once per day. So if there is some code in the fileserver which 
would get confused by this behavior (client UUID tracking?), we may be 
tripping it.


After upgrading our fileservers to OpenAFS (from Transarc AFS 3.6), we
began to have intermittent problems with fileservers not providing
correct access to certain clients. In brief, certain fileservers appear
to not grant access to hosts whose IP addresses are in a pts group that
should be giving them access.

The exact details of the problem are interesting:

 	- only certain fileservers are affected. One fileserver may not
be granting correct permissions to a group of clients, while another one
will grant the correct permissions to all clients.

 	- only certain client IP addresses are affected. The affected IP
addresses do not seem to change from day to day. Most of our clients 
have never been affected.

 	- the affected client IP addresses are confined to a few
networks in a few physical locations. The affected IP addresses are
generally consecutive; i.e. we've seen that:

 		141.213.66.174
 		141.213.66.177
 		141.213.66.180
 		141.213.66.181
 		141.213.66.184

are affected, but the hosts in between are not.

 	- a fileserver restart clears up the problem on that particular 
fileserver. The problem has then reoccurred several days later.

 	- sometimes, the problem will go away spontaneously on a
particular fileserver. After this happened, I didn't see anything
interesting in /usr/afs/logs/FileLog, though.

 	- when a fileserver is not correctly granting access to an IP
address which is a member of one pts group, it may still grant access to
that same IP address which is a member of a different pts group. In
other words, it appears as though the client host's cached group
membership stored in (struct host).hcps is not empty, but somehow
corrupted.


I wrote a test program to exercise the pr_GetHostCPS() function, which
is what the fileserver appears to use to actually obtain the pts group
membership for the IP address of a client host. My test program is here:

 	http://www-personal.engin.umich.edu/~wingc/afs.bug/gethostcps.c


I ran this test program against all of our database servers, and it
seemed to behave properly. In other words, all of the client IP addresses
which are showing this problem are members of the correct pts group
according to pr_GetHostCPS(). Using the '-d' option and a custom
CellServDB file, I can force the test program to contact a particular
database server when it makes its query. I tested all of our database
servers and got the correct result.

The output of 'pts mem' for the pts group is also correct, obviously.



Does anyone have any ideas on how we could go about debugging this 
problem? Is anyone here using 1.2.x fileservers and IP-based ACLs?


To summarize, what we think we are seeing is:

 	- fileserver is not applying correct permissions for an IP
 	  address which belongs to a pts group; it is behaving as though
 	  that IP address does not belong to the pts group in question

 	- we think the problem started when we upgraded fileservers from
 	  Transarc AFS 3.6 to OpenAFS 1.2.13

 	- problem seems to be restricted to only certain client IP
 	  addresses; it does not seem to happen for all clients at random

 	- problem exists for both Linux 1.4.1-rc2 and Windows 1.3.87
 	  clients. The clients boot into each OS daily, in case this may
 	  be confusing the fileservers somehow

 	- the problem does not go away after the fileserver timeout
 	  interval for hostcps cache information (running with '-hr 1')

 	- the problem did seem to go away spontaneously on several
 	  fileservers (it went away for all client IP addresses at
 	  once). It then reoccurred on one of the fileservers several
 	  days later.


Based on the behavior of the fileservers, my assumption is that the 
following is happening:

 	- the cached "hostcps" information in the fileserver gets
 	  corrupted for some reason

 	- the (corrupted) information does not time out after 1 hour as
 	  it should, but it remains for a long time.

Does this sound possible? The other possibilities I have considered are:

 	- incompatibility between Transarc AFS 3.6 ptserver and OpenAFS 
1.2.13 fileserver that results in pr_GetHostCPS() returning invalid 
information to the fileservers. (although, my test program seems to work 
OK)

 	- client weirdness (either in the Linux 1.4.1rc2 or WinXP 1.3.87
client, or in the dual boot, or some kind of network problem) that
prevents the fileservers from properly applying the correct permissions
for the clients' IP addresses. Although, this problem only affects
access via IP-based ACLs; permissions granted to user tokens are OK.



Is there any chance of trying to fix this with the 1.2 fileserver? Or
will we need to upgrade to 1.4 to get anywhere?


Thanks a lot,

Chris Wing
wingc@engin.umich.edu