[OpenAFS] 1.4.1 kaserver issues

Russ Allbery rra@stanford.edu
Thu, 13 Jul 2006 07:40:58 -0700

Not sure that anyone else cares much about this at this point, and we have
a workaround (for right now at least), but throwing it out for the
archives and in case anyone has any thoughts.

We did a VLDB upgrade this morning from a patched OpenAFS 1.2.6 to OpenAFS
1.4.1, and everything went smoothly except for the kaserver (which we
still have to run for another year or two as we get clients transitioned
to K5).  After the upgrade, the kaserver would periodically log in

Thu Jul 13 07:06:33 2006 Ubik: Error reading database file: errno=22

usually in batches of five to ten of these at a time and then keep
answering queries.  From the client behavior we were seeing, it looked
like when the kaserver logged this error, the client request failed with
errors like principal unknown.

Platform is Solaris 8, which has:

#define EINVAL  22      /* Invalid argument                     */

I think I traced the likely failing code down to src/ubik/phys.c in either
uphys_open or uphys_read and given the errno setting, I'm guessing it's
probably the lseek in uphys_read, but I'm not at all sure beyond that what
would cause this.

We rolled back to the old kaserver binary, which is working fine.  The
only drawback to that is that we were hoping to move the VLDB servers to
Linux prior to turning off K4, and I'm worried this problem will crop up
again with 1.4.1 on Linux.  We can probably rebuild the old binaries, but
that makes me nervous.

The vlserver and ptserver seem to be running fine, and I'm not seeing any
errors like that from them.  My guess is that something changed in shared
code that exposed some problem in kaserver and no one is giving kaserver
love (understandably), but I don't know what could have changed that would
cause this sort of intermittant failure.

If anyone has any thoughts, I'm all ears.

Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>