[OpenAFS] amd64_linux24 AFS 1.2.11 cache problem

Kenneth Gole kengole@cadence.com
Tue, 30 Mar 2004 10:58:51 -0500


We're running OpenAFS 1.2.11 on our Opteron machines (amd64_linux24),
stock RedHat Advanced Server 3. We built OpenAFS directly from the
source rpm. In general, this setup works great. However, when we try to
write large files to AFS, the machine crashes with no messages or
diagnostics, the kernel simply halts. If we write the same file to local
disk or NFS, it works fine. The crash seems to happen when the file
exceeds 70-80% of our /usr/vice/cache size (we run a 1 gigabyte cache)
and when it's a core-type file (memory-mapped files seem to do it, too).
Here's an example of a C program that will crash the system if the core
file is written to AFS, works if written to local disk or NFS:

-----------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>

int main(void)
{
  void * start=NULL,*base;
  size_t length=(3*256*1024*1024);
  int prot, flags=0, fd=-1;
  off_t offset=0;
  prot=(PROT_READ|PROT_WRITE);
  flags=MAP_SHARED|MAP_ANONYMOUS;
  base=mmap(start,length,prot,flags,fd,offset);
  printf("Our segment begins at %p, here comes the abort...\n",base);
  abort(); /* Generate a 756+ megabyte core file */
  return 0;
}
-------------------------------------------------------------

However, a similar program that fwrite()'s a file of the same size as
the core file works fine, even to afs:

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
  FILE *file;
  size_t size=805613568;
  void *p=calloc(1,size);
  if (file= fopen ("bigfile", "wb"))
  {
    fwrite(p,size,1,file);
    fclose(file);
  }
  return 0;
}

-------------------------------------------------------------------

Changing the size of the AFS cache masks the problem, but as soon as we
again reach a file size of about 75% of the cache size, it crashes. I
checked the documentation and the mailing lists, but I don't see any
debug for cache manager problems. We're using the "$XLARGE" in our afs
sysconfig and I've tried adding -debug and -verbose flags to afsd with
no luck. This fails consistently with both smp and uniprocessor kernels,
and also with OpenAFS 1.3.62.

I'd appreciate any suggestions or advice on how we can get to the bottom
of this problem - it's killing our productivity when our systems keep
crashing. Thanks!

Ken Gole
Cadence Design Systems
Endicott, NY
kengole@cadence.com
(607) 762-1342