[OpenAFS] OpenAFS client cache overrun?

Eric Chris Garrison ecgarris@iu.edu
Fri, 07 Mar 2014 13:51:06 -0500


I'll have to look for that message from Andrew to gather data if the
problem crops up again.

Thank you for the advice about .pst files. We'd already gotten the user to
stop what they were doing, but it's good to know there's some backing,
even from Microsoft, for heading off future issues.

Thanks also for the mention of AFS cache bypass, I think that may be a BIG
help with this problem.

Cheers,

Chris

On 3/6/14 5:20 PM, "Jeffrey Altman" <jaltman@your-file-system.com> wrote:

>Andrew previously explained the steps you can take to collect
>information regarding the state of the afs cache manager.  I never saw
>any follow up e-mails containing the requested state information.  If
>the problem is a dead lock in the afs cache manager, the only way to fix
>it is to identify where the dead lock resides.
>
>I wish to provide some advice regarding the use of Outlook Personal
>Folders (.pst) over a network file system.  DON'T!   A quick Google
>search of "outlook pst network file system" will show you that Microsoft
>says not to do so as do many other web sites.   The Outlook PST is an
>indexed database that acts as a cache of data stored on mail servers.
>It is not meant to be stored in a redirected portion of the user
>profile.  It is meant to be local and small.
>
>Accessing Outlook PST files over a network increases (not decreases) the
>amount of network traffic generated by Outlook.  Outlook assumes the
>file is on local disk.  The .pst file is the equivalent of a Microsoft
>Access Database and access to it makes heavy use of byte range oplocks.
> Byte range locking is not supported by OpenAFS.  The native Windows AFS
>client goes to great lengths to simulate byte range lock semantics and
>Windows share modes to ensure that data that is modified is flushed to
>the file servers at the appropriate times.  I cannot say the same for
>Samba sitting on top of a UNIX afs cache manager.  The interactions
>between SMB protocol and AFS protocol are not pretty.
>
>You might want to try enabling afs cache bypass
>
>  http://docs.openafs.org/Reference/1/fs_bypassthreshold.html
>
>for files over a couple of GBs.  That way when the PST files are
>accessed they won't force the cached data for other users out of the
>cache.
>
>In the end though I would strongly recommend instructing your user to
>not store Outlook PST files in /afs if they are going to access it via a
>SMB share.
>
>Jeffrey Altman
>
>
>
>On 3/6/2014 4:32 PM, Eric Chris Garrison wrote:
>> We upgraded the gateways mentioned in the original email to
>> openafs-client-1.6.6-0.pre1 awhile back, since there was a bugfix for
>> cache overrun in it (thanks for the help, Derrick). And for awhile it
>> seemed like it had worked, our AFS clients on the gateway hosts weren't
>> locking up.
>> 
>> But the problem is back. We've had many lockups, requiring reboot, over
>> the past month, usually happening in clusters, like a user locking up
>> one host, them moving to another to lock it up.
>> 
>> After taking "smbstatus" snapshots each time it locks up, I've finally
>> found a common factor: large Outlook .pst files being locked:
>> 
>> One host:
>> 
>> 4229         46516      DENY_ALL   0x7019f     RDWR       NONE
>>   /afs/iu.edu/home/b/e/xxxxxx
>> xxxxxx@ads.iu.edu/BL-ECON-WY214-1/Data/C/Users/xxxxxx/Documents/Outlook
>> Files/Xxx
>> x Xxxxxx E-Mail Archive (2006-2011) (2014_02_27 18_02_49 UTC).pst   Mon
>> Mar  3 13:57:23 201412868        46516      DENY_ALL   0x7019f     RDWR
>>       NONE             /afs/iu.edu/home/b/e/xxxxxx
>> xxxxxx@ads.iu.edu/BL-ECON-WY214-1/Data/C/Users/xxxxxx/Documents/Outlook
>> Files/Xxxx Xxxxxx E-Mail Archive (2006-2011) (2014_02_27 18_02_49
>> UTC).pst   Mon Mar  3 16:30:32 2014
>> 30686        46516      DENY_ALL   0x7019f     RDWR       NONE
>>   /afs/iu.edu/home/b/e/xxxxxx
>> xxxxxx@ads.iu.edu/BL-ECON-WY214-1/Data/C/Users/xxxxxx/Documents/Outlook
>> Files/Xxxx Xxxxxx E-Mail Archive (2006-2011) (2014_02_27 18_02_49
>> UTC).pst   Thu Mar  6 14:53:39 2014
>> 
>> On another host:
>> 
>> /home/b/e/xxxxxx
>> xxxxxx@ads.iu.edu/BL-ECON-WY214-1/Data/C/Users/xxxxxx/Documents/Outlook
>> Files/Xxxx Xxxxxx E-Mail Archive (2006-2011) (2014_02_27 18_02_49
>> UTC).pst   Mon Mar  3 15:21:53 2014
>> ecg-ss2:24849        46516      DENY_ALL   0x7019f     RDWR       NONE
>>           /afs/iu.edu/home/b/e/xxxxxx
>> xxxxxx@ads.iu.edu/BL-ECON-WY214-1/Data/C/Users/xxxxxx/Documents/Outlook
>> Files/Xxxx Xxxxxx E-Mail Archive (2006-2011) (2014_03_06 18_11_27
>> UTC).pst   Thu Mar  6 14:44:26 2014
>> 
>> These are always present on each host that's locked up. Same .pst file,
>> even. It is a 6.5 GB file. Our AFS client cache is 7GB in size on a 9GB
>> partition.
>> 
>> I'm writing to the user to see if he's doing anything extraordinary.
>> 
>> Still looking for ideas. I haven't tried Kim Kaball's idea of lowering
>> the cache size to 2.5GB, I may try that next, but I worry that it'll
>> impact performance too much.
>> 
>> Thanks!!!
>> 
>> Chris Garrison
>> Indiana University
>> UITS Research Storage
>> 
>> From: Chris Garrison <ecgarris@iu.edu <mailto:ecgarris@iu.edu>>
>> Date: Wednesday, November 20, 2013 4:47 PM
>> To: "openafs-info@openafs.org <mailto:openafs-info@openafs.org>"
>> <openafs-info@openafs.org <mailto:openafs-info@openafs.org>>
>> Subject: [OpenAFS] OpenAFS client cache overrun?
>> 
>> Hello,
>> 
>> We have some RHEL 5.5 servers with openafs-client-1.6.1-1 running. There
>> are 4 of them in a round-robin DNS, with Apache and Samba sitting on top
>> of OpenAFS filesystem.
>> 
>> The hosts' /etc/sysconfig/openafs files look like this:
>> 
>>   # OpenAFS Client Configuration
>>   AFSD_ARGS="-dynroot -fakestat-all -daemons 8 -chunksize 22"
>> 
>> The hosts' /usr/vice/etc/cacheinfo files look like this:
>> 
>>   /afs:/usr/vice/cache:7500000
>> 
>> I realize it's better for users to all use the openafs client for their
>> own OS, but we have a large base of users who insist on wanting to just
>> map a drive without installing a client. We have been running like this
>> for 8+ years now, it's not a new setup.
>> 
>> Something has been locking up the openafs client in the past month or
>> so.  The cache will show as more and more full in "df" and then at some
>> point, AFS stops answering, and any attempt to do a directory listing or
>> to access a file results in a zombie process.
>> 
>> The zombie processes mount up fast, the load on the machine skyrockets,
>> and the only solution seems to be to reboot.
>> 
>> What could cause that lockup? It's usually only on one host at a time,
>> and seems like it will "move" from host to host, even returning to the
>> same host in the same day after reboot once in awhile.
>> 
>> I doubled the cache size on these hosts, and it seemed to slow things
>> down, but we had another lockup today after a restart of all the clients
>> on Sunday during a hardware upgrade on the SAN, so no host had been
>> running more than 3 days.
>> 
>> To me, it feels like maybe someone is forcing a huge file through and
>> running the machine out of cache. Though if that's so, I wonder why it
>> only just started happening after all these years. If nothing else, it
>> seems like something new is going on with the user end that's causing
>>it.
>> 
>> Any help would be appreciated, anything from a fix by limiting something
>> in the openafs client or the cache or ideas as to what someone could be
>> doing. Because at this point, it's like a denial of service attack
>> that's making lots of problems for us.
>> 
>> Thank you,
>> 
>> Chris Garrison
>> Indiana University Research Storage
>