FS-Cache: Count culled objects and objects rejected due to lack of spaceCount the number of objects that get culled by the cache backend and thenumber of objects that the cache backend declines to
FS-Cache: Count culled objects and objects rejected due to lack of spaceCount the number of objects that get culled by the cache backend and thenumber of objects that the cache backend declines to instantiate due to lackof space in the cache.These numbers are made available through /proc/fs/fscache/statsSigned-off-by: David Howells <dhowells@redhat.com>Reviewed-by: Steve Dickson <steved@redhat.com>Acked-by: Jeff Layton <jeff.layton@primarydata.com>
show more ...
sched: Remove proliferation of wait_on_bit() action functionsThe current "wait_on_bit" interface requires an 'action'function to be provided which does the actual waiting.There are over 20 such f
sched: Remove proliferation of wait_on_bit() action functionsThe current "wait_on_bit" interface requires an 'action'function to be provided which does the actual waiting.There are over 20 such functions, many of them identical.Most cases can be satisfied by one of just two functions, onewhich uses io_schedule() and one which just uses schedule().So: Rename wait_on_bit and wait_on_bit_lock to wait_on_bit_action and wait_on_bit_lock_action to make it explicit that they need an action function. Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io which are *not* given an action function but implicitly use a standard one. The decision to error-out if a signal is pending is now made based on the 'mode' argument rather than being encoded in the action function. All instances of the old wait_on_bit and wait_on_bit_lock which can use the new version have been changed accordingly and their action functions have been discarded. wait_on_bit{_lock} does not return any specific error code in the event of a signal so the caller must check for non-zero and interpolate their own error code as appropriate.The wait_on_bit() call in __fscache_wait_on_invalidate() wasambiguous as it specified TASK_UNINTERRUPTIBLE but usedfscache_wait_bit_interruptible as an action function.David Howells confirms this should be uniformly"uninterruptible"The main remaining user of wait_on_bit{,_lock}_action is NFSwhich needs to use a freezer-aware schedule() call.A comment in fs/gfs2/glock.c notes that having multiple 'action'functions is useful as they display differently in the 'wchan'field of 'ps'. (and /proc/$PID/wchan).As the new bit_wait{,_io} functions are tagged "__sched", theywill not show up at all, but something higher in the stack. Sothe distinction will still be visible, only with differentfunction names (gds2_glock_wait versus gfs2_glock_dq_wait in thegfs2/glock.c case).Since first version of this patch (against 3.15) two new actionfunctions appeared, on in NFS and one in CIFS. CIFS also nowuses an action function that makes the same freezer awareschedule call as NFS.Signed-off-by: NeilBrown <neilb@suse.de>Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)Acked-by: Peter Zijlstra <peterz@infradead.org>Cc: Oleg Nesterov <oleg@redhat.com>Cc: Steve French <sfrench@samba.org>Cc: Linus Torvalds <torvalds@linux-foundation.org>Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brownSigned-off-by: Ingo Molnar <mingo@kernel.org>
FS-Cache: Provide the ability to enable/disable cookiesProvide the ability to enable and disable fscache cookies. A disabled cookiewill reject or ignore further requests to: Acquire a child coo
FS-Cache: Provide the ability to enable/disable cookiesProvide the ability to enable and disable fscache cookies. A disabled cookiewill reject or ignore further requests to: Acquire a child cookie Invalidate and update backing objects Check the consistency of a backing object Allocate storage for backing page Read backing pages Write to backing pagesbut still allows: Checks/waits on the completion of already in-progress objects Uncaching of pages Relinquishment of cookiesTwo new operations are provided: (1) Disable a cookie: void fscache_disable_cookie(struct fscache_cookie *cookie, bool invalidate); If the cookie is not already disabled, this locks the cookie against other dis/enablement ops, marks the cookie as being disabled, discards or invalidates any backing objects and waits for cessation of activity on any associated object. This is a wrapper around a chunk split out of fscache_relinquish_cookie(), but it reinitialises the cookie such that it can be reenabled. All possible failures are handled internally. The caller should consider calling fscache_uncache_all_inode_pages() afterwards to make sure all page markings are cleared up. (2) Enable a cookie: void fscache_enable_cookie(struct fscache_cookie *cookie, bool (*can_enable)(void *data), void *data) If the cookie is not already enabled, this locks the cookie against other dis/enablement ops, invokes can_enable() and, if the cookie is not an index cookie, will begin the procedure of acquiring backing objects. The optional can_enable() function is passed the data argument and returns a ruling as to whether or not enablement should actually be permitted to begin. All possible failures are handled internally. The cookie will only be marked as enabled if provisional backing objects are allocated.A later patch will introduce these to NFS. Cookie enablement during nfs_open()is then contingent on i_writecount <= 0. can_enable() checks for a racebetween open(O_RDONLY) and open(O_WRONLY/O_RDWR). This simplifies NFS's cookiehandling and allows us to get rid of open(O_RDONLY) accidentally introducingcaching to an inode that's open for writing already.One operation has its API modified: (3) Acquire a cookie. struct fscache_cookie *fscache_acquire_cookie( struct fscache_cookie *parent, const struct fscache_cookie_def *def, void *netfs_data, bool enable); This now has an additional argument that indicates whether the requested cookie should be enabled by default. It doesn't need the can_enable() function because the caller must prevent multiple calls for the same netfs object and it doesn't need to take the enablement lock because no one else can get at the cookie before this returns.Signed-off-by: David Howells <dhowells@redhat.com
fscache: Netfs function for cleanup post readpagesCurrently the fscache code expect the netfs to call fscache_readpages_or_allocinside the aops readpages callback. It marks all the pages in the l
fscache: Netfs function for cleanup post readpagesCurrently the fscache code expect the netfs to call fscache_readpages_or_allocinside the aops readpages callback. It marks all the pages in the listprovided by readahead with PG_private_2. In the cases that the netfs fails toread all the pages (which is legal) it ends up returning to the readahead andtriggering a BUG. This happens because the page list still contains markedpages.This patch implements a simple fscache_readpages_cancel function that the netfsshould call before returning from readpages. It will revoke the pages from theunderlying cache backend and unmark them.The problem was originally worked out in the Ceph devel tree, but it alsooccurs in CIFS. It appears that NFS, AFS and 9P are okay as read_cache_pages()will clean up the unprocessed pages in the case of an error.This can be used to address the following oops:[12410647.597278] BUG: Bad page state in process petabucket pfn:3d504e[12410647.597292] page:ffffea000f541380 count:0 mapcount:0 mapping: (null) index:0x0[12410647.597298] page flags: 0x200000000001000(private_2)...[12410647.597334] Call Trace:[12410647.597345] [<ffffffff815523f2>] dump_stack+0x19/0x1b[12410647.597356] [<ffffffff8111def7>] bad_page+0xc7/0x120[12410647.597359] [<ffffffff8111e49e>] free_pages_prepare+0x10e/0x120[12410647.597361] [<ffffffff8111fc80>] free_hot_cold_page+0x40/0x170[12410647.597363] [<ffffffff81123507>] __put_single_page+0x27/0x30[12410647.597365] [<ffffffff81123df5>] put_page+0x25/0x40[12410647.597376] [<ffffffffa02bdcf9>] ceph_readpages+0x2e9/0x6e0 [ceph][12410647.597379] [<ffffffff81122a8f>] __do_page_cache_readahead+0x1af/0x260[12410647.597382] [<ffffffff81122ea1>] ra_submit+0x21/0x30[12410647.597384] [<ffffffff81118f64>] filemap_fault+0x254/0x490[12410647.597387] [<ffffffff8113a74f>] __do_fault+0x6f/0x4e0[12410647.597391] [<ffffffff810125bd>] ? __switch_to+0x16d/0x4a0[12410647.597395] [<ffffffff810865ba>] ? finish_task_switch+0x5a/0xc0[12410647.597398] [<ffffffff8113d856>] handle_pte_fault+0xf6/0x930[12410647.597401] [<ffffffff81008c33>] ? pte_mfn_to_pfn+0x93/0x110[12410647.597403] [<ffffffff81008cce>] ? xen_pmd_val+0xe/0x10[12410647.597405] [<ffffffff81005469>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e[12410647.597407] [<ffffffff8113f361>] handle_mm_fault+0x251/0x370[12410647.597411] [<ffffffff812b0ac4>] ? call_rwsem_down_read_failed+0x14/0x30[12410647.597414] [<ffffffff8155bffa>] __do_page_fault+0x1aa/0x550[12410647.597418] [<ffffffff8108011d>] ? up_write+0x1d/0x20[12410647.597422] [<ffffffff8113141c>] ? vm_mmap_pgoff+0xbc/0xe0[12410647.597425] [<ffffffff81143bb8>] ? SyS_mmap_pgoff+0xd8/0x240[12410647.597427] [<ffffffff8155c3ae>] do_page_fault+0xe/0x10[12410647.597431] [<ffffffff81558818>] page_fault+0x28/0x30Signed-off-by: Milosz Tanski <milosz@adfin.com>Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache: Fix heading in documentationFix a heading in the documentation to make it consistent with the contentslist.Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache: Add interface to check consistency of a cached objectExtend the fscache netfs API so that the netfs can ask as to whether a cacheobject is up to date with respect to its corresponding ne
FS-Cache: Add interface to check consistency of a cached objectExtend the fscache netfs API so that the netfs can ask as to whether a cacheobject is up to date with respect to its corresponding netfs object: int fscache_check_consistency(struct fscache_cookie *cookie)This will call back to the netfs to check whether the auxiliary data associatedwith a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; itmay also return -ENOMEM and -ERESTARTSYS.The backends now have to implement a mandatory operation pointer: int (*check_consistency)(struct fscache_object *object)that corresponds to the above API call. FS-Cache takes care of pinning theobject and the cookie in memory and managing this call with respect to theobject state.Original-author: Hongyi Jia <jiayisuse@gmail.com>Signed-off-by: David Howells <dhowells@redhat.com>cc: Hongyi Jia <jiayisuse@gmail.com>cc: Milosz Tanski <milosz@adfin.com>
FS-Cache: Provide proper invalidationProvide a proper invalidation method rather than relying on the netfs retiringthe cookie it has and getting a new one. The problem with this is that isn'teas
FS-Cache: Provide proper invalidationProvide a proper invalidation method rather than relying on the netfs retiringthe cookie it has and getting a new one. The problem with this is that isn'teasy for the netfs to make sure that it has completed/cancelled all itsoutstanding storage and retrieval operations on the cookie it is retiring.Instead, have the cache provide an invalidation method that will cancel or waitfor all currently outstanding operations before invalidating the cache, andwill cause new operations to queue up behind that. Whilst invalidation is inprogress, some requests will be rejected until the cache can stack a barrier onthe operation queue to cause new operations to be deferred behind it.Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache: Fix operation state management and accountingFix the state management of internal fscache operations and the accounting ofwhat operations are in what states.This is done by: (1) Give
FS-Cache: Fix operation state management and accountingFix the state management of internal fscache operations and the accounting ofwhat operations are in what states.This is done by: (1) Give struct fscache_operation a enum variable that directly represents the state it's currently in, rather than spreading this knowledge over a bunch of flags, who's processing the operation at the moment and whether it is queued or not. This makes it easier to write assertions to check the state at various points and to prevent invalid state transitions. (2) Add an 'operation complete' state and supply a function to indicate the completion of an operation (fscache_op_complete()) and make things call it. The final call to fscache_put_operation() can then check that an op in the appropriate state (complete or cancelled). (3) Adjust the use of object->n_ops, ->n_in_progress, ->n_exclusive to better govern the state of an object: (a) The ->n_ops is now the number of extant operations on the object and is now decremented by fscache_put_operation() only. (b) The ->n_in_progress is simply the number of objects that have been taken off of the object's pending queue for the purposes of being run. This is decremented by fscache_op_complete() only. (c) The ->n_exclusive is the number of exclusive ops that have been submitted and queued or are in progress. It is decremented by fscache_op_complete() and by fscache_cancel_op(). fscache_put_operation() and fscache_operation_gc() now no longer try to clean up ->n_exclusive and ->n_in_progress. That was leading to double decrements against fscache_cancel_op(). fscache_cancel_op() now no longer decrements ->n_ops. That was leading to double decrements against fscache_put_operation(). fscache_submit_exclusive_op() now decides whether it has to queue an op based on ->n_in_progress being > 0 rather than ->n_ops > 0 as the latter will persist in being true even after all preceding operations have been cancelled or completed. Furthermore, if an object is active and there are runnable ops against it, there must be at least one op running. (4) Add a remaining-pages counter (n_pages) to struct fscache_retrieval and provide a function to record completion of the pages as they complete. When n_pages reaches 0, the operation is deemed to be complete and fscache_op_complete() is called. Add calls to fscache_retrieval_complete() anywhere we've finished with a page we've been given to read or allocate for. This includes places where we just return pages to the netfs for reading from the server and where accessing the cache fails and we discard the proposed netfs page.The bugs in the unfixed state management manifest themselves as oopses like thefollowing where the operation completion gets out of sync with return of thecookie by the netfs. This is possible because the cache unlocks and returnsall the netfs pages before recording its completion - which means that there'snothing to stop the netfs discarding them and returning the cookie.FS-Cache: Cookie 'NFS.fh' still has outstanding reads------------[ cut here ]------------kernel BUG at fs/fscache/cookie.c:519!invalid opcode: 0000 [#1] SMPCPU 1Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpcPid: 400, comm: kswapd0 Not tainted 3.1.0-rc7-fsdevel+ #1090 /DG965RYRIP: 0010:[<ffffffffa007050a>] [<ffffffffa007050a>] __fscache_relinquish_cookie+0x170/0x343 [fscache]RSP: 0018:ffff8800368cfb00 EFLAGS: 00010282RAX: 000000000000003c RBX: ffff880023cc8790 RCX: 0000000000000000RDX: 0000000000002f2e RSI: 0000000000000001 RDI: ffffffff813ab86cRBP: ffff8800368cfb50 R08: 0000000000000002 R09: 0000000000000000R10: ffff88003a1b7890 R11: ffff88001df6e488 R12: ffff880023d8ed98R13: ffff880023cc8798 R14: 0000000000000004 R15: ffff88003b8bf370FS: 0000000000000000(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003bCR2: 00000000008ba008 CR3: 0000000023d93000 CR4: 00000000000006e0DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400Process kswapd0 (pid: 400, threadinfo ffff8800368ce000, task ffff88003b8bf040)Stack: ffff88003b8bf040 ffff88001df6e528 ffff88001df6e528 ffffffffa00b46b0 ffff88003b8bf040 ffff88001df6e488 ffff88001df6e620 ffffffffa00b46b0 ffff88001ebd04c8 0000000000000004 ffff8800368cfb70 ffffffffa00b2c91Call Trace: [<ffffffffa00b2c91>] nfs_fscache_release_inode_cookie+0x3b/0x47 [nfs] [<ffffffffa008f25f>] nfs_clear_inode+0x3c/0x41 [nfs] [<ffffffffa0090df1>] nfs4_evict_inode+0x2f/0x33 [nfs] [<ffffffff810d8d47>] evict+0xa1/0x15c [<ffffffff810d8e2e>] dispose_list+0x2c/0x38 [<ffffffff810d9ebd>] prune_icache_sb+0x28c/0x29b [<ffffffff810c56b7>] prune_super+0xd5/0x140 [<ffffffff8109b615>] shrink_slab+0x102/0x1ab [<ffffffff8109d690>] balance_pgdat+0x2f2/0x595 [<ffffffff8103e009>] ? process_timeout+0xb/0xb [<ffffffff8109dba3>] kswapd+0x270/0x289 [<ffffffff8104c5ea>] ? __init_waitqueue_head+0x46/0x46 [<ffffffff8109d933>] ? balance_pgdat+0x595/0x595 [<ffffffff8104bf7a>] kthread+0x7f/0x87 [<ffffffff813ad6b4>] kernel_thread_helper+0x4/0x10 [<ffffffff81026b98>] ? finish_task_switch+0x45/0xc0 [<ffffffff813abcdd>] ? retint_restore_args+0xe/0xe [<ffffffff8104befb>] ? __init_kthread_worker+0x53/0x53 [<ffffffff813ad6b0>] ? gs_change+0xb/0xbSigned-off-by: David Howells <dhowells@redhat.com>
doc: fix broken referencesThere are numerous broken references to Documentation files (in otherDocumentation files, in comments, etc.). These broken references arecaused by typo's in the referenc
doc: fix broken referencesThere are numerous broken references to Documentation files (in otherDocumentation files, in comments, etc.). These broken references arecaused by typo's in the references, and by renames or removals of theDocumentation files. Some broken references are simply odd.Fix these broken references, sometimes by dropping the irrelevant textthey were part of.Signed-off-by: Paul Bolle <pebolle@tiscali.nl>Signed-off-by: Jiri Kosina <jkosina@suse.cz>
FS-Cache: Add a helper to bulk uncache pages on an inodeAdd an FS-Cache helper to bulk uncache pages on an inode. This willonly work for the circumstance where the pages in the cache correspond1
FS-Cache: Add a helper to bulk uncache pages on an inodeAdd an FS-Cache helper to bulk uncache pages on an inode. This willonly work for the circumstance where the pages in the cache correspond1:1 with the pages attached to an inode's page cache.This is required for CIFS and NFS: When disabling inode cookie, we werereturning the cookie and setting cifsi->fscache to NULL but failed toinvalidate any previously mapped pages. This resulted in "Bad pagestate" errors and manifested in other kind of errors when runningfsstress. Fix it by uncaching mapped pages when we disable the inodecookie.This patch should fix the following oops and "Bad page state" errorsseen during fsstress testing. ------------[ cut here ]------------ kernel BUG at fs/cachefiles/namei.c:201! invalid opcode: 0000 [#1] SMP Pid: 5, comm: kworker/u:0 Not tainted 2.6.38.7-30.fc15.x86_64 #1 Bochs Bochs RIP: 0010: cachefiles_walk_to_object+0x436/0x745 [cachefiles] RSP: 0018:ffff88002ce6dd00 EFLAGS: 00010282 RAX: ffff88002ef165f0 RBX: ffff88001811f500 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000282 RBP: ffff88002ce6dda0 R08: 0000000000000100 R09: ffffffff81b3a300 R10: 0000ffff00066c0a R11: 0000000000000003 R12: ffff88002ae54840 R13: ffff88002ae54840 R14: ffff880029c29c00 R15: ffff88001811f4b0 FS: 00007f394dd32720(0000) GS:ffff88002ef00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fffcb62ddf8 CR3: 000000001825f000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/u:0 (pid: 5, threadinfo ffff88002ce6c000, task ffff88002ce55cc0) Stack: 0000000000000246 ffff88002ce55cc0 ffff88002ce6dd58 ffff88001815dc00 ffff8800185246c0 ffff88001811f618 ffff880029c29d18 ffff88001811f380 ffff88002ce6dd50 ffffffff814757e4 ffff88002ce6dda0 ffffffff8106ac56 Call Trace: cachefiles_lookup_object+0x78/0xd4 [cachefiles] fscache_lookup_object+0x131/0x16d [fscache] fscache_object_work_func+0x1bc/0x669 [fscache] process_one_work+0x186/0x298 worker_thread+0xda/0x15d kthread+0x84/0x8c kernel_thread_helper+0x4/0x10 RIP cachefiles_walk_to_object+0x436/0x745 [cachefiles] ---[ end trace 1d481c9af1804caa ]---I tested the uncaching by the following means: (1) Create a big file on my NFS server (104857600 bytes). (2) Read the file into the cache with md5sum on the NFS client. Look in /proc/fs/fscache/stats: Pages : mrk=25601 unc=0 (3) Open the file for read/write ("bash 5<>/warthog/bigfile"). Look in proc again: Pages : mrk=25601 unc=25601Reported-by: Jeff Layton <jlayton@redhat.com>Signed-off-by: David Howells <dhowells@redhat.com>Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>cc: stable@kernel.orgSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix common misspellingsFixes generated by 'codespell' and manually reviewed.Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
fscache: convert object to use workqueue instead of slow-workMake fscache object state transition callbacks use workqueue insteadof slow-work. New dedicated unbound CPU workqueue fscache_object_w
fscache: convert object to use workqueue instead of slow-workMake fscache object state transition callbacks use workqueue insteadof slow-work. New dedicated unbound CPU workqueue fscache_object_wqis created. get/put callbacks are renamed and modified to take@object and called directly from the enqueue wrapper and the workfunction. While at it, make all open coded instances of get/put touse fscache_get/put_object().* Unbound workqueue is used.* work_busy() output is printed instead of slow-work flags in object debugging outputs. They mean basically the same thing bit-for-bit.* sysctl fscache.object_max_active added to control concurrency. The default value is nr_cpus clamped between 4 and WQ_UNBOUND_MAX_ACTIVE.* slow_work_sleep_till_thread_needed() is replaced with fscache private implementation fscache_object_sleep_till_congested() which waits on fscache_object_wq congestion.* debugfs support is dropped for now. Tracing API based debug facility is planned to be added.Signed-off-by: Tejun Heo <tj@kernel.org>Acked-by: David Howells <dhowells@redhat.com>
CacheFiles: Catch an overly long wait for an old active objectCatch an overly long wait for an old, dying active object when we want toreplace it with a new one. The probability is that all the s
CacheFiles: Catch an overly long wait for an old active objectCatch an overly long wait for an old, dying active object when we want toreplace it with a new one. The probability is that all the slow-work threadsare hogged, and the delete can't get a look in.What we do instead is: (1) if there's nothing in the slow work queue, we sleep until either the dying object has finished dying or there is something in the slow work queue behind which we can queue our object. (2) if there is something in the slow work queue, we return ETIMEDOUT to fscache_lookup_object(), which then puts us back on the slow work queue, presumably behind the deletion that we're blocked by. We are then deferred for a while until we work our way back through the queue - without blocking a slow-work thread unnecessarily.A backtrace similar to the following may appear in the log without this patch: INFO: task kslowd004:5711 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kslowd004 D 0000000000000000 0 5711 2 0x00000080 ffff88000340bb80 0000000000000046 ffff88002550d000 0000000000000000 ffff88002550d000 0000000000000007 ffff88000340bfd8 ffff88002550d2a8 000000000000ddf0 00000000000118c0 00000000000118c0 ffff88002550d2a8 Call Trace: [<ffffffff81058e21>] ? trace_hardirqs_on+0xd/0xf [<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles] [<ffffffffa011c4e1>] cachefiles_wait_bit+0x9/0xd [cachefiles] [<ffffffff81353153>] __wait_on_bit+0x43/0x76 [<ffffffff8111ae39>] ? ext3_xattr_get+0x1ec/0x270 [<ffffffff813531ef>] out_of_line_wait_on_bit+0x69/0x74 [<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles] [<ffffffff8104c125>] ? wake_bit_function+0x0/0x2e [<ffffffffa011bc79>] cachefiles_mark_object_active+0x203/0x23b [cachefiles] [<ffffffffa011c209>] cachefiles_walk_to_object+0x558/0x827 [cachefiles] [<ffffffffa011a429>] cachefiles_lookup_object+0xac/0x12a [cachefiles] [<ffffffffa00aa1e9>] fscache_lookup_object+0x1c7/0x214 [fscache] [<ffffffffa00aafc5>] fscache_object_state_machine+0xa5/0x52d [fscache] [<ffffffffa00ab4ac>] fscache_object_slow_work_execute+0x5f/0xa0 [fscache] [<ffffffff81082093>] slow_work_execute+0x18f/0x2d1 [<ffffffff8108239a>] slow_work_thread+0x1c5/0x308 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34 [<ffffffff810821d5>] ? slow_work_thread+0x0/0x308 [<ffffffff8104be91>] kthread+0x7a/0x82 [<ffffffff8100beda>] child_rip+0xa/0x20 [<ffffffff8100b87c>] ? restore_args+0x0/0x30 [<ffffffff8104be17>] ? kthread+0x0/0x82 [<ffffffff8100bed0>] ? child_rip+0x0/0x20 1 lock held by kslowd004/5711: #0: (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [<ffffffffa011be64>] cachefiles_walk_to_object+0x1b3/0x827 [cachefiles]Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache: Start processing an object's operations on that object's deathStart processing an object's operations when that object moves into the DYINGstate as the object cannot be destroyed until a
FS-Cache: Start processing an object's operations on that object's deathStart processing an object's operations when that object moves into the DYINGstate as the object cannot be destroyed until all its outstanding operationshave completed.Furthermore, make sure that read and allocation operations handle being wokenup on a dead object. Such events are recorded in the Allocs.abt andRetrvls.abt statistics as viewable through /proc/fs/fscache/stats.The code for waiting for object activation for the read and allocationoperations is also extracted into its own function as it is much the same inall cases, differing only in the stats incremented.Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache: Handle pages pending storage that get evicted under OOM conditionsHandle netfs pages that the vmscan algorithm wants to evict from the pagecacheunder OOM conditions, but that are waiting
FS-Cache: Handle pages pending storage that get evicted under OOM conditionsHandle netfs pages that the vmscan algorithm wants to evict from the pagecacheunder OOM conditions, but that are waiting for write to the cache. Under theseconditions, vmscan calls the releasepage() function of the netfs, asking if apage can be discarded.The problem is typified by the following trace of a stuck process: kslowd005 D 0000000000000000 0 4253 2 0x00000080 ffff88001b14f370 0000000000000046 ffff880020d0d000 0000000000000007 0000000000000006 0000000000000001 ffff88001b14ffd8 ffff880020d0d2a8 000000000000ddf0 00000000000118c0 00000000000118c0 ffff880020d0d2a8 Call Trace: [<ffffffffa00782d8>] __fscache_wait_on_page_write+0x8b/0xa7 [fscache] [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34 [<ffffffffa0078240>] ? __fscache_check_page_write+0x63/0x70 [fscache] [<ffffffffa00b671d>] nfs_fscache_release_page+0x4e/0xc4 [nfs] [<ffffffffa00927f0>] nfs_release_page+0x3c/0x41 [nfs] [<ffffffff810885d3>] try_to_release_page+0x32/0x3b [<ffffffff81093203>] shrink_page_list+0x316/0x4ac [<ffffffff8109372b>] shrink_inactive_list+0x392/0x67c [<ffffffff813532fa>] ? __mutex_unlock_slowpath+0x100/0x10b [<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130 [<ffffffff8135330e>] ? mutex_unlock+0x9/0xb [<ffffffff81093aa2>] shrink_list+0x8d/0x8f [<ffffffff81093d1c>] shrink_zone+0x278/0x33c [<ffffffff81052d6c>] ? ktime_get_ts+0xad/0xba [<ffffffff81094b13>] try_to_free_pages+0x22e/0x392 [<ffffffff81091e24>] ? isolate_pages_global+0x0/0x212 [<ffffffff8108e743>] __alloc_pages_nodemask+0x3dc/0x5cf [<ffffffff81089529>] grab_cache_page_write_begin+0x65/0xaa [<ffffffff8110f8c0>] ext3_write_begin+0x78/0x1eb [<ffffffff81089ec5>] generic_file_buffered_write+0x109/0x28c [<ffffffff8103cb69>] ? current_fs_time+0x22/0x29 [<ffffffff8108a509>] __generic_file_aio_write+0x350/0x385 [<ffffffff8108a588>] ? generic_file_aio_write+0x4a/0xae [<ffffffff8108a59e>] generic_file_aio_write+0x60/0xae [<ffffffff810b2e82>] do_sync_write+0xe3/0x120 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34 [<ffffffff810b18e1>] ? __dentry_open+0x1a5/0x2b8 [<ffffffff810b1a76>] ? dentry_open+0x82/0x89 [<ffffffffa00e693c>] cachefiles_write_page+0x298/0x335 [cachefiles] [<ffffffffa0077147>] fscache_write_op+0x178/0x2c2 [fscache] [<ffffffffa0075656>] fscache_op_execute+0x7a/0xd1 [fscache] [<ffffffff81082093>] slow_work_execute+0x18f/0x2d1 [<ffffffff8108239a>] slow_work_thread+0x1c5/0x308 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34 [<ffffffff810821d5>] ? slow_work_thread+0x0/0x308 [<ffffffff8104be91>] kthread+0x7a/0x82 [<ffffffff8100beda>] child_rip+0xa/0x20 [<ffffffff8100b87c>] ? restore_args+0x0/0x30 [<ffffffff8102ef83>] ? tg_shares_up+0x171/0x227 [<ffffffff8104be17>] ? kthread+0x0/0x82 [<ffffffff8100bed0>] ? child_rip+0x0/0x20In the above backtrace, the following is happening: (1) A page storage operation is being executed by a slow-work thread (fscache_write_op()). (2) FS-Cache farms the operation out to the cache to perform (cachefiles_write_page()). (3) CacheFiles is then calling Ext3 to perform the actual write, using Ext3's standard write (do_sync_write()) under KERNEL_DS directly from the netfs page. (4) However, for Ext3 to perform the write, it must allocate some memory, in particular, it must allocate at least one page cache page into which it can copy the data from the netfs page. (5) Under OOM conditions, the memory allocator can't immediately come up with a page, so it uses vmscan to find something to discard (try_to_free_pages()). (6) vmscan finds a clean netfs page it might be able to discard (possibly the one it's trying to write out). (7) The netfs is called to throw the page away (nfs_release_page()) - but it's called with __GFP_WAIT, so the netfs decides to wait for the store to complete (__fscache_wait_on_page_write()). (8) This blocks a slow-work processing thread - possibly against itself.The system ends up stuck because it can't write out any netfs pages to thecache without allocating more memory.To avoid this, we make FS-Cache cancel some writes that aren't in the middle ofactually being performed. This means that some data won't make it into thecache this time. To support this, a new FS-Cache function is addedfscache_maybe_release_page() that replaces what the netfs releasepage()functions used to do with respect to the cache.The decisions fscache_maybe_release_page() makes are counted and displayedthrough /proc/fs/fscache/stats on a line labelled "VmScan". There are fourcounters provided: "nos=N" - pages that weren't pending storage; "gon=N" -pages that were pending storage when we first looked, but weren't by the timewe got the object lock; "bsy=N" - pages that we ignored as they were activelybeing written when we looked; and "can=N" - pages that we cancelled the storageof.What I'd really like to do is alter the behaviour of the cancellationheuristics, depending on how necessary it is to expel pages. If there areplenty of other pages that aren't waiting to be written to the cache thatcould be ejected first, then it would be nice to hold up on immediatecancellation of cache writes - but I don't see a way of doing that.Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache: Handle read request vs lookup, creation or other cache failureFS-Cache doesn't correctly handle the netfs requesting a read from the cacheon an object that failed or was withdrawn by the
FS-Cache: Handle read request vs lookup, creation or other cache failureFS-Cache doesn't correctly handle the netfs requesting a read from the cacheon an object that failed or was withdrawn by the cache. A trace similar tothe following might be seen: CacheFiles: Lookup failed error -105 [exe ] unexpected submission OP165afe [OBJ6cac OBJECT_LC_DYING] [exe ] objstate=OBJECT_LC_DYING [OBJECT_LC_DYING] [exe ] objflags=0 [exe ] objevent=9 [fffffffffffffffb] [exe ] ops=0 inp=0 exc=0 Pid: 6970, comm: exe Not tainted 2.6.32-rc6-cachefs #50 Call Trace: [<ffffffffa0076477>] fscache_submit_op+0x3ff/0x45a [fscache] [<ffffffffa0077997>] __fscache_read_or_alloc_pages+0x187/0x3c4 [fscache] [<ffffffffa00b6480>] ? nfs_readpage_from_fscache_complete+0x0/0x66 [nfs] [<ffffffffa00b6388>] __nfs_readpages_from_fscache+0x7e/0x176 [nfs] [<ffffffff8108e483>] ? __alloc_pages_nodemask+0x11c/0x5cf [<ffffffffa009d796>] nfs_readpages+0x114/0x1d7 [nfs] [<ffffffff81090314>] __do_page_cache_readahead+0x15f/0x1ec [<ffffffff81090228>] ? __do_page_cache_readahead+0x73/0x1ec [<ffffffff810903bd>] ra_submit+0x1c/0x20 [<ffffffff810906bb>] ondemand_readahead+0x227/0x23a [<ffffffff81090762>] page_cache_sync_readahead+0x17/0x19 [<ffffffff8108a99e>] generic_file_aio_read+0x236/0x5a0 [<ffffffffa00937bd>] nfs_file_read+0xe4/0xf3 [nfs] [<ffffffff810b2fa2>] do_sync_read+0xe3/0x120 [<ffffffff81354cc3>] ? _spin_unlock_irq+0x2b/0x31 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34 [<ffffffff811848e5>] ? selinux_file_permission+0x5d/0x10f [<ffffffff81352bdb>] ? thread_return+0x3e/0x101 [<ffffffff8117d7b0>] ? security_file_permission+0x11/0x13 [<ffffffff810b3b06>] vfs_read+0xaa/0x16f [<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130 [<ffffffff810b3c84>] sys_read+0x45/0x6c [<ffffffff8100ae2b>] system_call_fastpath+0x16/0x1bThe object state might also be OBJECT_DYING or OBJECT_WITHDRAWING.This should be handled by simply rejecting the new operation with ENOBUFS.There's no need to log an error for it. Events of this type now appear in thestats file under Ops:rej.Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache: Fix lock misorder in fscache_write_op()FS-Cache has two structs internally for keeping track of the internal state ofa cached file: the fscache_cookie struct, which represents the netfs'
FS-Cache: Fix lock misorder in fscache_write_op()FS-Cache has two structs internally for keeping track of the internal state ofa cached file: the fscache_cookie struct, which represents the netfs's state,and fscache_object struct, which represents the cache's state. Each has apointer that points to the other (when both are in existence), and each has aspinlock for pointer maintenance.Since netfs operations approach these structures from the cookie side, they getthe cookie lock first, then the object lock. Cache operations, on the otherhand, approach from the object side, and get the object lock first. It is notthen permitted for a cache operation to get the cookie lock whilst it isholding the object lock lest deadlock occur; instead, it must do one of twothings: (1) increment the cookie usage counter, drop the object lock and then get both locks in order, or (2) simply hold the object lock as certain parts of the cookie may not be altered whilst the object lock is held.It is also not permitted to follow either pointer without holding the lock atthe end you start with. To break the pointers between the cookie and theobject, both locks must be held.fscache_write_op(), however, violates the locking rules: It attempts to get thecookie lock without (a) checking that the cookie pointer is a valid pointer,and (b) holding the object lock to protect the cookie pointer whilst it followsit. This is so that it can access the pending page store tree withoutinterference from __fscache_write_page().This is fixed by splitting the cookie lock, such that the page store trackingtree is protected by its own lock, and checking that the cookie pointer isnon-NULL before we attempt to follow it whilst holding the object lock.The new lock is subordinate to both the cookie lock and the object lock, and soshould be taken after those.Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache: Permit cache retrieval ops to be interrupted in the initial wait phasePermit the operations to retrieve data from the cache or to allocate space inthe cache for future writes to be inter
FS-Cache: Permit cache retrieval ops to be interrupted in the initial wait phasePermit the operations to retrieve data from the cache or to allocate space inthe cache for future writes to be interrupted whilst they're waiting forpermission for the operation to proceed. Typically this wait occurs whilst thecache object is being looked up on disk in the background.If an interruption occurs, and the operation has not yet been given thego-ahead to run, the operation is dequeued and cancelled, and control returnsto the read operation of the netfs routine with none of the requested pageshaving been read or in any way marked as known by the cache.This means that the initial wait is done interruptibly rather thanuninterruptibly.In addition, extra stats values are made available to show the number of opscancelled and the number of cache space allocations interrupted.Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache: Add counters for entry/exit to/from cache operation functionsCount entries to and exits from cache operation table functions. Maintainthese as a single counter that's added to or remove
FS-Cache: Add counters for entry/exit to/from cache operation functionsCount entries to and exits from cache operation table functions. Maintainthese as a single counter that's added to or removed from as appropriate.Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache: Allow the current state of all objects to be dumpedAllow the current state of all fscache objects to be dumped by doing: cat /proc/fs/fscache/objectsBy default, all objects and all fi
FS-Cache: Allow the current state of all objects to be dumpedAllow the current state of all fscache objects to be dumped by doing: cat /proc/fs/fscache/objectsBy default, all objects and all fields will be shown. This can be restrictedby adding a suitable key to one of the caller's keyrings (such as the sessionkeyring): keyctl add user fscache:objlist "<restrictions>" @sThe <restrictions> are: K Show hexdump of object key (don't show if not given) A Show hexdump of object aux data (don't show if not given)And paired restrictions: C Show objects that have a cookie c Show objects that don't have a cookie B Show objects that are busy b Show objects that aren't busy W Show objects that have pending writes w Show objects that don't have pending writes R Show objects that have outstanding reads r Show objects that don't have outstanding reads S Show objects that have slow work queued s Show objects that don't have slow work queuedIf neither side of a restriction pair is given, then both are implied. Forexample: keyctl add user fscache:objlist KB @sshows objects that are busy, and lists their object keys, but does not dumptheir auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' isnot implied.Signed-off-by: David Howells <dhowells@redhat.com>
trivial: Miscellaneous documentation typo fixesFix various typos in documentation txts.Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>Signed-off-by: Jiri Kosina <jkosina@suse.cz>
CacheFiles: Fix the documentation to use the correct credential pointer namesAdjust the CacheFiles documentation to use the correct names of the credentialpointers in task_struct.The documentati
CacheFiles: Fix the documentation to use the correct credential pointer namesAdjust the CacheFiles documentation to use the correct names of the credentialpointers in task_struct.The documentation was using names from the old versions of the credentialspatches.Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>Signed-off-by: David Howells <dhowells@redhat.com>Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CacheFiles: A cache that backs onto a mounted filesystemAdd an FS-Cache cache-backend that permits a mounted filesystem to be used as abacking store for the cache.CacheFiles uses a userspace dae
CacheFiles: A cache that backs onto a mounted filesystemAdd an FS-Cache cache-backend that permits a mounted filesystem to be used as abacking store for the cache.CacheFiles uses a userspace daemon to do some of the cache management - such asreaping stale nodes and culling. This is called cachefilesd and lives in/sbin. The source for the daemon can be downloaded from: http://people.redhat.com/~dhowells/cachefs/cachefilesd.cAnd an example configuration from: http://people.redhat.com/~dhowells/cachefs/cachefilesd.confThe filesystem and data integrity of the cache are only as good as those of thefilesystem providing the backing services. Note that CacheFiles does notattempt to journal anything since the journalling interfaces of the variousfilesystems are very specific in nature.CacheFiles creates a misc character device - "/dev/cachefiles" - that is usedto communication with the daemon. Only one thing may have this open at once,and whilst it is open, a cache is at least partially in existence. The daemonopens this and sends commands down it to control the cache.CacheFiles is currently limited to a single cache.CacheFiles attempts to maintain at least a certain percentage of free space onthe filesystem, shrinking the cache by culling the objects it contains to makespace if necessary - see the "Cache Culling" section. This means it can beplaced on the same medium as a live set of data, and will expand to make use ofspare space and automatically contract when the set of data requires morespace.============REQUIREMENTS============The use of CacheFiles and its daemon requires the following features to beavailable in the system and in the cache filesystem: - dnotify. - extended attributes (xattrs). - openat() and friends. - bmap() support on files in the filesystem (FIBMAP ioctl). - The use of bmap() to detect a partial page at the end of the file.It is strongly recommended that the "dir_index" option is enabled on Ext3filesystems being used as a cache.=============CONFIGURATION=============The cache is configured by a script in /etc/cachefilesd.conf. These commandsset up cache ready for use. The following script commands are available: (*) brun <N>% (*) bcull <N>% (*) bstop <N>% (*) frun <N>% (*) fcull <N>% (*) fstop <N>% Configure the culling limits. Optional. See the section on culling The defaults are 7% (run), 5% (cull) and 1% (stop) respectively. The commands beginning with a 'b' are file space (block) limits, those beginning with an 'f' are file count limits. (*) dir <path> Specify the directory containing the root of the cache. Mandatory. (*) tag <name> Specify a tag to FS-Cache to use in distinguishing multiple caches. Optional. The default is "CacheFiles". (*) debug <mask> Specify a numeric bitmask to control debugging in the kernel module. Optional. The default is zero (all off). The following values can be OR'd into the mask to collect various information: 1 Turn on trace of function entry (_enter() macros) 2 Turn on trace of function exit (_leave() macros) 4 Turn on trace of internal debug points (_debug()) This mask can also be set through sysfs, eg: echo 5 >/sys/modules/cachefiles/parameters/debug==================STARTING THE CACHE==================The cache is started by running the daemon. The daemon opens the cache device,configures the cache and tells it to begin caching. At that point the cachebinds to fscache and the cache becomes live.The daemon is run as follows: /sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]The flags are: (*) -d Increase the debugging level. This can be specified multiple times and is cumulative with itself. (*) -s Send messages to stderr instead of syslog. (*) -n Don't daemonise and go into background. (*) -f <configfile> Use an alternative configuration file rather than the default one.===============THINGS TO AVOID===============Do not mount other things within the cache as this will cause problems. Thekernel module contains its own very cut-down path walking facility that ignoresmountpoints, but the daemon can't avoid them.Do not create, rename or unlink files and directories in the cache whilst thecache is active, as this may cause the state to become uncertain.Renaming files in the cache might make objects appear to be other objects (thefilename is part of the lookup key).Do not change or remove the extended attributes attached to cache files by thecache as this will cause the cache state management to get confused.Do not create files or directories in the cache, lest the cache get confused orserve incorrect data.Do not chmod files in the cache. The module creates things with minimalpermissions to prevent random users being able to access them directly.=============CACHE CULLING=============The cache may need culling occasionally to make space. This involvesdiscarding objects from the cache that have been used less recently thananything else. Culling is based on the access time of data objects. Emptydirectories are culled if not in use.Cache culling is done on the basis of the percentage of blocks and thepercentage of files available in the underlying filesystem. There are six"limits": (*) brun (*) frun If the amount of free space and the number of available files in the cache rises above both these limits, then culling is turned off. (*) bcull (*) fcull If the amount of available space or the number of available files in the cache falls below either of these limits, then culling is started. (*) bstop (*) fstop If the amount of available space or the number of available files in the cache falls below either of these limits, then no further allocation of disk space or files is permitted until culling has raised things above these limits again.These must be configured thusly: 0 <= bstop < bcull < brun < 100 0 <= fstop < fcull < frun < 100Note that these are percentages of available space and available files, and do_not_ appear as 100 minus the percentage displayed by the "df" program.The userspace daemon scans the cache to build up a table of cullable objects.These are then culled in least recently used order. A new scan of the cache isstarted as soon as space is made in the table. Objects will be skipped iftheir atimes have changed or if the kernel module says it is still using them.===============CACHE STRUCTURE===============The CacheFiles module will create two directories in the directory it wasgiven: (*) cache/ (*) graveyard/The active cache objects all reside in the first directory. The CacheFileskernel module moves any retired or culled objects that it can't simply unlinkto the graveyard from which the daemon will actually delete them.The daemon uses dnotify to monitor the graveyard directory, and will deleteanything that appears therein.The module represents index objects as directories with the filename "I..." or"J...". Note that the "cache/" directory is itself a special index.Data objects are represented as files if they have no children, or directoriesif they do. Their filenames all begin "D..." or "E...". If represented as adirectory, data objects will have a file in the directory called "data" thatactually holds the data.Special objects are similar to data objects, except their filenames begin"S..." or "T...".If an object has children, then it will be represented as a directory.Immediately in the representative directory are a collection of directoriesnamed for hash values of the child object keys with an '@' prepended. Intothis directory, if possible, will be placed the representations of the childobjects: INDEX INDEX INDEX DATA FILES ========= ========== ================================= ================ cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400 cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ryIf the key is so long that it exceeds NAME_MAX with the decorations added on toit, then it will be cut into pieces, the first few of which will be used tomake a nest of directories, and the last one of which will be the objectsinside the last directory. The names of the intermediate directories will have'+' prepended: J1223/@23/+xy...z/+kl...m/EpqrNote that keys are raw data, and not only may they exceed NAME_MAX in size,they may also contain things like '/' and NUL characters, and so they may notbe suitable for turning directly into a filename.To handle this, CacheFiles will use a suitably printable filename directly and"base-64" encode ones that aren't directly suitable. The two versions ofobject filenames indicate the encoding: OBJECT TYPE PRINTABLE ENCODED =============== =============== =============== Index "I..." "J..." Data "D..." "E..." Special "S..." "T..."Intermediate directories are always "@" or "+" as appropriate.Each object in the cache has an extended attribute label that holds the objecttype ID (required to distinguish special objects) and the auxiliary data fromthe netfs. The latter is used to detect stale objects in the cache and updateor retire them.Note that CacheFiles will erase from the cache any file it doesn't recognise orany file of an incorrect type (such as a FIFO file or a device file).==========================SECURITY MODEL AND SELINUX==========================CacheFiles is implemented to deal properly with the LSM security features ofthe Linux kernel and the SELinux facility.One of the problems that CacheFiles faces is that it is generally acting onbehalf of a process, and running in that process's context, and that includes asecurity context that is not appropriate for accessing the cache - eitherbecause the files in the cache are inaccessible to that process, or because ifthe process creates a file in the cache, that file may be inaccessible to otherprocesses.The way CacheFiles works is to temporarily change the security context (fsuid,fsgid and actor security label) that the process acts as - without changing thesecurity context of the process when it the target of an operation performed bysome other process (so signalling and suchlike still work correctly).When the CacheFiles module is asked to bind to its cache, it: (1) Finds the security label attached to the root cache directory and uses that as the security label with which it will create files. By default, this is: cachefiles_var_t (2) Finds the security label of the process which issued the bind request (presumed to be the cachefilesd daemon), which by default will be: cachefilesd_t and asks LSM to supply a security ID as which it should act given the daemon's label. By default, this will be: cachefiles_kernel_t SELinux transitions the daemon's security ID to the module's security ID based on a rule of this form in the policy. type_transition <daemon's-ID> kernel_t : process <module's-ID>; For instance: type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;The module's security ID gives it permission to create, move and remove filesand directories in the cache, to find and access directories and files in thecache, to set and access extended attributes on cache objects, and to read andwrite files in the cache.The daemon's security ID gives it only a very restricted set of permissions: itmay scan directories, stat files and erase files and directories. It maynot read or write files in the cache, and so it is precluded from accessing thedata cached therein; nor is it permitted to create new files in the cache.There are policy source files available in: http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2and later versions. In that tarball, see the files: cachefilesd.te cachefilesd.fc cachefilesd.ifThey are built and installed directly by the RPM.If a non-RPM based system is being used, then copy the above files to their owndirectory and run: make -f /usr/share/selinux/devel/Makefile semodule -i cachefilesd.ppYou will need checkpolicy and selinux-policy-devel installed prior to thebuild.By default, the cache is located in /var/fscache, but if it is desirable thatit should be elsewhere, than either the above policy files must be altered, oran auxiliary policy must be installed to label the alternate location of thecache.For instructions on how to add an auxiliary policy to enable the cache to belocated elsewhere when SELinux is in enforcing mode, please see: /usr/share/doc/cachefilesd-*/move-cache.txtWhen the cachefilesd rpm is installed; alternatively, the document can be foundin the sources.==================A NOTE ON SECURITY==================CacheFiles makes use of the split security in the task_struct. It allocatesits own task_security structure, and redirects current->act_as to point to itwhen it acts on behalf of another process, in that process's context.The reason it does this is that it calls vfs_mkdir() and suchlike rather thanbypassing security and calling inode ops directly. Therefore the VFS and LSMmay deny the CacheFiles access to the cache data because under somecircumstances the caching code is running in the security context of whateverprocess issued the original syscall on the netfs.Furthermore, should CacheFiles create a file or directory, the securityparameters with that object is created (UID, GID, security label) would bederived from that process that issued the system call, thus potentiallypreventing other processes from accessing the cache - including CacheFiles'scache management daemon (cachefilesd).What is required is to temporarily override the security of the process thatissued the system call. We can't, however, just do an in-place change of thesecurity data as that affects the process as an object, not just as a subject.This means it may lose signals or ptrace events for example, and affects whatthe process looks like in /proc.So CacheFiles makes use of a logical split in the security between theobjective security (task->sec) and the subjective security (task->act_as). Theobjective security holds the intrinsic security properties of a process and isnever overridden. This is what appears in /proc, and is what is used when aprocess is the target of an operation by some other process (SIGKILL forexample).The subjective security holds the active security properties of a process, andmay be overridden. This is not seen externally, and is used whan a processacts upon another object, for example SIGKILLing another process or opening afile.LSM hooks exist that allow SELinux (or Smack or whatever) to reject a requestfor CacheFiles to run in a context of a specific security label, or to createfiles and directories with another security label.This documentation is added by the patch to: Documentation/filesystems/caching/cachefiles.txtSigned-Off-By: David Howells <dhowells@redhat.com>Acked-by: Steve Dickson <steved@redhat.com>Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>Acked-by: Al Viro <viro@zeniv.linux.org.uk>Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
FS-Cache: Add and document asynchronous operation handlingAdd and document asynchronous operation handling for use by FS-Cache's datastorage and retrieval routines.The following documentation is
FS-Cache: Add and document asynchronous operation handlingAdd and document asynchronous operation handling for use by FS-Cache's datastorage and retrieval routines.The following documentation is added to: Documentation/filesystems/caching/operations.txt ================================ ASYNCHRONOUS OPERATIONS HANDLING ========================================OVERVIEW========FS-Cache has an asynchronous operations handling facility that it uses for itsdata storage and retrieval routines. Its operations are represented byfscache_operation structs, though these are usually embedded into some otherstructure.This facility is available to and expected to be be used by the cache backends,and FS-Cache will create operations and pass them off to the appropriate cachebackend for completion.To make use of this facility, <linux/fscache-cache.h> should be #included.===============================OPERATION RECORD INITIALISATION===============================An operation is recorded in an fscache_operation struct: struct fscache_operation { union { struct work_struct fast_work; struct slow_work slow_work; }; unsigned long flags; fscache_operation_processor_t processor; ... };Someone wanting to issue an operation should allocate something with thisstruct embedded in it. They should initialise it by calling: void fscache_operation_init(struct fscache_operation *op, fscache_operation_release_t release);with the operation to be initialised and the release function to use.The op->flags parameter should be set to indicate the CPU time provision andthe exclusivity (see the Parameters section).The op->fast_work, op->slow_work and op->processor flags should be set asappropriate for the CPU time provision (see the Parameters section).FSCACHE_OP_WAITING may be set in op->flags prior to each submission of theoperation and waited for afterwards.==========PARAMETERS==========There are a number of parameters that can be set in the operation record's flagparameter. There are three options for the provision of CPU time in theseoperations: (1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread may decide it wants to handle an operation itself without deferring it to another thread. This is, for example, used in read operations for calling readpages() on the backing filesystem in CacheFiles. Although readpages() does an asynchronous data fetch, the determination of whether pages exist is done synchronously - and the netfs does not proceed until this has been determined. If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags before submitting the operation, and the operating thread must wait for it to be cleared before proceeding: wait_on_bit(&op->flags, FSCACHE_OP_WAITING, fscache_wait_bit, TASK_UNINTERRUPTIBLE); (2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it will be given to keventd to process. Such an operation is not permitted to sleep on I/O. This is, for example, used by CacheFiles to copy data from a backing fs page to a netfs page after the backing fs has read the page in. If this option is used, op->fast_work and op->processor must be initialised before submitting the operation: INIT_WORK(&op->fast_work, do_some_work); (3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it will be given to the slow work facility to process. Such an operation is permitted to sleep on I/O. This is, for example, used by FS-Cache to handle background writes of pages that have just been fetched from a remote server. If this option is used, op->slow_work and op->processor must be initialised before submitting the operation: fscache_operation_init_slow(op, processor)Furthermore, operations may be one of two types: (1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in conjunction with any other operation on the object being operated upon. An example of this is the attribute change operation, in which the file being written to may need truncation. (2) Shareable. Operations of this type may be running simultaneously. It's up to the operation implementation to prevent interference between other operations running at the same time.=========PROCEDURE=========Operations are used through the following procedure: (1) The submitting thread must allocate the operation and initialise it itself. Normally this would be part of a more specific structure with the generic op embedded within. (2) The submitting thread must then submit the operation for processing using one of the following two functions: int fscache_submit_op(struct fscache_object *object, struct fscache_operation *op); int fscache_submit_exclusive_op(struct fscache_object *object, struct fscache_operation *op); The first function should be used to submit non-exclusive ops and the second to submit exclusive ones. The caller must still set the FSCACHE_OP_EXCLUSIVE flag. If successful, both functions will assign the operation to the specified object and return 0. -ENOBUFS will be returned if the object specified is permanently unavailable. The operation manager will defer operations on an object that is still undergoing lookup or creation. The operation will also be deferred if an operation of conflicting exclusivity is in progress on the object. If the operation is asynchronous, the manager will retain a reference to it, so the caller should put their reference to it by passing it to: void fscache_put_operation(struct fscache_operation *op); (3) If the submitting thread wants to do the work itself, and has marked the operation with FSCACHE_OP_MYTHREAD, then it should monitor FSCACHE_OP_WAITING as described above and check the state of the object if necessary (the object might have died whilst the thread was waiting). When it has finished doing its processing, it should call fscache_put_operation() on it. (4) The operation holds an effective lock upon the object, preventing other exclusive ops conflicting until it is released. The operation can be enqueued for further immediate asynchronous processing by adjusting the CPU time provisioning option if necessary, eg: op->flags &= ~FSCACHE_OP_TYPE; op->flags |= ~FSCACHE_OP_FAST; and calling: void fscache_enqueue_operation(struct fscache_operation *op) This can be used to allow other things to have use of the worker thread pools.=====================ASYNCHRONOUS CALLBACK=====================When used in asynchronous mode, the worker thread pool will invoke theprocessor method with a pointer to the operation. This should then get at thecontainer struct by using container_of(): static void fscache_write_op(struct fscache_operation *_op) { struct fscache_storage *op = container_of(_op, struct fscache_storage, op); ... }The caller holds a reference on the operation, and will invokefscache_put_operation() when the processor function returns. The processorfunction is at liberty to call fscache_enqueue_operation() or to take extrareferences.Signed-off-by: David Howells <dhowells@redhat.com>Acked-by: Steve Dickson <steved@redhat.com>Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>Acked-by: Al Viro <viro@zeniv.linux.org.uk>Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
FS-Cache: Object management state machineImplement the cache object management state machine.The following documentation is added to illuminate the working of this statemachine. It will also be
FS-Cache: Object management state machineImplement the cache object management state machine.The following documentation is added to illuminate the working of this statemachine. It will also be added as: Documentation/filesystems/caching/object.txt ==================================================== IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT ==================================================================REPRESENTATION==============FS-Cache maintains an in-kernel representation of each object that a netfs iscurrently interested in. Such objects are represented by the fscache_cookiestruct and are referred to as cookies.FS-Cache also maintains a separate in-kernel representation of the objects thata cache backend is currently actively caching. Such objects are represented bythe fscache_object struct. The cache backends allocate these upon request, andare expected to embed them in their own representations. These are referred toas objects.There is a 1:N relationship between cookies and objects. A cookie may berepresented by multiple objects - an index may exist in more than one cache -or even by no objects (it may not be cached).Furthermore, both cookies and objects are hierarchical. The two hierarchiescorrespond, but the cookies tree is a superset of the union of the object treesof multiple caches: NETFS INDEX TREE : CACHE 1 : CACHE 2 : : : +-----------+ : +----------->| IObject | : +-----------+ | : +-----------+ : | ICookie |-------+ : | : +-----------+ | : | : +-----------+ | +------------------------------>| IObject | | : | : +-----------+ | : V : | | : +-----------+ : | V +----------->| IObject | : | +-----------+ | : +-----------+ : | | ICookie |-------+ : | : V +-----------+ | : | : +-----------+ | +------------------------------>| IObject | +-----+-----+ : | : +-----------+ | | : | : | V | : V : | +-----------+ | : +-----------+ : | | ICookie |------------------------->| IObject | : | +-----------+ | : +-----------+ : | | V : | : V | +-----------+ : | : +-----------+ | | ICookie |-------------------------------->| IObject | | +-----------+ : | : +-----------+ V | : V : | +-----------+ | : +-----------+ : | | DCookie |------------------------->| DObject | : | +-----------+ | : +-----------+ : | | : : | +-------+-------+ : : | | | : : | V V : : V +-----------+ +-----------+ : : +-----------+ | DCookie | | DCookie |------------------------>| DObject | +-----------+ +-----------+ : : +-----------+ : :In the above illustration, ICookie and IObject represent indices and DCookieand DObject represent data storage objects. Indices may have representation inmultiple caches, but currently, non-index objects may not. Objects of any typemay also be entirely unrepresented.As far as the netfs API goes, the netfs is only actually permitted to seepointers to the cookies. The cookies themselves and any objects attached tothose cookies are hidden from it.===============================OBJECT MANAGEMENT STATE MACHINE===============================Within FS-Cache, each active object is managed by its own individual statemachine. The state for an object is kept in the fscache_object struct, inobject->state. A cookie may point to a set of objects that are in differentstates.Each state has an action associated with it that is invoked when the machinewakes up in that state. There are four logical sets of states: (1) Preparation: states that wait for the parent objects to become ready. The representations are hierarchical, and it is expected that an object must be created or accessed with respect to its parent object. (2) Initialisation: states that perform lookups in the cache and validate what's found and that create on disk any missing metadata. (3) Normal running: states that allow netfs operations on objects to proceed and that update the state of objects. (4) Termination: states that detach objects from their netfs cookies, that delete objects from disk, that handle disk and system errors and that free up in-memory resources.In most cases, transitioning between states is in response to signalled events.When a state has finished processing, it will usually set the mask of events inwhich it is interested (object->event_mask) and relinquish the worker thread.Then when an event is raised (by calling fscache_raise_event()), if the eventis not masked, the object will be queued for processing (by callingfscache_enqueue_object()).PROVISION OF CPU TIME---------------------The work to be done by the various states is given CPU time by the threads ofthe slow work facility (see Documentation/slow-work.txt). This is used inpreference to the workqueue facility because: (1) Threads may be completely occupied for very long periods of time by a particular work item. These state actions may be doing sequences of synchronous, journalled disk accesses (lookup, mkdir, create, setxattr, getxattr, truncate, unlink, rmdir, rename). (2) Threads may do little actual work, but may rather spend a lot of time sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded workqueues don't necessarily have the right numbers of threads.LOCKING SIMPLIFICATION----------------------Because only one worker thread may be operating on any particular object'sstate machine at once, this simplifies the locking, particularly with respectto disconnecting the netfs's representation of a cache object (fscache_cookie)from the cache backend's representation (fscache_object) - which may berequested from either end.=================THE SET OF STATES=================The object state machine has a set of states that it can be in. There arepreparation states in which the object sets itself up and waits for its parentobject to transit to a state that allows access to its children: (1) State FSCACHE_OBJECT_INIT. Initialise the object and wait for the parent object to become active. In the cache, it is expected that it will not be possible to look an object up from the parent object, until that parent object itself has been looked up.There are initialisation states in which the object sets itself up and accessesdisk for the object metadata: (2) State FSCACHE_OBJECT_LOOKING_UP. Look up the object on disk, using the parent as a starting point. FS-Cache expects the cache backend to probe the cache to see whether this object is represented there, and if it is, to see if it's valid (coherency management). The cache should call fscache_object_lookup_negative() to indicate lookup failure for whatever reason, and should call fscache_obtained_object() to indicate success. At the completion of lookup, FS-Cache will let the netfs go ahead with read operations, no matter whether the file is yet cached. If not yet cached, read operations will be immediately rejected with ENODATA until the first known page is uncached - as to that point there can be no data to be read out of the cache for that file that isn't currently also held in the pagecache. (3) State FSCACHE_OBJECT_CREATING. Create an object on disk, using the parent as a starting point. This happens if the lookup failed to find the object, or if the object's coherency data indicated what's on disk is out of date. In this state, FS-Cache expects the cache to create The cache should call fscache_obtained_object() if creation completes successfully, fscache_object_lookup_negative() otherwise. At the completion of creation, FS-Cache will start processing write operations the netfs has queued for an object. If creation failed, the write ops will be transparently discarded, and nothing recorded in the cache.There are some normal running states in which the object spends its timeservicing netfs requests: (4) State FSCACHE_OBJECT_AVAILABLE. A transient state in which pending operations are started, child objects are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary lookup data is freed. (5) State FSCACHE_OBJECT_ACTIVE. The normal running state. In this state, requests the netfs makes will be passed on to the cache. (6) State FSCACHE_OBJECT_UPDATING. The state machine comes here to update the object in the cache from the netfs's records. This involves updating the auxiliary data that is used to maintain coherency.And there are terminal states in which an object cleans itself up, deallocatesmemory and potentially deletes stuff from disk: (7) State FSCACHE_OBJECT_LC_DYING. The object comes here if it is dying because of a lookup or creation error. This would be due to a disk error or system error of some sort. Temporary data is cleaned up, and the parent is released. (8) State FSCACHE_OBJECT_DYING. The object comes here if it is dying due to an error, because its parent cookie has been relinquished by the netfs or because the cache is being withdrawn. Any child objects waiting on this one are given CPU time so that they too can destroy themselves. This object waits for all its children to go away before advancing to the next state. (9) State FSCACHE_OBJECT_ABORT_INIT. The object comes to this state if it was waiting on its parent in FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself so that the parent may proceed from the FSCACHE_OBJECT_DYING state.(10) State FSCACHE_OBJECT_RELEASING.(11) State FSCACHE_OBJECT_RECYCLING. The object comes to one of these two states when dying once it is rid of all its children, if it is dying because the netfs relinquished its cookie. In the first state, the cached data is expected to persist, and in the second it will be deleted.(12) State FSCACHE_OBJECT_WITHDRAWING. The object transits to this state if the cache decides it wants to withdraw the object from service, perhaps to make space, but also due to error or just because the whole cache is being withdrawn.(13) State FSCACHE_OBJECT_DEAD. The object transits to this state when the in-memory object record is ready to be deleted. The object processor shouldn't ever see an object in this state.THE SET OF EVENTS-----------------There are a number of events that can be raised to an object state machine: (*) FSCACHE_OBJECT_EV_UPDATE The netfs requested that an object be updated. The state machine will ask the cache backend to update the object, and the cache backend will ask the netfs for details of the change through its cookie definition ops. (*) FSCACHE_OBJECT_EV_CLEARED This is signalled in two circumstances: (a) when an object's last child object is dropped and (b) when the last operation outstanding on an object is completed. This is used to proceed from the dying state. (*) FSCACHE_OBJECT_EV_ERROR This is signalled when an I/O error occurs during the processing of some object. (*) FSCACHE_OBJECT_EV_RELEASE (*) FSCACHE_OBJECT_EV_RETIRE These are signalled when the netfs relinquishes a cookie it was using. The event selected depends on whether the netfs asks for the backing object to be retired (deleted) or retained. (*) FSCACHE_OBJECT_EV_WITHDRAW This is signalled when the cache backend wants to withdraw an object. This means that the object will have to be detached from the netfs's cookie.Because the withdrawing releasing/retiring events are all handled by the objectstate machine, it doesn't matter if there's a collision with both ends tryingto sever the connection at the same time. The state machine can just pickwhich one it wants to honour, and that effects the other.Signed-off-by: David Howells <dhowells@redhat.com>Acked-by: Steve Dickson <steved@redhat.com>Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>Acked-by: Al Viro <viro@zeniv.linux.org.uk>Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
123