For a CONFIG_XFS_DEBUG=n build gcc complains about statements with no
effect in xfs_debug:
fs/xfs/quota/xfs_qm_syscalls.c: In function 'xfs_qm_scall_trunc_qfiles':
fs/xfs/quota/xfs_qm_syscalls.c:291:3: warning: statement with no effect
The reason for that is that the various new xfs message functions have a
return value which is never used, and in case of the non-debug build
xfs_debug the macro evaluates to a plain 0 which produces the above
warnings. This can be fixed by turning xfs_debug into an inline function
instead of a macro, but in addition to that I've also changed all the
message helpers to return void as we never use their return values.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
GCC 4.6 now warnings about variables set but not used. Fix the trivially
fixable warnings of this sort.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
When we are short on memory, we want to expedite the cleaning of
dirty objects. Hence when we run short on memory, we need to kick
the AIL flushing into action to clean as many dirty objects as
quickly as possible. To implement this, sample the lsn of the log
item at the head of the AIL and use that as the push target for the
AIL flush.
Further, we keep items in the AIL that are dirty that are not
tracked any other way, so we can get objects sitting in the AIL that
don't get written back until the AIL is pushed. Hence to get the
filesystem to the idle state, we might need to push the AIL to flush
out any remaining dirty objects sitting in the AIL. This requires
the same push mechanism as the reclaim push.
This patch also renames xfs_trans_ail_tail() to xfs_ail_min_lsn() to
match the new xfs_ail_max_lsn() function introduced in this patch.
Similarly for xfs_trans_ail_push -> xfs_ail_push.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Similar to the xfssyncd, the per-filesystem xfsaild threads can be
converted to a global workqueue and run periodically by delayed
works. This makes sense for the AIL pushing because it uses
variable timeouts depending on the work that needs to be done.
By removing the xfsaild, we simplify the AIL pushing code and
remove the need to spread the code to implement the threading
and pushing across multiple files.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Background inode reclaim needs to run more frequently that the XFS
syncd work is run as 30s is too long between optimal reclaim runs.
Add a new periodic work item to the xfs syncd workqueue to run a
fast, non-blocking inode reclaim scan.
Background inode reclaim is kicked by the act of marking inodes for
reclaim. When an AG is first marked as having reclaimable inodes,
the background reclaim work is kicked. It will continue to run
periodically untill it detects that there are no more reclaimable
inodes. It will be kicked again when the first inode is queued for
reclaim.
To ensure shrinker based inode reclaim throttles to the inode
cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the
background inode reclaim so that when we are low on memory we are
trying to reclaim inodes as efficiently as possible. This kick shoul
d not be necessary, but it will protect against failures to kick the
background reclaim when inodes are first dirtied.
To provide the rate throttling, make the shrinker pass do
synchronous inode reclaim so that it blocks on inodes under IO. This
means that the shrinker will reclaim inodes rather than just
skipping over them, but it does not adversely affect the rate of
reclaim because most dirty inodes are already under IO due to the
background reclaim work the shrinker kicked.
These two modifications solve one of the two OOM killer invocations
Chris Mason reported recently when running a stress testing script.
The particular workload trigger for the OOM killer invocation is
where there are more threads than CPUs all unlinking files in an
extremely memory constrained environment. Unlike other solutions,
this one does not have a performance impact on performance when
memory is not constrained or the number of concurrent threads
operating is <= to the number of CPUs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
On of the problems with the current inode flush at ENOSPC is that we
queue a flush per ENOSPC event, regardless of how many are already
queued. Thi can result in hundreds of queued flushes, most of
which simply burn CPU scanned and do no real work. This simply slows
down allocation at ENOSPC.
We really only need one active flush at a time, and we can easily
implement that via the new xfs_syncd_wq. All we need to do is queue
a flush if one is not already active, then block waiting for the
currently active flush to complete. The result is that we only ever
have a single ENOSPC inode flush active at a time and this greatly
reduces the overhead of ENOSPC processing.
On my 2p test machine, this results in tests exercising ENOSPC
conditions running significantly faster - 042 halves execution time,
083 drops from 60s to 5s, etc - while not introducing test
regressions.
This allows us to remove the old xfssyncd threads and infrastructure
as they are no longer used.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
All of the work xfssyncd does is background functionality. There is
no need for a thread per filesystem to do this work - it can al be
managed by a global workqueue now they manage concurrency
effectively.
Introduce a new gglobal xfssyncd workqueue, and convert the periodic
work to use this new functionality. To do this, use a delayed work
construct to schedule the next running of the periodic sync work
for the filesystem. When the sync work is complete, queue a new
delayed work for the next running of the sync work.
For laptop mode, we wait on completion for the sync works, so ensure
that the sync work queuing interface can flush and wait for work to
complete to enable the work queue infrastructure to replace the
current sequence number and wakeup that is used.
Because the sync work does non-trivial amounts of work, mark the
new work queue as CPU intensive.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: stop using the page cache to back the buffer cache
xfs: register the inode cache shrinker before quotachecks
xfs: xfs_trans_read_buf() should return an error on failure
xfs: introduce inode cluster buffer trylocks for xfs_iflush
vmap: flush vmap aliases when mapping fails
xfs: preallocation transactions do not need to be synchronous
Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_buf.c due to plug removal.
Now that the buffer cache has it's own LRU, we do not need to use
the page cache to provide persistent caching and reclaim
infrastructure. Convert the buffer cache to use alloc_pages()
instead of the page cache. This will remove all the overhead of page
cache management from setup and teardown of the buffers, as well as
needing to mark pages accessed as we find buffers in the buffer
cache.
By avoiding the page cache, we also remove the need to keep state in
the page_private(page) field for persistant storage across buffer
free/buffer rebuild and so all that code can be removed. This also
fixes the long-standing problem of not having enough bits in the
page_private field to track all the state needed for a 512
sector/64k page setup.
It also removes the need for page locking during reads as the pages
are unique to the buffer and nobody else will be attempting to
access them.
Finally, it removes the buftarg address space lock as a point of
global contention on workloads that allocate and free buffers
quickly such as when creating or removing large numbers of inodes in
parallel. This remove the 16TB limit on filesystem size on 32 bit
machines as the page index (32 bit) is no longer used for lookups
of metadata buffers - the buffer cache is now solely indexed by disk
address which is stored in a 64 bit field in the buffer.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
During mount, we can do a quotacheck that involves a bulkstat pass
on all inodes. If there are more inodes in the filesystem than can
be held in memory, we require the inode cache shrinker to run to
ensure that we don't run out of memory.
Unfortunately, the inode cache shrinker is not registered until we
get to the end of the superblock setup process, which is after a
quotacheck is run if it is needed. Hence we need to register the
inode cache shrinker earlier in the mount process so that we don't
OOM during mount. This requires that we also initialise the syncd
work before we register the shrinker, so we nee dto juggle that
around as well.
While there, make sure that we have set up the block sizes in the
VFS superblock correctly before the quotacheck is run so that any
inodes that are cached as a result of the quotacheck have their
block size fields set up correctly.
Cc: stable@kernel.org
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
There is an ABBA deadlock between synchronous inode flushing in
xfs_reclaim_inode and xfs_icluster_free. xfs_icluster_free locks the
buffer, then takes inode ilocks, whilst synchronous reclaim takes
the ilock followed by the buffer lock in xfs_iflush().
To avoid this deadlock, separate the inode cluster buffer locking
semantics from the synchronous inode flush semantics, allowing
callers to attempt to lock the buffer but still issue synchronous IO
if it can get the buffer. This requires xfs_iflush() calls that
currently use non-blocking semantics to pass SYNC_TRYLOCK rather
than 0 as the flags parameter.
This allows xfs_reclaim_inode to avoid the deadlock on the buffer
lock and detect the failure so that it can drop the inode ilock and
restart the reclaim attempt on the inode. This allows
xfs_ifree_cluster to obtain the inode lock, mark the inode stale and
release it and hence defuse the deadlock situation. It also has the
pleasant side effect of avoiding IO in xfs_reclaim_inode when it
tries to next reclaim the inode as it is now marked stale.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
On 32 bit systems, vmalloc space is limited and XFS can chew through
it quickly as the vmalloc space is lazily freed. This can result in
failure to map buffers, even when there is apparently large amounts
of vmalloc space available. Hence, if we fail to map a buffer, purge
the aliases that have not yet been freed to hopefuly free up enough
vmalloc space to allow a retry to succeed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Preallocation and hole punch transactions are currently synchronous
and this is causing performance problems in some cases. The
transactions don't need to be synchronous as we don't need to
guarantee the preallocation is persistent on disk until a
fdatasync, fsync, sync operation occurs. If the file is opened
O_SYNC or O_DATASYNC, only then should the transaction be issued
synchronously.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...
Fix up conflicts in fs/{aio.c,super.c}
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (23 commits)
xfs: don't name variables "panic"
xfs: factor agf counter updates into a helper
xfs: clean up the xfs_alloc_compute_aligned calling convention
xfs: kill support/debug.[ch]
xfs: Convert remaining cmn_err() callers to new API
xfs: convert the quota debug prints to new API
xfs: rename xfs_cmn_err_fsblock_zero()
xfs: convert xfs_fs_cmn_err to new error logging API
xfs: kill xfs_fs_mount_cmn_err() macro
xfs: kill xfs_fs_repair_cmn_err() macro
xfs: convert xfs_cmn_err to xfs_alert_tag
xfs: Convert xlog_warn to new logging interface
xfs: Convert linux-2.6/ files to new logging interface
xfs: introduce new logging API.
xfs: zero proper structure size for geometry calls
xfs: enable delaylog by default
xfs: more sensible inode refcounting for ialloc
xfs: stop using xfs_trans_iget in the RT allocator
xfs: check if device support discard in xfs_ioc_trim()
xfs: prevent leaking uninitialized stack memory in FSGEOMETRY_V1
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits)
AppArmor: kill unused macros in lsm.c
AppArmor: cleanup generated files correctly
KEYS: Add an iovec version of KEYCTL_INSTANTIATE
KEYS: Add a new keyctl op to reject a key with a specified error code
KEYS: Add a key type op to permit the key description to be vetted
KEYS: Add an RCU payload dereference macro
AppArmor: Cleanup make file to remove cruft and make it easier to read
SELinux: implement the new sb_remount LSM hook
LSM: Pass -o remount options to the LSM
SELinux: Compute SID for the newly created socket
SELinux: Socket retains creator role and MLS attribute
SELinux: Auto-generate security_is_socket_class
TOMOYO: Fix memory leak upon file open.
Revert "selinux: simplify ioctl checking"
selinux: drop unused packet flow permissions
selinux: Fix packet forwarding checks on postrouting
selinux: Fix wrong checks for selinux_policycap_netpeer
selinux: Fix check for xfrm selinux context algorithm
ima: remove unnecessary call to ima_must_measure
IMA: remove IMA imbalance checking
...
* 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix build failure introduced by s/freezeable/freezable/
workqueue: add system_freezeable_wq
rds/ib: use system_wq instead of rds_ib_fmr_wq
net/9p: replace p9_poll_task with a work
net/9p: use system_wq instead of p9_mux_wq
xfs: convert to alloc_workqueue()
reiserfs: make commit_wq use the default concurrency level
ocfs2: use system_wq instead of ocfs2_quota_wq
ext4: convert to alloc_workqueue()
scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path
scsi/be2iscsi,qla2xxx: convert to alloc_workqueue()
misc/iwmc3200top: use system_wq instead of dedicated workqueues
i2o: use alloc_workqueue() instead of create_workqueue()
acpi: kacpi*_wq don't need WQ_MEM_RECLAIM
fs/aio: aio_wq isn't used in memory reclaim path
input/tps6507x-ts: use system_wq instead of dedicated workqueue
cpufreq: use system_wq instead of dedicated workqueues
wireless/ipw2x00: use system_wq instead of dedicated workqueues
arm/omap: use system_wq in mailbox
workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER
The exportfs encode handle function should return the minimum required
handle size. This helps user to find out the handle size by passing 0
handle size in the first step and then redoing to the call again with
the returned handle size value.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The new xfs_alert_tag() used a variable named "panic",
and that is to be avoided. Rename it.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The remaining functionality in debug.[ch] is effectively just assert
handling, conditional debug definitions and hex dumping. The hex
dumping and assert function can be moved into the new printk module,
while the rest can be moved into top-level header files. This allows
fs/xfs/support/debug.[ch] to be completely removed from the
codebase.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Convert the files in fs/xfs/linux-2.6/ to use the new xfs_<level>
logging format that replaces the old Irix inherited cmn_err()
interfaces. While there, also convert naked printk calls to use the
relevant xfs logging function to standardise output format.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Commit 493f3358cb added this call to
xfs_fs_geometry() in order to avoid passing kernel stack data back
to user space:
+ memset(geo, 0, sizeof(*geo));
Unfortunately, one of the callers of that function passes the
address of a smaller data type, cast to fit the type that
xfs_fs_geometry() requires. As a result, this can happen:
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted
in: f87aca93
Pid: 262, comm: xfs_fsr Not tainted 2.6.38-rc6-493f3358cb2+ #1
Call Trace:
[<c12991ac>] ? panic+0x50/0x150
[<c102ed71>] ? __stack_chk_fail+0x10/0x18
[<f87aca93>] ? xfs_ioc_fsgeometry_v1+0x56/0x5d [xfs]
Fix this by fixing that one caller to pass the right type and then
copy out the subset it is interested in.
Note: This patch is an alternative to one originally proposed by
Eric Sandeen.
Reported-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu>
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu>
Most of the logging infrastructure in XFS is unneccessary and
designed around the infrastructure supplied by Irix rather than
Linux. To rationalise the logging interfaces, start by introducing
simple printk wrappers similar to the dev_printk() infrastructure.
Later patches will convert code to use this new interface.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Commit 493f3358cb added this call to
xfs_fs_geometry() in order to avoid passing kernel stack data back
to user space:
+ memset(geo, 0, sizeof(*geo));
Unfortunately, one of the callers of that function passes the
address of a smaller data type, cast to fit the type that
xfs_fs_geometry() requires. As a result, this can happen:
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted
in: f87aca93
Pid: 262, comm: xfs_fsr Not tainted 2.6.38-rc6-493f3358cb2+ #1
Call Trace:
[<c12991ac>] ? panic+0x50/0x150
[<c102ed71>] ? __stack_chk_fail+0x10/0x18
[<f87aca93>] ? xfs_ioc_fsgeometry_v1+0x56/0x5d [xfs]
Fix this by fixing that one caller to pass the right type and then
copy out the subset it is interested in.
Note: This patch is an alternative to one originally proposed by
Eric Sandeen.
Reported-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu>
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu>
Right now we, are relying on the fact that when we attempt to
actually do the discard, blkdev_issue_discar() returns -EOPNOTSUPP
and the user is informed that the device does not support discard.
However, in the case where the we do not hit any suitable free
extent to trim in FITRIM code, it will finish without any error.
This is very confusing, because it seems that FITRIM was successful
even though the device does not actually supports discard.
Solution: Check for the discard support before attempt to search for
free extents.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Right now we, are relying on the fact that when we attempt to
actually do the discard, blkdev_issue_discar() returns -EOPNOTSUPP
and the user is informed that the device does not support discard.
However, in the case where the we do not hit any suitable free
extent to trim in FITRIM code, it will finish without any error.
This is very confusing, because it seems that FITRIM was successful
even though the device does not actually supports discard.
Solution: Check for the discard support before attempt to search for
free extents.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
SELinux would like to implement a new labeling behavior of newly created
inodes. We currently label new inodes based on the parent and the creating
process. This new behavior would also take into account the name of the
new object when deciding the new label. This is not the (supposed) full path,
just the last component of the path.
This is very useful because creating /etc/shadow is different than creating
/etc/passwd but the kernel hooks are unable to differentiate these
operations. We currently require that userspace realize it is doing some
difficult operation like that and than userspace jumps through SELinux hoops
to get things set up correctly. This patch does not implement new
behavior, that is obviously contained in a seperate SELinux patch, but it
does pass the needed name down to the correct LSM hook. If no such name
exists it is fine to pass NULL.
Signed-off-by: Eric Paris <eparis@redhat.com>
Convert from create[_singlethread]_workqueue() to alloc_workqueue().
* xfsdatad_workqueue and xfsconvertd_workqueue are identity converted.
Using higher concurrency limit might be useful but given the
complexity of workqueue usage in xfs, proceeding cautiously seems
better.
* xfs_mru_reap_wq is converted to non-ordered workqueue with max
concurrency of 1 as the work items don't require any specific
ordering and already have proper synchronization. It seems it was
singlethreaded to save worker threads, which is no longer a concern.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alex Elder <aelder@sgi.com>
Cc: xfs-masters@oss.sgi.com
Cc: Christoph Hellwig <hch@infradead.org>
The extent size hint can be set to larger than an AG. This means
that the alignment process can push the range to be allocated
outside the bounds of the AG, resulting in assert failures or
corrupted bmbt records. Similarly, if the extsize is larger than the
maximum extent size supported, the alignment process will produce
extents that are too large to fit into the bmbt records, resulting
in a different type of assert/corruption failure.
Fix this by limiting extsize at the time іt is set firstly to be
less than MAXEXTLEN, then to be a maximum of half the size of the
AGs in the filesystem for non-realtime inodes. Realtime inodes do
not allocate out of AGs, so don't have to be restricted by the size
of AGs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Currently all filesystems except XFS implement fallocate asynchronously,
while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC
I/O we really want our allocation on disk, especially for the !KEEP_SIZE
case where we actually grow the file with user-visible zeroes. On the
other hand always commiting the transaction is a bad idea for fast-path
uses of fallocate like for example in recent Samba versions. Given
that block allocation is a data plane operation anyway change it from
an inode operation to a file operation so that we have the file structure
available that lets us check for O_SYNC.
This also includes moving the code around for a few of the filesystems,
and remove the already unnedded S_ISDIR checks given that we only wire
up fallocate for regular files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of various home grown checks that might need updates for new
flags just check for any bit outside the mask of the features supported
by the filesystem. This makes the check future proof for any newly
added flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
fs: add documentation on fallocate hole punching
Gfs2: fail if we try to use hole punch
Btrfs: fail if we try to use hole punch
Ext4: fail if we try to use hole punch
Ocfs2: handle hole punching via fallocate properly
XFS: handle hole punching via fallocate properly
fs: add hole punching to fallocate
vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
fix signedness mess in rw_verify_area() on 64bit architectures
fs: fix kernel-doc for dcache::prepend_path
fs: fix kernel-doc for dcache::d_validate
sanitize ecryptfs ->mount()
switch afs
move internal-only parts of ncpfs headers to fs/ncpfs
switch ncpfs
switch 9p
pass default dentry_operations to mount_pseudo()
switch hostfs
switch affs
switch configfs
...
This patch simply allows XFS to handle the hole punching flag in fallocate
properly. I've tested this with a little program that does a bunch of random
hole punching with FL_KEEP_SIZE and without it to make sure it does the right
thing. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We currently have a global error message buffer in cmn_err that is
protected by a spin lock that disables interrupts. Recently there
have been reports of NMI timeouts occurring when the console is
being flooded by SCSI error reports due to cmn_err() getting stuck
trying to print to the console while holding this lock (i.e. with
interrupts disabled). The NMI watchdog is seeing this CPU as
non-responding and so is triggering a panic. While the trigger for
the reported case is SCSI errors, pretty much anything that spams
the kernel log could cause this to occur.
Realistically the only reason that we have the intemediate message
buffer is to prepend the correct kernel log level prefix to the log
message. The only reason we have the lock is to protect the global
message buffer and the only reason the message buffer is global is
to keep it off the stack. Hence if we can avoid needing a global
message buffer we avoid needing the lock, and we can do this with a
small amount of cleanup and some preprocessor tricks:
1. clean up xfs_cmn_err() panic mask functionality to avoid
needing debug code in xfs_cmn_err()
2. remove the couple of "!" message prefixes that still exist that
the existing cmn_err() code steps over.
3. redefine CE_* levels directly to KERN_*
4. redefine cmn_err() and friends to use printk() directly
via variable argument length macros.
By doing this, we can completely remove the cmn_err() code and the
lock that is causing the problems, and rely solely on printk()
serialisation to ensure that we don't get garbled messages.
A series of followup patches is really needed to clean up all the
cmn_err() calls and related messages properly, but that results in a
series that is not easily back portable to enterprise kernels. Hence
this initial fix is only to address the direct problem in the lowest
impact way possible.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
If we get an IO error on a synchronous superblock write, we attach an
error release function to it so that when the last reference goes away
the release function is called and the buffer is invalidated and
unlocked. The buffer is left locked until the release function is
called so that other concurrent users of the buffer will be locked out
until the buffer error is fully processed.
Unfortunately, for the superblock buffer the filesyetm itself holds a
reference to the buffer which prevents the reference count from
dropping to zero and the release function being called. As a result,
once an IO error occurs on a sync write, the buffer will never be
unlocked and all future attempts to lock the buffer will hang.
To make matters worse, this problems is not unique to such buffers;
if there is a concurrent _xfs_buf_find() running, the lookup will grab
a reference to the buffer and then wait on the buffer lock, preventing
the reference count from ever falling to zero and hence unlocking the
buffer.
As such, the whole b_relse function implementation is broken because it
cannot rely on the buffer reference count falling to zero to unlock the
errored buffer. The synchronous write error path is the only path that
uses this callback - it is used to ensure that the synchronous waiter
gets the buffer error before the error state is cleared from the buffer
by the release function.
Given that the only sychronous buffer writes now go through xfs_bwrite
and the error path in question can only occur for a write of a dirty,
logged buffer, we can move most of the b_relse processing to happen
inline in xfs_buf_iodone_callbacks, just like a normal I/O completion.
In addition to that we make sure the error is not cleared in
xfs_buf_iodone_callbacks, so that xfs_bwrite can reliably check it.
Given that xfs_bwrite keeps the buffer locked until it has waited for
it and checked the error this allows to reliably propagate the error
to the caller, and make sure that the buffer is reliably unlocked.
Given that xfs_buf_iodone_callbacks was the only instance of the
b_relse callback we can remove it entirely.
Based on earlier patches by Dave Chinner and Ajeet Yadav.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ajeet Yadav <ajeet.yadav.77@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Allow manual discards from userspace using the FITRIM ioctl. This is not
intended to be run during normal workloads, as the freepsace btree walks
can cause large performance degradation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
To ensure the log is covered and the filesystem idles correctly, we
need to ensure that dummy transactions hit the disk and do not stay
pinned in memory. If the superblock is pinned in memory, it can't
be flushed so the log covering cannot make progress. The result is
dependent on timing - more oftent han not we continue to issues a
log covering transaction every 36s rather than idling after ~90s.
Fix this by making the log covering transaction synchronous. To
avoid additional log force from xfssyncd, make the log covering
transaction take the place of the existing log force in the xfssyncd
background sync process.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
This merge pulls the XFS master branch into the latest Linus master.
This results in a merge conflict whose best fix is not obvious.
I manually fixed the conflict, in "fs/xfs/xfs_iget.c".
Dave Chinner had done work that resulted in RCU freeing of inodes
separate from what Nick Piggin had done, and their results differed
slightly in xfs_inode_free(). The fix updates Nick's call_rcu()
with the use of VFS_I(), while incorporating needed updates to some
XFS inode fields implemented in Dave's series. Dave's RCU callback
function has also been removed.
Signed-off-by: Alex Elder <aelder@sgi.com>
When two concurrent unaligned, non-overlapping direct IOs are issued
to the same block, the direct Io layer will race to zero the block.
The result is that one of the concurrent IOs will overwrite data
written by the other IO with zeros. This is demonstrated by the
xfsqa test 240.
To avoid this problem, serialise all unaligned direct IOs to an
inode with a big hammer. We need a big hammer approach as we need to
serialise AIO as well, so we can't just block writes on locks.
Hence, the big hammer is calling xfs_ioend_wait() while holding out
other unaligned direct IOs from starting.
We don't bother trying to serialised aligned vs unaligned IOs as
they are overlapping IO and the result of concurrent overlapping IOs
is undefined - the result of either IO is a valid result so we let
them race. Hence we only penalise unaligned IO, which already has a
major overhead compared to aligned IO so this isn't a major problem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The buffered IO and direct IO write paths share a common set of
checks and limiting code prior to issuing the write. Factor that
into a common helper function.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Complete the split of the different write IO paths by splitting the
buffered IO write path out of xfs_file_aio_write(). This makes the
different mechanisms of the write patchs easier to follow.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The current xfs_file_aio_write code is a mess of locking shenanigans
to handle the different locking requirements of buffered and direct
IO. Start to clean this up by disentangling the direct IO path from
the mess.
This also removes the failed direct IO fallback path to buffered IO.
XFS handles all direct IO cases without needing to fall back to
buffered IO, so we can safely remove this unused path. This greatly
simplifies the logic and locking needed in the write path.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We need to obtain the i_mutex, i_iolock and i_ilock during the read
and write paths. Add a set of wrapper functions to neatly
encapsulate the lock ordering and shared/exclusive semantics to make
the locking easier to follow and get right.
Note that this changes some of the exclusive locking serialisation in
that serialisation will occur against the i_mutex instead of the
XFS_IOLOCK_EXCL. This does not change any behaviour, and it is
arguably more efficient to use the mutex for such serialisation than
the rw_sem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_file_aio_write() only returns the error from synchronous
flushing of the data and inode if error == 0. At the point where
error is being checked, it is guaranteed to be > 0. Therefore any
errors returned by the data or fsync flush will never be returned.
Fix the checks so we overwrite the current error once and only if an
error really occurred.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
The log grant ticket wait queues are currently protected by the log
grant lock. However, the queues are functionally independent from
each other, and operations on them only require serialisation
against other queue operations now that all of the other log
variables they use are atomic values.
Hence, we can make them independent of the grant lock by introducing
new locks just to protect the lists operations. because the lists
are independent, we can use a lock per list and ensure that reserve
and write head queuing do not contend.
To ensure forced shutdowns work correctly in conjunction with the
new fast paths, ensure that we check whether the log has been shut
down in the grant functions once we hold the relevant spin locks but
before we go to sleep. This is needed to co-ordinate correctly with
the wakeups that are issued on the ticket queues so we don't leave
any processes sleeping on the queues during a shutdown.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
log->l_tail_lsn is currently protected by the log grant lock. The
lock is only needed for serialising readers against writers, so we
don't really need the lock if we make the l_tail_lsn variable an
atomic. Converting the l_tail_lsn variable to an atomic64_t means we
can start to peel back the grant lock from various operations.
Also, provide functions to safely crack an atomic LSN variable into
it's component pieces and to recombined the components into an
atomic variable. Use them where appropriate.
This also removes the need for explicitly holding a spinlock to read
the l_tail_lsn on 32 bit platforms.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
The log grant queues are one of the few places left using sv_t
constructs for waiting. Given we are touching this code, we should
convert them to plain wait queues. While there, convert all the
other sv_t users in the log code as well.
Seeing as this removes the last users of the sv_t type, remove the
header file defining the wrapper and the fragments that still
reference it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Prepare for switching the grant heads to atomic variables by
combining the two 32 bit values that make up the grant head into a
single 64 bit variable. Provide wrapper functions to combine and
split the grant heads appropriately for calculations and use them as
necessary.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The grant write and reserve queues use a roll-your-own double linked
list, so convert it to a standard list_head structure and convert
all the list traversals to use list_for_each_entry(). We can also
get rid of the XLOG_TIC_IN_Q flag as we can use the list_empty()
check to tell if the ticket is in a list or not.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The xfaild often tries to rest to wait for congestion to pass of for
IO to complete, but is regularly woken in tail-pushing situations.
In severe cases, the xfsaild is getting woken tens of thousands of
times a second. Reduce the number needless wakeups by only waking
the xfsaild if the new target is larger than the old one. Further
make short sleeps uninterruptible as they occur when the xfsaild has
decided it needs to back off to allow some IO to complete and being
woken early is counter-productive.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that the buffer reclaim infrastructure can handle different reclaim
priorities for different types of buffers, reconnect the hooks in the
XFS code that has been sitting dormant since it was ported to Linux. This
should finally give use reclaim prioritisation that is on a par with the
functionality that Irix provided XFS 15 years ago.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Introduce a per-buftarg LRU for memory reclaim to operate on. This
is the last piece we need to put in place so that we can fully
control the buffer lifecycle. This allows XFS to be responsibile for
maintaining the working set of buffers under memory pressure instead
of relying on the VM reclaim not to take pages we need out from
underneath us.
The implementation introduces a b_lru_ref counter into the buffer.
This is currently set to 1 whenever the buffer is referenced and so is used to
determine if the buffer should be added to the LRU or not when freed.
Effectively it allows lazy LRU initialisation of the buffer so we do not need
to touch the LRU list and locks in xfs_buf_find().
Instead, when the buffer is being released and we drop the last
reference to it, we check the b_lru_ref count and if it is none zero
we re-add the buffer reference and add the inode to the LRU. The
b_lru_ref counter is decremented by the shrinker, and whenever the
shrinker comes across a buffer with a zero b_lru_ref counter, if
released the LRU reference on the buffer. In the absence of a lookup
race, this will result in the buffer being freed.
This counting mechanism is used instead of a reference flag so that
it is simple to re-introduce buffer-type specific reclaim reference
counts to prioritise reclaim more effectively. We still have all
those hooks in the XFS code, so this will provide the infrastructure
to re-implement that functionality.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
As reported by Nick Piggin, XFS is suffering from long pauses under
highly concurrent workloads when hosted on ramdisks. The problem is
that an inode buffer is stuck in the pinned state in memory and as a
result either the inode buffer or one of the inodes within the
buffer is stopping the tail of the log from being moved forward.
The system remains in this state until a periodic log force issued
by xfssyncd causes the buffer to be unpinned. The main problem is
that these are stale buffers, and are hence held locked until the
transaction/checkpoint that marked them state has been committed to
disk. When the filesystem gets into this state, only the xfssyncd
can cause the async transactions to be committed to disk and hence
unpin the inode buffer.
This problem was encountered when scaling the busy extent list, but
only the blocking lock interface was fixed to solve the problem.
Extend the same fix to the buffer trylock operations - if we fail to
lock a pinned, stale buffer, then force the log immediately so that
when the next attempt to lock it comes around, it will have been
unpinned.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Since the move to the new truncate sequence we call xfs_setattr to
truncate down excessively instanciated blocks. As shown by the testcase
in kernel.org BZ #22452 that doesn't work too well. Due to the confusion
of the internal inode size, and the VFS inode i_size it zeroes data that
it shouldn't.
But full blown truncate seems like overkill here. We only instanciate
delayed allocations in the write path, and given that we never released
the iolock we can't have converted them to real allocations yet either.
The only nasty case is pre-existing preallocation which we need to skip.
We already do this for page discard during writeback, so make the delayed
allocation block punching a generic function and call it from the failed
write path as well as xfs_aops_discard_page. The callers are
responsible for ensuring that partial blocks are not truncated away,
and that they hold the ilock.
Based on a fix originally from Christoph Hellwig. This version used
filesystem blocks as the range unit.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Before we introduce per-buftarg LRU lists, split the shrinker
implementation into per-buftarg shrinker callbacks. At the moment
we wake all the xfsbufds to run the delayed write queues to free
the dirty buffers and make their pages available for reclaim.
However, with an LRU, we want to be able to free clean, unused
buffers as well, so we need to separate the xfsbufd from the
shrinker callbacks.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
now that we are using RCU protection for the inode cache lookups,
the lock is only needed on the modification side. Hence it is not
necessary for the lock to be a rwlock as there are no read side
holders anymore. Convert it to a spin lock to reflect it's exclusive
nature.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
With delayed logging greatly increasing the sustained parallelism of inode
operations, the inode cache locking is showing significant read vs write
contention when inode reclaim runs at the same time as lookups. There is
also a lot more write lock acquistions than there are read locks (4:1 ratio)
so the read locking is not really buying us much in the way of parallelism.
To avoid the read vs write contention, change the cache to use RCU locking on
the read side. To avoid needing to RCU free every single inode, use the built
in slab RCU freeing mechanism. This requires us to be able to detect lookups of
freed inodes, so enѕure that ever freed inode has an inode number of zero and
the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
lookup path, but also add a check for a zero inode number as well.
We canthen convert all the read locking lockups to use RCU read side locking
and hence remove all read side locking.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
The XFS iolock needs to be re-initialised to a new lock class before
it enters reclaim to prevent lockdep false positives. Unfortunately,
this is not sufficient protection as inodes in the XFS_IRECLAIMABLE
state can be recycled and not re-initialised before being reused.
We need to re-initialise the lock state when transfering out of
XFS_IRECLAIMABLE state to XFS_INEW, but we need to keep the same
class as if the inode was just allocated. Hence we need a specific
lockdep class variable for the iolock so that both initialisations
use the same class.
While there, add a specific class for inodes in the reclaim state so
that it is easy to tell from lockdep reports what state the inode
was in that generated the report.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Use a goto label to consolidate all block not found cases, and add a
tracepoint for them. Also clean up a few whitespace issues.
Based on an earlier patch from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Move the buffer locking into the callers as they need to do it
wether they call xfs_map_at_offset or not. Remove the b_bdev
assignment, which is already done by get_blocks. Remove the
duplicate extent type asserts in xfs_convert_page just before
calling xfs_map_at_offset.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
After the last patches the code for overwrites is the same as for
delayed and unwritten extents except that it doesn't need to call
xfs_map_at_offset. Take care of that fact to simplify
xfs_vm_writepage.
The buffer loop now first checks the type of buffer and checks/sets
the ioend type, or continues to the next buffer if it's not
interesting to us. Only after that we validate the iomap and
perform the block mapping if needed, all in common code for the
cases where we have to do work.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The all_bh flag is always set when entering the page clustering
machinery with a regular written extent, which means the check for
it is superflous.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
xfs_map_blocks always calls xfs_bmapi with the XFS_BMAPI_ENTIRE
entire flag, which tells it to not cap the extent at the passed in
size, but just treat the size as an minimum to map. This means
xfs_probe_cluster is entirely useless as we'll always get the whole
extent back anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
No need to lock the extent map exclusive when performing an
overwrite, we know the extent map must already have been loaded by
get_blocks. Apply the non-blocking inode semantics to all mapping
types instead of just delayed allocations. Remove the handling of
not yet allocated blocks for the IO_UNWRITTEN case - if an extent is
marked as unwritten allocated in the buffer it must already have an
extent on disk.
Add asserts to verify all the assumptions above in debug builds.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Opencode the xfs_iomap code in it's two callers. The overlap of
passed flags already was minimal and will be further reduced in the
next patch.
As a side effect the BMAPI_* flags for xfs_bmapi and the IO_* flags
for I/O end processing are merged into a single set of flags, which
should be a bit more descriptive of the operation we perform.
Also improve the tracing by giving each caller it's own type set of
tracepoints.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Don't trylock the buffer. We are the only one ever locking it for a
regular file address space, and trylock was only copied from the
generic code which did it due to the old buffer based writeout in
jbd. Also make sure to only write out the buffer if the iomap
actually is valid, because we wouldn't have a proper mapping
otherwise. In practice we will never get an invalid mapping here as
the page lock guarantees truncate doesn't race with us, but better
be safe than sorry. Also make sure we allocate a new ioend when
crossing boundaries between mappings, just like we do for delalloc
and unwritten extents. Again this currently doesn't matter as the
I/O end handler only cares for the boundaries for unwritten extents,
but this makes the code fully correct and the same as for
delalloc/unwritten extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We'll never have BIO_EOPNOTSUPP set after calling submit_bio as this
can only happen for discards, and used to happen for barriers, none
of which is every submitted by xfs_submit_ioend_bio. Also remove
the loop around bio_alloc as it will never fail due to it's mempool
backing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently we only refuse a "read-only" mapping for writing out
unwritten and delayed buffers, and refuse any other for overwrites.
Improve the checks to require delalloc mappings for delayed buffers,
and unwritten extent mappings for unwritten extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We now support mounting and using filesystems with 64-bit inodes
even when not mounted with the inode64 option (which now only
controls if we allocate new inodes in that space or not). Make sure
we always use large NFS file handles when exporting a filesystem
that may contain 64-bit inodes. Note that this only affects newly
generated file handles, any outstanding 32-bit file handle is still
accepted.
[hch: the comment and commit log are mine, the rest is from a patch
snipplet from Samuel]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.
All users are converted. Most conversions are mechanical and don't
introduce any behavior difference. There are several exceptions.
* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
reason to OR it explicitly on blkdev_put().
* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
sb->s_mode.
* With the above changes, sb->s_mode now always should contain
FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
errors.
The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.
* blkdev_get/put() are the primary open and close functions.
* bd_claim/release() deal with exclusive open.
* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.
* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.
* open_by_devnum() wraps bdget() + blkdev_get().
The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,
* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.
* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.
* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.
* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.
* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().
* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().
Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.
open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.
Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
In commit 20cb52ebd1, titled
"xfs: simplify xfs_vm_writepage" I added an assert that any !mapped and
uptodate buffers are not dirty. That asserts turns out to trigger a lot
when running fsx on filesystems with small block sizes. The reason for
that is that the assert is simply incorrect. !mapped and uptodate
just mean this buffer covers a hole, and whenever we do a set_page_dirty
we mark all blocks in the page dirty, no matter if they have data or
not. So remove the assert, and update the comment above the condition
to match reality.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
XFS does not need it's inodes to actuall be hashed in the VFS inode
cache, but we require the inode to be marked hashed for the
writeback code to work.
Insted of using insert_inode_hash, which requires a second
inode_lock roundtrip after the partial merge of the inode
scalability patches in 2.6.37-rc simply use the new hlist_add_fake
helper to mark it hashed without requiring a lock or touching a
global cache line.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The delayed write buffer split trace currently issues a trace for
every buffer it scans. These buffers are not necessarily queued for
delayed write. Indeed, when buffers are pinned, there can be
thousands of traces of buffers that aren't actually queued for
delayed write and the ones that are are lost in the noise. Move the
trace point to record only buffers that are split out for IO to be
issued on.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The walk fails to decrement the per-ag reference count when the
non-blocking walk fails to obtain the per-ag reclaim lock, leading
to an assert failure on debug kernels when unmounting a filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
al_hreq is copied from userland. If al_hreq.buflen is not properly aligned
then xfs_attr_list will ignore the last bytes of kbuf. These bytes are
unitialized. It leads to leaking of contents of kernel stack memory.
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We promised to do this for 2.6.37, and the code looks stable enough to
keep that promise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...
This removes more dead code that was somehow missed by commit 0d99519efe
(writeback: remove unused nonblocking and congestion checks). There are
no behavior change except for the removal of two entries from one of the
ext4 tracing interface.
The nonblocking checks in ->writepages are no longer used because the
flusher now prefer to block on get_request_wait() than to skip inodes on
IO congestion. The latter will lead to more seeky IO.
The nonblocking checks in ->writepage are no longer used because it's
redundant with the WB_SYNC_NONE check.
We no long set ->nonblocking in VM page out and page migration, because
a) it's effectively redundant with WB_SYNC_NONE in current code
b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
that would skip some dirty inodes on congestion and page out others, which
is unfair in terms of LRU age.
Inspired by Christoph Hellwig. Thanks!
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: Steve French <sfrench@samba.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves. For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>