Commit graph

41531 commits

Author SHA1 Message Date
Fan Li
f8b703da2c f2fs: fix to update cached_en of extent tree properly
In f2fs_lookup_extent_tree, et->cached_en was read and updated with only
read lock held,
it could cause __lookup_extent_tree within return entirely wrong
extent_node, if other
thread update et->cached_en just before __lookup_extent_tree return.

However, there are two things about this patch that need to be noticed:
1. It does no good to arrange the order of concurrent read/write, the result
would still
be random in such case.
2. It's built on this assumption: the mix up of reads and writes on a single
pointer would
not make the pointer partially wrong at any time. Please let me know if I'm
wrong, thx.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-21 22:45:06 -07:00
Junesung Lee
217940d4f0 f2fs: fix typo
Fix typo.

Signed-off-by: Junesung Lee <junesoung412@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-21 22:43:32 -07:00
Jaegeuk Kim
24928634f8 f2fs: check the node block address of newly allocated nid
This patch adds a routine which checks the block address of newly allocated nid.
If an nid has already allocated by other thread due to subtle data races, it
will result in filesystem corruption.
So, it needs to check whether its block address was already allocated or not
in prior to nid allocation as the last chance.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:14 -07:00
Jaegeuk Kim
a21c20f0c8 f2fs: go out for insert_inode_locked failure
We should not call unlock_new_inode when insert_inode_locked failed.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:13 -07:00
Jaegeuk Kim
5ee5293c32 f2fs: retry gc if one section is not successfully reclaimed
If FG_GC failed to reclaim one section, let's retry with another section
from the start, since we can get anoterh good candidate.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:12 -07:00
Jaegeuk Kim
2286c0205d f2fs: fix to cover lock_op for update_inode_page
Previously, update_inode_page is not called under f2fs_lock_op.
Instead we should call with f2fs_write_inode.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:11 -07:00
Jaegeuk Kim
2683446646 f2fs: reuse nids more aggressively
If we can reuse nids as many as possible, we can mitigate producing obsolete
node pages in the page cache.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:11 -07:00
Jaegeuk Kim
26d5859974 f2fs: avoid garbage collecting already moved node blocks
If node blocks were already moved, we don't need to move them again.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:10 -07:00
Jaegeuk Kim
740432f835 f2fs: handle failed bio allocation
As the below comment of bio_alloc_bioset, f2fs can allocate multiple bios at the
same time. So, we can't guarantee that bio is allocated all the time.

"
 *   When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
 *   able to allocate a bio. This is due to the mempool guarantees. To make this
 *   work, callers must never allocate more than 1 bio at a time from this pool.
 *   Callers that need to allocate more than 1 bio must always submit the
 *   previously allocated bio for IO before attempting to allocate a new one.
 *   Failure to do so can cause deadlocks under memory pressure.
"

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:09 -07:00
Jaegeuk Kim
a6db67f06f f2fs: increase the number of max hard links
This patch increases the number of maximum hard links for one file.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:08 -07:00
Jaegeuk Kim
798c1b16d1 f2fs: skip checkpoint if there is no dirty and prefree segments
We should avoid needless checkpoints when there is no dirty and prefree segment.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:07 -07:00
Chao Yu
31696580bf f2fs: shrink free_nids entries
This patch introduces __count_free_nids/try_to_free_nids and registers
them in slab shrinker for shrinking under memory pressure.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:06 -07:00
Chao Yu
206e61be29 f2fs: avoid clear valid page
In f2fs_delete_entry, if last dirent is remove from the dentry page,
we will try to punch that page since it has no valid date in it.

But truncate_hole which is used for punching could fail because of
no memory or IO error, if that happened, we'd better skip clearing
this valid dentry page.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 09:00:06 -07:00
Jaegeuk Kim
315df8398e f2fs: do not write any node pages related to orphan inodes
We should not write node pages when deleting orphan inodes.
In order to do that, we can eaisly set POR_DOING flag earlier before entering
orphan inode routine.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-20 08:59:42 -07:00
Jan Kara
9181f8bf5a udf: Don't modify filesystem for read-only mounts
When read-write mount of a filesystem is requested but we find out we
can mount the filesystem only in read-only mode, we still modify
LVID in udf_close_lvid(). That is both unnecessary and contrary to
expectation that when we fall back to read-only mount we don't modify
the filesystem.

Make sure we call udf_close_lvid() only if we called udf_open_lvid() so
that filesystem gets modified only if we verified we are allowed to
write to it.

Reported-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Jan Kara <jack@suse.com>
2015-08-20 14:58:35 +02:00
kbuild test robot
18df8a87ba dlm: sctp_accept_from_sock() can be static
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2015-08-17 16:23:09 -05:00
Marcelo Ricardo Leitner
00dcffaebf dlm: fix reconnecting but not sending data
There are cases on which lowcomms_connect_sock() is called directly,
which caused the CF_WRITE_PENDING flag to not bet set upon reconnect,
specially on send_to_sock() error handling. On this last, the flag was
already cleared and no further attempt on transmitting would be done.

As dlm tends to connect when it needs to transmit something, it makes
sense to always mark this flag right after the connect.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2015-08-17 16:22:21 -05:00
Marcelo Ricardo Leitner
acee4e527d dlm: replace BUG_ON with a less severe handling
BUG_ON() is a severe action for this case, specially now that DLM with
SCTP will use 1 socket per association. Instead, we can just close the
socket on this error condition and return from the function.

Also move the check to an earlier stage as it won't change and thus we
can abort as soon as possible.

Although this issue was reported when still using SCTP with 1-to-many
API, this cleanup wouldn't be that simple back then because we couldn't
close the socket and making sure such event would cease would be hard.
And actually, previous code was closing the association, yet SCTP layer
is still raising the new data event. Probably a bug to be fixed in SCTP.

Reported-by: <tan.hu@zte.com.cn>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2015-08-17 16:22:21 -05:00
Marcelo Ricardo Leitner
ee44b4bc05 dlm: use sctp 1-to-1 API
DLM is using 1-to-many API but in a 1-to-1 fashion. That is, it's not
needed but this causes it to use sctp_do_peeloff() to mimic an
kernel_accept() and this causes a symbol dependency on sctp module.

By switching it to 1-to-1 API we can avoid this dependency and also
reduce quite a lot of SCTP-specific code in lowcomms.c.

The caveat is that now DLM won't always use the same src port. It will
choose a random one, just like TCP code. This allows the peers to
attempt simultaneous connections, which now are handled just like for
TCP.

Even more sharing between TCP and SCTP code on DLM is possible, but it
is intentionally left for a later commit.

Note that for using nodes with this commit, you have to have at least
the early fixes on this patchset otherwise it will trigger some issues
on old nodes.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2015-08-17 16:22:20 -05:00
Marcelo Ricardo Leitner
356344c4c3 dlm: fix not reconnecting on connecting error handling
If we don't clear that bit, lowcomms_connect_sock() will not schedule
another attempt, and no further attempt will be done.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2015-08-17 16:22:19 -05:00
Marcelo Ricardo Leitner
0d737a8cfd dlm: fix race while closing connections
When a connection have issues DLM may need to close it.  Therefore we
should also cancel pending workqueues for such connection at that time,
and not just when dlm is not willing to use this connection anymore.

Also, if we don't clear CF_CONNECT_PENDING flag, the error handling
routines won't be able to re-connect as lowcomms_connect_sock() will
check for it.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2015-08-17 16:22:19 -05:00
Marcelo Ricardo Leitner
28926a0965 dlm: fix connection stealing if using SCTP
When using SCTP and accepting a new connection, DLM currently validates
if the peer trying to connect to it is one of the cluster nodes, but it
doesn't check if it already has a connection to it or not.

If it already had a connection, it will be overwritten, and the new one
will be used for writes, possibly causing the node to leave the cluster
due to communication breakage.

Still, one could DoS the node by attempting N connections and keeping
them open.

As said, but being explicit, both situations are only triggerable from
other cluster nodes, but are doable with only user-level perms.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2015-08-17 16:22:15 -05:00
Jann Horn
8ed1f0e22f fs/fuse: fix ioctl type confusion
fuse_dev_ioctl() performed fuse_get_dev() on a user-supplied fd,
leading to a type confusion issue. Fix it by checking file->f_op.

Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-16 12:35:44 -07:00
Theodore Ts'o
bdfe0cbd74 Revert "ext4: remove block_device_ejected"
This reverts commit 08439fec26.

Unfortunately we still need to test for bdi->dev to avoid a crash when a
USB stick is yanked out while a file system is mounted:

   usb 2-2: USB disconnect, device number 2
   Buffer I/O error on dev sdb1, logical block 15237120, lost sync page write
   JBD2: Error -5 detected when updating journal superblock for sdb1-8.
   BUG: unable to handle kernel paging request at 34beb000
   IP: [<c136ce88>] __percpu_counter_add+0x18/0xc0
   *pdpt = 0000000023db9001 *pde = 0000000000000000 
   Oops: 0000 [#1] SMP 
   CPU: 0 PID: 4083 Comm: umount Tainted: G     U     OE   4.1.1-040101-generic #201507011435
   Hardware name: LENOVO 7675CTO/7675CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
   task: ebf06b50 ti: ebebc000 task.ti: ebebc000
   EIP: 0060:[<c136ce88>] EFLAGS: 00010082 CPU: 0
   EIP is at __percpu_counter_add+0x18/0xc0
   EAX: f21c8e88 EBX: f21c8e88 ECX: 00000000 EDX: 00000001
   ESI: 00000001 EDI: 00000000 EBP: ebebde60 ESP: ebebde40
    DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
   CR0: 8005003b CR2: 34beb000 CR3: 33354200 CR4: 000007f0
   Stack:
    c1abe100 edcb0098 edcb00ec ffffffff f21c8e68 ffffffff f21c8e68 f286d160
    ebebde84 c1160454 00000010 00000282 f72a77f8 00000984 f72a77f8 f286d160
    f286d170 ebebdea0 c11e613f 00000000 00000282 f72a77f8 edd7f4d0 00000000
   Call Trace:
    [<c1160454>] account_page_dirtied+0x74/0x110
    [<c11e613f>] __set_page_dirty+0x3f/0xb0
    [<c11e6203>] mark_buffer_dirty+0x53/0xc0
    [<c124a0cb>] ext4_commit_super+0x17b/0x250
    [<c124ac71>] ext4_put_super+0xc1/0x320
    [<c11f04ba>] ? fsnotify_unmount_inodes+0x1aa/0x1c0
    [<c11cfeda>] ? evict_inodes+0xca/0xe0
    [<c11b925a>] generic_shutdown_super+0x6a/0xe0
    [<c10a1df0>] ? prepare_to_wait_event+0xd0/0xd0
    [<c1165a50>] ? unregister_shrinker+0x40/0x50
    [<c11b92f6>] kill_block_super+0x26/0x70
    [<c11b94f5>] deactivate_locked_super+0x45/0x80
    [<c11ba007>] deactivate_super+0x47/0x60
    [<c11d2b39>] cleanup_mnt+0x39/0x80
    [<c11d2bc0>] __cleanup_mnt+0x10/0x20
    [<c1080b51>] task_work_run+0x91/0xd0
    [<c1011e3c>] do_notify_resume+0x7c/0x90
    [<c1720da5>] work_notify
   Code: 8b 55 e8 e9 f4 fe ff ff 90 90 90 90 90 90 90 90 90 90 90 55 89 e5 83 ec 20 89 5d f4 89 c3 89 75 f8 89 d6 89 7d fc 89 cf 8b 48 14 <64> 8b 01 89 45 ec 89 c2 8b 45 08 c1 fa 1f 01 75 ec 89 55 f0 89
   EIP: [<c136ce88>] __percpu_counter_add+0x18/0xc0 SS:ESP 0068:ebebde40
   CR2: 0000000034beb000
   ---[ end trace dd564a7bea834ecd ]---

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101011

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-08-16 10:03:57 -04:00
Theodore Ts'o
e294a5371b ext4: ratelimit the file system mounted message
The xfstests ext4/305 will mount and unmount the same file system over
4,000 times, and each one of these will cause a system log message.
Ratelimit this message since if we are getting more than a few dozen
of these messages, they probably aren't going to be helpful.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15 14:59:44 -04:00
Dan Carpenter
da0b5e40ab ext4: silence a format string false positive
Static checkers complain that the format string should be "%s".  It does
not make a difference for the current code.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15 11:38:13 -04:00
Dan Carpenter
9810446836 ext4: simplify some code in read_mmp_block()
My static check complains because we have:

	if (!*bh)
		return -ENOMEM;
	if (*bh) {

The second check is unnecessary.

I've simplified this code by moving the "if (!*bh)" checks around.  Also
Andreas Dilger says we should probably print a warning if sb_getblk()
fails.

[ Restructured the code so that we print a warning message as well if
  the mmp block doesn't check out, and to print the error code to
  disambiguate between the error cases.  - TYT ]

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15 11:30:31 -04:00
Eric Sandeen
c642dc9e1a ext4: don't manipulate recovery flag when freezing no-journal fs
At some point along this sequence of changes:

f6e63f9 ext4: fold ext4_nojournal_sops into ext4_sops
bb04457 ext4: support freezing ext2 (nojournal) file systems
9ca9238 ext4: Use separate super_operations structure for no_journal filesystems

ext4 started setting needs_recovery on filesystems without journals
when they are unfrozen.  This makes no sense, and in fact confuses
blkid to the point where it doesn't recognize the filesystem at all.

(freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs,
see needs_recovery set on fs w/ no journal).

To fix this, don't manipulate the INCOMPAT_RECOVER feature on
filesystems without journals.

Reported-by: Stu Mark <smark@datto.com>
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-08-15 10:45:06 -04:00
Jaegeuk Kim
4c278394b0 f2fs: avoid a build warning
If F2FS_CHECK_FS is turned off, we can get a build warning for unused variable.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-14 16:02:15 -07:00
Chao Yu
8c14bfadea f2fs: handle error of f2fs_iget correctly
In recover_orphan_inode, whenever f2fs_iget fail, we will make kernel panic,
but it's not reasonable, because f2fs_iget can fail due to a lot of reasons
including out of memory.

So we change error handling method as below:
a) when finding no entry for the orphan inode, bug_on for catching bugs;
b) for other reasons, report it to caller.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-14 16:02:14 -07:00
Jaegeuk Kim
47e70ca46f f2fs: do not assign a new segment for dio under space shortage
If there is not enough free segment, we should not assign a new segment
explicitly. Otherwise, we can run out of free segment.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-14 16:02:13 -07:00
Kent Overstreet
b54ffb73ca block: remove bio_get_nr_vecs()
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.

Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:32:04 -06:00
Kent Overstreet
6cf66b4caf fs: use helper bio_add_page() instead of open coding on bi_io_vec
Call pre-defined helper bio_add_page() instead of open coding for
iterating through bi_io_vec[]. Doing that, it's possible to make some
parts in filesystems and mm/page_io.c simpler than before.

Acked-by: Dave Kleikamp <shaggy@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:32:00 -06:00
Kent Overstreet
0e28997ec4 btrfs: remove bio splitting and merge_bvec_fn() calls
Btrfs has been doing bio splitting from btrfs_map_bio(), by checking
device limits as well as calling ->merge_bvec_fn() etc. That is not
necessary any more, because generic_make_request() is now able to
handle arbitrarily sized bios. So clean up unnecessary code paths.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:31:43 -06:00
Eric W. Biederman
4b75de8615 fs: Set the size of empty dirs to 0.
Before the make_empty_dir_inode calls were introduce into proc, sysfs,
and sysctl those directories when stated reported an i_size of 0.
make_empty_dir_inode started reporting an i_size of 2.  At least one
userspace application depended on stat returning i_size of 0.  So
modify make_empty_dir_inode to cause an i_size of 0 to be reported for
these directories.

Cc: stable@vger.kernel.org
Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-08-12 15:28:45 -05:00
Chao Yu
decd36b6c4 f2fs: remove inmem radix tree
Previously, we use radix tree to index all registered page entries for
atomic file, but now we only use radix tree to see whether current page
is indexed or not, since the other user of radix tree is gone in commit
042b7816aa ("f2fs: remove unnecessary call to invalidate inmemory pages").

So in this patch, we try to use one more efficient way:
Introducing a macro ATOMIC_WRITTEN_PAGE, and setting it as page private
value to indicate page indexing status. By using this way, we can save
memory and lookup time.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-11 11:31:14 -07:00
Chao Yu
c15e8599ff f2fs: report EINVAL for unalignment direct IO
We run ltp testcase with f2fs and obtain a TFAIL in diotest4, the result in
detail is as fallow:

dio04

<<<test_start>>>
tag=dio04 stime=1432278894
cmdline="diotest4"
contacts=""
analysis=exit
<<<test_output>>>
diotest4    1  TPASS  :  Negative Offset
diotest4    2  TPASS  :  removed
diotest4    3  TFAIL  :  diotest4.c:129: write allows odd count.returns 1: Success
diotest4    4  TFAIL  :  diotest4.c:183: Odd count of read and write
diotest4    5  TPASS  :  Read beyond the file size
......

the result of ext4 with same environment:

dio04

<<<test_start>>>
tag=dio04 stime=1432259643
cmdline="diotest4"
contacts=""
analysis=exit
<<<test_output>>>
diotest4    1  TPASS  :  Negative Offset
diotest4    2  TPASS  :  removed
diotest4    3  TPASS  :  Odd count of read and write
diotest4    4  TPASS  :  Read beyond the file size
......

The reason is that when triggering DIO in f2fs, we will return zero value
in ->direct_IO if writer's buffer offset, file offset and transfer size is
not alignment to block size of filesystem, resulting in falling back into
buffered write instead of returning -EINVAL.

This patch fixes that problem by returning correct error number for above
case, and removing the judgement condition in check_direct_IO to make sure
the verification will be enabled for direct reader too.

Besides, Jaegeuk Kim pointed out that there is expectional cases we should
always make direct-io falling back into buffered write, such as dio in
encrypted file.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
[Chao Yu make small change and add detail description in commit message]
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-11 11:30:24 -07:00
Sasha Levin
9b81c84235 block: don't access bio->bi_error after bio_put()
Commit 4246a0b6 ("block: add a bi_error field to struct bio") has added a few
dereferences of 'bio' after a call to bio_put(). This causes use-after-frees
such as:

[521120.719695] BUG: KASan: use after free in dio_bio_complete+0x2b3/0x320 at addr ffff880f36b38714
[521120.720638] Read of size 4 by task mount.ocfs2/9644
[521120.721212] =============================================================================
[521120.722056] BUG kmalloc-256 (Not tainted): kasan: bad access detected
[521120.722968] -----------------------------------------------------------------------------
[521120.722968]
[521120.723915] Disabling lock debugging due to kernel taint
[521120.724539] INFO: Slab 0xffffea003cdace00 objects=32 used=25 fp=0xffff880f36b38600 flags=0x46fffff80004080
[521120.726037] INFO: Object 0xffff880f36b38700 @offset=1792 fp=0xffff880f36b38800
[521120.726037]
[521120.726974] Bytes b4 ffff880f36b386f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.727898] Object ffff880f36b38700: 00 88 b3 36 0f 88 ff ff 00 00 d8 de 0b 88 ff ff  ...6............
[521120.728822] Object ffff880f36b38710: 02 00 00 f0 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.729705] Object ffff880f36b38720: 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00  ................
[521120.730623] Object ffff880f36b38730: 00 00 00 00 00 00 00 00 01 00 00 00 00 02 00 00  ................
[521120.731621] Object ffff880f36b38740: 00 02 00 00 01 00 00 00 d0 f7 87 ad ff ff ff ff  ................
[521120.732776] Object ffff880f36b38750: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.733640] Object ffff880f36b38760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.734508] Object ffff880f36b38770: 01 00 03 00 01 00 00 00 88 87 b3 36 0f 88 ff ff  ...........6....
[521120.735385] Object ffff880f36b38780: 00 73 22 ad 02 88 ff ff 40 13 e0 3c 00 ea ff ff  .s".....@..<....
[521120.736667] Object ffff880f36b38790: 00 02 00 00 00 04 00 00 00 00 00 00 00 00 00 00  ................
[521120.737596] Object ffff880f36b387a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.738524] Object ffff880f36b387b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.739388] Object ffff880f36b387c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.740277] Object ffff880f36b387d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.741187] Object ffff880f36b387e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.742233] Object ffff880f36b387f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.743229] CPU: 41 PID: 9644 Comm: mount.ocfs2 Tainted: G    B           4.2.0-rc6-next-20150810-sasha-00039-gf909086 #2420
[521120.744274]  ffff880f36b38000 ffff880d89c8f638 ffffffffb6e9ba8a ffff880101c0e5c0
[521120.745025]  ffff880d89c8f668 ffffffffad76a313 ffff880101c0e5c0 ffffea003cdace00
[521120.745908]  ffff880f36b38700 ffff880f36b38798 ffff880d89c8f690 ffffffffad772854
[521120.747063] Call Trace:
[521120.747520] dump_stack (lib/dump_stack.c:52)
[521120.748053] print_trailer (mm/slub.c:653)
[521120.748582] object_err (mm/slub.c:660)
[521120.749079] kasan_report_error (include/linux/kasan.h:20 mm/kasan/report.c:152 mm/kasan/report.c:194)
[521120.750834] __asan_report_load4_noabort (mm/kasan/report.c:250)
[521120.753580] dio_bio_complete (fs/direct-io.c:478)
[521120.755752] do_blockdev_direct_IO (fs/direct-io.c:494 fs/direct-io.c:1291)
[521120.759765] __blockdev_direct_IO (fs/direct-io.c:1322)
[521120.761658] blkdev_direct_IO (fs/block_dev.c:162)
[521120.762993] generic_file_read_iter (mm/filemap.c:1738)
[521120.767405] blkdev_read_iter (fs/block_dev.c:1649)
[521120.768556] __vfs_read (fs/read_write.c:423 fs/read_write.c:434)
[521120.772126] vfs_read (fs/read_write.c:454)
[521120.773118] SyS_pread64 (fs/read_write.c:607 fs/read_write.c:594)
[521120.776062] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:186)
[521120.777375] Memory state around the buggy address:
[521120.778118]  ffff880f36b38600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.779211]  ffff880f36b38680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.780315] >ffff880f36b38700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.781465]                          ^
[521120.782083]  ffff880f36b38780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.783717]  ffff880f36b38800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[521120.784818] ==================================================================

This patch fixes a few of those places that I caught while auditing the patch, but the
original patch should be audited further for more occurences of this issue since I'm
not too familiar with the code.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-11 11:34:32 -06:00
Dan Carpenter
72d4d0e489 quota: remove an unneeded condition
We know "ret" is zero here so we can remove this condition.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.com>
2015-08-11 10:01:24 +02:00
Chao Yu
6394328ab8 f2fs: report error of fill_zero
fill_zero can fail due to a lot of reason, but previously we do not handle
its return value, so its callers such as punch_hole/f2fs_zero_range may
report success, but actually can fail because of error occurs inside
fill_zero.

This patch fixes to report correct return value of fill_zero.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-10 12:26:34 -07:00
Linus Torvalds
a3ca013d88 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull RCU pathwalk fix from Al Viro:
 "Another racy use of nd->path.dentry in RCU mode"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  may_follow_link() should use nd->inode
2015-08-10 10:04:47 -07:00
Greg Kroah-Hartman
5d44f4b348 Merge 4.2-rc6 into char-misc-next
We want the fixes in Linus's tree in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-09 16:28:09 -07:00
Linus Torvalds
af0b3152bb Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "We have a btrfs quota regression fix.

  I merged this one on Thursday and have run it through tests against
  current master.

  Normally I wouldn't have sent this while you were finalizing rc6, but
  I'm feeding mosquitoes in the adirondacks next week, so I wanted to
  get this one out before leaving.  I'll leave longer tests running and
  check on things during the week, but I don't expect any problems"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: qgroup: Fix a regression in qgroup reserved space.
2015-08-09 05:56:31 +03:00
Masahiro Yamada
e1c05067c3 treewide: fix typos in comment blocks
Looks like the word "contiguous" is often mistyped.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07 14:46:24 +02:00
Geert Uytterhoeven
d41e36a0ab Btrfs: Spelling s/consitent/consistent/
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07 14:13:21 +02:00
Nik Nyby
d31e77177d ntfs: super.c: Fix error log
"transation" should be "transaction"

Signed-off-by: Nik Nyby <nikolas@gnu.org>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07 14:06:35 +02:00
Geert Uytterhoeven
eaf593c38d freevxfs: Grammar s/an negative/a negative/
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07 13:59:24 +02:00
Masanari Iida
971bd8fa36 treewide: Fix typo in printk
This patch fix spelling typo inv various part of sources.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07 13:58:05 +02:00
Stephen Smalley
e1832f2923 ipc: use private shmem or hugetlbfs inodes for shm segments.
The shm implementation internally uses shmem or hugetlbfs inodes for shm
segments.  As these inodes are never directly exposed to userspace and
only accessed through the shm operations which are already hooked by
security modules, mark the inodes with the S_PRIVATE flag so that inode
security initialization and permission checking is skipped.

This was motivated by the following lockdep warning:

  ======================================================
   [ INFO: possible circular locking dependency detected ]
   4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G        W
  -------------------------------------------------------
   httpd/1597 is trying to acquire lock:
   (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
   but task is already holding lock:
   (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
   which lock already depends on the new lock.
   the existing dependency chain (in reverse order) is:
   -> #3 (&mm->mmap_sem){++++++}:
        lock_acquire+0xc7/0x270
        __might_fault+0x7a/0xa0
        filldir+0x9e/0x130
        xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
        xfs_readdir+0x1b4/0x330 [xfs]
        xfs_file_readdir+0x2b/0x30 [xfs]
        iterate_dir+0x97/0x130
        SyS_getdents+0x91/0x120
        entry_SYSCALL_64_fastpath+0x12/0x76
   -> #2 (&xfs_dir_ilock_class){++++.+}:
        lock_acquire+0xc7/0x270
        down_read_nested+0x57/0xa0
        xfs_ilock+0x167/0x350 [xfs]
        xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
        xfs_attr_get+0xbd/0x190 [xfs]
        xfs_xattr_get+0x3d/0x70 [xfs]
        generic_getxattr+0x4f/0x70
        inode_doinit_with_dentry+0x162/0x670
        sb_finish_set_opts+0xd9/0x230
        selinux_set_mnt_opts+0x35c/0x660
        superblock_doinit+0x77/0xf0
        delayed_superblock_init+0x10/0x20
        iterate_supers+0xb3/0x110
        selinux_complete_init+0x2f/0x40
        security_load_policy+0x103/0x600
        sel_write_load+0xc1/0x750
        __vfs_write+0x37/0x100
        vfs_write+0xa9/0x1a0
        SyS_write+0x58/0xd0
        entry_SYSCALL_64_fastpath+0x12/0x76
  ...

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Reported-by: Morten Stevens <mstevens@fedoraproject.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:41 +03:00
Joseph Qi
32e5a2a2be ocfs2: fix shift left overflow
When using a large volume, for example 9T volume with 2T already used,
frequent creation of small files with O_DIRECT when the IO is not
cluster aligned may clear sectors in the wrong place.  This will cause
filesystem corruption.

This is because p_cpos is a u32.  When calculating the corresponding
sector it should be converted to u64 first, otherwise it may overflow.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>	[4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:41 +03:00
Jan Kara
8f2f3eb59d fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()
fsnotify_clear_marks_by_group_flags() can race with
fsnotify_destroy_marks() so that when fsnotify_destroy_mark_locked()
drops mark_mutex, a mark from the list iterated by
fsnotify_clear_marks_by_group_flags() can be freed and thus the next
entry pointer we have cached may become stale and we dereference free
memory.

Fix the problem by first moving marks to free to a special private list
and then always free the first entry in the special list.  This method
is safe even when entries from the list can disappear once we drop the
lock.

Signed-off-by: Jan Kara <jack@suse.com>
Reported-by: Ashish Sangwan <a.sangwan@samsung.com>
Reviewed-by: Ashish Sangwan <a.sangwan@samsung.com>
Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:41 +03:00
Amanieu d'Antras
3ead7c52bd signalfd: fix information leak in signalfd_copyinfo
This function may copy the si_addr_lsb field to user mode when it hasn't
been initialized, which can leak kernel stack data to user mode.

Just checking the value of si_code is insufficient because the same
si_code value is shared between multiple signals.  This is solved by
checking the value of si_signo in addition to si_code.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:40 +03:00
Joseph Qi
209f7512d0 ocfs2: fix BUG in ocfs2_downconvert_thread_do_work()
The "BUG_ON(list_empty(&osb->blocked_lock_list))" in
ocfs2_downconvert_thread_do_work can be triggered in the following case:

ocfs2dc has firstly saved osb->blocked_lock_count to local varibale
processed, and then processes the dentry lockres.  During the dentry
put, it calls iput and then deletes rw, inode and open lockres from
blocked list in ocfs2_mark_lockres_freeing.  And this causes the
variable `processed' to not reflect the number of blocked lockres to be
processed, which triggers the BUG.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:40 +03:00
Mel Gorman
4248b0da46 fs, file table: reinit files_stat.max_files after deferred memory initialisation
Dave Hansen reported the following;

	My laptop has been behaving strangely with 4.2-rc2.  Once I log
	in to my X session, I start getting all kinds of strange errors
	from applications and see this in my dmesg:

        	VFS: file-max limit 8192 reached

The problem is that the file-max is calculated before memory is fully
initialised and miscalculates how much memory the kernel is using.  This
patch recalculates file-max after deferred memory initialisation.  Note
that using memory hotplug infrastructure would not have avoided this
problem as the value is not recalculated after memory hot-add.

4.1:             files_stat.max_files = 6582781
4.2-rc2:         files_stat.max_files = 8192
4.2-rc2 patched: files_stat.max_files = 6562467

Small differences with the patch applied and 4.1 but not enough to matter.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Alex Ng <alexng@microsoft.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:40 +03:00
Qu Wenruo
c05f9429e1 btrfs: qgroup: Fix a regression in qgroup reserved space.
During the change to new btrfs extent-oriented qgroup implement, due to
it doesn't use the old __qgroup_excl_accounting() for exclusive extent,
it didn't free the reserved bytes.

The bug will cause limit function go crazy as the reserved space is
never freed, increasing limit will have no effect and still cause
EQOUT.

The fix is easy, just free reserved bytes for newly created exclusive
extent as what it does before.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Yang Dongsheng <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-06 14:51:15 -07:00
Chao Yu
12a8343e99 f2fs: recover invalid/reserved block address for fsynced file
When testing with generic/101 in xfstests, error message outputed as below:

    --- tests/generic/101.out
    +++ results//generic/101.out.bad
    @@ -10,10 +10,14 @@
     File foo content after log replay:
     0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
     *
    -0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    +0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
     *
     0372000
    ...
    (Run 'diff -u tests/generic/101.out results/generic/101.out.bad'  to see the entire diff)

The test flow is like below:
1. pwrite foo -S 0xaa 0 64K
2. pwrite foo -S 0xbb 64K 61K
3. sync
4. truncate foo 64K
5. truncate foo 125K
6. fsync foo
7. flakey drop writes
8. umount

After this test, we expect the data of recovered file will have the first
64k of data filling with value 0xaa and the next 61k of data filling with
value 0x00 because we have fsynced it before dropping writes in dm.

In f2fs, during recovering, we will only recover the valid block address
in direct node page if it is marked as a fsynced dnode, but block address
which means invalid/reserved (with value NULL_ADDR/NEW_ADDR) will not be
recovered. So, the file recovered shows its incorrect data 0xbb in range of
[61k, 125k].

In this patch, we fix to recover invalid/reserved block during recover flow.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 22:16:41 -07:00
Fan Li
759af1c9c1 f2fs: use extent cache to optimize f2fs_reserve_block
In some cases, we only need the block address when we call
f2fs_reserve_block,
other fields of struct dnode_of_data aren't necessary.
We can try extent cache first for such cases in order to speed up the
process.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 22:15:42 -07:00
Partha Pratim Mukherjee
594069bc3d fs/char_dev.c: fix incorrect documentation for unregister_chrdev_region
The current documentation for unregister_chrdev_region says that it return
a range of device numbers which is incorrect.  Instead it unregister a
range of device numbers.  Fix the documentation to make this clear.

Signed-off-by: Partha Pratim Mukherjee <ppm.floss@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-05 13:49:35 -07:00
Greg Kroah-Hartman
f368ed6088 char: make misc_deregister a void function
With well over 200+ users of this api, there are a mere 12 users that
actually checked the return value of this function.  And all of them
really didn't do anything with that information as the system or module
was shutting down no matter what.

So stop pretending like it matters, and just return void from
misc_deregister().  If something goes wrong in the call, you will get a
WARNING splat in the syslog so you know how to fix up your driver.
Other than that, there's nothing that can go wrong.

Cc: Alasdair Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-05 10:35:49 -07:00
Chao Yu
e90c2d2850 f2fs: invalidate temporary meta page
To avoid meeting garbage data in next free node block at the end of warm
node chain when doing recovery, we will try to zero out that invalid block.

If the device is not support discard, our way for zeroing out block is:
grabbing a temporary zeroed page in meta inode, then, issue write request
with this page.

But, we forget to release that temporary page, so our memory usage will
increase without gaining any hit ratio benefit, so it's better to free it
for saving memory.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:19:21 -07:00
Chao Yu
470f00e968 f2fs: fix to release inode page correctly
In following call path, we will pass a locked and referenced ipage
pointer to get_new_data_page:
 - init_inode_metadata
  - make_empty_dir
   - get_new_data_page

There are two exit paths in get_new_data_page when error occurs:
1) grab_cache_page fails, ipage will not be released;
2) f2fs_reserve_block fails, ipage will be released in callee.

So, it's not consistent for error handling in get_new_data_page.

For f2fs_reserve_block, it's not very easy to change the rule
of error handling, since it's already complicated.

Here we deside to choose an easy way to fix this issue:
If any error occur in get_new_data_page, we will ensure releasing
ipage in this function.

The same issue is in f2fs_convert_inline_dir, fix that too.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:23 -07:00
Liu Xue
7a04f64d4d f2fs: unify f2fs_bug_on when check blocks and segment
Replace BUG_ON with f2fs_bug_on to deal with
block and segment validity check failed.

Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:18 -07:00
Chao Yu
f3f338caad f2fs: freeze filesystem when fail to update meta page due to IO error
In get_meta_page, we guarantee no failure for the returned page,
but sometimes, IO error from device will incur returning an
non-updated page.

Then, we still use this page as updated one, exception could happen
when using this kind of page.

So in this condition, we'd better freeze fs by making fs readonly and
and stop doing checkpoint.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:17 -07:00
Fan Li
5768dcdd7f f2fs: change the timing of f2fs_wait_on_page_writeback
some backing devices need pages to be stable during writeback. It doesn't
matter if
the page is completely overwritten or already uptodate, it needs to wait
before write.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:16 -07:00
Jaegeuk Kim
edb27deea7 f2fs: handle error cases in commit_inmem_pages
This patch adds to handle error cases in commit_inmem_pages.
If an error occurs, it stops to write the pages and return the error right
away.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:15 -07:00
Chao Yu
a6d494b6d8 f2fs: fix to build free nids from readaheaded nat pages
When there is no enough free nids in free nid cache, we will try to
readahead FREE_NID_PAGES:4 nat pages into page cache of meta_inode,
then, reading nat entries in nat page for adding free nids to free nid
cache.

But when traversing all nat pages we readaheaded in a circulation,
our exit condition is not set right, one more nat page will be scanned
without readaheading, resulting worse read performance.

This patch fixes to read the correct number nat pages to avoid bad
performance.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:14 -07:00
Chao Yu
e4e762723a f2fs: fix inline data/dentry stat number leak
If we clear inline data/dentry flag in handle_failed_inode, we will fail
to decline the stat count of inline data/dentry in f2fs_evict_inode due
to no flag in inode. So remove the wrong clearing.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:14 -07:00
Chao Yu
f4c9c743ac f2fs: convert inline data before set atomic/volatile flag
In f2fs_ioc_start_{atomic,volatile}_write, if we failed in converting
inline data, we will report error to user, but still remain atomic/volatile
flag in inode, it will impact further writes for this file. Fix it.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:13 -07:00
Chao Yu
a5f64b6aa6 f2fs: fix to wait all atomic written pages writeback
This patch fixes the incorrect range (0, LONG_MAX) which is used
in ranged fsync. If we use LONG_MAX as the parameter for indicating
the end of file we want to synchronize, in 32-bits architecture
machine, these datas after 4GB offset may not be persisted in
storage after ->fsync returned.

Here, we alter LONG_MAX to LLONG_MAX to fix this issue.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:12 -07:00
Chao Yu
6a2905443c f2fs: skip writing in ->writepages when no dirty pages exist
When flushing comes from background, if there is no dirty page in the
mapping of inode, we'd better to skip seeking dirty page from mapping
for writebacking.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:11 -07:00
Tiezhu Yang
737f18992e f2fs: optimize f2fs_write_cache_pages
The if statement "goto continue_unlock" is exactly the same when
each if condition is true that is depended on the value of both
"step" and "is_cold_data(page)" are 0 or 1. That means when the
value of "step" equals to "is_cold_data(page)", the if condition
is true and the if statement "goto continue_unlock" appears only
once, so it can be optimized to reduce the duplicated code.

Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:10 -07:00
Chao Yu
55f57d2c42 f2fs: fix double lock in handle_failed_inode
In handle_failed_inode, there is a potential deadlock which can happen
in below call path:

- f2fs_create
 - f2fs_lock_op   down_read(cp_rwsem)
 - f2fs_add_link
  - __f2fs_add_link
   - init_inode_metadata
    - f2fs_init_security    failed
    - truncate_blocks    failed
 - handle_failed_inode
  - f2fs_truncate
   - truncate_blocks(..,true)
					- write_checkpoint
					 - block_operations
					  - f2fs_lock_all  down_write(cp_rwsem)
    - f2fs_lock_op   down_read(cp_rwsem)

So in this path, we pass parameter to f2fs_truncate to make sure
cp_rwsem in truncate_blocks will not be locked again.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:09 -07:00
Chao Yu
ecbaa4068f f2fs: reduce region of cp_rwsem covered in f2fs_do_collapse
In f2fs_do_collapse, region cp_rwsem covered is large, since it will be
held until all blocks are left shifted, so if we try to collapse small
area at the beginning of large file, checkpoint who want to grab writer's
lock of cp_rwsem will be delayed for long time.

In order to avoid this condition, altering to lock/unlock cp_rwsem each
shift operation.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:09 -07:00
Fan Li
0f825ee6e8 f2fs: add new interfaces for extent tree
Add a lookup and a insertion interface for extent tree.
The new lookup return the insert position and the prev/next
extents closest to the offset we lookup when find no match.
The new insertion uses above parameters to improve performance.

There are three possible insertions after the lookup in
f2fs_update_extent_tree, two of them insert parts of removed extent
back to tree, since no merge happens during this process, new insertion
skips the merge check in this scanario; the another insertion inserts a
new extent to tree, new insertion uses prev/next extent and insert
position to insert this extent directly, and save the time of searching
down the tree.

As long as tree remains unchanged between lookup and insertion, this
would work fine. And the new lookup would be useful when add
multi-blocks extent support for insertion interface.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:08 -07:00
Jaegeuk Kim
86531d6b84 f2fs: callers take care of the page from bio error
This patch changes for a caller to handle the page after its bio gets an error.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:07 -07:00
Chao Yu
727edac572 f2fs: use atomic_t to record hit ratio info of extent cache
Variables for recording extent cache ratio info were updated without
protection, this patch tries to alter them to atomic_t type for more
accurate stat.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:06 -07:00
Chao Yu
d5e8f6c980 f2fs: stat inline xattr inode number
This patch adds to stat the number of inline xattr inode for
showing in debugfs.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:05 -07:00
Jaegeuk Kim
1b77c416e7 f2fs: use a page temporarily for encrypted gced page
That encrypted page is used temporarily, so we don't need to mark it accessed.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-05 08:08:04 -07:00
Linus Torvalds
9e91edcd1b Merge branch 'for-4.2' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields.

* 'for-4.2' of git://linux-nfs.org/~bfields/linux:
  nfsd: do nfs4_check_fh in nfs4_check_file instead of nfs4_check_olstateid
  nfsd: Fix a file leak on nfsd4_layout_setlease failure
  nfsd: Drop BUG_ON and ignore SECLABEL on absent filesystem
2015-08-05 10:59:59 +02:00
Al Viro
aa65fa35ba may_follow_link() should use nd->inode
Now that we can get there in RCU mode, we shouldn't play with
nd->path.dentry->d_inode - it's not guaranteed to be stable.
Use nd->inode instead.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-08-04 23:23:50 -04:00
Chao Yu
8f46dcaea8 f2fs: expose f2fs_write_cache_pages
If there are gced dirty pages and normal dirty pages in the mapping
of one inode, we might writeback them alternately with discontinuous
block address, resulting in low performance.

This patch introduces f2fs_write_cache_pages with codes copied from
write_cache_pages in mm/page-writeback.c.

In this function, we refactor flow with two steps:
1) writeback all cold type pages.
2) writeback all non-cold type pages.

By using this method, f2fs will writeback dirty pages with the same
temperature in bunch mode, it makes writeouted block being with
more continuous address, so they can be merged as much as possible
in f2fs bio cache, and also it will reduce the chance of submiting
small IO from block layer.

Test environment: 8g nokia sd card (very old sd card, but it shows
better effect when testing with this patch, and with a 32g kingston
sd card, I didn't see much more improvement).

Test step:
1. touch testfile;
2. truncate -s 512K testfile;
3. write all pages with odd index;
4. trigger gc by ioctl;
5. write all pages with even index;
6. time fsync testfile.

before:
real	0m0.402s
user	0m0.000s
sys	0m0.000s

after:
real	0m0.143s
user	0m0.004s
sys	0m0.004s

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:59 -07:00
Chao Yu
037fe70c9a f2fs: correct return value of ->setxattr
This patch fixes to return correct error number of ->setxattr, which
is reported by xfstest tests/generic/026 as below:

generic/026      - output mismatch
    --- tests/generic/026.out
    +++ results/generic/026.out.bad
    @@ -4,6 +4,6 @@
     1 below acl max
     acl max
     1 above acl max
    -chacl: cannot set access acl on "largeaclfile": Argument list too long
    +chacl: cannot set access acl on "largeaclfile": Numerical result out of range
     use 16 aces
     use 17 aces
    ...
Ran: generic/026
Failures: generic/026
Failed 1 of 1 tests

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:59 -07:00
Chao Yu
bd936f8407 f2fs: cleanup write_orphan_inodes
Previously, since 'commit 4531929e39 ("f2fs: move grabing orphan
pages out of protection region")' was committed, in write_orphan_inodes(),
we will grab all meta page in a batch before we use them under spinlock,
so that we can avoid large time delay of grabbing meta pages under
spinlock.

Now, 'commit d6c67a4fee ("f2fs: revmove spin_lock for
write_orphan_inodes")' remove the spinlock in write_orphan_inodes,
so there is no issue we describe above, we'd better recover to move
the grab operation to original place for readability.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:59 -07:00
Chao Yu
5b3391244d f2fs: warm up cold page after mmaped write
With cost-benifit method, background gc will consider old section with
fewer valid blocks as candidate victim, these old blocks in section will
be treated as cold data, and laterly will be moved into cold segment.

But if the gcing page is attached by user through buffered or mmaped
write, we should reset the page as non-cold one, because this page may
have more opportunity for further updating.

So fix to add clearing code for the missed 'mmap' case.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:59 -07:00
Chao Yu
c1c1b58359 f2fs: add new ioctl F2FS_IOC_GARBAGE_COLLECT
When background gc is off, the only way to trigger gc is executing
a force gc in some operations who wants to grab space in disk.

The executing condition is limited: to execute force gc, we should
wait for the time when there is almost no more free section for LFS
allocation. This seems not reasonable for our user who wants to
control triggering gc by himself.

This patch introduces F2FS_IOC_GARBAGE_COLLECT interface for
triggering garbage collection by using ioctl. It provides our users
one more option to trigger gc.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:58 -07:00
Chao Yu
a28ef1f5ae f2fs: maintain extent cache in separated file
This patch moves extent cache related code from data.c into extent_cache.c
since extent cache is independent feature, and its codes are not relate to
others in data.c, it's better for us to maintain them in separated place.

There is no functionality change, but several small coding style fixes
including:
* rename __drop_largest_extent to f2fs_drop_largest_extent for exporting;
* rename misspelled word 'untill' to 'until';
* remove unneeded 'return' in the end of f2fs_destroy_extent_tree().

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:58 -07:00
Fan Li
3c7df87dad f2fs: don't try to split extents shorter than F2FS_MIN_EXTENT_LEN
Since only parts of extents longer than F2FS_MIN_EXTENT_LEN will
be kept in extent cache after split, extents already shorter than
F2FS_MIN_EXTENT_LEN don't need to try split at all.

Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:58 -07:00
Chao Yu
90d4388ac2 f2fs: fix to update page flag
This patch fixes to update page flag (e.g. Uptodate/cold flag) in
->write_begin.

Otherwise, page will be non-uptodate when we try to write entire
page, and cold data flag in page will not be clean when gced page
is being rewritten.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:57 -07:00
Jaegeuk Kim
7023a1ad17 f2fs: shrink unreferenced extent_caches first
If an extent_tree entry has a zero reference count, we can drop it from the
cache in higher priority rather than currently referencing entries.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:57 -07:00
Chao Yu
bb96a8d51e f2fs: enhance multithread performance
In ->writepages, we use writepages mutex lock to serialize all block
address allocation and page submitting pairs from different inodes.
This method makes our delayed dirty pages of one inode being written
continously as many as possible.

But there is one problem that we did not submit current cached bio in
protection region of writepages mutex lock, so there is a small chance
that we submit the one of other thread's as below, resulting in
splitting more bios.

thread 1			thread 2
->writepages
  lock(writepages)
  ->write_cache_pages
  unlock(writepages)
				  lock(writepages)
				  ->write_cache_pages
  ->f2fs_submit_merged_bio
				    ->writepage
				  unlock(writepages)

fs_mark-6535  [002] ....  2242.270230: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5766152, size = 524288
fs_mark-6536  [000] ....  2242.270361: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5767176, size = 4096
fs_mark-6536  [000] ....  2242.270370: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, NODE, sector = 8138112, size = 4096
fs_mark-6535  [002] ....  2242.270776: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5767184, size = 516096

This may really increase time of block layer works, and may cause
larger IO lantency.

This patch moves the submitting operation into region of writepages
mutex lock to avoid bio splits when concurrently writebacking is
intensive.

my test environment: virtual machine,
intel cpu i5 2500, 8GB size memory, 4GB size ramdisk

time fs_mark  -t  16  -L  1  -s  524288  -S  1  -d  /mnt/f2fs/

before:
real	0m4.244s
user	0m0.088s
sys	0m12.336s

after:
real	0m3.822s
user	0m0.072s
sys	0m10.760s

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:57 -07:00
Chao Yu
741a7bea79 f2fs: restrict multimedia filename
When testing with fs_mark, some blocks were written out as cold
data which were mixed with warm data, resulting in splitting more
bios.

This is because fs_mark will create file with random filename as
below:

559551ee~~~~~~~~15Z29OCC05JCKQP60JQ42MKV
559551ee~~~~~~~~NZAZ6X8OA8LHIIP6XD0L58RM
559551ef~~~~~~~~B15YDSWAK789HPSDZKYTW6WM
559551f1~~~~~~~~2DAE5DPS79785BUNTFWBEMP3
559551f1~~~~~~~~1MYDY0BKSQCJPI32Q8C514RM
559551f1~~~~~~~~YQOTMAOMN5CVRFOUNI026MP4
559551f3~~~~~~~~1WF42LPRTQJNPPGR3EINKMPE
559551f3~~~~~~~~8Y2NRK7CEPPAA02LY936PJPG

They are regarded as cold file since their filename are ended with
multimedia files' extension, but this should be wrong as we only
match the extension of filename, not the whole one.

In this patch, we try to fix the format of multimedia filename to:
"filename + '.' + extension", then we set cold file only its
filename matches the format.

So after this change, it will reduce the probability we set the
wrong cold file, also it helps a little for fs_mark's performance
on f2fs.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:57 -07:00
Nicholas Krause
c1079892f4 f2fs: make the function check_dnode have a return type of bool and change it's name to is_alive
This makes the function check_dnode have a return type of bool
due to this particular function only ever returning either one
or zero as its return value and changes the name of the function
to is_alive in order to better explain this function's intended
work of checking if a dnode is still in use by the filesystem.

Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
[Jaegeuk Kim: change the return value check for the renamed function]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:56 -07:00
Jaegeuk Kim
84bc926c07 f2fs: check the largest extent at look-up time
Because of the extent shrinker or other -ENOMEM scenarios, it cannot guarantee
that the largest extent would be cached in the tree all the time.

Instead of relying on extent_tree, we can simply check the cached one in extent
tree accordingly.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:56 -07:00
Jaegeuk Kim
3e72f72139 f2fs: use extent_cache by default
We don't need to handle the duplicate extent information.

The integrated rule is:
 - update on-disk extent with largest one tracked by in-memory extent_cache
 - destroy extent_tree for the truncation case
 - drop per-inode extent_cache by shrinker

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:56 -07:00
Jaegeuk Kim
7daaea256d f2fs: add noextent_cache mount option
This patch adds noextent_cache mount option.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:55 -07:00
Jaegeuk Kim
554df79e52 f2fs: shrink extent_cache entries
This patch registers shrinking extent_caches.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:55 -07:00
Jaegeuk Kim
1b38dc8e74 f2fs: shrink nat_cache entries
This patch registers shrinking nat_cache entries.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:55 -07:00
Jaegeuk Kim
2658e50de6 f2fs: introduce a shrinker for mounted fs
This patch introduces a shrinker targeting to reduce memory footprint consumed
by a number of in-memory f2fs data structures.

In addition, it newly adds:
 - sbi->umount_mutex to avoid data races on shrinker and put_super
 - sbi->shruinker_run_no to not revisit objects

Note that the basic implementation was copied from fs/ubifs/shrinker.c

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:54 -07:00
Jaegeuk Kim
244f4fc1c5 f2fs: set cached_en after checking finally
This patch relocates cached_en not only to be covered by spin_lock, but also
to set once after checking out completely.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:54 -07:00
Jaegeuk Kim
cbe91923a9 f2fs: update on-disk extents even under extent_cache
Previously, f2fs_update_extent_cache() updates in-memory extent_cache all the
time, and then finally preserves its up-to-date extent into on-disk one during
f2fs_evict_inode.

But, in the following scenario:

1. mount
2. open & write an extent X
3. f2fs_evict_inode; on-disk extent is X
4. open & update the extent X with Y
5. sync; trigger checkpoint
6. power-cut

after power-on, f2fs should serve extent Y, but we have an on-disk extent X.

This causes a failure on xfstests/311.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-04 14:09:54 -07:00