Commit graph

830 commits

Author SHA1 Message Date
Linus Torvalds
d34fc1adf0 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - various misc bits

 - DAX updates

 - OCFS2

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits)
  mm,fork: introduce MADV_WIPEONFORK
  x86,mpx: make mpx depend on x86-64 to free up VMA flag
  mm: add /proc/pid/smaps_rollup
  mm: hugetlb: clear target sub-page last when clearing huge page
  mm: oom: let oom_reap_task and exit_mmap run concurrently
  swap: choose swap device according to numa node
  mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  z3fold: use per-cpu unbuddied lists
  mm, swap: don't use VMA based swap readahead if HDD is used as swap
  mm, swap: add sysfs interface for VMA based swap readahead
  mm, swap: VMA based swap readahead
  mm, swap: fix swap readahead marking
  mm, swap: add swap readahead hit statistics
  mm/vmalloc.c: don't reinvent the wheel but use existing llist API
  mm/vmstat.c: fix wrong comment
  selftests/memfd: add memfd_create hugetlbfs selftest
  mm/shmem: add hugetlbfs support to memfd_create()
  mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
  mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
  ...
2017-09-06 20:49:49 -07:00
Jan Kara
397162ffa2 mm: remove nr_pages argument from pagevec_lookup{,_range}()
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.

Just drop the argument.

Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Jan Kara
2b85a6171d ext4: use pagevec_lookup_range() in writeback code
Both occurences of pagevec_lookup() actually want only pages from a
given range.  Use pagevec_lookup_range() for the lookup.

Link: http://lkml.kernel.org/r/20170726114704.7626-7-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Jan Kara
d72dc8a25a mm: make pagevec_lookup() update index
Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue.  Most callers want this
and also pagevec_lookup_tag() already does this.

Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Tahsin Erdogan
a6d0567604 ext4: backward compatibility support for Lustre ea_inode implementation
Original Lustre ea_inode feature did not have ref counts on xattr inodes
because there was always one parent that referenced it. New
implementation expects ref count to be initialized which is not true for
Lustre case. Handle this by detecting Lustre created xattr inode and set
its ref count to 1.

The quota handling of xattr inodes have also changed with deduplication
support. New implementation manually manages quotas to support sharing
across multiple users. A consequence is that, a referencing inode
incorporates the blocks of xattr inode into its own i_block field.

We need to know how a xattr inode was created so that we can reverse the
block charges during reference removal. This is handled by introducing a
EXT4_STATE_LUSTRE_EA_INODE flag. The flag is set on a xattr inode if
inode appears to have been created by Lustre. During xattr inode reference
removal, the manual quota uncharge is skipped if the flag is set.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-24 14:25:02 -04:00
Miao Xie
c03b45b853 ext4, project: expand inode extra size if possible
When upgrading from old format, try to set project id
to old file first time, it will return EOVERFLOW, but if
that file is dirtied(touch etc), changing project id will
be allowed, this might be confusing for users, we could
try to expand @i_extra_isize here too.

Reported-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-06 01:00:49 -04:00
Miao Xie
cf0a5e818f ext4: restructure ext4_expand_extra_isize
Current ext4_expand_extra_isize just tries to expand extra isize, if
someone is holding xattr lock or some check fails, it will give up.
So rename its name to ext4_try_to_expand_extra_isize.

Besides that, we clean up unnecessary check and move some relative checks
into it.

Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
2017-08-06 00:40:01 -04:00
Miao Xie
3b10fdc6d8 ext4: fix forgetten xattr lock protection in ext4_expand_extra_isize
We should avoid the contention between the i_extra_isize update and
the inline data insertion, so move the xattr trylock in front of
i_extra_isize update.

Signed-off-by: Miao Xie <miaoxie@huawei.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
2017-08-06 00:27:38 -04:00
Tahsin Erdogan
9699d4f91d ext4: make xattr inode reads faster
ext4_xattr_inode_read() currently reads each block sequentially while
waiting for io operation to complete before moving on to the next
block. This prevents request merging in block layer.

Add a ext4_bread_batch() function that starts reads for all blocks
then optionally waits for them to complete. A similar logic is used
in ext4_find_entry(), so update that code to use the new function.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-06 00:07:01 -04:00
Eric Whitney
1e21196c8e ext4: correct comment references to ext4_ext_direct_IO()
Commit 914f82a32d "ext4: refactor direct IO code" deleted
ext4_ext_direct_IO(), but references to that function remain in
comments.  Update them to refer to ext4_direct_IO_write().

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-30 22:26:40 -04:00
Tahsin Erdogan
407cd7fb83 ext4: change fast symlink test to not rely on i_blocks
ext4_inode_info->i_data is the storage area for 4 types of data:

  a) Extents data
  b) Inline data
  c) Block map
  d) Fast symlink data (symlink length < 60)

Extents data case is positively identified by EXT4_INODE_EXTENTS flag.
Inline data case is also obvious because of EXT4_INODE_INLINE_DATA
flag.

Distinguishing c) and d) however requires additional logic. This
currently relies on i_blocks count. After subtracting external xattr
block from i_blocks, if it is greater than 0 then we know that some
data blocks exist, so there must be a block map.

This logic got broken after ea_inode feature was added. That feature
charges the data blocks of external xattr inodes to the referencing
inode and so adds them to the i_blocks. To fix this, we could subtract
ea_inode blocks by iterating through all xattr entries and then check
whether remaining i_blocks count is zero. Besides being complicated,
this won't change the fact that the current way of distinguishing
between c) and d) is fragile.

The alternative solution is to test whether i_size is less than 60 to
determine fast symlink case. ext4_symlink() uses the same test to decide
whether to store the symlink in i_data. There is one caveat to address
before this can work though.

If an inode's i_nlink is zero during eviction, its i_size is set to
zero and its data is truncated. If system crashes before inode is removed
from the orphan list, next boot orphan cleanup may find the inode with
zero i_size. So, a symlink that had its data stored in a block may now
appear to be a fast symlink. The solution used in this patch is to treat
i_size = 0 as a non-fast symlink case. A zero sized symlink is not legal
so the only time this can happen is the mentioned scenario. This is also
logically correct because a i_size = 0 symlink has no data stored in
i_data.

Suggested-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2017-07-04 00:11:21 -04:00
Eric Biggers
63136858ae ext4: require key for truncate(2) of encrypted file
Currently, filesystems allow truncate(2) on an encrypted file without
the encryption key.  However, it's impossible to correctly handle the
case where the size being truncated to is not a multiple of the
filesystem block size, because that would require decrypting the final
block, zeroing the part beyond i_size, then encrypting the block.

As other modifications to encrypted file contents are prohibited without
the key, just prohibit truncate(2) as well, making it fail with ENOKEY.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-23 19:48:44 -04:00
Jan Kara
3abb1a0fc2 ext4: avoid unnecessary stalls in ext4_evict_inode()
These days inode reclaim calls evict_inode() only when it has no pages
in the mapping.  In that case it is not necessary to wait for transaction
commit in ext4_evict_inode() as there can be no pages waiting to be
committed.  So avoid unnecessary transaction waiting in that case.

We still have to keep the check for the case where ext4_evict_inode()
gets called from other paths (e.g. umount) where inode still can have
some page cache pages.

Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 23:49:46 -04:00
Tahsin Erdogan
7a9ca53aea quota: add get_inode_usage callback to transfer multi-inode charges
Ext4 ea_inode feature allows storing xattr values in external inodes to
be able to store values that are bigger than a block in size. Ext4 also
has deduplication support for these type of inodes. With deduplication,
the actual storage waste is eliminated but the users of such inodes are
still charged full quota for the inodes as if there was no sharing
happening in the background.

This design requires ext4 to manually charge the users because the
inodes are shared.

An implication of this is that, if someone calls chown on a file that
has such references we need to transfer the quota for the file and xattr
inodes. Current dquot_transfer() function implicitly transfers one inode
charge. With ea_inode feature, we would like to transfer multiple inode
charges.

Add get_inode_usage callback which can interrogate the total number of
inodes that were charged for a given inode.

[ Applied fix from Colin King to make sure the 'ret' variable is
  initialized on the successful return path.  Detected by
  CoverityScan, CID#1446616 ("Uninitialized scalar variable") --tytso]

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jan Kara <jack@suse.cz>
2017-06-22 11:46:48 -04:00
Tahsin Erdogan
dec214d00e ext4: xattr inode deduplication
Ext4 now supports xattr values that are up to 64k in size (vfs limit).
Large xattr values are stored in external inodes each one holding a
single value. Once written the data blocks of these inodes are immutable.

The real world use cases are expected to have a lot of value duplication
such as inherited acls etc. To reduce data duplication on disk, this patch
implements a deduplicator that allows sharing of xattr inodes.

The deduplication is based on an in-memory hash lookup that is a best
effort sharing scheme. When a xattr inode is read from disk (i.e.
getxattr() call), its crc32c hash is added to a hash table. Before
creating a new xattr inode for a value being set, the hash table is
checked to see if an existing inode holds an identical value. If such an
inode is found, the ref count on that inode is incremented. On value
removal the ref count is decremented and if it reaches zero the inode is
deleted.

The quota charging for such inodes is manually managed. Every reference
holder is charged the full size as if there was no sharing happening.
This is consistent with how xattr blocks are also charged.

[ Fixed up journal credits calculation to handle inline data and the
  rare case where an shared xattr block can get freed when two thread
  race on breaking the xattr block sharing. --tytso ]

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 11:44:55 -04:00
Tahsin Erdogan
30a7eb970c ext4: cleanup transaction restarts during inode deletion
During inode deletion, the number of journal credits that will be
needed is hard to determine.  For that reason we have journal
extend/restart calls in several places.  Whenever a transaction is
restarted, filesystem must be in a consistent state because there is
no atomicity guarantee beyond a restart call.

Add ext4_xattr_ensure_credits() helper function which takes care of
journal extend/restart logic.  It also handles getting jbd2 write
access and dirty metadata calls.  This function is called at every
iteration of handling an ea_inode reference.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 11:42:09 -04:00
Tahsin Erdogan
02749a4c20 ext4: add ext4_is_quota_file()
IS_NOQUOTA() indicates whether quota is disabled for an inode. Ext4
also uses it to check whether an inode is for a quota file. The
distinction currently doesn't matter because quota is disabled only
for the quota files. When we start disabling quota for other inodes
in the future, we will want to make the distinction clear.

Replace IS_NOQUOTA() call with ext4_is_quota_file() at places where
we are checking for quota files.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 11:31:25 -04:00
Tahsin Erdogan
0421a189bc ext4: modify ext4_xattr_ino_array to hold struct inode *
Tracking struct inode * rather than the inode number eliminates the
repeated ext4_xattr_inode_iget() call later. The second call cannot
fail in practice but still requires explanation when it wants to ignore
the return value. Avoid the trouble and make things simple.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 10:26:31 -04:00
Tahsin Erdogan
33d201e027 ext4: fix lockdep warning about recursive inode locking
Setting a large xattr value may require writing the attribute contents
to an external inode. In this case we may need to lock the xattr inode
along with the parent inode. This doesn't pose a deadlock risk because
xattr inodes are not directly visible to the user and their access is
restricted.

Assign a lockdep subclass to xattr inode's lock.

 ============================================
 WARNING: possible recursive locking detected
 4.12.0-rc1+ #740 Not tainted
 --------------------------------------------
 python/1822 is trying to acquire lock:
  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff804912ca>] ext4_xattr_set_entry+0x65a/0x7b0

 but task is already holding lock:
  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&sb->s_type->i_mutex_key#15);
   lock(&sb->s_type->i_mutex_key#15);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 4 locks held by python/1822:
  #0:  (sb_writers#10){.+.+.+}, at: [<ffffffff803d0eef>] mnt_want_write+0x1f/0x50
  #1:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0
  #2:  (jbd2_handle){.+.+..}, at: [<ffffffff80493f40>] start_this_handle+0xf0/0x420
  #3:  (&ei->xattr_sem){++++..}, at: [<ffffffff804920ba>] ext4_xattr_set_handle+0x9a/0x4f0

 stack backtrace:
 CPU: 0 PID: 1822 Comm: python Not tainted 4.12.0-rc1+ #740
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 Call Trace:
  dump_stack+0x67/0x9e
  __lock_acquire+0x5f3/0x1750
  lock_acquire+0xb5/0x1d0
  down_write+0x2c/0x60
  ext4_xattr_set_entry+0x65a/0x7b0
  ext4_xattr_block_set+0x1b2/0x9b0
  ext4_xattr_set_handle+0x322/0x4f0
  ext4_xattr_set+0x144/0x1a0
  ext4_xattr_user_set+0x34/0x40
  __vfs_setxattr+0x66/0x80
  __vfs_setxattr_noperm+0x69/0x1c0
  vfs_setxattr+0xa2/0xb0
  setxattr+0x12e/0x150
  path_setxattr+0x87/0xb0
  SyS_setxattr+0xf/0x20
  entry_SYSCALL_64_fastpath+0x18/0xad

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-21 21:17:10 -04:00
Andreas Dilger
e50e5129f3 ext4: xattr-in-inode support
Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.

If the size of an xattr value is larger than will fit in a single
external block, then the xattr value will be saved into the body
of an external xattr inode.

The also helps support a larger number of xattr, since only the headers
will be stored in the in-inode space or the single external block.

The inode is referenced from the xattr header via "e_value_inum",
which was formerly "e_value_block", but that field was never used.
The e_value_size still contains the xattr size so that listing
xattrs does not need to look up the inode if the data is not accessed.

struct ext4_xattr_entry {
        __u8    e_name_len;     /* length of name */
        __u8    e_name_index;   /* attribute name index */
        __le16  e_value_offs;   /* offset in disk block of value */
        __le32  e_value_inum;   /* inode in which value is stored */
        __le32  e_value_size;   /* size of attribute value */
        __le32  e_hash;         /* hash value of name and value */
        char    e_name[0];      /* attribute name */
};

The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
holds a back-reference to the owning inode in its i_mtime field,
allowing the ext4/e2fsck to verify the correct inode is accessed.

[ Applied fix by Dan Carpenter to avoid freeing an ERR_PTR. ]

Lustre-Jira: https://jira.hpdd.intel.com/browse/LU-80
Lustre-bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=4424
Signed-off-by: Kalpak Shah <kalpak.shah@sun.com>
Signed-off-by: James Simmons <uja.ornl@gmail.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2017-06-21 21:10:32 -04:00
Artem Blagodarenko
e08ac99fa2 ext4: add largedir feature
This INCOMPAT_LARGEDIR feature allows larger directories to be created
in ldiskfs, both with directory sizes over 2GB and and a maximum htree
depth of 3 instead of the current limit of 2. These features are needed
in order to exceed the current limit of approximately 10M entries in a
single directory.

This patch was originally written by Yang Sheng to support the Lustre server.

[ Bumped the credits needed to update an indexed directory -- tytso ]

Signed-off-by: Liang Zhen <liang.zhen@intel.com>
Signed-off-by: Yang Sheng <yang.sheng@intel.com>
Signed-off-by: Artem Blagodarenko <artem.blagodarenko@seagate.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
2017-06-21 21:09:57 -04:00
Jan Kara
67a7d5f561 ext4: fix fdatasync(2) after extent manipulation operations
Currently, extent manipulation operations such as hole punch, range
zeroing, or extent shifting do not record the fact that file data has
changed and thus fdatasync(2) has a work to do. As a result if we crash
e.g. after a punch hole and fdatasync, user can still possibly see the
punched out data after journal replay. Test generic/392 fails due to
these problems.

Fix the problem by properly marking that file data has changed in these
operations.

CC: stable@vger.kernel.org
Fixes: a4bb6b64e3
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-29 13:24:55 -04:00
Jan Kara
a056bdaae7 ext4: fix data corruption for mmap writes
mpage_submit_page() can race with another process growing i_size and
writing data via mmap to the written-back page. As mpage_submit_page()
samples i_size too early, it may happen that ext4_bio_write_page()
zeroes out too large tail of the page and thus corrupts user data.

Fix the problem by sampling i_size only after the page has been
write-protected in page tables by clear_page_dirty_for_io() call.

Reported-by: Michael Zimmer <michael@swarm64.com>
CC: stable@vger.kernel.org
Fixes: cb20d51883
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-26 17:45:45 -04:00
Eric Biggers
c41d342b39 ext4: remove redundant check for encrypted file on dio write path
Currently we don't allow direct I/O on encrypted regular files, so in
such cases we return 0 early in ext4_direct_IO().  There was also an
additional BUG_ON() check in ext4_direct_IO_write(), but it can never be
hit because of the earlier check for the exact same condition in
ext4_direct_IO().  There was also no matching check on the read path,
which made the write path specific check seem very ad-hoc.

Just remove the unnecessary BUG_ON().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: David Gstir <david@sigma-star.at>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:20:31 -04:00
Eric Biggers
e5465795ca ext4: fix off-by-one error when writing back pages before dio read
The 'lend' argument of filemap_write_and_wait_range() is inclusive, so
we need to subtract 1 from pos + count.

Note that 'count' is guaranteed to be nonzero since
ext4_file_read_iter() returns early when given a 0 count.

Fixes: 16c5468859 ("ext4: Allow parallel DIO reads")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:05:29 -04:00
Konstantin Khlebnikov
887a973061 ext4: keep existing extra fields when inode expands
ext4_expand_extra_isize() should clear only space between old and new
size.

Fixes: 6dd4ee7cab # v2.6.23
Cc: stable@vger.kernel.org
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:36:23 -04:00
Dan Williams
f5705aa8cf dax, xfs, ext4: compile out iomap-dax paths in the FS_DAX=n case
Tetsuo reports:

  fs/built-in.o: In function `xfs_file_iomap_end':
  xfs_iomap.c:(.text+0xe0ef9): undefined reference to `put_dax'
  fs/built-in.o: In function `xfs_file_iomap_begin':
  xfs_iomap.c:(.text+0xe1a7f): undefined reference to `dax_get_by_host'
  make: *** [vmlinux] Error 1
  $ grep DAX .config
  CONFIG_DAX=m
  # CONFIG_DEV_DAX is not set
  # CONFIG_FS_DAX is not set

When FS_DAX=n we can/must throw away the dax code in filesystems.
Implement 'fs_' versions of dax_get_by_host() and put_dax() that are
nops in the FS_DAX=n case.

Cc: <linux-xfs@vger.kernel.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: Jan Kara <jack@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Fixes: ef51042472 ("block, dax: move 'select DAX' from BLOCK to FS_DAX")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-13 17:52:16 -07:00
Linus Torvalds
dd727dad37 Add GETFSMAP support; some performance improvements for very large
file systems and for random write workloads into a preallocated file;
 bug fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlkPYB8ACgkQ8vlZVpUN
 gaP1HwgApoMQGegtRIbCZKUzKBJ2S6vwIoPAMz62JuwngOyWygJ1T1TliKTitG04
 XvijKpUHtEggMO/ZsUOCoyr2LzJlpVvvrJZsavEubO12LKreYMpvNraZF1GACYTb
 lIZpdWkpcEz5WnPV/PXW/dEMcSMhnKe8tbmHXMyAouSC6a55F5Wp456KF/plqkHU
 zkWTCDbEOtHThzpL8cthUL71ji62I3Op5jn/qOfKCm6/JtUlw5pYjWkRUNqqjSQE
 uQqMpqLxI/VjOdEiBPxEF6A+ZudZmoBQKY15ibWCcHUPFOPqk4RdYz6VivRI7zrg
 KrrKcdFT29MtKnRfAAoJcc0nJ4e1Iw==
 =il74
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:

 - add GETFSMAP support

 - some performance improvements for very large file systems and for
   random write workloads into a preallocated file

 - bug fixes and cleanups.

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: cleanup write flags handling from jbd2_write_superblock()
  ext4: mark superblock writes synchronous for nobarrier mounts
  ext4: inherit encryption xattr before other xattrs
  ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
  ext4: avoid unnecessary transaction stalls during writeback
  ext4: preload block group descriptors
  ext4: make ext4_shutdown() static
  ext4: support GETFSMAP ioctls
  vfs: add common GETFSMAP ioctl definitions
  ext4: evict inline data when writing to memory map
  ext4: remove ext4_xattr_check_entry()
  ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
  ext4: merge ext4_xattr_list() into ext4_listxattr()
  ext4: constify static data that is never modified
  ext4: trim return value and 'dir' argument from ext4_insert_dentry()
  jbd2: fix dbench4 performance regression for 'nobarrier' mounts
  jbd2: Fix lockdep splat with generic/270 test
  mm: retry writepages() on ENOMEM when doing an data integrity writeback
2017-05-08 11:30:05 -07:00
Linus Torvalds
53ef7d0e20 libnvdimm for 4.12
* Region media error reporting: A libnvdimm region device is the parent
 to one or more namespaces. To date, media errors have been reported via
 the "badblocks" attribute attached to pmem block devices for namespaces
 in "raw" or "memory" mode. Given that namespaces can be in "device-dax"
 or "btt-sector" mode this new interface reports media errors
 generically, i.e. independent of namespace modes or state. This
 subsequently allows userspace tooling to craft "ACPI 6.1 Section
 9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and
 submit them via the ioctl path for NVDIMM root bus devices.
 
 * Introduce 'struct dax_device' and 'struct dax_operations': Prompted by
 a request from Linus and feedback from Christoph this allows for dax
 capable drivers to publish their own custom dax operations. This fixes
 the broken assumption that all dax operations are related to a
 persistent memory device, and makes it easier for other architectures
 and platforms to add customized persistent memory support.
 
 * 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
 available for storage appliance applications to manually trigger memory
 controllers to drain write-pending buffers that would otherwise be
 flushed automatically by the platform ADR (asynchronous-DRAM-refresh)
 mechanism at a power loss event. Support for "locked" DIMMs is included
 to prevent namespaces from surfacing when the namespace label data area
 is locked. Finally, fixes for various reported deadlocks and crashes,
 also tagged for -stable.
 
 * ACPI / nfit driver updates: General updates of the nfit driver to add
 DSM command overrides, ACPI 6.1 health state flags support, DSM payload
 debug available by default, and various fixes.
 
 Acknowledgements that came after the branch was pushed:
 
 commmit 565851c972 "device-dax: fix sysfs attribute deadlock"
 Tested-by: Yi Zhang <yizhan@redhat.com>
 
 commit 23f4984483 "libnvdimm: rework region badblocks clearing"
 Tested-by: Toshi Kani <toshi.kani@hpe.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZDONJAAoJEB7SkWpmfYgC3SsP/2KrLvTUcz646ViuPOgZ2cC4
 W6wAx6cvDSt+H52kLnFEsYoFt7WAj20ggPirb/Bc5jkGlvwE0lT9Xtmso9GpVkYT
 J9ZJ9pP/4YaAD3II1gmTwaUjYi0FxoOdx3Eb92yuWkO/8ylz4b2Nu3cBpYwyziGQ
 nIfEVwDXRLE86u6x0bWuf6TlVuvsbdiAI55CDqDMVQC6xIOLbSez7b8QIHlpiKEb
 Mw+xqdQva0esoreZEOXEhWNO+qtfILx8/ceBEGTNMp4e/JjZ2FbrSNplM+9bH5k7
 ywqP8lW+mBEw0fmBBkYoVG/xyesiiBb55JLnbi8Ew+7IUxw8a3iV7wftRi62lHcK
 zAjsHe4L+MansgtZsCL8wluvIPaktAdtB4xr7l9VNLKRYRUG73jEWU0gcUNryHIL
 BkQJ52pUS1PkClyAsWbBBHl1I/CvzVPd21VW0YELmLR4OywKy1c+eKw2bcYgjrb4
 59HZSv6S6EoKaQC+2qvVNpePil7cdfg5V2ubH/ki9HoYVyoxDptEWHnvf0NNatIH
 Y7mNcOPvhOksJmnKSyHbDjtRur7WoHIlC9D7UjEFkSBWsKPjxJHoidN4SnCMRtjQ
 WKQU0seoaKj04b68Bs/Qm9NozVgnsPFIUDZeLMikLFX2Jt7YSPu+Jmi2s4re6WLh
 TmJQ3Ly9t3o3/weHSzmn
 =Ox0s
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "The bulk of this has been in multiple -next releases. There were a few
  late breaking fixes and small features that got added in the last
  couple days, but the whole set has received a build success
  notification from the kbuild robot.

  Change summary:

   - Region media error reporting: A libnvdimm region device is the
     parent to one or more namespaces. To date, media errors have been
     reported via the "badblocks" attribute attached to pmem block
     devices for namespaces in "raw" or "memory" mode. Given that
     namespaces can be in "device-dax" or "btt-sector" mode this new
     interface reports media errors generically, i.e. independent of
     namespace modes or state.

     This subsequently allows userspace tooling to craft "ACPI 6.1
     Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
     requests and submit them via the ioctl path for NVDIMM root bus
     devices.

   - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
     by a request from Linus and feedback from Christoph this allows for
     dax capable drivers to publish their own custom dax operations.
     This fixes the broken assumption that all dax operations are
     related to a persistent memory device, and makes it easier for
     other architectures and platforms to add customized persistent
     memory support.

   - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
     available for storage appliance applications to manually trigger
     memory controllers to drain write-pending buffers that would
     otherwise be flushed automatically by the platform ADR
     (asynchronous-DRAM-refresh) mechanism at a power loss event.
     Support for "locked" DIMMs is included to prevent namespaces from
     surfacing when the namespace label data area is locked. Finally,
     fixes for various reported deadlocks and crashes, also tagged for
     -stable.

   - ACPI / nfit driver updates: General updates of the nfit driver to
     add DSM command overrides, ACPI 6.1 health state flags support, DSM
     payload debug available by default, and various fixes.

  Acknowledgements that came after the branch was pushed:

   - commmit 565851c972 "device-dax: fix sysfs attribute deadlock":
     Tested-by: Yi Zhang <yizhan@redhat.com>

   - commit 23f4984483 "libnvdimm: rework region badblocks clearing"
     Tested-by: Toshi Kani <toshi.kani@hpe.com>"

* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
  libnvdimm, pfn: fix 'npfns' vs section alignment
  libnvdimm: handle locked label storage areas
  libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
  brd: fix uninitialized use of brd->dax_dev
  block, dax: use correct format string in bdev_dax_supported
  device-dax: fix sysfs attribute deadlock
  libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
  libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
  libnvdimm: rework region badblocks clearing
  acpi, nfit: kill ACPI_NFIT_DEBUG
  libnvdimm: fix clear length of nvdimm_forget_poison()
  libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
  libnvdimm, region: sysfs trigger for nvdimm_flush()
  libnvdimm: fix phys_addr for nvdimm_clear_poison
  x86, dax, pmem: remove indirection around memcpy_from_pmem()
  block: remove block_device_operations ->direct_access()
  block, dax: convert bdev_dax_supported() to dax_direct_access()
  filesystem-dax: convert to dax_direct_access()
  Revert "block: use DAX for partition table reads"
  ext2, ext4, xfs: retrieve dax_device for iomap operations
  ...
2017-05-05 18:49:20 -07:00
Linus Torvalds
a3719f34fd Merge branch 'generic' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, reiserfs, udf and ext2 updates from Jan Kara:
 "The branch contains changes to quota code so that it does not modify
  persistent flags in inode->i_flags (it was the only place in kernel
  doing that) and handle it inside filesystem's quotaon/off handlers
  instead.

  The branch also contains two UDF cleanups, a couple of reiserfs fixes
  and one fix for ext2 quota locking"

* 'generic' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext4: Improve comments in ext4_quota_{on|off}()
  udf: use kmap_atomic for memcpy copying
  udf: use octal for permissions
  quota: Remove dquot_quotactl_ops
  reiserfs: Remove i_attrs_to_sd_attrs()
  reiserfs: Remove useless setting of i_flags
  jfs: Remove jfs_get_inode_flags()
  ext2: Remove ext2_get_inode_flags()
  ext4: Remove ext4_get_inode_flags()
  quota: Stop setting IMMUTABLE and NOATIME flags on quota files
  jfs: Set flags on quota files directly
  ext2: Set flags on quota files directly
  reiserfs: Set flags on quota files directly
  ext4: Set flags on quota files directly
  reiserfs: Protect dquot_writeback_dquots() by s_umount semaphore
  reiserfs: Make cancel_old_flush() reliable
  ext2: Call dquot_writeback_dquots() with s_umount held
  reiserfs: avoid a -Wmaybe-uninitialized warning
2017-05-03 11:35:47 -07:00
Jan Kara
dddbd6ac8f ext4: avoid unnecessary transaction stalls during writeback
Currently ext4_writepages() submits all pages with transaction started.
When no page needs block allocation or extent conversion we can submit
all dirty pages in the inode while holding a single transaction handle
and when device is congested this can take significant amount of time.
Thus ext4_writepages() can block transaction commits for extended
periods of time.

Take for example a simple benchmark simulating PostgreSQL database
(pgioperf in mmtest). The benchmark runs 16 processes doing random reads
from a huge file, one process doing random writes to the huge file, and
one process doing sequential writes to a small files and frequently
running fsync. With unpatched kernel transaction commits take on average
~18s with standard deviation of ~41s, top 5 commit times are:

274.466639s, 126.467347s, 86.992429s, 34.351563s, 31.517653s.

After this patch transaction commits take on average 0.1s with standard
deviation of 0.15s, top 5 commit times are:

0.563792s, 0.519980s, 0.509841s, 0.471700s, 0.469899s

[ Modified so we use an explicit do_map flag instead of relying on
  io_end not being allocated, the since io_end->inode is needed for I/O
  error handling. -- tytso ]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 18:29:10 -04:00
Eric Biggers
7b4cc9787f ext4: evict inline data when writing to memory map
Currently the case of writing via mmap to a file with inline data is not
handled.  This is maybe a rare case since it requires a writable memory
map of a very small file, but it is trivial to trigger with on
inline_data filesystem, and it causes the
'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
ext4_writepages() to be hit:

    mkfs.ext4 -O inline_data /dev/vdb
    mount /dev/vdb /mnt
    xfs_io -f /mnt/file \
	-c 'pwrite 0 1' \
	-c 'mmap -w 0 1m' \
	-c 'mwrite 0 1' \
	-c 'fsync'

	kernel BUG at fs/ext4/inode.c:2723!
	invalid opcode: 0000 [#1] SMP
	CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
	task: ffff88003d3a8040 task.stack: ffffc90000300000
	RIP: 0010:ext4_writepages+0xc89/0xf8a
	RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
	RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
	RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
	RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
	R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
	R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
	FS:  00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
	Call Trace:
	 ? _raw_spin_unlock+0x27/0x2a
	 ? kvm_clock_read+0x1e/0x20
	 do_writepages+0x23/0x2c
	 ? do_writepages+0x23/0x2c
	 __filemap_fdatawrite_range+0x80/0x87
	 filemap_write_and_wait_range+0x67/0x8c
	 ext4_sync_file+0x20e/0x472
	 vfs_fsync_range+0x8e/0x9f
	 ? syscall_trace_enter+0x25b/0x2d0
	 vfs_fsync+0x1c/0x1e
	 do_fsync+0x31/0x4a
	 SyS_fsync+0x10/0x14
	 do_syscall_64+0x69/0x131
	 entry_SYSCALL64_slow_path+0x25/0x25

We could try to be smart and keep the inline data in this case, or at
least support delayed allocation when allocating the block, but these
solutions would be more complicated and don't seem worthwhile given how
rare this case seems to be.  So just fix the bug by calling
ext4_convert_inline_data() when we're asked to make a page writable, so
that any inline data gets evicted, with the block allocated immediately.

Reported-by: Nick Alcock <nick.alcock@oracle.com>
Cc: stable@vger.kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:10:50 -04:00
Dan Williams
fa5d932c32 ext2, ext4, xfs: retrieve dax_device for iomap operations
In preparation for converting fs/dax.c to use dax_direct_access()
instead of bdev_direct_access(), add the plumbing to retrieve the
dax_device associated with a given block_device.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25 13:20:46 -07:00
Jan Kara
38eae95ddc ext4: Remove ext4_get_inode_flags()
Now that all places setting inode->i_flags that should be reflected in
on-disk flags are gone, we can remove ext4_get_inode_flags() call.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19 14:21:23 +02:00
David Howells
3209f68b3c statx: Include a mask for stx_attributes in struct statx
Include a mask in struct stat to indicate which bits of stx_attributes the
filesystem actually supports.

This would also be useful if we add another system call that allows you to
do a 'bulk attribute set' and pass in a statx struct with the masks
appropriately set to say what you want to set.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03 01:06:00 -04:00
David Howells
99652ea56a ext4: Add statx support
Return enhanced file attributes from the Ext4 filesystem.  This includes
the following:

 (1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME.

 (2) Certain FS_xxx_FL flags are mapped to stx_attribute flags.

This requires that all ext4 inodes have a getattr call, not just some of
them, so to this end, split the ext4_getattr() function and only call part
of it where appropriate.

Example output:

	[root@andromeda ~]# touch foo
	[root@andromeda ~]# chattr +ai foo
	[root@andromeda ~]# /tmp/test-statx foo
	statx(foo) = 0
	results=fff
	  Size: 0               Blocks: 0          IO Block: 4096    regular file
	Device: 08:12           Inode: 2101950     Links: 1
	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
	Access: 2016-02-11 17:08:29.031795451+0000
	Modify: 2016-02-11 17:08:29.031795451+0000
	Change: 2016-02-11 17:11:11.987790114+0000
	 Birth: 2016-02-11 17:08:29.031795451+0000
	Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----)

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03 01:05:58 -04:00
Linus Torvalds
1c23de6308 Fix a memory leak on an error path, and two races when modifying
inodes relating to the inline_data and metadata checksum features.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAljXHNMACgkQ8vlZVpUN
 gaPwoggAiodb37DHZ/X6fnRr8314OJT8mRUbUK3aDagCRb0Kp9iFAwwpHIG8Gxw1
 akI7Jy8VWLC4EbHb9wzXFEO7wl/IBLq3t70Vid2cBR302gblhIIz6hkHrQ9RIlW3
 MH5sFhXiVq4WYPuxQFWS6ohg6/SYTwcgI9rXxEnkLVmOiG2Ov2/v4/wiflau8vgK
 fNYyncHSylwJ5QIaT8mUIawetlunEHO0Vz5AZNzkcMhkzUHxmRWvMtGWcvwukstb
 7vXZhN5HHB8RZ33qcdtuAaNBHwBmrU/acicIpsvL/jfkFWlJTS0PBRUvwxnPeebo
 G0xRDEIwpZoy5h8fxzIxqh+CQqg6QA==
 =/ycw
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a memory leak on an error path, and two races when modifying
  inodes relating to the inline_data and metadata checksum features"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix two spelling nits
  ext4: lock the xattr block before checksuming it
  jbd2: don't leak memory if setting up journal fails
  ext4: mark inode dirty after converting inline directory
2017-03-26 10:29:21 -07:00
Theodore Ts'o
d67d64f423 ext4: fix two spelling nits
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-03-25 17:33:31 -04:00
David Howells
a528d35e8b statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.

The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode.  This change is propagated to the vfs_getattr*()
function.

Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.

========
OVERVIEW
========

The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.

A number of requests were gathered for features to be included.  The
following have been included:

 (1) Make the fields a consistent size on all arches and make them large.

 (2) Spare space, request flags and information flags are provided for
     future expansion.

 (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
     __s64).

 (4) Creation time: The SMB protocol carries the creation time, which could
     be exported by Samba, which will in turn help CIFS make use of
     FS-Cache as that can be used for coherency data (stx_btime).

     This is also specified in NFSv4 as a recommended attribute and could
     be exported by NFSD [Steve French].

 (5) Lightweight stat: Ask for just those details of interest, and allow a
     netfs (such as NFS) to approximate anything not of interest, possibly
     without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
     Dilger] (AT_STATX_DONT_SYNC).

 (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
     its cached attributes are up to date [Trond Myklebust]
     (AT_STATX_FORCE_SYNC).

And the following have been left out for future extension:

 (7) Data version number: Could be used by userspace NFS servers [Aneesh
     Kumar].

     Can also be used to modify fill_post_wcc() in NFSD which retrieves
     i_version directly, but has just called vfs_getattr().  It could get
     it from the kstat struct if it used vfs_xgetattr() instead.

     (There's disagreement on the exact semantics of a single field, since
     not all filesystems do this the same way).

 (8) BSD stat compatibility: Including more fields from the BSD stat such
     as creation time (st_btime) and inode generation number (st_gen)
     [Jeremy Allison, Bernd Schubert].

 (9) Inode generation number: Useful for FUSE and userspace NFS servers
     [Bernd Schubert].

     (This was asked for but later deemed unnecessary with the
     open-by-handle capability available and caused disagreement as to
     whether it's a security hole or not).

(10) Extra coherency data may be useful in making backups [Andreas Dilger].

     (No particular data were offered, but things like last backup
     timestamp, the data version number and the DOS archive bit would come
     into this category).

(11) Allow the filesystem to indicate what it can/cannot provide: A
     filesystem can now say it doesn't support a standard stat feature if
     that isn't available, so if, for instance, inode numbers or UIDs don't
     exist or are fabricated locally...

     (This requires a separate system call - I have an fsinfo() call idea
     for this).

(12) Store a 16-byte volume ID in the superblock that can be returned in
     struct xstat [Steve French].

     (Deferred to fsinfo).

(13) Include granularity fields in the time data to indicate the
     granularity of each of the times (NFSv4 time_delta) [Steve French].

     (Deferred to fsinfo).

(14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
     Note that the Linux IOC flags are a mess and filesystems such as Ext4
     define flags that aren't in linux/fs.h, so translation in the kernel
     may be a necessity (or, possibly, we provide the filesystem type too).

     (Some attributes are made available in stx_attributes, but the general
     feeling was that the IOC flags were to ext[234]-specific and shouldn't
     be exposed through statx this way).

(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
     Michael Kerrisk].

     (Deferred, probably to fsinfo.  Finding out if there's an ACL or
     seclabal might require extra filesystem operations).

(16) Femtosecond-resolution timestamps [Dave Chinner].

     (A __reserved field has been left in the statx_timestamp struct for
     this - if there proves to be a need).

(17) A set multiple attributes syscall to go with this.

===============
NEW SYSTEM CALL
===============

The new system call is:

	int ret = statx(int dfd,
			const char *filename,
			unsigned int flags,
			unsigned int mask,
			struct statx *buffer);

The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat().  There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.

Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):

 (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
     respect.

 (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
     its attributes with the server - which might require data writeback to
     occur to get the timestamps correct.

 (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
     network filesystem.  The resulting values should be considered
     approximate.

mask is a bitmask indicating the fields in struct statx that are of
interest to the caller.  The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat().  It should be noted that asking for
more information may entail extra I/O operations.

buffer points to the destination for the data.  This must be 256 bytes in
size.

======================
MAIN ATTRIBUTES RECORD
======================

The following structures are defined in which to return the main attribute
set:

	struct statx_timestamp {
		__s64	tv_sec;
		__s32	tv_nsec;
		__s32	__reserved;
	};

	struct statx {
		__u32	stx_mask;
		__u32	stx_blksize;
		__u64	stx_attributes;
		__u32	stx_nlink;
		__u32	stx_uid;
		__u32	stx_gid;
		__u16	stx_mode;
		__u16	__spare0[1];
		__u64	stx_ino;
		__u64	stx_size;
		__u64	stx_blocks;
		__u64	__spare1[1];
		struct statx_timestamp	stx_atime;
		struct statx_timestamp	stx_btime;
		struct statx_timestamp	stx_ctime;
		struct statx_timestamp	stx_mtime;
		__u32	stx_rdev_major;
		__u32	stx_rdev_minor;
		__u32	stx_dev_major;
		__u32	stx_dev_minor;
		__u64	__spare2[14];
	};

The defined bits in request_mask and stx_mask are:

	STATX_TYPE		Want/got stx_mode & S_IFMT
	STATX_MODE		Want/got stx_mode & ~S_IFMT
	STATX_NLINK		Want/got stx_nlink
	STATX_UID		Want/got stx_uid
	STATX_GID		Want/got stx_gid
	STATX_ATIME		Want/got stx_atime{,_ns}
	STATX_MTIME		Want/got stx_mtime{,_ns}
	STATX_CTIME		Want/got stx_ctime{,_ns}
	STATX_INO		Want/got stx_ino
	STATX_SIZE		Want/got stx_size
	STATX_BLOCKS		Want/got stx_blocks
	STATX_BASIC_STATS	[The stuff in the normal stat struct]
	STATX_BTIME		Want/got stx_btime{,_ns}
	STATX_ALL		[All currently available stuff]

stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.

Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution.  Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.

The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does.  The following
attributes map to FS_*_FL flags and are the same numerical value:

	STATX_ATTR_COMPRESSED		File is compressed by the fs
	STATX_ATTR_IMMUTABLE		File is marked immutable
	STATX_ATTR_APPEND		File is append-only
	STATX_ATTR_NODUMP		File is not to be dumped
	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs

Within the kernel, the supported flags are listed by:

	KSTAT_ATTR_FS_IOC_FLAGS

[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]

New flags include:

	STATX_ATTR_AUTOMOUNT		Object is an automount trigger

These are for the use of GUI tools that might want to mark files specially,
depending on what they are.

Fields in struct statx come in a number of classes:

 (0) stx_dev_*, stx_blksize.

     These are local system information and are always available.

 (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
     stx_size, stx_blocks.

     These will be returned whether the caller asks for them or not.  The
     corresponding bits in stx_mask will be set to indicate whether they
     actually have valid values.

     If the caller didn't ask for them, then they may be approximated.  For
     example, NFS won't waste any time updating them from the server,
     unless as a byproduct of updating something requested.

     If the values don't actually exist for the underlying object (such as
     UID or GID on a DOS file), then the bit won't be set in the stx_mask,
     even if the caller asked for the value.  In such a case, the returned
     value will be a fabrication.

     Note that there are instances where the type might not be valid, for
     instance Windows reparse points.

 (2) stx_rdev_*.

     This will be set only if stx_mode indicates we're looking at a
     blockdev or a chardev, otherwise will be 0.

 (3) stx_btime.

     Similar to (1), except this will be set to 0 if it doesn't exist.

=======
TESTING
=======

The following test program can be used to test the statx system call:

	samples/statx/test-statx.c

Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.

Here's some example output.  Firstly, an NFS directory that crosses to
another FSID.  Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.

	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:26           Inode: 1703937     Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000
	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

Secondly, the result of automounting on that directory.

	[root@andromeda ~]# /tmp/test-statx /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:27           Inode: 2           Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-03-02 20:51:15 -05:00
Fabian Frederick
93407472a2 fs: add i_blocksize()
Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
branch.

This patch also fixes multiple checkpatch warnings: WARNING: Prefer
'unsigned int' to bare use of 'unsigned'

Thanks to Andrew Morton for suggesting more appropriate function instead
of macro.

[geliangtang@gmail.com: truncate: use i_blocksize()]
  Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:46 -08:00
Dave Jiang
11bac80004 mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.

Remove the vma parameter to simplify things.

[arnd@arndb.de: fix ARM build]
  Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Linus Torvalds
a27fcb0cd1 Changes since last update:
- Various cleanups
  - Livelock fixes for eofblocks scanning
  - Improved input verification for on-disk metadata
  - Fix races in the copy on write remap mechanism
  - Fix buffer io error timeout controls
  - Streamlining of directio copy on write
  - Asynchronous discard support
  - Fix asserts when splitting delalloc reservations
  - Don't bloat bmbt when right shifting extents
  - Inode alignment fixes for 32k block sizes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJYp85wAAoJEPh/dxk0SrTr5HgP/jcx/oI+ap/NaXMi1Q8K65mh
 C3gf27cgUxtdGnEO5KRUE1Jyscuu4ZpzugDdLQISwR55kesT5FU0xpgbsfiICc86
 dxLAhg8auwpTfHV+96Do2hfpO3IhYoBC2w5jo32+C+SaQUqTdPixncZukX89tjyP
 HOFLrQnpc336hCO2rv1Q9hSkD6IUCkSAtk+Dh1xMvbsmKFLGdmkTdqUQfl1U4YnV
 2S98k9QSRdiVyzj3lAGOy+IU9aTcPX/PptMEYaQZEaod5WWNjy91lQZNM6zRc4QW
 8P199yiH6CQa2vESO2SV72cJ40WihM1KQXqnrlJjAMGQ7mMGTGJcTwxhuZYUbDYZ
 cuk6bAUaijt/PzfmydJKlcH8vFerX4aU4CGkxPU0nph0iTR5kxYlIAMmFw2cdRzf
 Iar3SBb8Pc9jiNnEZMFsQ0Fd9hNk9rNoUSpKqm4FtSRocU6JjmpAdPqNYdTVKc2l
 2EY7JMo0xCaTVC1WT6sE2NsxsFvm0R7H6HHG2vMFIMNkhI24GRijIXH6dQlaGCQJ
 5oTHrSM7503qPlEQNsxF7zI02LpJT+duf+2ODw/FSjA1z/TWwOUYYUrPUOyQNdzP
 NrRnMa6LWsEehkuvz2FFko8PKXD55lTuUP1KdjigjqKp8Jzkc/PP+uvuwF5vUFfd
 pWRvE5m/NePWBZetbL3Q
 =Ga1F
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.11-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "Here are the XFS changes for 4.11. We aren't introducing any major
  features in this release cycle except for this being the first merge
  window I've managed on my own. :)

  Changes since last update:

   - Various cleanups

   - Livelock fixes for eofblocks scanning

   - Improved input verification for on-disk metadata

   - Fix races in the copy on write remap mechanism

   - Fix buffer io error timeout controls

   - Streamlining of directio copy on write

   - Asynchronous discard support

   - Fix asserts when splitting delalloc reservations

   - Don't bloat bmbt when right shifting extents

   - Inode alignment fixes for 32k block sizes"

* tag 'xfs-4.11-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (39 commits)
  xfs: remove XFS_ALLOCTYPE_ANY_AG and XFS_ALLOCTYPE_START_AG
  xfs: simplify xfs_rtallocate_extent
  xfs: tune down agno asserts in the bmap code
  xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignment
  xfs: don't reserve blocks for right shift transactions
  xfs: fix len comparison in xfs_extent_busy_trim
  xfs: fix uninitialized variable in _reflink_convert_cow
  xfs: split indlen reservations fairly when under reserved
  xfs: handle indlen shortage on delalloc extent merge
  xfs: resurrect debug mode drop buffered writes mechanism
  xfs: clear delalloc and cache on buffered write failure
  xfs: don't block the log commit handler for discards
  xfs: improve busy extent sorting
  xfs: improve handling of busy extents in the low-level allocator
  xfs: don't fail xfs_extent_busy allocation
  xfs: correct null checks and error processing in xfs_initialize_perag
  xfs: update ctime and mtime on clone destinatation inodes
  xfs: allocate direct I/O COW blocks in iomap_begin
  xfs: go straight to real allocations for direct I/O COW writes
  xfs: return the converted extent in __xfs_reflink_convert_cow
  ...
2017-02-22 18:05:23 -08:00
Theodore Ts'o
0d06863f90 ext4: don't BUG when truncating encrypted inodes on the orphan list
Fix a BUG when the kernel tries to mount a file system constructed as
follows:

echo foo > foo.txt
mke2fs -Fq -t ext4 -O encrypt foo.img 100
debugfs -w foo.img << EOF
write foo.txt a
set_inode_field a i_flags 0x80800
set_super_value s_last_orphan 12
quit
EOF

root@kvm-xfstests:~# mount -o loop foo.img /mnt
[  160.238770] ------------[ cut here ]------------
[  160.240106] kernel BUG at /usr/projects/linux/ext4/fs/ext4/inode.c:3874!
[  160.240106] invalid opcode: 0000 [#1] SMP
[  160.240106] Modules linked in:
[  160.240106] CPU: 0 PID: 2547 Comm: mount Tainted: G        W       4.10.0-rc3-00034-gcdd33b941b67 #227
[  160.240106] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
[  160.240106] task: f4518000 task.stack: f47b6000
[  160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4
[  160.240106] EFLAGS: 00010246 CPU: 0
[  160.240106] EAX: 00000001 EBX: f7be4b50 ECX: f47b7dc0 EDX: 00000007
[  160.240106] ESI: f43b05a8 EDI: f43babec EBP: f47b7dd0 ESP: f47b7dac
[  160.240106]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  160.240106] CR0: 80050033 CR2: bfd85b08 CR3: 34a00680 CR4: 000006f0
[  160.240106] Call Trace:
[  160.240106]  ext4_truncate+0x1e9/0x3e5
[  160.240106]  ext4_fill_super+0x286f/0x2b1e
[  160.240106]  ? set_blocksize+0x2e/0x7e
[  160.240106]  mount_bdev+0x114/0x15f
[  160.240106]  ext4_mount+0x15/0x17
[  160.240106]  ? ext4_calculate_overhead+0x39d/0x39d
[  160.240106]  mount_fs+0x58/0x115
[  160.240106]  vfs_kern_mount+0x4b/0xae
[  160.240106]  do_mount+0x671/0x8c3
[  160.240106]  ? _copy_from_user+0x70/0x83
[  160.240106]  ? strndup_user+0x31/0x46
[  160.240106]  SyS_mount+0x57/0x7b
[  160.240106]  do_int80_syscall_32+0x4f/0x61
[  160.240106]  entry_INT80_32+0x2f/0x2f
[  160.240106] EIP: 0xb76b919e
[  160.240106] EFLAGS: 00000246 CPU: 0
[  160.240106] EAX: ffffffda EBX: 08053838 ECX: 08052188 EDX: 080537e8
[  160.240106] ESI: c0ed0000 EDI: 00000000 EBP: 080537e8 ESP: bfa13660
[  160.240106]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[  160.240106] Code: 59 8b 00 a8 01 0f 84 09 01 00 00 8b 07 66 25 00 f0 66 3d 00 80 75 61 89 f8 e8 3e e2 ff ff 84 c0 74 56 83 bf 48 02 00 00 00 75 02 <0f> 0b 81 7d e8 00 10 00 00 74 02 0f 0b 8b 43 04 8b 53 08 31 c9
[  160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 SS:ESP: 0068:f47b7dac
[  160.317241] ---[ end trace d6a773a375c810a5 ]---

The problem is that when the kernel tries to truncate an inode in
ext4_truncate(), it tries to clear any on-disk data beyond i_size.
Without the encryption key, it can't do that, and so it triggers a
BUG.

E2fsck does *not* provide this service, and in practice most file
systems have their orphan list processed by e2fsck, so to avoid
crashing, this patch skips this step if we don't have access to the
encryption key (which is the case when processing the orphan list; in
all other cases, we will have the encryption key, or the kernel
wouldn't have allowed the file to be opened).

An open question is whether the fact that e2fsck isn't clearing the
bytes beyond i_size causing problems --- and if we've lived with it
not doing it for so long, can we drop this from the kernel replay of
the orphan list in all cases (not just when we don't have the key for
encrypted inodes).

Addresses-Google-Bug: #35209576

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-02-14 11:31:15 -05:00
Theodore Ts'o
0db1ff222d ext4: add shutdown bit and check for it
Add a shutdown bit that will cause ext4 processing to fail immediately
with EIO.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-02-05 01:28:48 -05:00
Theodore Ts'o
eb5efbcb76 ext4: fix inline data error paths
The write_end() function must always unlock the page and drop its ref
count, even on an error.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2017-02-04 23:04:00 -05:00
Christoph Hellwig
8ff6daa17b iomap: constify struct iomap_ops
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-30 16:32:25 -08:00
Jan Kara
3b136499e9 ext4: fix data corruption in data=journal mode
ext4_journalled_write_end() did not propely handle all the cases when
generic_perform_write() did not copy all the data into the target page
and could mark buffers with uninitialized contents as uptodate and dirty
leading to possible data corruption (which would be quickly fixed by
generic_perform_write() retrying the write but still). Fix the problem
by carefully handling the case when the page that is written to is not
uptodate.

CC: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-01-27 14:35:38 -05:00
Theodore Ts'o
01daf94525 ext4: propagate error values from ext4_inline_data_truncate()
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-01-22 19:35:49 -05:00
Theodore Ts'o
b907f2d519 ext4: avoid calling ext4_mark_inode_dirty() under unneeded semaphores
There is no need to call ext4_mark_inode_dirty while holding xattr_sem
or i_data_sem, so where it's easy to avoid it, move it out from the
critical region.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-01-11 22:14:49 -05:00
Linus Torvalds
80eabba702 Merge branch 'for-4.10/fs-unmap' of git://git.kernel.dk/linux-block
Pull fs meta data unmap optimization from Jens Axboe:
 "A series from Jan Kara, providing a more efficient way for unmapping
  meta data from in the buffer cache than doing it block-by-block.

  Provide a general helper that existing callers can use"

* 'for-4.10/fs-unmap' of git://git.kernel.dk/linux-block:
  fs: Remove unmap_underlying_metadata
  fs: Add helper to clean bdev aliases under a bh and use it
  ext2: Use clean_bdev_aliases() instead of iteration
  ext4: Use clean_bdev_aliases() instead of iteration
  direct-io: Use clean_bdev_aliases() instead of handmade iteration
  fs: Provide function to unmap metadata for a range of blocks
2016-12-14 17:09:00 -08:00