btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range
We hit a regression while rolling out 5.2 internally where we were hitting the following panic kernel BUG at mm/page-writeback.c:2659! RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0 Call Trace: __process_pages_contig+0x25a/0x350 ? extent_clear_unlock_delalloc+0x43/0x70 submit_compressed_extents+0x359/0x4d0 normal_work_helper+0x15a/0x330 process_one_work+0x1f5/0x3f0 worker_thread+0x2d/0x3d0 ? rescuer_thread+0x340/0x340 kthread+0x111/0x130 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x1f/0x30 This is happening because the page is not locked when doing clear_page_dirty_for_io. Looking at the core dump it was because our async_extent had a ram_size of 24576 but our async_chunk range only spanned 20480, so we had a whole extra page in our ram_size for our async_extent. This happened because we try not to compress pages outside of our i_size, however a cleanup patch changed us to do actual_end = min_t(u64, i_size_read(inode), end + 1); which is problematic because i_size_read() can evaluate to different values in between checking and assigning. So either an expanding truncate or a fallocate could increase our i_size while we're doing writeout and actual_end would end up being past the range we have locked. I confirmed this was what was happening by installing a debug kernel that had actual_end = min_t(u64, i_size_read(inode), end + 1); if (actual_end > end + 1) { printk(KERN_ERR "KABOOM\n"); actual_end = end + 1; } and installing it onto 500 boxes of the tier that had been seeing the problem regularly. Last night I got my debug message and no panic, confirming what I expected. [ dsterba: the assembly confirms a tiny race window: mov 0x20(%rsp),%rax cmp %rax,0x48(%r15) # read movl $0x0,0x18(%rsp) mov %rax,%r12 mov %r14,%rax cmovbe 0x48(%r15),%r12 # eval Where r15 is inode and 0x48 is offset of i_size. The original fix was to revert62b3762271
that would do an intermediate assignment and this would also avoid the doulble evaluation but is not future-proof, should the compiler merge the stores and call i_size_read anyway. There's a patch adding READ_ONCE to i_size_read but that's not being applied at the moment and we need to fix the bug. Instead, emulate READ_ONCE by two barrier()s that's what effectively happens. The assembly confirms single evaluation: mov 0x48(%rbp),%rax # read once mov 0x20(%rsp),%rcx mov $0x20,%edx cmp %rax,%rcx cmovbe %rcx,%rax mov %rax,(%rsp) mov %rax,%rcx mov %r14,%rax Where 0x48(%rbp) is inode->i_size stored to %eax. ] Fixes:62b3762271
("btrfs: Remove isize local variable in compress_file_range") CC: stable@vger.kernel.org # v5.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ changelog updated ] Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
parent
0cab7acc4a
commit
d98da49977
1 changed files with 14 additions and 1 deletions
|
@ -474,6 +474,7 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
|
|||
u64 start = async_chunk->start;
|
||||
u64 end = async_chunk->end;
|
||||
u64 actual_end;
|
||||
u64 i_size;
|
||||
int ret = 0;
|
||||
struct page **pages = NULL;
|
||||
unsigned long nr_pages;
|
||||
|
@ -488,7 +489,19 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
|
|||
inode_should_defrag(BTRFS_I(inode), start, end, end - start + 1,
|
||||
SZ_16K);
|
||||
|
||||
actual_end = min_t(u64, i_size_read(inode), end + 1);
|
||||
/*
|
||||
* We need to save i_size before now because it could change in between
|
||||
* us evaluating the size and assigning it. This is because we lock and
|
||||
* unlock the page in truncate and fallocate, and then modify the i_size
|
||||
* later on.
|
||||
*
|
||||
* The barriers are to emulate READ_ONCE, remove that once i_size_read
|
||||
* does that for us.
|
||||
*/
|
||||
barrier();
|
||||
i_size = i_size_read(inode);
|
||||
barrier();
|
||||
actual_end = min_t(u64, i_size, end + 1);
|
||||
again:
|
||||
will_compress = 0;
|
||||
nr_pages = (end >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
|
||||
|
|
Loading…
Reference in a new issue