16cdcec736
Changelog V5 -> V6: - Fix oom when the memory load is high, by storing the delayed nodes into the root's radix tree, and letting btrfs inodes go. Changelog V4 -> V5: - Fix the race on adding the delayed node to the inode, which is spotted by Chris Mason. - Merge Chris Mason's incremental patch into this patch. - Fix deadlock between readdir() and memory fault, which is reported by Itaru Kitayama. Changelog V3 -> V4: - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache inode in time. Changelog V2 -> V3: - Fix the race between the delayed worker and the task which does delayed items balance, which is reported by Tsutomu Itoh. - Modify the patch address David Sterba's comment. - Fix the bug of the cpu recursion spinlock, reported by Chris Mason Changelog V1 -> V2: - break up the global rb-tree, use a list to manage the delayed nodes, which is created for every directory and file, and used to manage the delayed directory name index items and the delayed inode item. - introduce a worker to deal with the delayed nodes. Compare with Ext3/4, the performance of file creation and deletion on btrfs is very poor. the reason is that btrfs must do a lot of b+ tree insertions, such as inode item, directory name item, directory name index and so on. If we can do some delayed b+ tree insertion or deletion, we can improve the performance, so we made this patch which implemented delayed directory name index insertion/deletion and delayed inode update. Implementation: - introduce a delayed root object into the filesystem, that use two lists to manage the delayed nodes which are created for every file/directory. One is used to manage all the delayed nodes that have delayed items. And the other is used to manage the delayed nodes which is waiting to be dealt with by the work thread. - Every delayed node has two rb-tree, one is used to manage the directory name index which is going to be inserted into b+ tree, and the other is used to manage the directory name index which is going to be deleted from b+ tree. - introduce a worker to deal with the delayed operation. This worker is used to deal with the works of the delayed directory name index items insertion and deletion and the delayed inode update. When the delayed items is beyond the lower limit, we create works for some delayed nodes and insert them into the work queue of the worker, and then go back. When the delayed items is beyond the upper bound, we create works for all the delayed nodes that haven't been dealt with, and insert them into the work queue of the worker, and then wait for that the untreated items is below some threshold value. - When we want to insert a directory name index into b+ tree, we just add the information into the delayed inserting rb-tree. And then we check the number of the delayed items and do delayed items balance. (The balance policy is above.) - When we want to delete a directory name index from the b+ tree, we search it in the inserting rb-tree at first. If we look it up, just drop it. If not, add the key of it into the delayed deleting rb-tree. Similar to the delayed inserting rb-tree, we also check the number of the delayed items and do delayed items balance. (The same to inserting manipulation) - When we want to update the metadata of some inode, we cached the data of the inode into the delayed node. the worker will flush it into the b+ tree after dealing with the delayed insertion and deletion. - We will move the delayed node to the tail of the list after we access the delayed node, By this way, we can cache more delayed items and merge more inode updates. - If we want to commit transaction, we will deal with all the delayed node. - the delayed node will be freed when we free the btrfs inode. - Before we log the inode items, we commit all the directory name index items and the delayed inode update. I did a quick test by the benchmark tool[1] and found we can improve the performance of file creation by ~15%, and file deletion by ~20%. Before applying this patch: Create files: Total files: 50000 Total time: 1.096108 Average time: 0.000022 Delete files: Total files: 50000 Total time: 1.510403 Average time: 0.000030 After applying this patch: Create files: Total files: 50000 Total time: 0.932899 Average time: 0.000019 Delete files: Total files: 50000 Total time: 1.215732 Average time: 0.000024 [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3 Many thanks for Kitayama-san's help! Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dave@jikos.cz> Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: Chris Mason <chris.mason@oracle.com>
180 lines
4.8 KiB
C
180 lines
4.8 KiB
C
/*
|
|
* Copyright (C) 2007 Oracle. All rights reserved.
|
|
*
|
|
* This program is free software; you can redistribute it and/or
|
|
* modify it under the terms of the GNU General Public
|
|
* License v2 as published by the Free Software Foundation.
|
|
*
|
|
* This program is distributed in the hope that it will be useful,
|
|
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
|
* General Public License for more details.
|
|
*
|
|
* You should have received a copy of the GNU General Public
|
|
* License along with this program; if not, write to the
|
|
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
|
|
* Boston, MA 021110-1307, USA.
|
|
*/
|
|
|
|
#ifndef __BTRFS_I__
|
|
#define __BTRFS_I__
|
|
|
|
#include "extent_map.h"
|
|
#include "extent_io.h"
|
|
#include "ordered-data.h"
|
|
#include "delayed-inode.h"
|
|
|
|
/* in memory btrfs inode */
|
|
struct btrfs_inode {
|
|
/* which subvolume this inode belongs to */
|
|
struct btrfs_root *root;
|
|
|
|
/* key used to find this inode on disk. This is used by the code
|
|
* to read in roots of subvolumes
|
|
*/
|
|
struct btrfs_key location;
|
|
|
|
/* the extent_tree has caches of all the extent mappings to disk */
|
|
struct extent_map_tree extent_tree;
|
|
|
|
/* the io_tree does range state (DIRTY, LOCKED etc) */
|
|
struct extent_io_tree io_tree;
|
|
|
|
/* special utility tree used to record which mirrors have already been
|
|
* tried when checksums fail for a given block
|
|
*/
|
|
struct extent_io_tree io_failure_tree;
|
|
|
|
/* held while logging the inode in tree-log.c */
|
|
struct mutex log_mutex;
|
|
|
|
/* used to order data wrt metadata */
|
|
struct btrfs_ordered_inode_tree ordered_tree;
|
|
|
|
/* for keeping track of orphaned inodes */
|
|
struct list_head i_orphan;
|
|
|
|
/* list of all the delalloc inodes in the FS. There are times we need
|
|
* to write all the delalloc pages to disk, and this list is used
|
|
* to walk them all.
|
|
*/
|
|
struct list_head delalloc_inodes;
|
|
|
|
/*
|
|
* list for tracking inodes that must be sent to disk before a
|
|
* rename or truncate commit
|
|
*/
|
|
struct list_head ordered_operations;
|
|
|
|
/* node for the red-black tree that links inodes in subvolume root */
|
|
struct rb_node rb_node;
|
|
|
|
/* the space_info for where this inode's data allocations are done */
|
|
struct btrfs_space_info *space_info;
|
|
|
|
/* full 64 bit generation number, struct vfs_inode doesn't have a big
|
|
* enough field for this.
|
|
*/
|
|
u64 generation;
|
|
|
|
/* sequence number for NFS changes */
|
|
u64 sequence;
|
|
|
|
/*
|
|
* transid of the trans_handle that last modified this inode
|
|
*/
|
|
u64 last_trans;
|
|
|
|
/*
|
|
* log transid when this inode was last modified
|
|
*/
|
|
u64 last_sub_trans;
|
|
|
|
/*
|
|
* transid that last logged this inode
|
|
*/
|
|
u64 logged_trans;
|
|
|
|
/* total number of bytes pending delalloc, used by stat to calc the
|
|
* real block usage of the file
|
|
*/
|
|
u64 delalloc_bytes;
|
|
|
|
/* total number of bytes that may be used for this inode for
|
|
* delalloc
|
|
*/
|
|
u64 reserved_bytes;
|
|
|
|
/*
|
|
* the size of the file stored in the metadata on disk. data=ordered
|
|
* means the in-memory i_size might be larger than the size on disk
|
|
* because not all the blocks are written yet.
|
|
*/
|
|
u64 disk_i_size;
|
|
|
|
/* flags field from the on disk inode */
|
|
u32 flags;
|
|
|
|
/*
|
|
* if this is a directory then index_cnt is the counter for the index
|
|
* number for new files that are created
|
|
*/
|
|
u64 index_cnt;
|
|
|
|
/* the start of block group preferred for allocations. */
|
|
u64 block_group;
|
|
|
|
/* the fsync log has some corner cases that mean we have to check
|
|
* directories to see if any unlinks have been done before
|
|
* the directory was logged. See tree-log.c for all the
|
|
* details
|
|
*/
|
|
u64 last_unlink_trans;
|
|
|
|
/*
|
|
* Counters to keep track of the number of extent item's we may use due
|
|
* to delalloc and such. outstanding_extents is the number of extent
|
|
* items we think we'll end up using, and reserved_extents is the number
|
|
* of extent items we've reserved metadata for.
|
|
*/
|
|
atomic_t outstanding_extents;
|
|
atomic_t reserved_extents;
|
|
|
|
/*
|
|
* ordered_data_close is set by truncate when a file that used
|
|
* to have good data has been truncated to zero. When it is set
|
|
* the btrfs file release call will add this inode to the
|
|
* ordered operations list so that we make sure to flush out any
|
|
* new data the application may have written before commit.
|
|
*
|
|
* yes, its silly to have a single bitflag, but we might grow more
|
|
* of these.
|
|
*/
|
|
unsigned ordered_data_close:1;
|
|
unsigned orphan_meta_reserved:1;
|
|
unsigned dummy_inode:1;
|
|
|
|
/*
|
|
* always compress this one file
|
|
*/
|
|
unsigned force_compress:4;
|
|
|
|
struct btrfs_delayed_node *delayed_node;
|
|
|
|
struct inode vfs_inode;
|
|
};
|
|
|
|
extern unsigned char btrfs_filetype_table[];
|
|
|
|
static inline struct btrfs_inode *BTRFS_I(struct inode *inode)
|
|
{
|
|
return container_of(inode, struct btrfs_inode, vfs_inode);
|
|
}
|
|
|
|
static inline void btrfs_i_size_write(struct inode *inode, u64 size)
|
|
{
|
|
i_size_write(inode, size);
|
|
BTRFS_I(inode)->disk_i_size = size;
|
|
}
|
|
|
|
#endif
|