Commit graph

1372 commits

Author SHA1 Message Date
Yan, Zheng
fc2744aa12 ceph: fix race between page writeback and truncate
The client can receive truncate request from MDS at any time.
So the page writeback code need to get i_size, truncate_seq and
truncate_size atomically

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:47 -07:00
Yan, Zheng
3803da4963 ceph: reset iov_len when discarding cap release messages
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:47 -07:00
Yan, Zheng
bb137f84d1 ceph: fix cap release race
ceph_encode_inode_release() can race with ceph_open() and release
caps wanted by open files. So it should call __ceph_caps_wanted()
to get the wanted caps.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:46 -07:00
790eac5640 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second set of VFS changes from Al Viro:
 "Assorted f_pos race fixes, making do_splice_direct() safe to call with
  i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
  ->d_hash/->d_compare calling conventions changes from Linus, misc
  stuff all over the place."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  Document ->tmpfile()
  ext4: ->tmpfile() support
  vfs: export lseek_execute() to modules
  lseek_execute() doesn't need an inode passed to it
  block_dev: switch to fixed_size_llseek()
  cpqphp_sysfs: switch to fixed_size_llseek()
  tile-srom: switch to fixed_size_llseek()
  proc_powerpc: switch to fixed_size_llseek()
  ubi/cdev: switch to fixed_size_llseek()
  pci/proc: switch to fixed_size_llseek()
  isapnp: switch to fixed_size_llseek()
  lpfc: switch to fixed_size_llseek()
  locks: give the blocked_hash its own spinlock
  locks: add a new "lm_owner_key" lock operation
  locks: turn the blocked_list into a hashtable
  locks: convert fl_link to a hlist_node
  locks: avoid taking global lock if possible when waking up blocked waiters
  locks: protect most of the file_lock handling with i_lock
  locks: encapsulate the fl_link list handling
  locks: make "added" in __posix_lock_file a bool
  ...
2013-07-03 09:10:19 -07:00
Jie Liu
46a1c2c7ae vfs: export lseek_execute() to modules
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
matter in lseek_execute() to update the current file offset
to the desired offset if it is valid, ceph also does the
simliar things at ceph_llseek().

To reduce the duplications, this patch make lseek_execute()
public accessible so that we can call it directly from the
underlying file systems.

Thanks Dave Chinner for this suggestion.

[AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]

v2->v1:
- Add kernel-doc comments for lseek_execute()
- Call lseek_execute() in ceph->llseek()

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-03 16:23:27 +04:00
9e239bb939 Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
 block size is smaller than the page size (i.e., file systems 1k blocks
 on x86, or more interestingly file systems with 4k blocks on Power or
 ia64 systems.)
 
 In the cleanup category, the ext4's punch hole implementation was
 significantly improved by Lukas Czerner, and now supports bigalloc
 file systems.  In addition, Jan Kara significantly cleaned up the
 write submission code path.  We also improved error checking and added
 a few sanity checks.
 
 In the optimizations category, two major optimizations deserve
 mention.  The first is that ext4_writepages() is now used for
 nodelalloc and ext3 compatibility mode.  This allows writes to be
 submitted much more efficiently as a single bio request, instead of
 being sent as individual 4k writes into the block layer (which then
 relied on the elevator code to coalesce the requests in the block
 queue).  Secondly, the extent cache shrink mechanism, which was
 introduce in 3.9, no longer has a scalability bottleneck caused by the
 i_es_lru spinlock.  Other optimizations include some changes to reduce
 CPU usage and to avoid issuing empty commits unnecessarily.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG
 0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV
 yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv
 NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg
 rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT
 ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2
 AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/
 zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k
 U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD
 vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX
 +TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7
 XrddRLGlk4Hf+2o7/D7B
 =SwaI
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "Lots of bug fixes, cleanups and optimizations.  In the bug fixes
  category, of note is a fix for on-line resizing file systems where the
  block size is smaller than the page size (i.e., file systems 1k blocks
  on x86, or more interestingly file systems with 4k blocks on Power or
  ia64 systems.)

  In the cleanup category, the ext4's punch hole implementation was
  significantly improved by Lukas Czerner, and now supports bigalloc
  file systems.  In addition, Jan Kara significantly cleaned up the
  write submission code path.  We also improved error checking and added
  a few sanity checks.

  In the optimizations category, two major optimizations deserve
  mention.  The first is that ext4_writepages() is now used for
  nodelalloc and ext3 compatibility mode.  This allows writes to be
  submitted much more efficiently as a single bio request, instead of
  being sent as individual 4k writes into the block layer (which then
  relied on the elevator code to coalesce the requests in the block
  queue).  Secondly, the extent cache shrink mechanism, which was
  introduce in 3.9, no longer has a scalability bottleneck caused by the
  i_es_lru spinlock.  Other optimizations include some changes to reduce
  CPU usage and to avoid issuing empty commits unnecessarily."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
  ext4: optimize starting extent in ext4_ext_rm_leaf()
  jbd2: invalidate handle if jbd2_journal_restart() fails
  ext4: translate flag bits to strings in tracepoints
  ext4: fix up error handling for mpage_map_and_submit_extent()
  jbd2: fix theoretical race in jbd2__journal_restart
  ext4: only zero partial blocks in ext4_zero_partial_blocks()
  ext4: check error return from ext4_write_inline_data_end()
  ext4: delete unnecessary C statements
  ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
  jbd2: move superblock checksum calculation to jbd2_write_superblock()
  ext4: pass inode pointer instead of file pointer to punch hole
  ext4: improve free space calculation for inline_data
  ext4: reduce object size when !CONFIG_PRINTK
  ext4: improve extent cache shrink mechanism to avoid to burn CPU time
  ext4: implement error handling of ext4_mb_new_preallocation()
  ext4: fix corruption when online resizing a fs with 1K block size
  ext4: delete unused variables
  ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
  jbd2: remove debug dependency on debug_fs and update Kconfig help text
  jbd2: use a single printk for jbd_debug()
  ...
2013-07-02 09:39:34 -07:00
Dan Carpenter
6af8652849 ceph: tidy ceph_mdsmap_decode() a little
I introduced a new temporary variable "info" instead of
"m->m_info[mds]".  Also I reversed the if condition and pulled
everything in one indent level.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-07-01 09:52:02 -07:00
Emil Goode
c213b50b7d ceph: improve error handling in ceph_mdsmap_decode
This patch makes the following improvements to the error handling
in the ceph_mdsmap_decode function:

- Add a NULL check for return value from kcalloc
- Make use of the variable err

Signed-off-by: Emil Goode <emilgoode@gmail.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-01 09:52:02 -07:00
Jim Schutt
4d1bf79aff ceph: fix up comment for ceph_count_locks() as to which lock to hold
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-07-01 09:52:01 -07:00
Jeff Layton
1c8c601a8c locks: protect most of the file_lock handling with i_lock
Having a global lock that protects all of this code is a clear
scalability problem. Instead of doing that, move most of the code to be
protected by the i_lock instead. The exceptions are the global lists
that the ->fl_link sits on, and the ->fl_block list.

->fl_link is what connects these structures to the
global lists, so we must ensure that we hold those locks when iterating
over or updating these lists.

Furthermore, sound deadlock detection requires that we hold the
blocked_list state steady while checking for loops. We also must ensure
that the search and update to the list are atomic.

For the checking and insertion side of the blocked_list, push the
acquisition of the global lock into __posix_lock_file and ensure that
checking and update of the  blocked_list is done without dropping the
lock in between.

On the removal side, when waking up blocked lock waiters, take the
global lock before walking the blocked list and dequeue the waiters from
the global list prior to removal from the fl_block list.

With this, deadlock detection should be race free while we minimize
excessive file_lock_lock thrashing.

Finally, in order to avoid a lock inversion problem when handling
/proc/locks output we must ensure that manipulations of the fl_block
list are also protected by the file_lock_lock.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:42 +04:00
Al Viro
77acfa29e1 [readdir] convert ceph
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:41 +04:00
8d7a8fe2ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
 "There is a pair of fixes for double-frees in the recent bundle for
  3.10, a couple of fixes for long-standing bugs (sleep while atomic and
  an endianness fix), and a locking fix that can be triggered when osds
  are going down"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix cleanup in rbd_add()
  rbd: don't destroy ceph_opts in rbd_add()
  ceph: ceph_pagelist_append might sleep while atomic
  ceph: add cpu_to_le32() calls when encoding a reconnect capability
  libceph: must hold mutex for reset_changed_osds()
2013-06-12 08:28:19 -07:00
Lukas Czerner
569d39fc3e ceph: use ->invalidatepage() length argument
->invalidatepage() aop now accepts range to invalidate so we can make
use of it in ceph_invalidatepage().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Acked-by: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org
2013-05-21 23:58:48 -04:00
Lukas Czerner
d47992f86b mm: change invalidatepage prototype to accept length
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.

Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).

This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.

We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.

Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
2013-05-21 23:17:23 -04:00
Jim Schutt
39be95e9c8 ceph: ceph_pagelist_append might sleep while atomic
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc()
while holding a lock, but it's spoiled because ceph_pagelist_addpage()
always calls kmap(), which might sleep.  Here's the result:

[13439.295457] ceph: mds0 reconnect start
[13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58
[13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1
    . . .
[13439.376225] Call Trace:
[13439.378757]  [<ffffffff81076f4c>] __might_sleep+0xfc/0x110
[13439.384353]  [<ffffffffa03f4ce0>] ceph_pagelist_append+0x120/0x1b0 [libceph]
[13439.391491]  [<ffffffffa0448fe9>] ceph_encode_locks+0x89/0x190 [ceph]
[13439.398035]  [<ffffffff814ee849>] ? _raw_spin_lock+0x49/0x50
[13439.403775]  [<ffffffff811cadf5>] ? lock_flocks+0x15/0x20
[13439.409277]  [<ffffffffa045e2af>] encode_caps_cb+0x41f/0x4a0 [ceph]
[13439.415622]  [<ffffffff81196748>] ? igrab+0x28/0x70
[13439.420610]  [<ffffffffa045e9f8>] ? iterate_session_caps+0xe8/0x250 [ceph]
[13439.427584]  [<ffffffffa045ea25>] iterate_session_caps+0x115/0x250 [ceph]
[13439.434499]  [<ffffffffa045de90>] ? set_request_path_attr+0x2d0/0x2d0 [ceph]
[13439.441646]  [<ffffffffa0462888>] send_mds_reconnect+0x238/0x450 [ceph]
[13439.448363]  [<ffffffffa0464542>] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph]
[13439.455250]  [<ffffffffa0462e42>] check_new_map+0x352/0x500 [ceph]
[13439.461534]  [<ffffffffa04631ad>] ceph_mdsc_handle_map+0x1bd/0x260 [ceph]
[13439.468432]  [<ffffffff814ebc7e>] ? mutex_unlock+0xe/0x10
[13439.473934]  [<ffffffffa043c612>] extra_mon_dispatch+0x22/0x30 [ceph]
[13439.480464]  [<ffffffffa03f6c2c>] dispatch+0xbc/0x110 [libceph]
[13439.486492]  [<ffffffffa03eec3d>] process_message+0x1ad/0x1d0 [libceph]
[13439.493190]  [<ffffffffa03f1498>] ? read_partial_message+0x3e8/0x520 [libceph]
    . . .
[13439.587132] ceph: mds0 reconnect success
[13490.720032] ceph: mds0 caps stale
[13501.235257] ceph: mds0 recovery completed
[13501.300419] ceph: mds0 caps renewed

Fix it up by encoding locks into a buffer first, and when the number
of encoded locks is stable, copy that into a ceph_pagelist.

[elder@inktank.com: abbreviated the stack info a bit.]

Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-17 12:45:48 -05:00
Jim Schutt
c420276a53 ceph: add cpu_to_le32() calls when encoding a reconnect capability
In his review, Alex Elder mentioned that he hadn't checked that
num_fcntl_locks and num_flock_locks were properly decoded on the
server side, from a le32 over-the-wire type to a cpu type.
I checked, and AFAICS it is done; those interested can consult
    Locker::_do_cap_update()
in src/mds/Locker.cc and src/include/encoding.h in the Ceph server
code (git://github.com/ceph/ceph).

I also checked the server side for flock_len decoding, and I believe
that also happens correctly, by virtue of having been declared
__le32 in struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h.

Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-17 12:45:43 -05:00
Kent Overstreet
a27bb332c0 aio: don't include aio.h in sched.h
Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 20:16:25 -07:00
Alex Elder
812164f8c3 ceph: use ceph_create_snap_context()
Now that we have a library routine to create snap contexts, use it.

This is part of:
    http://tracker.ceph.com/issues/4857

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:09 -07:00
Alex Elder
406e2c9f92 libceph: kill off osd data write_request parameters
In the incremental move toward supporting distinct data items in an
osd request some of the functions had "write_request" parameters to
indicate, basically, whether the data belonged to in_data or the
out_data.  Now that we maintain the data fields in the op structure
there is no need to indicate the direction, so get rid of the
"write_request" parameters.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:58 -07:00
Randy Dunlap
ac7f29bf2e ceph: fix printk format warnings in file.c
Fix printk format warnings by using %zd for 'ssize_t' variables:

fs/ceph/file.c:751:2: warning: format '%ld' expects argument of type 'long int', but argument 11 has type 'ssize_t' [-Wformat]
fs/ceph/file.c:762:2: warning: format '%ld' expects argument of type 'long int', but argument 11 has type 'ssize_t' [-Wformat]

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc:	ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-01 21:18:57 -07:00
Yan, Zheng
1ac0fc8adf ceph: fix race between writepages and truncate
ceph_writepages_start() reads inode->i_size in two places. It can get
different values between successive read, because truncate can change
inode->i_size at any time. The race can lead to mismatch between data
length of osd request and pages marked as writeback. When osd request
finishes, it clear writeback page according to its data length. So
some pages can be left in writeback state forever. The fix is only
read inode->i_size once, save its value to a local variable and use
the local variable when i_size is needed.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:18:55 -07:00
Yan, Zheng
03d254edeb ceph: apply write checks in ceph_aio_write
copy write checks in __generic_file_aio_write to ceph_aio_write.
To make these checks cover sync write path.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:18:54 -07:00
Yan, Zheng
37505d5768 ceph: take i_mutex before getting Fw cap
There is deadlock as illustrated bellow. The fix is taking i_mutex
before getting Fw cap reference.

      write                    truncate                 MDS
---------------------     --------------------      --------------
get Fw cap
                          lock i_mutex
lock i_mutex (blocked)
                          request setattr.size  ->
                                                <-   revoke Fw cap

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:18:53 -07:00
Alex Elder
26be88087a libceph: change how "safe" callback is used
An osd request currently has two callbacks.  They inform the
initiator of the request when we've received confirmation for the
target osd that a request was received, and when the osd indicates
all changes described by the request are durable.

The only time the second callback is used is in the ceph file system
for a synchronous write.  There's a race that makes some handling of
this case unsafe.  This patch addresses this problem.  The error
handling for this callback is also kind of gross, and this patch
changes that as well.

In ceph_sync_write(), if a safe callback is requested we want to add
the request on the ceph inode's unsafe items list.  Because items on
this list must have their tid set (by ceph_osd_start_request()), the
request added *after* the call to that function returns.  The
problem with this is that there's a race between starting the
request and adding it to the unsafe items list; the request may
already be complete before ceph_sync_write() even begins to put it
on the list.

To address this, we change the way the "safe" callback is used.
Rather than just calling it when the request is "safe", we use it to
notify the initiator the bounds (start and end) of the period during
which the request is *unsafe*.  So the initiator gets notified just
before the request gets sent to the osd (when it is "unsafe"), and
again when it's known the results are durable (it's no longer
unsafe).  The first call will get made in __send_request(), just
before the request message gets sent to the messenger for the first
time.  That function is only called by __send_queued(), which is
always called with the osd client's request mutex held.

We then have this callback function insert the request on the ceph
inode's unsafe list when we're told the request is unsafe.  This
will avoid the race because this call will be made under protection
of the osd client's request mutex.  It also nicely groups the setup
and cleanup of the state associated with managing unsafe requests.

The name of the "safe" callback field is changed to "unsafe" to
better reflect its new purpose.  It has a Boolean "unsafe" parameter
to indicate whether the request is becoming unsafe or is now safe.
Because the "msg" parameter wasn't used, we drop that.

This resolves the original problem reportedin:
    http://tracker.ceph.com/issues/4706

Reported-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:18:52 -07:00
Alex Elder
7d7d51ce14 ceph: let osd client clean up for interrupted request
In ceph_sync_write(), if a safe callback is supplied with a request,
and an error is returned by ceph_osdc_wait_request(), a block of
code is executed to remove the request from the unsafe writes list
and drop references to capabilities acquired just prior to a call to
ceph_osdc_wait_request().

The only function used for this callback is sync_write_commit(),
and it does *exactly* what that block of error handling code does.

Now in ceph_osdc_wait_request(), if an error occurs (due to an
interupt during a wait_for_completion_interruptible() call),
complete_request() gets called, and that calls the request's
safe_callback method if it's defined.

So this means that this cleanup activity gets called twice in this
case, which is erroneous (and in fact leads to a crash).

Fix this by just letting the osd client handle the cleanup in
the event of an interrupt.

This resolves one problem mentioned in:
    http://tracker.ceph.com/issues/4706

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-01 21:18:51 -07:00
Yan, Zheng
0b93267252 ceph: fix symlink inode operations
add getattr/setattr and xattrs related methods.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:18:50 -07:00
Sam Lang
a84cd29335 ceph: Use pseudo-random numbers to choose mds
We don't need to use up entropy to choose an mds,
so use prandom_u32() to get a pseudo-random number.

Also, we don't need to choose a random mds if only
one mds is available, so add special casing for the
common case.

Fixes http://tracker.ceph.com/issues/3579

Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:18:49 -07:00
Alex Elder
90af36022a libceph: add, don't set data for a message
Change the names of the functions that put data on a pagelist to
reflect that we're adding to whatever's already there rather than
just setting it to the one thing.  Currently only one data item is
ever added to a message, but that's about to change.

This resolves:
    http://tracker.ceph.com/issues/2770

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:34 -07:00
Alex Elder
a4ce40a9a7 libceph: combine initializing and setting osd data
This ends up being a rather large patch but what it's doing is
somewhat straightforward.

Basically, this is replacing two calls with one.  The first of the
two calls is initializing a struct ceph_osd_data with data (either a
page array, a page list, or a bio list); the second is setting an
osd request op so it associates that data with one of the op's
parameters.  In place of those two will be a single function that
initializes the op directly.

That means we sort of fan out a set of the needed functions:
    - extent ops with pages data
    - extent ops with pagelist data
    - extent ops with bio list data
and
    - class ops with page data for receiving a response

We also have define another one, but it's only used internally:
    - class ops with pagelist data for request parameters

Note that we *still* haven't gotten rid of the osd request's
r_data_in and r_data_out fields.  All the osd ops refer to them for
their data.  For now, these data fields are pointers assigned to the
appropriate r_data_* field when these new functions are called.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:23 -07:00
Alex Elder
c99d2d4abb libceph: specify osd op by index in request
An osd request now holds all of its source op structures, and every
place that initializes one of these is in fact initializing one
of the entries in the the osd request's array.

So rather than supplying the address of the op to initialize, have
caller specify the osd request and an indication of which op it
would like to initialize.  This better hides the details the
op structure (and faciltates moving the data pointers they use).

Since osd_req_op_init() is a common routine, and it's not used
outside the osd client code, give it static scope.  Also make
it return the address of the specified op (so all the other
init routines don't have to repeat that code).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:15 -07:00
Alex Elder
8c042b0df9 libceph: add data pointers in osd op structures
An extent type osd operation currently implies that there will
be corresponding data supplied in the data portion of the request
(for write) or response (for read) message.  Similarly, an osd class
method operation implies a data item will be supplied to receive
the response data from the operation.

Add a ceph_osd_data pointer to each of those structures, and assign
it to point to eithre the incoming or the outgoing data structure in
the osd message.  The data is not always available when an op is
initially set up, so add two new functions to allow setting them
after the op has been initialized.

Begin to make use of the data item pointer available in the osd
operation rather than the request data in or out structure in
places where it's convenient.  Add some assertions to verify
pointers are always set the way they're expected to be.

This is a sort of stepping stone toward really moving the data
into the osd request ops, to allow for some validation before
making that jump.

This is the first in a series of patches that resolve:
    http://tracker.ceph.com/issues/4657

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:14 -07:00
Alex Elder
79528734f3 libceph: keep source rather than message osd op array
An osd request keeps a pointer to the osd operations (ops) array
that it builds in its request message.

In order to allow each op in the array to have its own distinct
data, we will need to keep track of each op's data, and that
information does not go over the wire.

As long as we're tracking the data we might as well just track the
entire (source) op definition for each of the ops.  And if we're
doing that, we'll have no more need to keep a pointer to the
wire-encoded version.

This patch makes the array of source ops be kept with the osd
request structure, and uses that instead of the version encoded in
the message in places where that was previously used.  The array
will be embedded in the request structure, and the maximum number of
ops we ever actually use is currently 2.  So reduce CEPH_OSD_MAX_OP
to 2 to reduce the size of the structure.

The result of doing this sort of ripples back up, and as a result
various function parameters and local variables become unnecessary.

Make r_num_ops be unsigned, and move the definition of struct
ceph_osd_req_op earlier to ensure it's defined where needed.

It does not yet add per-op data, that's coming soon.

This resolves:
    http://tracker.ceph.com/issues/4656

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:12 -07:00
Alex Elder
87060c1089 libceph: a few more osd data cleanups
These are very small changes that make use osd_data local pointers
as shorthands for structures being operated on.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:10 -07:00
Alex Elder
43bfe5de9f libceph: define osd data initialization helpers
Define and use functions that encapsulate the initializion of a
ceph_osd_data structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:06 -07:00
Alex Elder
e5975c7c8e ceph: build osd request message later for writepages
Hold off building the osd request message in ceph_writepages_start()
until just before it will be submitted to the osd client for
execution.

We'll still create the request and allocate the page pointer array
after we learn we have at least one page to write.  A local variable
will be used to keep track of the allocated array of pages.  Wait
until just before submitting the request for assigning that page
array pointer to the request message.

Create ands use a new function osd_req_op_extent_update() whose
purpose is to serve this one spot where the length value supplied
when an osd request's op was initially formatted might need to get
changed (reduced, never increased) before submitting the request.

Previously, ceph_writepages_start() assigned the message header's
data length because of this update.  That's no longer necessary,
because ceph_osdc_build_request() will recalculate the right
value to use based on the content of the ops in the request.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:02 -07:00
Alex Elder
02ee07d300 libceph: hold off building osd request
Defer building the osd request until just before submitting it in
all callers except ceph_writepages_start().  (That caller will be
handed in the next patch.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:01 -07:00
Alex Elder
88486957f9 ceph: kill ceph alloc_page_vec()
There is a helper function alloc_page_vec() that, despite its
generic sounding name depends heavily on an osd request structure
being populated with certain information.

There is only one place this function is used, and it ends up
being a bit simpler to just open code what it does, so get
rid of the helper.

The real motivation for this is deferring building the of the osd
request message, and this is a step in that direction.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:00 -07:00
Alex Elder
94fe8420bf ceph: define ceph_writepages_osd_request()
Mostly for readability, define ceph_writepages_osd_request() and
use it to allocate the osd request for ceph_writepages_start().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:59 -07:00
Alex Elder
acead002b2 libceph: don't build request in ceph_osdc_new_request()
This patch moves the call to ceph_osdc_build_request() out of
ceph_osdc_new_request() and into its caller.

This is in order to defer formatting osd operation information into
the request message until just before request is started.

The only unusual (ab)user of ceph_osdc_build_request() is
ceph_writepages_start(), where the final length of write request may
change (downward) based on the current inode size or the oldest
snapshot context with dirty data for the inode.

The remaining callers don't change anything in the request after has
been built.

This means the ops array is now supplied by the caller.  It also
means there is no need to pass the mtime to ceph_osdc_new_request()
(it gets provided to ceph_osdc_build_request()).  And rather than
passing a do_sync flag, have the number of ops in the ops array
supplied imply adding a second STARTSYNC operation after the READ or
WRITE requested.

This and some of the patches that follow are related to having the
messenger (only) be responsible for filling the content of the
message header, as described here:
    http://tracker.ceph.com/issues/4589

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:58 -07:00
Alex Elder
25d71cb92d ceph: use page_offset() in ceph_writepages_start()
There's one spot in ceph_writepages_start() that open-codes what
page_offset() does safely.  Use the macro so we don't have to worry
about wrapping.

This resolves:
    http://tracker.ceph.com/issues/4648

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:53 -07:00
Alex Elder
3bf53337af ceph: set up page array mempool with correct size
In create_fs_client() a memory pool is set up be used for arrays of
pages that might be needed in ceph_writepages_start() if memory is
tight.  There are two problems with the way it's initialized:
    - The size provided is the number of pages we want in the
      array, but it should be the number of bytes required for
      that many page pointers.
    - The number of pages computed can end up being 0, while we
      will always need at least one page.

This patch fixes both of these problems.

This resolves the two simple problems defined in:
    http://tracker.ceph.com/issues/4603

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:17:50 -07:00
Sage Weil
27859f9773 libceph: wrap auth ops in wrapper functions
Use wrapper functions that check whether the auth op exists so that callers
do not need a bunch of conditional checks.  Simplifies the external
interface.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:17:14 -07:00
Sage Weil
0bed9b5c52 libceph: add update_authorizer auth method
Currently the messenger calls out to a get_authorizer con op, which will
create a new authorizer if it doesn't yet have one.  In the meantime, when
we rotate our service keys, the authorizer doesn't get updated.  Eventually
it will be rejected by the server on a new connection attempt and get
invalidated, and we will then rebuild a new authorizer, but this is not
ideal.

Instead, if we do have an authorizer, call a new update_authorizer op that
will verify that the current authorizer is using the latest secret.  If it
is not, we will build a new one that does.  This avoids the transient
failure.

This fixes one of the sorry sequence of events for bug

	http://tracker.ceph.com/issues/4282

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01 21:17:13 -07:00
Henry C Chang
022f3e2ee2 ceph: fix buffer pointer advance in ceph_sync_write
We should advance the user data pointer by _len_ instead of _written_.
_len_ is the data length written in each iteration while _written_ is the
accumulated data length we have writtent out.

Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Tested-by: Sage Weil <sage@inktank.com>
2013-05-01 21:17:08 -07:00
Yan, Zheng
2f276c5111 ceph: use i_release_count to indicate dir's completeness
Current ceph code tracks directory's completeness in two places.
ceph_readdir() checks i_release_count to decide if it can set the
I_COMPLETE flag in i_ceph_flags. All other places check the I_COMPLETE
flag. This indirection introduces locking complexity.

This patch adds a new variable i_complete_count to ceph_inode_info.
Set i_release_count's value to it when marking a directory complete.
By comparing the two variables, we know if a directory is complete

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-01 21:17:07 -07:00
Alex Elder
ebf18f4709 ceph: only set message data pointers if non-empty
Change it so we only assign outgoing data information for messages
if there is outgoing data to send.

This then allows us to add a few more (currently commented-out)
assertions.

This is related to:
    http://tracker.ceph.com/issues/4284

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:41 -07:00
Alex Elder
27fa83852b libceph: isolate other message data fields
Define ceph_msg_data_set_pagelist(), ceph_msg_data_set_bio(), and
ceph_msg_data_set_trail() to clearly abstract the assignment of the
remaining data-related fields in a ceph message structure.  Use the
new functions in the osd client and mds client.

This partially resolves:
    http://tracker.ceph.com/issues/4263

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:40 -07:00
Alex Elder
f1baeb2b9f libceph: set page info with byte length
When setting page array information for message data, provide the
byte length rather than the page count ceph_msg_data_set_pages().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:39 -07:00
Alex Elder
02afca6ca0 libceph: isolate message page field manipulation
Define a function ceph_msg_data_set_pages(), which more clearly
abstracts the assignment page-related fields for data in a ceph
message structure.  Use this new function in the osd client and mds
client.

Ideally, these fields would never be set more than once (with
BUG_ON() calls to guarantee that).  At the moment though the osd
client sets these every time it receives a message, and in the event
of a communication problem this can happen more than once.  (This
will be resolved shortly, but setting up these helpers first makes
it all a bit easier to work with.)

Rearrange the field order in a ceph_msg structure to group those
that are used to define the possible data payloads.

This partially resolves:
    http://tracker.ceph.com/issues/4263

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:38 -07:00
Alex Elder
e0c594878e libceph: record byte count not page count
Record the byte count for an osd request rather than the page count.
The number of pages can always be derived from the byte count (and
alignment/offset) but the reverse is not true.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:36 -07:00
Alex Elder
0fff87ec79 libceph: separate read and write data
An osd request defines information about where data to be read
should be placed as well as where data to write comes from.
Currently these are represented by common fields.

Keep information about data for writing separate from data to be
read by splitting these into data_in and data_out fields.

This is the key patch in this whole series, in that it actually
identifies which osd requests generate outgoing data and which
generate incoming data.  It's less obvious (currently) that an osd
CALL op generates both outgoing and incoming data; that's the focus
of some upcoming work.

This resolves:
    http://tracker.ceph.com/issues/4127

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:27 -07:00
Alex Elder
2ac2b7a6d4 libceph: distinguish page and bio requests
An osd request uses either pages or a bio list for its data.  Use a
union to record information about the two, and add a data type
tag to select between them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:25 -07:00
Alex Elder
2794a82a11 libceph: separate osd request data info
Pull the fields in an osd request structure that define the data for
the request out into a separate structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:24 -07:00
Alex Elder
153e5167e0 libceph: don't assign page info in ceph_osdc_new_request()
Currently ceph_osdc_new_request() assigns an osd request's
r_num_pages and r_alignment fields.  The only thing it does
after that is call ceph_osdc_build_request(), and that doesn't
need those fields to be assigned.

Move the assignment of those fields out of ceph_osdc_new_request()
and into its caller.  As a result, the page_align parameter is no
longer used, so get rid of it.

Note that in ceph_sync_write(), the value for req->r_num_pages had
already been calculated earlier (as num_pages, and fortunately
it was computed the same way).  So don't bother recomputing it,
but because it's not needed earlier, move that calculation after the
call to ceph_osdc_new_request().  Hold off making the assignment to
r_alignment, doing it instead r_pages and r_num_pages are
getting set.

Similarly, in start_read(), nr_pages already holds the number of
pages in the array (and is calculated the same way), so there's no
need to recompute it.  Move the assignment of the page alignment
down with the others there as well.

This and the next few patches are preparation work for:
    http://tracker.ceph.com/issues/4127

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:23 -07:00
Alex Elder
3a42b6c43e ceph: simplify ceph_sync_write() page_align calculation
(This is being reposted.  The first one had a problem because it
erroneously added a similar change elsewhere; that change has been
dropped.)

The next patch in this series points out that the calculation for
the number of pages in an osd request is getting done twice.  It
is not obvious, but the result of both calculations is identical.
This patch simplifies one of them--as a separate step--to make
it clear that the transformation in the next patch is valid.

In ceph_sync_write() there is some magic that computes page_align
for an osd request.  But a little analysis shows it can be
simplified.

First, we have:
 	io_align = pos & ~PAGE_MASK;
which is used here:
	page_align = (pos - io_align + buf_align) & ~PAGE_MASK;

Note (pos - io_align) simply rounds "pos" down to the nearest multiple
of the page size.

We also have:
 	buf_align = (unsigned long)data & ~PAGE_MASK;

Adding buf_align to that rounded-down "pos" value will stay within
the same page; the result will just be offset by the page offset for
the "data" pointer.  The final mask therefore leaves just the value
of "buf_align".

One more simplification.  Note that the result of calc_pages_for()
is invariant of which page the offset starts in--the only thing that
matters is the offset within the starting page.  We will have
put the proper page offset to use into "page_align", so just use
that in calculating num_pages.

This resolves:
    http://tracker.ceph.com/issues/4166

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:22 -07:00
Alex Elder
cf7b7e1492 ceph: use calc_pages_for() in start_read()
There's a spot that computes the number of pages to allocate for a
page-aligned length by just shifting it.  Use calc_pages_for()
instead, to be consistent with usage everywhere else.  The result
is the same.

The reason for this is to make it clearer in an upcoming patch that
this calculation is duplicated.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:21 -07:00
Alex Elder
54ae0756e3 libceph: no need for alignment for mds message
Currently, incoming mds messages never use page data, which means
there is no need to set the page_alignment field in the message.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:20 -07:00
Alex Elder
53ded495c6 libceph: define mds_alloc_msg() method
The only user of the ceph messenger that doesn't define an alloc_msg
method is the mds client.  Define one, such that it works just like
it did before, and simplify ceph_con_in_msg_alloc() by assuming the
alloc_msg method is always present.

This and the next patch resolve:
    http://tracker.ceph.com/issues/4322

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:19 -07:00
Alex Elder
41766f87f5 libceph: rename ceph_calc_object_layout()
The purpose of ceph_calc_object_layout() is to fill in the pool
number and seed for a ceph_pg structure provided, based on a given
osd map and target object id.

Currently that function takes a file layout parameter, but the only
thing used out of that is its pool number.

Change the function so it takes a pool number rather than the full
file layout structure.  Only update the ceph_pg if the pool is found
in the osd map.  Get rid of few useless lines of code from the
function while there.

Since the function now very clearly just fills in the ceph_pg
structure it's provided, rename it ceph_calc_ceph_pg().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:17 -07:00
Alex Elder
ec02a2f2ff libceph: kill ceph_msg->pagelist_count
The pagelist_count field is never actually used, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:16:16 -07:00
Yan, Zheng
3f99969f42 ceph: acquire i_mutex in __ceph_do_pending_vmtruncate
make __ceph_do_pending_vmtruncate() acquire the i_mutex if the caller
does not hold the i_mutex, so ceph_aio_read() can call safely.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:11 -07:00
Yan, Zheng
6070e0c1e2 ceph: don't early drop Fw cap
ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
cap dirty before data is copied to page cache and inode size is
updated. The optimization avoids slow cap revocation caused by
balance_dirty_pages(), but introduces inode size update race. If
ceph_check_caps() flushes the dirty cap before the inode size is
updated, MDS can miss the new inode size. So just remove the
optimization.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:16:10 -07:00
Sage Weil
7971bd92ba ceph: revert commit 22cddde104
commit 22cddde104 breaks the atomicity of write operation, it also
introduces a deadlock between write and truncate.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

Conflicts:
	fs/ceph/addr.c
2013-05-01 21:15:58 -07:00
Yan, Zheng
a8673d61ad ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag
commit c6ffe10015 moved the flag that tracks if the dcache contents
for a directory are complete to dentry. The problem is there are
lots of places that use ceph_dir_{set,clear,test}_complete() while
holding i_ceph_lock. but ceph_dir_{set,clear,test}_complete() may
sleep because they call dput().

This patch basically reverts that commit. For ceph_d_prune(), it's
called with both the dentry to prune and the parent dentry are
locked. So it's safe to access the parent dentry's d_inode and
clear I_COMPLETE flag.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:14:33 -07:00
Yan, Zheng
964266cce9 ceph: set mds_want according to cap import message
MDS ignores cap update message if migrate_seq mismatch, so when
receiving a cap import message with higher migrate_seq, set mds_want
according to the cap import message.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:14:32 -07:00
Yan, Zheng
d40ee0dcc1 ceph: queue cap release when trimming cap
So the client will later send cap release message to MDS

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:14:31 -07:00
Yan, Zheng
8a03449700 ceph: fix LSSNAP regression
commit 6e8575faa8 makes parse_reply_info_extra() return -EIO for LSSNAP

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 21:14:30 -07:00
Alex Elder
d4b515fa10 libceph: distinguish page array and pagelist count
Use distinct fields for tracking the number of pages in a message's
page array and in a message's page list.  Currently only one or the
other is used at a time, but that will be changing soon.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:14:28 -07:00
Eric W. Biederman
7f78e03513 fs: Limit sys_mount to only request filesystem modules.
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.

A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.

Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.

Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives.  Allowing simple, safe,
well understood work-arounds to known problematic software.

This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work.  While writing this patch I saw a handful of such
cases.  The most significant being autofs that lives in the module
autofs4.

This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.

After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module.  The common pattern in the kernel is to call request_module()
without regards to the users permissions.  In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted.  In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-03 19:36:31 -08:00
1cf0209c43 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "A few groups of patches here.  Alex has been hard at work improving
  the RBD code, layout groundwork for understanding the new formats and
  doing layering.  Most of the infrastructure is now in place for the
  final bits that will come with the next window.

  There are a few changes to the data layout.  Jim Schutt's patch fixes
  some non-ideal CRUSH behavior, and a set of patches from me updates
  the client to speak a newer version of the protocol and implement an
  improved hashing strategy across storage nodes (when the server side
  supports it too).

  A pair of patches from Sam Lang fix the atomicity of open+create
  operations.  Several patches from Yan, Zheng fix various mds/client
  issues that turned up during multi-mds torture tests.

  A final set of patches expose file layouts via virtual xattrs, and
  allow the policies to be set on directories via xattrs as well
  (avoiding the awkward ioctl interface and providing a consistent
  interface for both kernel mount and ceph-fuse users)."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (143 commits)
  libceph: add support for HASHPSPOOL pool flag
  libceph: update osd request/reply encoding
  libceph: calculate placement based on the internal data types
  ceph: update support for PGID64, PGPOOL3, OSDENC protocol features
  ceph: update "ceph_features.h"
  libceph: decode into cpu-native ceph_pg type
  libceph: rename ceph_pg -> ceph_pg_v1
  rbd: pass length, not op for osd completions
  rbd: move rbd_osd_trivial_callback()
  libceph: use a do..while loop in con_work()
  libceph: use a flag to indicate a fault has occurred
  libceph: separate non-locked fault handling
  libceph: encapsulate connection backoff
  libceph: eliminate sparse warnings
  ceph: eliminate sparse warnings in fs code
  rbd: eliminate sparse warnings
  libceph: define connection flag helpers
  rbd: normalize dout() calls
  rbd: barriers are hard
  rbd: ignore zero-length requests
  ...
2013-02-28 17:43:09 -08:00
d895cb1af1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile (part one) from Al Viro:
 "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
  locking violations, etc.

  The most visible changes here are death of FS_REVAL_DOT (replaced with
  "has ->d_weak_revalidate()") and a new helper getting from struct file
  to inode.  Some bits of preparation to xattr method interface changes.

  Misc patches by various people sent this cycle *and* ocfs2 fixes from
  several cycles ago that should've been upstream right then.

  PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  saner proc_get_inode() calling conventions
  proc: avoid extra pde_put() in proc_fill_super()
  fs: change return values from -EACCES to -EPERM
  fs/exec.c: make bprm_mm_init() static
  ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
  ocfs2: fix possible use-after-free with AIO
  ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
  get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
  target: writev() on single-element vector is pointless
  export kernel_write(), convert open-coded instances
  fs: encode_fh: return FILEID_INVALID if invalid fid_type
  kill f_vfsmnt
  vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
  nfsd: handle vfs_getattr errors in acl protocol
  switch vfs_getattr() to struct path
  default SET_PERSONALITY() in linux/elf.h
  ceph: prepopulate inodes only when request is aborted
  d_hash_and_lookup(): export, switch open-coded instances
  9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
  9p: split dropping the acls from v9fs_set_create_acl()
  ...
2013-02-26 20:16:07 -08:00
Sage Weil
1b83bef24c libceph: update osd request/reply encoding
Use the new version of the encoding for osd requests and replies.  In the
process, update the way we are tracking request ops and reply lengths and
results in the struct ceph_osd_request.  Update the rbd and fs/ceph users
appropriately.

The main changes are:
 - we keep pointers into the request memory for fields we need to update
   each time the request is sent out over the wire
 - we keep information about the result in an array in the request struct
   where the users can easily get at it.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-02-26 15:02:50 -08:00
Sage Weil
2169aea649 libceph: calculate placement based on the internal data types
Instead of using the old ceph_object_layout struct, update our internal
ceph_calc_object_layout method to use the ceph_pg type.  This allows us to
pass the full 32-bit precision of the pgid.seed to the callers.  It also
allows some callers to avoid reaching into the request structures for the
struct ceph_object_layout fields.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-02-26 15:02:37 -08:00
Sage Weil
4f6a7e5ee1 ceph: update support for PGID64, PGPOOL3, OSDENC protocol features
Support (and require) the PGID64, PGPOOL3, and OSDENC protocol features.
These have been present in ceph.git since v0.42, Feb 2012.  Require these
features to simplify support; nobody is running older userspace.

Note that the new request and reply encoding is still not in place, so the new
code is not yet functional.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-02-26 15:02:25 -08:00
Sage Weil
5b191d9914 libceph: decode into cpu-native ceph_pg type
Always decode data into our cpu-native ceph_pg type that has the correct
field widths.  Limit any remaining uses of ceph_pg_v1 to dealing with the
legacy protocol.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-02-26 15:01:57 -08:00
Sage Weil
12979354a1 libceph: rename ceph_pg -> ceph_pg_v1
Rename the old version this type to distinguish it from the new version.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-02-26 15:01:41 -08:00
Namjae Jeon
94e07a7590 fs: encode_fh: return FILEID_INVALID if invalid fid_type
This patch is a follow up on below patch:

[PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
commit: 216b6cbdcb

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Vivek Trivedi <t.vivek@samsung.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-26 02:46:10 -05:00
Sage Weil
79f9f99ad1 ceph: prepopulate inodes only when request is aborted
If r_aborted is true, we do not hold the dir i_mutex, and cannot touch
the dcache.  However, we still need to update the inodes with the state
returned by the MDS.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-26 02:46:08 -05:00
94f2f14234 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace and namespace infrastructure changes from Eric W Biederman:
 "This set of changes starts with a few small enhnacements to the user
  namespace.  reboot support, allowing more arbitrary mappings, and
  support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
  user namespace root.

  I do my best to document that if you care about limiting your
  unprivileged users that when you have the user namespace support
  enabled you will need to enable memory control groups.

  There is a minor bug fix to prevent overflowing the stack if someone
  creates way too many user namespaces.

  The bulk of the changes are a continuation of the kuid/kgid push down
  work through the filesystems.  These changes make using uids and gids
  typesafe which ensures that these filesystems are safe to use when
  multiple user namespaces are in use.  The filesystems converted for
  3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs.  The
  changes for these filesystems were a little more involved so I split
  the changes into smaller hopefully obviously correct changes.

  XFS is the only filesystem that remains.  I was hoping I could get
  that in this release so that user namespace support would be enabled
  with an allyesconfig or an allmodconfig but it looks like the xfs
  changes need another couple of days before it they are ready."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
  cifs: Enable building with user namespaces enabled.
  cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
  cifs: Convert struct cifs_sb_info to use kuids and kgids
  cifs: Modify struct smb_vol to use kuids and kgids
  cifs: Convert struct cifsFileInfo to use a kuid
  cifs: Convert struct cifs_fattr to use kuid and kgids
  cifs: Convert struct tcon_link to use a kuid.
  cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
  cifs: Convert from a kuid before printing current_fsuid
  cifs: Use kuids and kgids SID to uid/gid mapping
  cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
  cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
  cifs: Override unmappable incoming uids and gids
  nfsd: Enable building with user namespaces enabled.
  nfsd: Properly compare and initialize kuids and kgids
  nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
  nfsd: Modify nfsd4_cb_sec to use kuids and kgids
  nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
  nfsd: Convert nfsxdr to use kuids and kgids
  nfsd: Convert nfs3xdr to use kuids and kgids
  ...
2013-02-25 16:00:49 -08:00
Alex Elder
2c3dd4ff59 ceph: eliminate sparse warnings in fs code
Fix the causes for sparse warnings reported in the ceph file system
code.  Here there are only two (and they're sort of silly but
they're easy to fix).

This partially resolves:
    http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:14 -06:00
Al Viro
496ad9aa8e new helper: file_inode(file)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:31 -05:00
Sage Weil
92a49fb0f7 ceph: fix statvfs fr_size
Different versions of glibc are broken in different ways, but the short of
it is that for the time being, frsize should == bsize, and be used as the
multiple for the blocks, free, and available fields.  This mirrors what is
done for NFS.  The previous reporting of the page size for frsize meant
that newer glibc and df would report a very small value for the fs size.

Fixes http://tracker.ceph.com/issues/3793.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-02-22 15:31:00 -08:00
Alex Elder
9e0eb85d58 ceph: remove a few bogus declarations
There are three ceph page vector functions declared in
"fs/ceph/super.h" that don't belong there.  They're
probably left over from some long-ago code reorganization.

They're properly declared in "include/linux/ceph/libceph.h"
so just delete the ones in "super.h".

This and the next few commits resolve:
    http://tracker.ceph.com/issues/4053

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-19 19:14:03 -06:00
Alex Elder
0eb40bf65e libceph: update ceph_mds_state_name() and ceph_mds_op_name()
Update ceph_mds_state_name() and ceph_mds_op_name() to include the
newly-added definitions in "ceph_fs.h", and to match its counterpart
in the user space code.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:20:34 -06:00
Alex Elder
a3bea47e8b ceph: kill ceph_osdc_new_request() "num_reply" parameter
The "num_reply" parameter to ceph_osdc_new_request() is never
used inside that function, so get rid of it.

Note that ceph_sync_write() passes 2 for that argument, while all
other callers pass 1.  It doesn't matter, but perhaps someone should
verify this doesn't indicate a problem.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:19:39 -06:00
Alex Elder
2480882611 ceph: kill ceph_osdc_writepages() "flags" parameter
There is only one caller of ceph_osdc_writepages(), and it always
passes 0 as its "flags" argument.  Get rid of that argument and
replace its use in ceph_osdc_writepages() with 0.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:19:35 -06:00
Alex Elder
fbf8685fb1 ceph: kill ceph_osdc_writepages() "dosync" parameter
There is only one caller of ceph_osdc_writepages(), and it always
passes 0 as its "dosync" argument.  Get rid of that argument and
replace its use in ceph_osdc_writepages() with 0.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:19:28 -06:00
Alex Elder
87f979d390 ceph: kill ceph_osdc_writepages() "nofail" parameter
There is only one caller of ceph_osdc_writepages(), and it always
passes the value true as its "nofail" argument.  Get rid of that
argument and replace its use in ceph_osdc_writepages() with the
constant value true.

This and a number of cleanup patches that follow resolve:
    http://tracker.ceph.com/issues/4126

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:19:22 -06:00
Sage Weil
695b711933 ceph: implement hidden per-field ceph.*.layout.* vxattrs
Allow individual fields of the layout to be fetched via getxattr.
The ceph.dir.layout.* vxattr with "disappear" if the exists_cb
indicates there no dir layout set.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
2013-02-13 18:26:17 -08:00
Sage Weil
1f08f2b056 ceph: add ceph.dir.layout vxattr
This virtual xattr will only appear when there is a dir layout policy
set on the directory.  It can be set via setxattr and removed via
removexattr (implemented by the MDS).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
2013-02-13 18:26:13 -08:00
Sage Weil
32ab0bd78d ceph: change ceph.file.layout.* implementation, content
Implement a new method to generate the ceph.file.layout vxattr using
the new framework.

Use 'stripe_unit' instead of 'chunk_size'.

Include pool name, either as a string or as an integer.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
2013-02-13 18:26:10 -08:00
Sage Weil
b65917dd27 ceph: fix listxattr handling for vxattrs
Only include vxattrs in the result if they are not hidden and exist
(as determined by the exists_cb callback).

Note that the buffer size we return when 0 is passed in always includes
vxattrs that *might* exist, forming an upper bound.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
2013-02-13 18:26:06 -08:00
Sage Weil
0bee82fb4b ceph: fix getxattr vxattr handling
Change the vxattr handling for getxattr so that vxattrs are checked
prior to any xattr content, and never after.  Enforce vxattr existence
via the exists_cb callback.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
2013-02-13 18:26:03 -08:00
Sage Weil
f36e447296 ceph: add exists_cb to vxattr struct
Allow for a callback to dynamically determine if a vxattr exists for
the given inode.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
2013-02-13 18:25:58 -08:00
Sage Weil
d421acb1ad ceph: pass ceph.* removexattrs through to MDS
If we do not explicitly recognized a vxattr (e.g., as readonly), pass
the request through to the MDS and deal with it there.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
2013-02-13 18:25:55 -08:00
Sage Weil
3adf654ddb ceph: pass unhandled ceph.* setxattrs through to MDS
If we do not specifically understand a setxattr on a ceph.* virtual
xattr, send it through to the MDS.  This allows us to implement new
functionality via the MDS without direct support on the client side.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
2013-02-13 18:25:51 -08:00
Sage Weil
8860147a01 ceph: support hidden vxattrs
Add ability to flag virtual xattrs as hidden, such that you can
getxattr them but they do not appear in listxattr.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
2013-02-13 18:25:47 -08:00
Sage Weil
39b648d9ec ceph: remove 'ceph.layout' virtual xattr
This has been deprecated since v3.3, 114fc474.  Kill it.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
2013-02-13 18:25:43 -08:00
Eric W. Biederman
bd2bae6a66 ceph: Convert kuids and kgids before printing them.
Before printing kuid and kgids values convert them into
the initial user namespace.

Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-12 03:19:27 -08:00
Eric W. Biederman
ff3d004662 ceph: Convert struct ceph_mds_request to use kuid_t and kgid_t
Hold the uid and gid for a pending ceph mds request using the types
kuid_t and kgid_t.  When a request message is finally created convert
the kuid_t and kgid_t values into uids and gids in the initial user
namespace.

Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-12 03:19:26 -08:00
Eric W. Biederman
ab871b903e ceph: Translate inode uid and gid attributes to/from kuids and kgids.
- In fill_inode() transate uids and gids in the initial user namespace
  into kuids and kgids stored in inode->i_uid and inode->i_gid.

- In ceph_setattr() if they have changed convert inode->i_uid and
  inode->i_gid into initial user namespace uids and gids for
  transmission.

Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-12 03:19:25 -08:00
Eric W. Biederman
05cb11c17e ceph: Translate between uid and gids in cap messages and kuids and kgids
- Make the uid and gid arguments of send_cap_msg() used to compose
  ceph_mds_caps messages of type kuid_t and kgid_t.

- Pass inode->i_uid and inode->i_gid in __send_cap to send_cap_msg()
  through variables of type kuid_t and kgid_t.

- Modify struct ceph_cap_snap to store uids and gids in types kuid_t
  and kgid_t.  This allows capturing inode->i_uid and inode->i_gid in
  ceph_queue_cap_snap() without loss and pssing them to
  __ceph_flush_snaps() where they are removed from struct
  ceph_cap_snap and passed to send_cap_msg().

- In handle_cap_grant translate uid and gids in the initial user
  namespace stored in struct ceph_mds_cap into kuids and kgids
  before setting inode->i_uid and inode->i_gid.

Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-12 03:19:24 -08:00
Alex Elder
969e5aa3b0 Merge branch 'testing' of github.com:ceph/ceph-client into v3.8-rc5-testing 2013-01-30 07:54:34 -06:00
Alex Elder
e8afad656c libceph: pass length to ceph_calc_file_object_mapping()
ceph_calc_file_object_mapping() takes (among other things) a "file"
offset and length, and based on the layout, determines the object
number ("bno") backing the affected portion of the file's data and
the offset into that object where the desired range begins.  It also
computes the size that should be used for the request--either the
amount requested or something less if that would exceed the end of
the object.

This patch changes the input length parameter in this function so it
is used only for input.  That is, the argument will be passed by
value rather than by address, so the value provided won't get
updated by the function.

The value would only get updated if the length would surpass the
current object, and in that case the value it got updated to would
be exactly that returned in *oxlen.

Only one of the two callers is affected by this change.  Update
ceph_calc_raw_layout() so it records any updated value.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:04 -06:00
Yan, Zheng
390306c38d ceph: check mds_wanted for imported cap
The MDS may have incorrect wanted caps after importing caps. So the
client should check the value mds has and send cap update if necessary.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:38 -06:00
Yan, Zheng
66f58691c5 ceph: allocate cap_release message when receiving cap import
When client wants to release an imported cap, it's possible there
is no reserved cap_release message in corresponding mds session.
so __queue_cap_release causes kernel panic.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:38 -06:00
Yan, Zheng
395c312b9c ceph: allow revoking duplicated caps issued by non-auth MDS
Allow revoking duplicated caps issued by non-auth MDS if these caps
are also issued by auth MDS.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:38 -06:00
Yan, Zheng
8a92a119b2 ceph: move dirty inode to migrating list when clearing auth caps
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:37 -06:00
Sam Lang
6e8575faa8 ceph: Check for created flag in response from mds
The mds now sends back a created inode if the create request
performed the create.  If the file already existed, no inode is
returned in the reply.  This allows ceph to set the created flag
in atomic_open so that permissions are properly checked in the case
that the file wasn't created by the create call to the mds.

To ensure compability with previous kernels, a feature for sending
back the inode in the create reply was added, so that the mds will
only send back the inode if the client indicates it supports the
feature.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:36 -06:00
Sam Lang
79aec9844d ceph: Check for err on mds request in atomic_open
The error returned by ceph_mdsc_do_request includes errors sending the
request, errors on timeout, or any errors coming from the mds.  If
ceph_mdsc_do_request returns an error, the reply struct will most likely
be bogus.  We need to bail out and propogate the error instead of
overwriting it.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:36 -06:00
Kees Cook
1b6a78a522 fs/ceph: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Sage Weil <sage@inktank.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Sage Weil <sage@inktank.com>
2013-01-11 11:39:04 -08:00
40889e8d9f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph update from Sage Weil:
 "There are a few different groups of commits here.  The largest is
  Alex's ongoing work to enable the coming RBD features (cloning,
  striping).  There is some cleanup in libceph that goes along with it.

  Cyril and David have fixed some problems with NFS reexport (leaking
  dentries and page locks), and there is a batch of patches from Yan
  fixing problems with the fs client when running against a clustered
  MDS.  There are a few bug fixes mixed in for good measure, many of
  which will be going to the stable trees once they're upstream.

  My apologies for the late pull.  There is still a gremlin in the rbd
  map/unmap code and I was hoping to include the fix for that as well,
  but we haven't been able to confirm the fix is correct yet; I'll send
  that in a separate pull once it's nailed down."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits)
  rbd: get rid of rbd_{get,put}_dev()
  libceph: register request before unregister linger
  libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
  libceph: init event->node in ceph_osdc_create_event()
  libceph: init osd->o_node in create_osd()
  libceph: report connection fault with warning
  libceph: socket can close in any connection state
  rbd: don't use ENOTSUPP
  rbd: remove linger unconditionally
  rbd: get rid of RBD_MAX_SEG_NAME_LEN
  libceph: avoid using freed osd in __kick_osd_requests()
  ceph: don't reference req after put
  rbd: do not allow remove of mounted-on image
  libceph: Unlock unprocessed pages in start_read() error path
  ceph: call handle_cap_grant() for cap import message
  ceph: Fix __ceph_do_pending_vmtruncate
  ceph: Don't add dirty inode to dirty list if caps is in migration
  ceph: Fix infinite loop in __wake_requests
  ceph: Don't update i_max_size when handling non-auth cap
  bdi_register: add __printf verification, fix arg mismatch
  ...
2012-12-20 14:00:13 -08:00
Cyril Roelandt
f6af75dac3 ceph: fix dentry reference leak in ceph_encode_fh()
dput() was not called in the error path.

Signed-off-by: Cyril Roelandt <tipecaml@gmail.com>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-18 15:02:11 -08:00
Andrew Morton
965c8e59cf lseek: the "whence" argument is called "whence"
But the kernel decided to call it "origin" instead.  Fix most of the
sites.

Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:12 -08:00
David Zafman
8884d53dd6 libceph: Unlock unprocessed pages in start_read() error path
Function start_read() can get an error before processing all pages.
It must not only release the remaining pages, but unlock them too.

This fixes http://tracker.newdream.net/issues/3370

Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-12-13 08:13:09 -06:00
Yan, Zheng
0e5e1774a9 ceph: call handle_cap_grant() for cap import message
If client sends cap message that requests new max size during
exporting caps, the exporting MDS will drop the message quietly.
So the client may wait for the reply that updates the max size
forever. call handle_cap_grant() for cap import message can
avoid this issue.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-13 08:13:08 -06:00
Yan, Zheng
a85f50b6ef ceph: Fix __ceph_do_pending_vmtruncate
we should set i_truncate_pending to 0 after page cache is truncated
to i_truncate_size

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-13 08:13:08 -06:00
Yan, Zheng
0685235ffd ceph: Don't add dirty inode to dirty list if caps is in migration
Add dirty inode to cap_dirty_migrating list instead, this can avoid
ceph_flush_dirty_caps() entering infinite loop.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-13 08:13:08 -06:00
Yan, Zheng
ed75ec2cd1 ceph: Fix infinite loop in __wake_requests
__wake_requests() will enter infinite loop if we use it to wake
requests in the session->s_waiting list. __wake_requests() deletes
requests from the list and __do_request() adds requests back to
the list.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-13 08:13:07 -06:00
Yan, Zheng
5e62ad3015 ceph: Don't update i_max_size when handling non-auth cap
The cap from non-auth mds doesn't have a meaningful max_size value.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-13 08:13:07 -06:00
Joe Perches
d2cc4dde92 bdi_register: add __printf verification, fix arg mismatch
__printf is useful to verify format and arguments.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-12-13 08:13:07 -06:00
Sage Weil
83aff95eb9 libceph: remove 'osdtimeout' option
This would reset a connection with any OSD that had an outstanding
request that was taking more than N seconds.  The idea was that if the
OSD was buggy, the client could compensate by resending the request.

In reality, this only served to hide server bugs, and we haven't
actually seen such a bug in quite a while.  Moreover, the userspace
client code never did this.

More importantly, often the request is taking a long time because the
OSD is trying to recover, or overloaded, and killing the connection
and retrying would only make the situation worse by giving the OSD
more work to do.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-12-13 08:13:06 -06:00
Cyril Roelandt
cfc84c9f73 ceph: fix dentry reference leak in ceph_encode_fh().
dput() was not called in the error path.

Signed-off-by: Cyril Roelandt <tipecaml@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-12-13 08:13:06 -06:00
Sage Weil
22cddde104 ceph: Fix i_size update race
ceph_aio_write() has an optimization that marks cap EPH_CAP_FILE_WR
dirty before data is copied to page cache and inode size is updated.
If ceph_check_caps() flushes the dirty cap before the inode size is
updated, MDS can miss the new inode size. The fix is move
ceph_{get,put}_cap_refs() into ceph_write_{begin,end}() and call
__ceph_mark_dirty_caps() after inode size is updated.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-11-05 11:07:23 -08:00
Yan, Zheng
4d1d0534f5 ceph: Hold caps_list_lock when adjusting caps_{use, total}_count
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-11-04 03:08:24 -08:00
35fd3dc58d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes form Sage Weil:
 "There are two fixes in the messenger code, one that can trigger a NULL
  dereference, and one that error in refcounting (extra put).  There is
  also a trivial fix that in the fs client code that is triggered by NFS
  reexport."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix dentry reference leak in encode_fh()
  libceph: avoid NULL kref_put when osd reset races with alloc_msg
  rbd: reset BACKOFF if unable to re-queue
2012-10-29 08:49:25 -07:00
David Zafman
52eb5a900a ceph: fix dentry reference leak in encode_fh()
Call to d_find_alias() needs a corresponding dput()

This fixes http://tracker.newdream.net/issues/3271

Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-10-29 08:17:10 -07:00
David Zafman
b000056a5a ceph: Fix NULL ptr crash in strlen()
set_request_path_attr() checks for NULL ptr before calling strlen()

This fixes http://tracker.newdream.net/issues/3404

Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-10-26 16:35:07 -05:00
David Zafman
0f9831a893 ceph: fix dentry reference leak in encode_fh()
Call to d_find_alias() needs a corresponding dput()

This fixes http://tracker.newdream.net/issues/3271

Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-10-26 16:34:53 -05:00
Hugh Dickins
35c2a7f490 tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking
Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
	u64 inum = fid->raw[2];
which is unhelpfully reported as at the end of shmem_alloc_inode():

BUG: unable to handle kernel paging request at ffff880061cd3000
IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Call Trace:
 [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0
 [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0
 [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10
 [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6

Right, tmpfs is being stupid to access fid->raw[2] before validating that
fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
fall at the end of a page, and the next page not be present.

But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
could oops in the same way: add the missing fh_len checks to those.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Sage Weil <sage@inktank.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-09 23:33:55 -04:00
Konstantin Khlebnikov
0b173bc4da mm: kill vma flag VM_CAN_NONLINEAR
Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().

Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.

Now device drivers can implement this method and obtain nonlinear vma support.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:17 +09:00
7035cdf36d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "The bulk of this pull is a series from Alex that refactors and cleans
  up the RBD code to lay the groundwork for supporting the new image
  format and evolving feature set.  There are also some cleanups in
  libceph, and for ceph there's fixed validation of file striping
  layouts and a bugfix in the code handling a shrinking MDS cluster."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
  ceph: avoid 32-bit page index overflow
  ceph: return EIO on invalid layout on GET_DATALOC ioctl
  rbd: BUG on invalid layout
  ceph: propagate layout error on osd request creation
  libceph: check for invalid mapping
  ceph: convert to use le32_add_cpu()
  ceph: Fix oops when handling mdsmap that decreases max_mds
  rbd: update remaining header fields for v2
  rbd: get snapshot name for a v2 image
  rbd: get the snapshot context for a v2 image
  rbd: get image features for a v2 image
  rbd: get the object prefix for a v2 rbd image
  rbd: add code to get the size of a v2 rbd image
  rbd: lay out header probe infrastructure
  rbd: encapsulate code that gets snapshot info
  rbd: add an rbd features field
  rbd: don't use index in __rbd_add_snap_dev()
  rbd: kill create_snap sysfs entry
  rbd: define rbd_dev_image_id()
  rbd: define some new format constants
  ...
2012-10-08 06:38:18 +09:00
Alex Elder
6285bc2312 ceph: avoid 32-bit page index overflow
A pgoff_t is defined (by default) to have type (unsigned long).  On
architectures such as i686 that's a 32-bit type.  The ceph address
space code was attempting to produce 64 bit offsets by shifting a
page's index by PAGE_CACHE_SHIFT, but the result was not what was
desired because the shift occurred before the result got promoted
to 64 bits.

Fix this by converting all uses of page->index used in this way to
use the page_offset() macro, which ensures the 64-bit result has the
intended value.

This fixes http://tracker.newdream.net/issues/3112

Reported-by:  Mohamed Pakkeer <pakkeer.mohideen@realimage.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-10-03 10:51:18 -05:00
Sage Weil
457712a0bc ceph: return EIO on invalid layout on GET_DATALOC ioctl
If the user calls GET_DATALOC on a file with an invalid (e.g.,
zeroed) layout, return EIO to userland.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-10-03 10:51:17 -05:00
Kirill A. Shutemov
8c0a853770 fs: push rcu_barrier() from deactivate_locked_super() to filesystems
There's no reason to call rcu_barrier() on every
deactivate_locked_super().  We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.

Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths.  E.g.  on my machine exit_group() of a last process in IPC
namespace takes 0.07538s.  rcu_barrier() takes 0.05188s of that time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-02 21:35:55 -04:00
Sage Weil
6816282dab ceph: propagate layout error on osd request creation
If we are creating an osd request and get an invalid layout, return
an EINVAL to the caller.  We switch up the return to have an error
code instead of NULL implying -ENOMEM.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-10-01 17:20:00 -05:00
Wei Yongjun
b905a7f8b7 ceph: convert to use le32_add_cpu()
Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu().

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 14:30:54 -05:00
Yan, Zheng
3e8f43a089 ceph: Fix oops when handling mdsmap that decreases max_mds
When i >= newmap->m_max_mds, ceph_mdsmap_get_addr(newmap, i) return
NULL. Passing NULL to memcmp() triggers oops.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 14:30:54 -05:00
Alex Elder
c98f533c94 ceph: let path portion of mount "device" be optional
A recent change to /sbin/mountall causes any trailing '/' character
in the "device" (or fs_spec) field in /etc/fstab to be stripped.  As
a result, an entry for a ceph mount that intends to mount the root
of the name space ends up with now path portion, and the ceph mount
option processing code rejects this.

That is, an entry in /etc/fstab like:
    cephserver:port:/ /mnt ceph defaults 0 0
provides to the ceph code just "cephserver:port:" as the "device,"
and that gets rejected.

Although this is a bug in /sbin/mountall, we can have the ceph mount
code support an empty/nonexistent path, interpreting it to mean the
root of the name space.

RFC 5952 offers recommendations for how to express IPv6 addresses,
and recommends the usage found in RFC 3986 (which specifies the
format for URI's) for representing both IPv4 and IPv6 addresses that
include port numbers.  (See in particular the definition of
"authority" found in the Appendix of RFC 3986.)

According to those standards, no host specification will ever
contain a '/' character.  As a result, it is sufficient to scan a
provided "device" from an /etc/fstab entry for the first '/'
character, and if it's found, treat that as the beginning of the
path.  If no '/' character is present, we can treat the entire
string as the monitor host specification(s), and assume the path
to be the root of the name space.  We'll still require a ':' to
separate the host portion from the (possibly empty) path portion.

This means that we can more formally define how ceph will interpret
the "device" it's provided when processing a mount request:

    "device" will look like:
        <server_spec>[,<server_spec>...]:[<path>]
    where
        <server_spec> is <ip>[:<port>]
        <path> is optional, but if present must begin with '/'

This addresses http://tracker.newdream.net/issues/2919

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2012-10-01 14:30:49 -05:00
Al Viro
2744c171db ceph: don't abuse d_delete() on failure exits
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 22:20:20 -04:00
Sage Weil
45f2e081f5 ceph: avoid divide by zero in __validate_layout()
If "l->stripe_unit" is zero the the mod on the next line will cause a
divide by zero bug.  This comes from the copy_from_user() in
ceph_ioctl_set_layout_policy().  Passing 0 is valid, though (it means
"do not change") so avoid the % check in that case.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-08-21 15:55:28 -07:00
Sage Weil
6c5e50fa61 ceph: tolerate (and warn on) extraneous dentry from mds
If the MDS gives us a dentry and we weren't prepared to handle it,
WARN_ON_ONCE instead of crashing.

Reported-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-08-21 15:55:25 -07:00
Sage Weil
d1c338a509 libceph: delay debugfs initialization until we learn global_id
The debugfs directory includes the cluster fsid and our unique global_id.
We need to delay the initialization of the debug entry until we have
learned both the fsid and our global_id from the monitor or else the
second client can't create its debugfs entry and will fail (and multiple
client instances aren't properly reflected in debugfs).

Reported by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-08-20 10:03:15 -07:00
Sage Weil
5ef50c3bec ceph: simplify+fix atomic_open
The initial ->atomic_open op was carried over from the old intent code,
which was incomplete and didn't really work.  Replace it with a fresh
method.  In particular:

 * always attempt to do an atomic open+lookup, both for the create case
   and for lookups of existing files.
 * fix symlink handling by returning 1 to the VFS so that we can follow
   the link to its destination. This fixes a longstanding ceph bug (#2392).

Signed-off-by: Sage Weil <sage@inktank.com>
2012-08-02 09:11:19 -07:00
a0e881b7c1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second vfs pile from Al Viro:
 "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
  deadlock reproduced by xfstests 068), symlink and hardlink restriction
  patches, plus assorted cleanups and fixes.

  Note that another fsfreeze deadlock (emergency thaw one) is *not*
  dealt with - the series by Fernando conflicts a lot with Jan's, breaks
  userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
  for massive vfsmount leak; this is going to be handled next cycle.
  There probably will be another pull request, but that stuff won't be
  in it."

Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
  delousing target_core_file a bit
  Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
  fs: Remove old freezing mechanism
  ext2: Implement freezing
  btrfs: Convert to new freezing mechanism
  nilfs2: Convert to new freezing mechanism
  ntfs: Convert to new freezing mechanism
  fuse: Convert to new freezing mechanism
  gfs2: Convert to new freezing mechanism
  ocfs2: Convert to new freezing mechanism
  xfs: Convert to new freezing code
  ext4: Convert to new freezing mechanism
  fs: Protect write paths by sb_start_write - sb_end_write
  fs: Skip atime update on frozen filesystem
  fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
  fs: Improve filesystem freezing handling
  switch the protection of percpu_counter list to spinlock
  nfsd: Push mnt_want_write() outside of i_mutex
  btrfs: Push mnt_want_write() outside of i_mutex
  fat: Push mnt_want_write() outside of i_mutex
  ...
2012-08-01 10:26:23 -07:00
cc8362b1f6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph changes from Sage Weil:
 "Lots of stuff this time around:

   - lots of cleanup and refactoring in the libceph messenger code, and
     many hard to hit races and bugs closed as a result.
   - lots of cleanup and refactoring in the rbd code from Alex Elder,
     mostly in preparation for the layering functionality that will be
     coming in 3.7.
   - some misc rbd cleanups from Josh Durgin that are finally going
     upstream
   - support for CRUSH tunables (used by newer clusters to improve the
     data placement)
   - some cleanup in our use of d_parent that Al brought up a while back
   - a random collection of fixes across the tree

  There is another patch coming that fixes up our ->atomic_open()
  behavior, but I'm going to hammer on it a bit more before sending it."

Fix up conflicts due to commits that were already committed earlier in
drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
  rbd: create rbd_refresh_helper()
  rbd: return obj version in __rbd_refresh_header()
  rbd: fixes in rbd_header_from_disk()
  rbd: always pass ops array to rbd_req_sync_op()
  rbd: pass null version pointer in add_snap()
  rbd: make rbd_create_rw_ops() return a pointer
  rbd: have __rbd_add_snap_dev() return a pointer
  libceph: recheck con state after allocating incoming message
  libceph: change ceph_con_in_msg_alloc convention to be less weird
  libceph: avoid dropping con mutex before fault
  libceph: verify state after retaking con lock after dispatch
  libceph: revoke mon_client messages on session restart
  libceph: fix handling of immediate socket connect failure
  ceph: update MAINTAINERS file
  libceph: be less chatty about stray replies
  libceph: clear all flags on con_close
  libceph: clean up con flags
  libceph: replace connection state bits with states
  libceph: drop unnecessary CLOSED check in socket state change callback
  libceph: close socket directly from ceph_con_close()
  ...
2012-07-31 14:35:28 -07:00
Alex Elder
aa711ee340 ceph: define snap counts as u32 everywhere
There are two structures in which a count of snapshots are
maintained:

    struct ceph_snap_context {
	...
        u32 num_snaps;
	...
    }
and
    struct ceph_snap_realm {
	...
        u32 num_prior_parent_snaps;   /*  had prior to parent_since */
	...
        u32 num_snaps;
	...
    }

These fields never take on negative values (e.g., to hold special
meaning), and so are really inherently unsigned.  Furthermore they
take their value from over-the-wire or on-disk formatted 32-bit
values.

So change their definition to have type u32, and change some spots
elsewhere in the code to account for this change.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:47 -07:00
Alan Cox
21ec6ffa46 ceph: fix potential double free
We re-run the loop but we don't re-set the attrs pointer back to NULL.

Signed-off-by: Alan Cox <alan@linux.intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:35 -07:00
Sage Weil
a53aab645c ceph: close old con before reopening on mds reconnect
When we detect a mds session reset, close the old ceph_connection before
reopening it.  This ensures we clean up the old socket properly and keep
the ceph_connection state correct.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30 18:15:32 -07:00
Sage Weil
1fe60e51a3 libceph: move feature bits to separate header
This is simply cleanup that will keep things more closely synced with the
userland code.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-30 16:23:22 -07:00
Jan Kara
3ca9c3bd8a ceph: Push file_update_time() into ceph_page_mkwrite()
CC: Sage Weil <sage@newdream.net>
CC: ceph-devel@vger.kernel.org
Acked-by: Sage Weil <sage@newdream.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 01:02:45 +04:00
Sage Weil
8842b3be96 ceph: clean up useless d_parent checks
d_parent is never NULL, and IS_ROOT() is the proper way to check for a
(non-self-referential) parent.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 09:29:54 -07:00
David Howells
9249e17fe0 VFS: Pass mount flags to sget()
Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called.  They could also be passed to the
compare function.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:38:34 +04:00
Al Viro
ebfc3b49a7 don't pass nameidata to ->create()
boolean "does it have to be exclusive?" flag is passed instead;
Local filesystem should just ignore it - the object is guaranteed
not to be there yet.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:47 +04:00
Al Viro
00cd8dd3bf stop passing nameidata to ->lookup()
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument.  And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:32 +04:00
Al Viro
0b728e1911 stop passing nameidata * to ->d_revalidate()
Just the lookup flags.  Die, bastard, die...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:14 +04:00
Al Viro
e45198a6ac make finish_no_open() return int
namely, 1 ;-)  That's what we want to return from ->atomic_open()
instances after finish_no_open().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:33:45 +04:00
Al Viro
30d9049474 kill struct opendata
Just pass struct file *.  Methods are happier that way...
There's no need to return struct file * from finish_open() now,
so let it return int.  Next: saner prototypes for parts in
namei.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:33:39 +04:00
Al Viro
d95852777b make ->atomic_open() return int
Change of calling conventions:
old		new
NULL		1
file		0
ERR_PTR(-ve)	-ve

Caller *knows* that struct file *; no need to return it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:33:35 +04:00
Al Viro
47237687d7 ->atomic_open() prototype change - pass int * instead of bool *
... and let finish_open() report having opened the file via that sucker.
Next step: don't modify od->filp at all.

[AV: FILE_CREATE was already used by cifs; Miklos' fix folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:33:31 +04:00
Miklos Szeredi
2d83bde9a1 ceph: implement i_op->atomic_open()
Add an ->atomic_open implementation which replaces the atomic lookup+open+create
operation implemented via ->lookup and ->create operations.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:33:16 +04:00
Miklos Szeredi
3819219b59 ceph: remove unused arg from ceph_lookup_open()
What was the purpose of this?

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:33:16 +04:00
Sage Weil
b7a9e5dd40 libceph: set peer name on con_open, not init
The peer name may change on each open attempt, even when the connection is
reused.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-05 21:14:35 -07:00
Yan, Zheng
61600ef848 ceph: check PG_Private flag before accessing page->private
I got lots of NULL pointer dereference Oops when compiling kernel on ceph.
The bug is because the kernel page migration routine replaces some pages
in the page cache with new pages, these new pages' private can be non-zero.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 28c0254ede)
2012-06-20 07:43:48 -05:00
Sage Weil
9a64e8e0ac Linux 3.5-rc1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQEcBAABAgAGBQJPyr4LAAoJEHm+PkMAQRiGhvMH/1uXaDmJiiyAtMhC9kQbLclK
 5RpUOV+ukRrPXBJhwWGEZvC9G/DiWAfZ/19Ee6qTGZbA46yxkgZklqO+bw7fuOLH
 dPf4MNXdhgOgbs0KkVAk6aXIYzIU836pcYg+LcapG8E8SZp3SWbJzrVbUPFwPM+m
 Sv11ZcpJfM2HH9wFRdKErUOiZHsMY+LZHcw0nx+BObytjgzBbzHNkpF57F714TUO
 QplYpIToO3XtGhIM1yRDxww+2zFlVNsCZ8IC57EDbLb8BMZWuyZoFgWZqLAnrU0u
 vy7CHLledMSvs855juJ9JxGo/EDnfwJpCnjmcp8BY+h4b5T/k5mGK6d9aeXYRf4=
 =CcWn
 -----END PGP SIGNATURE-----

Merge tag 'v3.5-rc1'

Linux 3.5-rc1

Conflicts:
	net/ceph/messenger.c
2012-06-15 12:32:04 -07:00
Alex Elder
1bfd89f4e6 libceph: fully initialize connection in con_init()
Move the initialization of a ceph connection's private pointer,
operations vector pointer, and peer name information into
ceph_con_init().  Rearrange the arguments so the connection pointer
is first.  Hide the byte-swapping of the peer entity number inside
ceph_con_init()

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-06 09:23:54 -05:00
1193755ac6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs changes from Al Viro.
 "A lot of misc stuff.  The obvious groups:
   * Miklos' atomic_open series; kills the damn abuse of
     ->d_revalidate() by NFS, which was the major stumbling block for
     all work in that area.
   * ripping security_file_mmap() and dealing with deadlocks in the
     area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
     general.
   * ->encode_fh() switched to saner API; insane fake dentry in
     mm/cleancache.c gone.
   * assorted annotations in fs (endianness, __user)
   * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
   * ->update_time() work from Josef.
   * other bits and pieces all over the place.

  Normally it would've been in two or three pull requests, but
  signal.git stuff had eaten a lot of time during this cycle ;-/"

Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
'truncate_range' inode method was removed by the VM changes, the VFS
update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
to sparse fix added twice, with other changes nearby).

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
  nfs: don't open in ->d_revalidate
  vfs: retry last component if opening stale dentry
  vfs: nameidata_to_filp(): don't throw away file on error
  vfs: nameidata_to_filp(): inline __dentry_open()
  vfs: do_dentry_open(): don't put filp
  vfs: split __dentry_open()
  vfs: do_last() common post lookup
  vfs: do_last(): add audit_inode before open
  vfs: do_last(): only return EISDIR for O_CREAT
  vfs: do_last(): check LOOKUP_DIRECTORY
  vfs: do_last(): make ENOENT exit RCU safe
  vfs: make follow_link check RCU safe
  vfs: do_last(): use inode variable
  vfs: do_last(): inline walk_component()
  vfs: do_last(): make exit RCU safe
  vfs: split do_lookup()
  Btrfs: move over to use ->update_time
  fs: introduce inode operation ->update_time
  reiserfs: get rid of resierfs_sync_super
  reiserfs: mark the superblock as dirty a bit later
  ...
2012-06-01 10:34:35 -07:00
Alex Elder
15d9882c33 libceph: embed ceph messenger structure in ceph_client
A ceph client has a pointer to a ceph messenger structure in it.
There is always exactly one ceph messenger for a ceph client, so
there is no need to allocate it separate from the ceph client
structure.

Switch the ceph_client structure to embed its ceph_messenger
structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-01 08:37:56 -05:00
Yan, Zheng
28c0254ede ceph: check PG_Private flag before accessing page->private
I got lots of NULL pointer dereference Oops when compiling kernel on ceph.
The bug is because the kernel page migration routine replaces some pages
in the page cache with new pages, these new pages' private can be non-zero.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-06-01 08:06:29 -05:00
08615d7d85 Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc patches from Andrew Morton:

 - the "misc" tree - stuff from all over the map

 - checkpatch updates

 - fatfs

 - kmod changes

 - procfs

 - cpumask

 - UML

 - kexec

 - mqueue

 - rapidio

 - pidns

 - some checkpoint-restore feature work.  Reluctantly.  Most of it
   delayed a release.  I'm still rather worried that we don't have a
   clear roadmap to completion for this work.

* emailed from Andrew Morton <akpm@linux-foundation.org>: (78 patches)
  kconfig: update compression algorithm info
  c/r: prctl: add ability to set new mm_struct::exe_file
  c/r: prctl: extend PR_SET_MM to set up more mm_struct entries
  c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
  syscalls, x86: add __NR_kcmp syscall
  fs, proc: introduce /proc/<pid>/task/<tid>/children entry
  sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE
  aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()
  eventfd: change int to __u64 in eventfd_signal()
  fs/nls: add Apple NLS
  pidns: make killed children autoreap
  pidns: use task_active_pid_ns in do_notify_parent
  rapidio/tsi721: add DMA engine support
  rapidio: add DMA engine support for RIO data transfers
  ipc/mqueue: add rbtree node caching support
  tools/selftests: add mq_perf_tests
  ipc/mqueue: strengthen checks on mqueue creation
  ipc/mqueue: correct mq_attr_ok test
  ipc/mqueue: improve performance of send/recv
  selftests: add mq_open_tests
  ...
2012-05-31 18:10:18 -07:00
Xi Wang
a3860c1c5d introduce SIZE_MAX
ULONG_MAX is often used to check for integer overflow when calculating
allocation size.  While ULONG_MAX happens to work on most systems, there
is no guarantee that `size_t' must be the same size as `long'.

This patch introduces SIZE_MAX, the maximum value of `size_t', to improve
portability and readability for allocation size validation.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Acked-by: Alex Elder <elder@dreamhost.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31 17:49:26 -07:00
af56e0aa35 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "There are some updates and cleanups to the CRUSH placement code, a bug
  fix with incremental maps, several cleanups and fixes from Josh Durgin
  in the RBD block device code, a series of cleanups and bug fixes from
  Alex Elder in the messenger code, and some miscellaneous bounds
  checking and gfp cleanups/fixes."

Fix up trivial conflicts in net/ceph/{messenger.c,osdmap.c} due to the
networking people preferring "unsigned int" over just "unsigned".

* git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (45 commits)
  libceph: fix pg_temp updates
  libceph: avoid unregistering osd request when not registered
  ceph: add auth buf in prepare_write_connect()
  ceph: rename prepare_connect_authorizer()
  ceph: return pointer from prepare_connect_authorizer()
  ceph: use info returned by get_authorizer
  ceph: have get_authorizer methods return pointers
  ceph: ensure auth ops are defined before use
  ceph: messenger: reduce args to create_authorizer
  ceph: define ceph_auth_handshake type
  ceph: messenger: check return from get_authorizer
  ceph: messenger: rework prepare_connect_authorizer()
  ceph: messenger: check prepare_write_connect() result
  ceph: don't set WRITE_PENDING too early
  ceph: drop msgr argument from prepare_write_connect()
  ceph: messenger: send banner in process_connect()
  ceph: messenger: reset connection kvec caller
  libceph: don't reset kvec in prepare_write_banner()
  ceph: ignore preferred_osd field
  ceph: fully initialize new layout
  ...
2012-05-30 11:17:19 -07:00
Sage Weil
c862868bb4 ceph: move encode_fh to new API
Use parent_inode has a flag for whether nfsd wants a connectable fh, but
generate one opportunistically so that we can take advantage of the
additional info in there.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-29 23:28:33 -04:00
Al Viro
b0b0382bb4 ->encode_fh() API change
pass inode + parent's inode or NULL instead of dentry + bool saying
whether we want the parent or not.

NOTE: that needs ceph fix folded in.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-29 23:28:33 -04:00
Alex Elder
8f43fb5389 ceph: use info returned by get_authorizer
Rather than passing a bunch of arguments to be filled in with the
content of the ceph_auth_handshake buffer now returned by the
get_authorizer method, just use the returned information in the
caller, and drop the unnecessary arguments.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:13 -05:00
Alex Elder
a3530df33e ceph: have get_authorizer methods return pointers
Have the get_authorizer auth_client method return a ceph_auth
pointer rather than an integer, pointer-encoding any returned
error value.  This is to pave the way for making use of the
returned value in an upcoming patch.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:13 -05:00
Alex Elder
a255651d4c ceph: ensure auth ops are defined before use
In the create_authorizer method for both the mds and osd clients,
the auth_client->ops pointer is blindly dereferenced.  There is no
obvious guarantee that this pointer has been assigned.  And
furthermore, even if the ops pointer is non-null there is definitely
no guarantee that the create_authorizer or destroy_authorizer
methods are defined.

Add checks in both routines to make sure they are defined (non-null)
before use.  Add similar checks in a few other spots in these files
while we're at it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:13 -05:00
Alex Elder
74f1869f76 ceph: messenger: reduce args to create_authorizer
Make use of the new ceph_auth_handshake structure in order to reduce
the number of arguments passed to the create_authorizor method in
ceph_auth_client_ops.  Use a local variable of that type as a
shorthand in the get_authorizer method definitions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:12 -05:00
Alex Elder
6c4a19158b ceph: define ceph_auth_handshake type
The definitions for the ceph_mds_session and ceph_osd both contain
five fields related only to "authorizers."  Encapsulate those fields
into their own struct type, allowing for better isolation in some
upcoming patches.

Fix the #includes in "linux/ceph/osd_client.h" to lay out their more
complete canonical path.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:12 -05:00
Sage Weil
c047be0934 ceph: ignore preferred_osd field
Old users may not expect EINVAL, and there is no clear user-visibile
behavior change now that we ignore it.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-05-16 14:28:28 -05:00
Sage Weil
702aeb1f88 ceph: fully initialize new layout
When we are setting a new layout, fully initialize the structure:
 - zero it out
 - always set preferred_osd to -1

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-05-16 14:28:27 -05:00
Sage Weil
e49bf4c51c ceph: refactor SETLAYOUT and SETDIRLAYOUT ioctl checks into common helper
Both of these methods perform similar checks; move that code to a helper
so that we can ensure the checks are consistent.

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-05-07 15:34:35 -07:00
Sage Weil
3469ac1aa3 ceph: drop support for preferred_osd pgs
This was an ill-conceived feature that has been removed from Ceph.  Do
this gracefully:

 - reject attempts to specify a preferred_osd via the ioctl
 - stop exposing this information via virtual xattrs
 - always fill in -1 for requests, in case we talk to an older server
 - don't calculate preferred_osd placements/pgids

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-05-07 15:33:36 -07:00
56b59b429b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates for 3.4-rc1 from Sage Weil:
 "Alex has been busy.  There are a range of rbd and libceph cleanups,
  especially surrounding device setup and teardown, and a few critical
  fixes in that code.  There are more cleanups in the messenger code,
  virtual xattrs, a fix for CRC calculation/checks, and lots of other
  miscellaneous stuff.

  There's a patch from Amon Ott to make inos behave a bit better on
  32-bit boxes, some decode check fixes from Xi Wang, and network
  throttling fix from Jim Schutt, and a couple RBD fixes from Josh
  Durgin.

  No new functionality, just a lot of cleanup and bug fixing."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (65 commits)
  rbd: move snap_rwsem to the device, rename to header_rwsem
  ceph: fix three bugs, two in ceph_vxattrcb_file_layout()
  libceph: isolate kmap() call in write_partial_msg_pages()
  libceph: rename "page_shift" variable to something sensible
  libceph: get rid of zero_page_address
  libceph: only call kernel_sendpage() via helper
  libceph: use kernel_sendpage() for sending zeroes
  libceph: fix inverted crc option logic
  libceph: some simple changes
  libceph: small refactor in write_partial_kvec()
  libceph: do crc calculations outside loop
  libceph: separate CRC calculation from byte swapping
  libceph: use "do" in CRC-related Boolean variables
  ceph: ensure Boolean options support both senses
  libceph: a few small changes
  libceph: make ceph_tcp_connect() return int
  libceph: encapsulate some messenger cleanup code
  libceph: make ceph_msgr_wq private
  libceph: encapsulate connection kvec operations
  libceph: move prepare_write_banner()
  ...
2012-03-28 10:01:29 -07:00
Alex Elder
3489b42a72 ceph: fix three bugs, two in ceph_vxattrcb_file_layout()
In ceph_vxattrcb_file_layout(), there is a check to determine
whether a preferred PG should be formatted into the output buffer.
That check assumes that a preferred PG number of 0 indicates "no
preference," but that is wrong.  No preference is indicated by a
negative (specifically, -1) PG number.

In addition, if that condition yields true, the preferred value
is formatted into a sized buffer, but the size consumed by the
earlier snprintf() call is not accounted for, opening up the
possibilty of a buffer overrun.

Finally, in ceph_vxattrcb_dir_rctime() where the nanoseconds part of
the time displayed did not include leading 0's, which led to
erroneous (sub-second portion of) time values being shown.

This fixes these three issues:
    http://tracker.newdream.net/issues/2155
    http://tracker.newdream.net/issues/2156
    http://tracker.newdream.net/issues/2157

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:52 -05:00
Alex Elder
cffaba15cd ceph: ensure Boolean options support both senses
Many ceph-related Boolean options offer the ability to both enable
and disable a feature.  For all those that don't offer this, add
a new option so that they do.

Note that ceph_show_options()--which reports mount options currently
in effect--only reports the option if it is different from the
default value.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
ee57741c52 rbd: make ceph_parse_options() return a pointer
ceph_parse_options() takes the address of a pointer as an argument
and uses it to return the address of an allocated structure if
successful.  With this interface is not evident at call sites that
the pointer is always initialized.  Change the interface to return
the address instead (or a pointer-coded error code) to make the
validity of the returned pointer obvious.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder
18fa8b3fea ceph: make ceph_setxattr() and ceph_removexattr() more alike
This patch just rearranges a few bits of code to make more
portions of ceph_setxattr() and ceph_removexattr() identical.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
3ce6cd1233 ceph: avoid repeatedly computing the size of constant vxattr names
All names defined in the directory and file virtual extended
attribute tables are constant, and the size of each is known at
compile time.  So there's no need to compute their length every
time any file's attribute is listed.

Record the length of each string and use it when needed to determine
the space need to represent them.  In addition, compute the
aggregate size of strings in each table just once at initialization
time.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
aa4066ed7b ceph: encode type in vxattr callback routines
The names of the callback functions used for virtual extended
attributes are based only on the last component of the attribute
name.  Because of the way these are defined, this precludes allowing
a single (lowest) attribute name for different callbacks, dependent
on the type of file being operated on.  (For example, it might be
nice to support both "ceph.dir.layout" and "ceph.file.layout".)

Just change the callback names to avoid this problem.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
881a5fa200 ceph: drop "_cb" from name of struct ceph_vxattr_cb
A struct ceph_vxattr_cb does not represent a callback at all, but
rather a virtual extended attribute itself.  Drop the "_cb" suffix
from its name to reflect that.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
eb78808446 ceph: use macros to normalize vxattr table definitions
Entries in the ceph virtual extended attribute tables all follow a
distinct pattern in their definition.  Enforce this pattern through
the use of a macro.

Also, a null name field signals the end of the table, so make that
be the first field in the ceph_vxattr_cb structure.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
2289190719 ceph: use a symbolic name for "ceph." extended attribute namespace
Use symbolic constants to define the top-level prefix for "ceph."
extended attribute names.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
06476a69d8 ceph: pass inode rather than table to ceph_match_vxattr()
All callers of ceph_match_vxattr() determine what to pass as the
first argument by calling ceph_inode_vxattrs(inode).  Just do that
inside ceph_match_vxattr() itself, changing it to take an inode
rather than the vxattr pointer as its first argument.

Also ensure the function works correctly for an empty table (i.e.,
containing only a terminating null entry).

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
b829c1954d ceph: don't null-terminate xattr values
For some reason, ceph_setxattr() allocates an extra byte in which a
'\0' is stored past the end of an extended attribute value.  This is
not needed, and is potentially misleading, so get rid of it.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Xi Wang
80834312a4 ceph: fix overflow check in build_snap_context()
The overflow check for a + n * b should be (n > (ULONG_MAX - a) / b),
rather than (n > ULONG_MAX / b - a).

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Xi Wang
810339ec2f ceph: avoid panic with mismatched symlink sizes in fill_inode()
Return -EINVAL rather than panic if iinfo->symlink_len and inode->i_size
do not match.

Also use kstrndup rather than kmalloc/memcpy.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:45 -05:00
Amon Ott
a661fc5611 ceph: use 2 instead of 1 as fallback for 32-bit inode number
The root directory of the Ceph mount has inode number 1, so falling back
to 1 always creates a collision. 2 is unused on my test systems and seems
less likely to collide.

Signed-off-by: Amon Ott <ao@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Alex Elder
1ce208a6ce ceph: don't reset s_cap_ttl to zero
Avoid the need to check for a special zero s_cap_ttl value by just
using (jiffies - 1) as the value assigned to indicate "sometime in
the past."

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Al Viro
48fde701af switch open-coded instances of d_make_root() to new helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20 21:29:35 -04:00
6c073a7ee2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix safety of rbd_put_client()
  rbd: fix a memory leak in rbd_get_client()
  ceph: create a new session lock to avoid lock inversion
  ceph: fix length validation in parse_reply_info()
  ceph: initialize client debugfs outside of monc->mutex
  ceph: change "ceph.layout" xattr to be "ceph.file.layout"
2012-02-02 15:47:33 -08:00
Alex Elder
d8fb02abdc ceph: create a new session lock to avoid lock inversion
Lockdep was reporting a possible circular lock dependency in
dentry_lease_is_valid().  That function needs to sample the
session's s_cap_gen and and s_cap_ttl fields coherently, but needs
to do so while holding a dentry lock.  The s_cap_lock field was
being used to protect the two fields, but that can't be taken while
holding a lock on a dentry within the session.

In most cases, the s_cap_gen and s_cap_ttl fields only get operated
on separately.  But in three cases they need to be updated together.
Implement a new lock to protect the spots updating both fields
atomically is required.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-02-02 12:49:19 -08:00
Xi Wang
32852a81bc ceph: fix length validation in parse_reply_info()
"len" is read from network and thus needs validation.  Otherwise, given
a bogus "len" value, p+len could be an out-of-bounds pointer, which is
used in further parsing.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-02-02 12:49:11 -08:00
Alex Elder
114fc47492 ceph: change "ceph.layout" xattr to be "ceph.file.layout"
The virtual extended attribute named "ceph.layout" is meaningful
only for regular files.  Change its name to be "ceph.file.layout" to
more directly reflect that in the ceph xattr namespace.  Preserve
the old "ceph.layout" name for the time being (until we decide it's
safe to get rid of it entirely).

Add a missing initializer for "readonly" in the terminating entry.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-02-02 12:48:52 -08:00
1a52bb0b68 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: ensure prealloc_blob is in place when removing xattr
  rbd: initialize snap_rwsem in rbd_add()
  ceph: enable/disable dentry complete flags via mount option
  vfs: export symbol d_find_any_alias()
  ceph: always initialize the dentry in open_root_dentry()
  libceph: remove useless return value for osd_client __send_request()
  ceph: avoid iput() while holding spinlock in ceph_dir_fsync
  ceph: avoid useless dget/dput in encode_fh
  ceph: dereference pointer after checking for NULL
  crush: fix force for non-root TAKE
  ceph: remove unnecessary d_fsdata conditional checks
  ceph: Use kmemdup rather than duplicating its implementation

Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
always initialize the dentry in open_root_dentry)
2012-01-13 10:29:21 -08:00
Alex Elder
83eb26af0d ceph: ensure prealloc_blob is in place when removing xattr
In __ceph_build_xattrs_blob(), if a ceph inode's extended attributes
are marked dirty, all attributes recorded in its rb_tree index are
formatted into a "blob" buffer.  The target buffer is recorded in
ceph_inode->i_xattrs.prealloc_blob, and it is expected to exist and
be of sufficient size to hold the attributes.

The extended attributes are marked dirty in two cases: when a new
attribute is added to the inode; or when one is removed.  In the
former case work is done to ensure the prealloc_blob buffer is
properly set up, but in the latter it is not.

Change the logic in ceph_removexattr() so it matches what is
done in ceph_setxattr().  Note that this is done in a way that
keeps the two blocks of code nearly identical, in anticipation
of a subsequent patch that encapsulates some of this logic into
one or more helper routines.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-12 11:00:51 -08:00
Sage Weil
a40dc6cc2e ceph: enable/disable dentry complete flags via mount option
Enable/disable use of the dentry dir 'complete' flag via a mount option.
This lets the admin control whether ceph uses the dcache to satisfy
negative lookups or readdir when it has the entire directory contents in
its cache.

This is purely a performance optimization; correctness is guaranteed
whether it is enabled or not.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-12 11:00:40 -08:00
Alex Elder
d46cfba536 ceph: always initialize the dentry in open_root_dentry()
When open_root_dentry() gets a dentry via d_obtain_alias() it does
not get initialized.  If the dentry obtained came from the cache,
this is OK.  But if not, the result is an improperly initialized
dentry.

To fix this, call ceph_init_dentry() regardless of which path
produced the dentry.  That function returns immediately for a dentry
that is already initialized, it is safe to use either way.

(Credit to Sage, who suggested this fix.)

Signed-off-by: Alex Elder <aelder@sgi.com>
2012-01-11 16:28:25 -08:00
Sage Weil
2ff179e650 ceph: avoid iput() while holding spinlock in ceph_dir_fsync
ceph_mdsc_put_request() can call iput(), which can sleep.  Don't do that.

Fixes: #1812
Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-10 08:57:02 -08:00
Sage Weil
ee6b1baf67 ceph: avoid useless dget/dput in encode_fh
Nothing we do here sleeps, so just do it under d_lock and avoid the dget/
dput entirely.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-10 08:57:00 -08:00
Yehuda Sadeh
b8cd952b51 ceph: dereference pointer after checking for NULL
moved dereference after BUG_ON

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2012-01-10 08:56:59 -08:00
Sage Weil
3d8eb7a94e ceph: remove unnecessary d_fsdata conditional checks
We now set d_fsdata unconditionally on all dentries prior to setting up
the d_ops, so all of these checks are unnecessary.

Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-10 08:56:56 -08:00
Al Viro
3c5184ef12 ceph: d_alloc_root() may fail
... and ceph_init_dentry(NULL) will oops

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-09 16:36:12 -05:00
Al Viro
34c80b1d93 vfs: switch ->show_options() to struct dentry *
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-06 23:19:54 -05:00
Al Viro
5706b27dea ceph: propagate umode_t
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:55:16 -05:00
Al Viro
dba19c6064 get rid of open-coded S_ISREG(), etc.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:55:12 -05:00
Al Viro
1a67aafb5f switch ->mknod() to umode_t
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:54 -05:00
Al Viro
4acdaf27eb switch ->create() to umode_t
vfs_create() ignores everything outside of 16bit subset of its
mode argument; switching it to umode_t is obviously equivalent
and it's the only caller of the method

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:53 -05:00
Al Viro
18bb1db3e7 switch vfs_mkdir() and ->mkdir() to umode_t
vfs_mkdir() gets int, but immediately drops everything that might not
fit into umode_t and that's the only caller of ->mkdir()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:53 -05:00
Al Viro
6b520e0565 vfs: fix the stupidity with i_dentry in inode destructors
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else.  Not to mention the removal of
boilerplate code from ->destroy_inode() instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:52:40 -05:00
Sage Weil
a4d46363ce ceph: disable use of dcache for readdir etc.
Ceph attempts to use the dcache to satisfy negative lookups and readdir
when the entire directory contents are in cache.  Disable this behavior
until lingering bugs in this code are shaken out; we'll re-enable these
hooks once things are fully stable.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-12-29 08:05:14 -08:00
Yehuda Sadeh
9d5a09e659 ceph: add missing spin_unlock at ceph_mdsc_build_path()
one of the paths was missing spin_unlock

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2011-12-13 11:59:53 -08:00
Sage Weil
6a82c47aa8 ceph: fix SEEK_CUR, SEEK_SET regression
Commit 06222e491e got the if wrong so that
it always evaluates as true.  This is semantically harmless, but makes
SEEK_CUR and SEEK_SET needlessly query the server.

Rewrite the if to explicitly enumerate the cases we DO need a valid i_size
to make this code less fragile.

Reported-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-12-13 09:19:26 -08:00
Sage Weil
be655596b3 ceph: use i_ceph_lock instead of i_lock
We have been using i_lock to protect all kinds of data structures in the
ceph_inode_info struct, including lists of inodes that we need to iterate
over while avoiding races with inode destruction.  That requires grabbing
a reference to the inode with the list lock protected, but igrab() now
takes i_lock to check the inode flags.

Changing the list lock ordering would be a painful process.

However, using a ceph-specific i_ceph_lock in the ceph inode instead of
i_lock is a simple mechanical change and avoids the ordering constraints
imposed by igrab().

Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-12-07 10:46:44 -08:00
Sage Weil
2151937d7c ceph: fix rasize reporting by ceph_show_options
Fix typo.

Reported-by: mowang da <whooya.xxl@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-12-02 09:27:54 -08:00
Sage Weil
774ac21da7 ceph: initialize root dentry
Set up d_fsdata on the root dentry.  This fixes a NULL pointer dereference
in ceph_d_prune on umount.  It also means we can eventually strip out all
of the conditional checks on d_fsdata because it is now set unconditionally
(prior to setting up the d_ops).

Fix the ceph_d_prune debug print while we're here.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-11 09:50:17 -08:00
Sage Weil
15a2015fbc ceph: fix iput race when queueing inode work
If we queue a work item that calls iput(), make sure we ihold() before
attempting to queue work. Otherwise our queued work might miraculously run
before we notice the queue_work() succeeded and call ihold(), allowing the
inode to be destroyed.

That is, instead of

	if (queue_work(...))
		ihold();

we need to do

	ihold();
	if (!queue_work(...))
		iput();

Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-05 22:06:31 -07:00
H Hartley Sweeten
0c6d4b4e22 ceph/super.c: quiet sparse noise
Quiet the sparse noise:

warning: symbol 'create_fs_client' was not declared. Should it be static?
warning: symbol 'destroy_fs_client' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Sage Weil <sage@newdream.net>
ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-05 21:10:12 -07:00
H Hartley Sweeten
7fd7d101ff ceph/mds_client.c: quiet sparse noise
Quiet the following sparse noise:

warning: symbol 'get_nonsnap_parent' was not declared. Should it be static?
warning: symbol 'done_closing_sessions' was not declared. Should it be static?

Local functions don't need external visability. Make them static.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-05 21:10:11 -07:00
Sage Weil
c6ffe10015 ceph: use new D_COMPLETE dentry flag
We used to use a flag on the directory inode to track whether the dcache
contents for a directory were a complete cached copy.  Switch to a dentry
flag CEPH_D_COMPLETE that is safely updated by ->d_prune().

Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-05 21:10:10 -07:00
Sage Weil
b58dc4100b ceph: clear parent D_COMPLETE flag when on dentry prune
When the VFS prunes a dentry from the cache, clear the D_COMPLETE flag
on the parent dentry.  Do this for the live and snapshotted namespaces. Do
not bother for the .snap dir contents, since we do not cache that.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-03 09:23:49 -07:00
Miklos Szeredi
bfe8684869 filesystems: add set_nlink()
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:43 +01:00
Sage Weil
3395734067 libceph: fix double-free of page vector
ceph_release_page_vector() kfrees the vector; we shouldn't do it here too.

Reported-by: Jeff Wu <cpwu@tnsoft.com.cn>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:17 -07:00
Amon Ott
3310f7541f ceph: fix 32-bit ino numbers
Fix 32-bit ino generation to not always be 1.

Signed-off-by: Amon Ott <a.ott@m-privacy.de>
2011-10-25 16:10:17 -07:00
Greg Farnum
a35eca958a ceph: let the set_layout ioctl set single traits
Previously we were validating the passed-in stripe unit, object size,
and stripe count against each other (and not testing most other stuff).
Instead, make sure that the composed previous layout and new values are valid,
and only send the new values to the MDS. This lets users change the
pool without setting the whole layout, for instance.

Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
2011-10-25 16:10:16 -07:00
Sage Weil
83eaea22bd Revert "ceph: don't truncate dirty pages in invalidate work thread"
This reverts commit c9af9fb68e.

We need to block and truncate all pages in order to reliably invalidate
them.  Otherwise, we could:

 - have some uptodate pages in the cache
 - queue an invalidate
 - write(2) locks some pages
 - invalidate_work skips them
 - write(2) only overwrites part of the page
 - page now dirty and uptodate
 -> partial leakage of invalidated data

It's not entirely clear why we started skipping locked pages in the first
place.  I just ran this through fsx and didn't see any problems.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:16 -07:00
Noah Watkins
80db8bea6a ceph: replace leading spaces with tabs
Trivial formatting fix.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:16 -07:00
Sage Weil
b61c27636f libceph: don't complain on msgpool alloc failures
The pool allocation failures are masked by the pool; there is no need to
spam the console about them.  (That's the whole point of having the pool
in the first place.)

Mark msg allocations whose failure is safely handled as such.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Sage Weil
6ab00d465a libceph: create messenger with client
This simplifies the init/shutdown paths, and makes client->msgr available
during the rest of the setup process.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Sage Weil
6a8ea4706a ceph: document ioctls
...after some prodding by Christoph.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Sage Weil
0d66a487c1 ceph: implement (optional) max read size
The 'rsize' mount option limits the maximum size of an individual
read(ahead) operation that is sent off to an OSD.  This is distinct from
'rasize', which controls the size of the readahead window.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Sage Weil
83817e35cb ceph: rename rsize -> rasize
It controls readahead.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Sage Weil
7c272194e6 ceph: make readpages fully async
When we get a ->readpages() aop, submit async reads for all page ranges
in the provided page list.  Lock the pages immediately, so that VFS/MM
will block until the reads complete.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:14 -07:00
0d20fbbe82 Merge branch 'for-linus' of git://ceph.newdream.net/git/ceph-client
* 'for-linus' of git://ceph.newdream.net/git/ceph-client:
  libceph: fix leak of osd structs during shutdown
  ceph: fix memory leak
  ceph: fix encoding of ino only (not relative) paths
  libceph: fix msgpool
2011-09-09 15:48:34 -07:00
Noah Watkins
259a187ade ceph: fix memory leak
kfree does not clean up indirect allocations in
ceph_fs_client and ceph_options (e.g. snapdir_name).

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-08-22 13:06:59 -07:00
Sage Weil
795858dbd2 ceph: fix encoding of ino only (not relative) paths
A 'path' consists of a starting ino and relative component.  Encode even
when there is no relative component.  This is primarily needed by the
NFS reexport code.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-08-15 13:03:56 -07:00
ba5b56cb3e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
  ceph: document unlocked d_parent accesses
  ceph: explicitly reference rename old_dentry parent dir in request
  ceph: document locking for ceph_set_dentry_offset
  ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
  ceph: protect d_parent access in ceph_d_revalidate
  ceph: protect access to d_parent
  ceph: handle racing calls to ceph_init_dentry
  ceph: set dir complete frag after adding capability
  rbd: set blk_queue request sizes to object size
  ceph: set up readahead size when rsize is not passed
  rbd: cancel watch request when releasing the device
  ceph: ignore lease mask
  ceph: fix ceph_lookup_open intent usage
  ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
  ceph: fix bad parent_inode calc in ceph_lookup_open
  ceph: avoid carrying Fw cap during write into page cache
  libceph: don't time out osd requests that haven't been received
  ceph: report f_bfree based on kb_avail rather than diffing.
  ceph: only queue capsnap if caps are dirty
  ceph: fix snap writeback when racing with writes
  ...
2011-07-26 13:38:50 -07:00
Sage Weil
d79698da32 ceph: document unlocked d_parent accesses
For the most part we don't care about racing with rename when directing
MDS requests; either the old or new parent is fine.  Document that, and
do some minor cleanup.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:31:26 -07:00
Sage Weil
41b02e1f9b ceph: explicitly reference rename old_dentry parent dir in request
We carry a pin on the parent directory for the rename source and dest
dentries.  For the source it's r_locked_dir; we need to explicitly
reference the old_dentry parent as well, since the dentry's d_parent may
change between when the request was created and pinned and when it is
freed.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:31:14 -07:00
Sage Weil
4f17726452 ceph: document locking for ceph_set_dentry_offset
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:31:08 -07:00
Sage Weil
e5f86dc377 ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
Have caller pass in a safely-obtained reference to the parent directory
for calculating a dentry's hash valud.

While we're here, simpify the flow through ceph_encode_fh() so that there
is a single exit point and cleanup.

Also fix a bug with the dentry hash calculation: calculate the hash for the
dentry we were given, not its parent.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:30:55 -07:00
Sage Weil
bf1c6aca96 ceph: protect d_parent access in ceph_d_revalidate
Protect d_parent with d_lock.  Carry a reference.  Simplify the flow so
that there is a single exit point and cleanup.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:30:43 -07:00
Sage Weil
5f21c96dd5 ceph: protect access to d_parent
d_parent is protected by d_lock: use it when looking up a dentry's parent
directory inode.  Also take a reference and drop it in the caller to avoid
a use-after-free.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:30:29 -07:00
Sage Weil
48d0cbd124 ceph: handle racing calls to ceph_init_dentry
The ->lookup() and prepopulate_readdir() callers are working with unhashed
dentries, so we don't have to worry.  The export.c callers, though, need
to initialize something they got back from d_obtain_alias() and are
potentially racing with other callers.  Make sure we don't return unless
the dentry is properly initialized (by us or someone else).

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:30:15 -07:00
Sage Weil
dfabbed6fd ceph: set dir complete frag after adding capability
Curretly ceph_add_cap clears the complete bit if we are newly issued the
FILE_SHARED cap, which is normally the case for a newly issue cap on a new
directory.  That means we clear the just-set bit.  Move the check that sets
the flag to after the cap is added/updated.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:30:02 -07:00
Yehuda Sadeh
e985222743 ceph: set up readahead size when rsize is not passed
This should improve the default read performance, as without it
readahead is practically disabled.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2011-07-26 11:29:14 -07:00
Sage Weil
2f90b852e3 ceph: ignore lease mask
The lease mask is no longer used (and it changed a while back).  Instead,
use a non-zero duration to indicate that there is a lease being issued.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:28:25 -07:00
Sage Weil
468640e32c ceph: fix ceph_lookup_open intent usage
We weren't properly calling lookup_instantiate_filp when setting up the
lookup intent, which could lead to file leakage on errors.  So:

 - use separate helper for the hidden snapdir translation, immediately
   following the mds request
 - use ceph_finish_lookup for the final dentry/return value dance in the
   exit path
 - lookup_instantiate_filp on success

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:28:11 -07:00
Sage Weil
9bae113a08 ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
We only need to put these on the directory unsafe list if they have
side effects that fsync(2) should flush out.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:27:59 -07:00
Sage Weil
acda765788 ceph: fix bad parent_inode calc in ceph_lookup_open
We were always getting NULL here because the intent file f_dentry is always
NULL at this point, which means we were always passing NULL to
ceph_mdsc_do_request.  In reality, this was fine, since this isn't
currently ever a write operation that needs to get strung on the dir's
unsafe list.

Use the dir explicitly, and only pass it if this open has side-effects that
a dir fsync should flush.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:27:48 -07:00
Sage Weil
d8de9ab63a ceph: avoid carrying Fw cap during write into page cache
The generic_file_aio_write call may block on balance_dirty_pages while we
flush data to the OSDs.  If we hold a reference to the FILE_WR cap during
that interval revocation by the MDS (e.g., to do a stat(2)) may be very
slow.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:27:34 -07:00
Greg Farnum
8f04d42276 ceph: report f_bfree based on kb_avail rather than diffing.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
2011-07-26 11:27:06 -07:00
Sage Weil
e77dc3e9c0 ceph: only queue capsnap if caps are dirty
We used to go into this branch if i_wrbuffer_ref_head was non-zero.  This
was an ancient check from before we were careful about dealing with all
kinds of caps (and not just dirty pages).  It is cleaner to only queue a
capsnap if there is an actual dirty cap.  If we are racing with...
something...we will end up here with ci->i_wrbuffer_refs but no dirty
caps.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:26:41 -07:00
Sage Weil
af0ed569d7 ceph: fix snap writeback when racing with writes
There are two problems that come up when we try to queue a capsnap while a
write is in progress:

 - The FILE_WR cap is held, but not yet dirty, so we may queue a capsnap
   with dirty == 0.  That will crash later in __ceph_flush_snaps().  Or
   on the FILE_WR cap if a write is in progress.
 - We may not have i_head_snapc set, which causes problems pretty quickly.
   Look to the snaprealm in this case.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:26:31 -07:00
Sage Weil
9cfa1098dc ceph: use flag bit for at_end readdir flag
This saves us a word of memory per file.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:26:18 -07:00
Sage Weil
4918b6d140 ceph: add F_SYNC file flag to force sync (non-O_DIRECT) io
This allows us to force IO through the sync path which you normally only
get when multiple clients are reading/writing to the same file or by
mounting with -o sync.  Among other things, this lets test programs verify
correctness with a single mount.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:26:07 -07:00
Sage Weil
252c6728de ceph: add flags field to file_info
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:25:27 -07:00
Josef Bacik
02c24a8218 fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers.  Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2.  For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 20:47:59 -04:00
Josef Bacik
06222e491e fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
This converts everybody to handle SEEK_HOLE/SEEK_DATA properly.  In some cases
we just return -EINVAL, in others we do the normal generic thing, and in others
we're simply making sure that the properly due-dilligence is done.  For example
in NFS/CIFS we need to make sure the file size is update properly for the
SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
that is all we have to do.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 20:47:58 -04:00
Al Viro
b85fd6bdc9 don't open-code parent_ino() in assorted ->readdir()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 20:47:54 -04:00
Al Viro
a127e0af59 ceph: LOOKUP_OPEN is set only when it's the last component
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:59 -04:00
Al Viro
8a5e929dd2 don't transliterate lower bits of ->intent.open.flags to FMODE_...
->create() instances are much happier that way...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:52 -04:00
Al Viro
10556cb21a ->permission() sanitizing: don't pass flags to ->permission()
not used by the instances anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:24 -04:00
Al Viro
2830ba7f34 ->permission() sanitizing: don't pass flags to generic_permission()
redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of
them removes that bit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:22 -04:00
Al Viro
178ea73521 kill check_acl callback of generic_permission()
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:16 -04:00
Al Viro
1b71fe2efa ceph analog of cifs build_path_from_dentry() race fix
... unfortunately, cifs bug got copied.  Fix is essentially the same.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-16 23:43:58 -04:00
Sage Weil
d7f124f129 ceph: fix sync and dio writes across stripe boundaries
We were iterating across stripe boundaries properly, but not moving the
write buffer pointer forward.  This caused us to rewrite the same data
after the break.  Fix by adjusting the data pointer forward, and
recalculating the io and buffer alignment after the break.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-13 16:26:22 -07:00
Sage Weil
773e9b4426 ceph: fix page alignment corrections
dd if=/dev/urandom of=/mnt/fs_depot/dd10 bs=500 seek=8388 count=1
 dd if=/mnt/fs_depot/dd10 of=/root/dd10out bs=500 skip=8388 count=1

Reported-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-13 16:26:10 -07:00
Sage Weil
0c1f91f271 ceph: unwind canceled flock state
If we request a lock and then abort (e.g., ^C), we need to send a matching
unlock request to the MDS to unwind our lock attempt to avoid indefinitely
blocking other clients.

Reported-by: Brian Chrisman <brchrisman@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:36:45 -07:00
Sage Weil
0e98728fa3 ceph: fix ENOENT logic in striped_read
Getting ENOENT is equivalent to reading 0 bytes.  Make that correction
before setting up the hit_stripe and was_short flags.

Fixes the following case:
 dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0
 dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct

Reported-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:34:16 -07:00
Sage Weil
c3cd62839a ceph: fix short sync reads from the OSD
If we get a short read from the OSD because the object is small, we need to
zero the remainder of the buffer.  For O_DIRECT reads, the attempted range
is not trimmed to i_size by the VFS, so we were actually looping
indefinitely.

Fix by trimming by i_size, and the unconditionally zeroing the trailing
range.

Reported-by: Jeff Wu <cpwu@tnsoft.com.cn>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:34:14 -07:00
Sage Weil
70b666c3b4 ceph: use ihold when we already have an inode ref
We should use ihold whenever we already have a stable inode ref, even
when we aren't holding i_lock.  This avoids adding new and unnecessary
locking dependencies.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:34:11 -07:00
Sage Weil
db3540522e ceph: fix cap flush race reentrancy
In e9964c10 we change cap flushing to do a delicate dance because some
inodes on the cap_dirty list could be in a migrating state (got EXPORT but
not IMPORT) in which we couldn't actually flush and move from
dirty->flushing, breaking the while (!empty) { process first } loop
structure.  It worked for a single sync thread, but was not reentrant and
triggered infinite loops when multiple syncers came along.

Instead, move inodes with dirty to a separate cap_dirty_migrating list
when in the limbo export-but-no-import state, allowing us to go back to
the simple loop structure (which was reentrant).  This is cleaner and more
robust.

Audited the cap_dirty users and this looks fine:
list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we
have dirty caps (which list we're on is irrelevant) and list_del_init()
calls still do the right thing.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:12 -07:00
Sage Weil
45e3d3eeb6 ceph: avoid inode lookup on nfs fh reconnect
If we get the inode from the MDS, we have a reference in req; don't do a
fresh lookup.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:06 -07:00
Sage Weil
3c454cf216 ceph: use LOOKUPINO to make unconnected nfs fh more reliable
If we are unable to locate an inode by ino, ask the MDS using the new
LOOKUPINO command.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:05 -07:00
Sage Weil
9d6fcb081a ceph: check return value for start_request in writepages
Since we pass the nofail arg, we should never get an error; BUG if we do.
(And fix the function to not return an error if __map_request fails.)

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:05 -07:00
Sage Weil
6b4a3b517a ceph: remove useless check
rc is only ever 0 or negative in this method.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:05 -07:00
Sage Weil
da39822c65 ceph: fix broken comparison in readdir loop
Both off and fi->offset are unsigned, so the difference is always >= 0.
Compare them directly instead of the sign of the difference.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:04 -07:00
Sage Weil
3540303f87 ceph: fix rare potential cap leak
If we grab new_cap, retake the lock, and find we already have a cap now
for the given mds, release new_cap.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:03 -07:00
Sage Weil
ae59808301 ceph: use snprintf for dirstat content
We allocate a buffer for rstats if the dirstat option is enabled.  Use
snprintf.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:02 -07:00
Sage Weil
1b36698577 libceph: remove unused variable
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:24:17 -07:00
Sage Weil
3b66378034 ceph: take reference on mds request r_unsafe_dir
We put ourselves on an inode list for the parent directory of metadata
operations so that an fsync on the directory will wait for metadata updates
to commit to disk.  We weren't holding a reference to that directory,
however, and under certain workloads (fsstress in this case) the directory
can go away.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:20:07 -07:00
Henry C Chang
d3d0720d4a ceph: do not use i_wrbuffer_ref as refcount for Fb cap
We increments i_wrbuffer_ref when taking the Fb cap. This breaks
the dirty page accounting and causes looping in
__ceph_do_pending_vmtruncate, and ceph client hangs.

This bug can be reproduced occasionally by running blogbench.

Add a new field i_wb_ref to inode and dedicate it to Fb reference
counting.

Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11 10:44:48 -07:00
Henry C Chang
a26a185d27 ceph: fix list_add in ceph_put_snap_realm
Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11 10:44:36 -07:00
Henry C Chang
7d8e18a69d ceph: print debug message before put mds session
The mds session, s, could be freed during ceph_put_mds_session.
Move dout before ceph_put_mds_session.

Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11 10:44:34 -07:00
Sage Weil
fca65b4ad7 ceph: do not call __mark_dirty_inode under i_lock
The __mark_dirty_inode helper now takes i_lock as of 250df6ed.  Fix the
one ceph callers that held i_lock (__ceph_mark_dirty_caps) to return the
flags value so that the callers can do it outside of i_lock.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-04 12:56:45 -07:00
Henry C Chang
8c71897be2 ceph: handle ceph_osdc_new_request failure in ceph_writepages_start
We should unlock the page and return -ENOMEM if ceph_osdc_new_request
failed.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:12 -07:00
Sage Weil
3772d26d87 ceph: use ihold() when i_lock is held
See 0444d76ae6.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:08 -07:00
42933bac11 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
  Fix common misspellings
2011-04-07 11:14:49 -07:00
Lucas De Marchi
25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
50f3515828 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: Create a new key type "ceph".
  libceph: Get secret from the kernel keys api when mounting with key=NAME.
  ceph: Move secret key parsing earlier.
  libceph: fix null dereference when unregistering linger requests
  ceph: unlock on error in ceph_osdc_start_request()
  ceph: fix possible NULL pointer dereference
  ceph: flush msgr_wq during mds_client shutdown
2011-03-30 09:46:09 -07:00
Tommi Virtanen
8323c3aa74 ceph: Move secret key parsing earlier.
This makes the base64 logic be contained in mount option parsing,
and prepares us for replacing the homebew key management with the
kernel key retention service.

Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29 12:11:16 -07:00
Dave Chinner
0444d76ae6 fs: don't use igrab() while holding i_lock
Fix the incorrect use of igrab() inside the i_lock in NFS and Ceph‥

If we are already holding the i_lock, we have a reference to the
inode so we can safely use ihold() to gain an extra reference. This
avoids hangs due to lock recursion on the i_lock now that the
inode_lock is gone and igrab() uses the i_lock itself.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Ryan Mallon <ryan@bluewatersys.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-29 07:50:34 -07:00
Sage Weil
ef550f6f4f ceph: flush msgr_wq during mds_client shutdown
The release method for mds connections uses a backpointer to the
mds_client, so we need to flush the workqueue of any pending work (and
ceph_connection references) prior to freeing the mds_client.  This fixes
an oops easily triggered under UML by

 while true ; do mount ... ; umount ... ; done

Also fix an outdated comment: the flush in ceph_destroy_client only flushes
OSD connections out.  This bug is basically an artifact of the ceph ->
ceph+libceph conversion.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-25 13:27:48 -07:00
Sage Weil
147851d2dc ceph: rename dentry_release -> d_release, fix comment
Just for consistency's sake.  Fix obsolete comment too.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21 12:24:26 -07:00
Henry C Chang
49bcb93236 ceph: add request to the tail of unsafe write list
In sync_write_wait(), we assume that the newest request is at the
tail of unsafe write list. We should maintain the semantics here.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21 12:24:25 -07:00
Henry C Chang
78a255654f ceph: remove request from unsafe list if it is canceled/timed out
This fixes the list corruption warning like this:

------------[ cut here ]------------
WARNING: at lib/list_debug.c:30 __list_add+0x68/0x81()
Hardware name: X8DTU
list_add corruption. prev->next should be next (ffff880618931250), but was (null). (prev=ffff880c188b9130).
Modules linked in: nfsd lockd nfs_acl auth_rpcgss exportfs ceph libceph libcrc32c sunrpc ipv6 fuse igb i2c_i801 ioatdma i2c_core iTCO_wdt iTCO_vendor_support joydev dca serio_raw usb_storage [last unloaded: scsi_wait_scan]
Pid: 10977, comm: smbd Tainted: G        W  2.6.32.23-170.Elaster.xendom0.fc12.x86_64 #1
Call Trace:
[<ffffffff8105753c>] warn_slowpath_common+0x7c/0x94
[<ffffffff810575ab>] warn_slowpath_fmt+0x41/0x43
[<ffffffff812351a3>] __list_add+0x68/0x81
[<ffffffffa014799d>] ceph_aio_write+0x614/0x8a2 [ceph]
[<ffffffff8111d2a0>] do_sync_write+0xe8/0x125
[<ffffffff81075a1f>] ? autoremove_wake_function+0x0/0x39
[<ffffffff811f21ec>] ? selinux_file_permission+0x5c/0xb3
[<ffffffff811e8521>] ? security_file_permission+0x16/0x18
[<ffffffff8111d864>] vfs_write+0xae/0x10b
[<ffffffff8111d91b>] sys_pwrite64+0x5a/0x76
[<ffffffff81012d32>] system_call_fastpath+0x16/0x1b
---[ end trace 08573eb9f07ff6f4 ]---

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21 12:24:24 -07:00
Sage Weil
80456f8672 ceph: move readahead default to fs/ceph from libceph
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21 12:24:23 -07:00
Yehuda Sadeh
ad1fee96cb ceph: add ino32 mount option
The ino32 mount option forces the ceph fs to report 32 bit
ino values.  This is useful for 64 bit kernels with 32 bit userspace.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2011-03-21 12:24:22 -07:00
Sage Weil
21f3b5f1bb ceph: remove debugfs debug cruft
Whoops!

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21 12:24:20 -07:00
Sage Weil
09adc80c61 ceph: preserve I_COMPLETE across rename
d_move puts the renamed dentry at the end of d_subdirs, screwing with our
cached dentry directory offsets.  We were just clearing I_COMPLETE to avoid
any possibility of trouble.  However, assigning the renamed dentry an
offset at the end of the directory (to match it's new d_subdirs position)
is sufficient to maintain correct behavior and hold onto I_COMPLETE.

This is especially important for workloads like rsync, which renames files
into place.  Before, we would lose I_COMPLETE and do MDS lookups for each
file.  With this patch we only talk to the MDS on create and rename.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-15 09:14:03 -07:00
Al Viro
0eb980e317 ceph: fix d_revalidate oopsen on NFS exports
can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10 03:44:05 -05:00
Sage Weil
455cec0abf ceph: no .snap inside of snapped namespace
Otherwise you can do things like

# mkdir .snap/foo
# cd .snap/foo/.snap
# ls
<badness>

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-04 12:25:09 -08:00
Sage Weil
16a8b70a5a ceph: do not clear I_COMPLETE from d_release
First, this was racy anyway: d_release isn't called until well after the
dentry is unhashed.  Second, this runs afoul of the recent dcache change
that clears d_parent prior to calling d_release (949854d0), causing a NULL
pointer dereference.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-03 10:09:52 -08:00
Sage Weil
b545cc1505 ceph: do not set I_COMPLETE
Do not set the I_COMPLETE flag on directories until we resolve races with
dcache pruning.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-03 10:09:51 -08:00
Sage Weil
9bde178d05 Revert "ceph: keep reference to parent inode on ceph_dentry"
This reverts commit 97d79b403e.

This fails to account for d_parent changes due to rename or disconnected
dentries due to submounts or NFS reexports.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-03 10:09:50 -08:00
8bd89ca220 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: keep reference to parent inode on ceph_dentry
  ceph: queue cap_snaps once per realm
  libceph: fix socket write error handling
  libceph: fix socket read error handling
2011-02-21 15:01:38 -08:00
Yehuda Sadeh
97d79b403e ceph: keep reference to parent inode on ceph_dentry
When creating a new dentry we now hold a reference to the parent
inode in the ceph_dentry.  This is required due to the new RCU
changes from 949854d0, which set dentry->d_parent to NULL in d_kill before
calling the ->release() callback.  If/when that behavior is changed, we can
revert this hack.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-02-19 19:59:14 -08:00
Sage Weil
e8e1ba96b2 ceph: queue cap_snaps once per realm
We were forming a dirty list, and then queueing cap_snaps for each realm
_and_ its children, regardless of whether the children were already in the
dirty list.  This meant we did it twice for some realms.  Which in turn
meant we corrupted mdsc->snap_flush_list when the cap_snap was re-added to
the list it was already on, and could trigger an infinite loop.

We were also using recursion to do reach all the children, a no-no when
stack is limited.

Instead, (re)queue any children on the dirty list, avoiding processing
anything twice and avoiding any recursion.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-02-04 20:45:58 -08:00
b12ece7d85 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: avoid picking MDS that is not active
  ceph: avoid immediate cap check after import
  ceph: fix flushing of caps vs cap import
  ceph: fix erroneous cap flush to non-auth mds
  ceph: fix cap_wanted_delay_{min,max} mount option initialization
  ceph: fix xattr rbtree search
  ceph: fix getattr on directory when using norbytes
2011-01-28 12:12:58 +10:00
Sage Weil
d66bbd441c ceph: avoid picking MDS that is not active
Ignore replication or auth frag data if it indicates an MDS that is not
active.  This can happen if the MDS shuts down and the client has stale
data about the namespace distribution across the MDS cluster.  If that's
the case, fall back to directing the request based on the auth cap (which
should always be accurate).

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-25 08:16:37 -08:00
Sage Weil
7e57b81c76 ceph: avoid immediate cap check after import
The NODELAY flag avoids the heuristics that delay cap (issued/wanted)
release.  There's no reason for that after we import a cap, and it kills
whatever benefit we get from those delays.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-19 09:23:26 -08:00
Sage Weil
088b3f5e9e ceph: fix flushing of caps vs cap import
If we are mid-flush and a cap is migrated to another node, we need to
resend the cap flush message to the new MDS, and do so with the original
flush_seq to avoid leaking across a sync boundary.  Previously we didn't
redo the flush (we only flushed newly dirty data), which would cause a
later sync to hang forever.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-19 09:23:25 -08:00
Sage Weil
24be0c4810 ceph: fix erroneous cap flush to non-auth mds
The int flushing is global and not clear on each iteration of the loop,
which can cause a second flush of caps to any MDSs with ids greater than
the auth.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-19 09:23:24 -08:00
Sage Weil
50aac4fec5 ceph: fix cap_wanted_delay_{min,max} mount option initialization
These were initialized to 0 instead of the default, fallout from the RBD
refactor in 3d14c5d2b6.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-19 09:23:22 -08:00
Sage Weil
17db143fc0 ceph: fix xattr rbtree search
Fix xattr name comparison in rbtree search for strings that share a prefix.
The *name argument is null terminated, but the xattr name is not, so we
need to use strncmp, but that means adjusting for the case where name is
a prefix of xattr->name.

The corresponding case in __set_xattr() already handles this properly
(although in that case *name is also not null terminated).

Reported-by: Sergiy Kibrik <sakib@meta.ua>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-13 15:50:11 -08:00
Yehuda Sadeh
1c1266bb91 ceph: fix getattr on directory when using norbytes
The norbytes mount option was broken, and when doing getattr
on a directory it return the rbytes instead of the number of
entities. This commit fixes it.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-13 15:50:06 -08:00
a170315420 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix cleanup when trying to mount inexistent image
  net/ceph: make ceph_msgr_wq non-reentrant
  ceph: fsc->*_wq's aren't used in memory reclaim path
  ceph: Always free allocated memory in osdmap_decode()
  ceph: Makefile: Remove unnessary code
  ceph: associate requests with opening sessions
  ceph: drop redundant r_mds field
  ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
  ceph: add dir_layout to inode
2011-01-13 10:25:24 -08:00
Tejun Heo
01e6acc4ea ceph: fsc->*_wq's aren't used in memory reclaim path
fsc->*_wq's aren't depended upon during memory reclaim.  Convert to
alloc_workqueue() w/o WQ_MEM_RECLAIM.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sage Weil <sage@newdream.net>
Cc: ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:14 -08:00
Tracey Dent
582c86e690 ceph: Makefile: Remove unnessary code
Remove the if and else conditional because the code is in mainline and there
is no need in it being there.

Also, Changed Makefile to use <modules>-y instead of <modules>-objs
because -objs is deprecated and not mentioned in
 Documentation/kbuild/makefiles.txt.

Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Sage Weil
dc69e2e9fc ceph: associate requests with opening sessions
Associate request with sessions that aren't yep open.  This makes the
debugfs mdsc request list more informative.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Sage Weil
4af25fdda6 ceph: drop redundant r_mds field
The r_mds field is redundant, since we can find the same information at
r_session->s_mds, and when r_session is NULL then r_mds is meaningless.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Sage Weil
14303d20f3 ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
This implements the DIRLAYOUTHASH protocol feature, which passes the dir
layout over the wire from the MDS.  This gives the client knowledge
of the correct hash function to use for mapping dentries among dir
fragments.

Note that if this feature is _not_ present on the client but is on the
MDS, the client may misdirect requests.  This will result in a forward
and degrade performance.  It may also result in inaccurate NFS filehandle
generation, which will prevent fh resolution when the inode is not present
in the client cache and the parent directories have been fragmented.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Sage Weil
6c0f3af72c ceph: add dir_layout to inode
Add a ceph_dir_layout to the inode, and calculate dentry hash values based
on the parent directory's specified dir_hash function.  This is needed
because the old default Linux dcache hash function is extremely week and
leads to a poor distribution of files among dir fragments.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:12 -08:00
Nick Piggin
b74c79e993 fs: provide rcu-walk aware permission i_ops
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:29 +11:00
Nick Piggin
34286d6662 fs: rcu-walk aware d_revalidate method
Require filesystems be aware of .d_revalidate being called in rcu-walk
mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
-ECHILD from all implementations.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:29 +11:00
Nick Piggin
fb045adb99 fs: dcache reduce branches in lookup path
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.

Patched with:

git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:28 +11:00
Nick Piggin
fa0d7e3de6 fs: icache RCU free inodes
RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Nick Piggin
b5c84bf6f6 fs: dcache remove dcache_lock
dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:23 +11:00
Nick Piggin
2fd6b7f507 fs: dcache scale subdirs
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).

Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:21 +11:00
Nick Piggin
da5029563a fs: dcache scale d_unhashed
Protect d_unhashed(dentry) condition with d_lock. This means keeping
DCACHE_UNHASHED bit in synch with hash manipulations.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:21 +11:00
Nick Piggin
b7ab39f631 fs: dcache scale dentry refcount
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:21 +11:00
Henry C Chang
b6aa5901c7 ceph: mark user pages dirty on direct-io reads
For read operation, we have to set the argument _write_ of get_user_pages
to 1 since we will write data to pages. Also, we need to SetPageDirty before
releasing these pages.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:54:40 -08:00
Sage Weil
92cf765237 ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport
The fh_to_dentry etc. methods use ceph_init_dentry(), which assumes that
d_parent is defined.  It isn't for those callers, so check!

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:53:48 -08:00
Henry C Chang
ab226e21ad ceph: fix direct-io on non-page-aligned buffers
The user buffer may be 512-byte aligned, not page-aligned.  We were
assuming the buffer was page-aligned and only accounting for
non-page-aligned io offsets.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-15 20:46:16 -08:00
Sage Weil
1cd275f609 ceph: fix ioctl magic
The ioctl magic was inadvertently changed in 571dba52.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-06 09:45:22 -08:00
Herb Shiu
a5b10629ed ceph: Behave better when handling file lock replies.
Fill in the local lock with response data if appropriate,
and don't call posix_lock_file when reading locks.

Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:22:34 -08:00
Herb Shiu
637ae8d547 ceph: pass lock information by struct file_lock instead of as individual params.
Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:22:34 -08:00
Herb Shiu
25933abdd8 ceph: Handle file locks in replies from the MDS.
Previously the kernel client incorrectly assumed everything was a directory.

Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:22:27 -08:00
Sage Weil
884ea89276 ceph: avoid possible null deref in readdir after dir llseek
last may be NULL, but we dereference it in the else branch without
checking.  Normally it doesn't trigger because last == NULL when fpos == 2,
but it could happen on a newly opened dir if the user seeks forward.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:15:31 -08:00
76db8ac45f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix readdir EOVERFLOW on 32-bit archs
  ceph: fix frag offset for non-leftmost frags
  ceph: fix dangling pointer
  ceph: explicitly specify page alignment in network messages
  ceph: make page alignment explicit in osd interface
  ceph: fix comment, remove extraneous args
  ceph: fix update of ctime from MDS
  ceph: fix version check on racing inode updates
  ceph: fix uid/gid on resent mds requests
  ceph: fix rdcache_gen usage and invalidate
  ceph: re-request max_size if cap auth changes
  ceph: only let auth caps update max_size
  ceph: fix open for write on clustered mds
  ceph: fix bad pointer dereference in ceph_fill_trace
  ceph: fix small seq message skipping
  Revert "ceph: update issue_seq on cap grant"
2010-11-19 15:32:22 -08:00
Sage Weil
3105c19c45 ceph: fix readdir EOVERFLOW on 32-bit archs
One of the readdir filldir_t callers was passing the raw ceph 64-bit ino
instead of the hashed 32-bit one, producing an EOVERFLOW in the filler
callback.  Fix this by calling the ceph_vino_to_ino() helper to do the
conversion.

Reported-by: Jan Smets <jan.smets@alcatel-lucent.com>
Tested-by: Jan Smets <jan.smets@alcatel-lucent.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-18 09:15:07 -08:00
Arnd Bergmann
451a3c24b0 BKL: remove extraneous #include <smp_lock.h>
The big kernel lock has been removed from all these files at some point,
leaving only the #include.

Remove this too as a cleanup.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-17 08:59:32 -08:00
Sage Weil
7b88dadc13 ceph: fix frag offset for non-leftmost frags
We start at offset 2 for the leftmost frag, and 0 for subsequent frags.
When we reach the end (rightmost), we go back to 2.  This fixes readdir on
fragmented (large) directories.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-11 16:48:59 -08:00
Sage Weil
a1629c3b24 ceph: fix dangling pointer
Clear fi->last_name when it's freed.  The only caller is rewinddir() (or
equivalent lseek).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-11 15:24:06 -08:00
Sage Weil
b7495fc2ff ceph: make page alignment explicit in osd interface
We used to infer alignment of IOs within a page based on the file offset,
which assumed they matched.  This broke with direct IO that was not aligned
to pages (e.g., 512-byte aligned IO).  We were also trusting the alignment
specified in the OSD reply, which could have been adjusted by the server.

Explicitly specify the page alignment when setting up OSD IO requests.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09 12:43:12 -08:00
Sage Weil
e98b6fed84 ceph: fix comment, remove extraneous args
The offset/length arguments aren't used.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09 12:24:53 -08:00
Sage Weil
d8672d64b8 ceph: fix update of ctime from MDS
The client can have a newer ctime than the MDS due to AUTH_EXCL and
XATTR_EXCL caps as well; update the check in ceph_fill_file_time
appropriately.

This fixes cases where ctime/mtime goes backward under the right sequence
of local updates (e.g. chmod) and mds replies (e.g. subsequent stat that
goes to the MDS).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 09:24:34 -08:00
Sage Weil
8bd59e0188 ceph: fix version check on racing inode updates
We may get updates on the same inode from multiple MDSs; generally we only
pay attention if the update is newer than what we already have.  The
exception is when an MDS sense unstable information, in which case we
always update.

The old > check got this wrong when our version was odd (e.g. 3) and the
reply version was even (e.g. 2): the older stale (v2) info would be
applied.  Fixed and clarified the comment.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 09:23:12 -08:00
Sage Weil
cb4276cca4 ceph: fix uid/gid on resent mds requests
MDS requests can be rebuilt and resent in non-process context, but were
filling in uid/gid from current_fsuid/gid.  Put that information in the
request struct on request setup.

This fixes incorrect (and root) uid/gid getting set for requests that
are forwarded between MDSs, usually due to metadata migrations.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 07:29:05 -08:00
Sage Weil
cd045cb42a ceph: fix rdcache_gen usage and invalidate
We used to use rdcache_gen to indicate whether we "might" have cached
pages.  Now we just look at the mapping to determine that.  However, some
old behavior remains from that transition.

First, rdcache_gen == 0 no longer means we have no pages.  That can happen
at any time (presumably when we carry FILE_CACHE).  We should not reset it
to zero, and we should not check that it is zero.

That means that the only purpose for rdcache_revoking is to resolve races
between new issues of FILE_CACHE and an async invalidate.  If they are
equal, we should invalidate.  On success, we decrement rdcache_revoking,
so that it is no longer equal to rdcache_gen.  Similarly, if we success
in doing a sync invalidate, set revoking = gen - 1.  (This is a small
optimization to avoid doing unnecessary invalidate work and does not
affect correctness.)

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 07:29:05 -08:00
Sage Weil
feb4cc9bb4 ceph: re-request max_size if cap auth changes
If the auth cap migrates to another MDS, clear requested_max_size so that
we resend any pending max_size increase requests.  This fixes potential
hangs on writes that extend a file and race with an cap migration between
MDSs.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 09:39:23 -08:00
Sage Weil
912a9b0319 ceph: only let auth caps update max_size
Only the auth MDS has a meaningful max_size value for us, so only update it
in fill_inode if we're being issued an auth cap.  Otherwise, a random
stat result from a non-auth MDS can clobber a meaningful max_size, get
the client<->mds cap state out of sync, and make writes hang.

Specifically, even if the client re-requests a larger max_size (which it
will), the MDS won't respond because as far as it knows we already have a
sufficiently large value.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 09:39:21 -08:00
Sage Weil
7421ab8041 ceph: fix open for write on clustered mds
Normally when we open a file we already have a cap, and simply update the
wanted set.  However, if we open a file for write, but don't have an auth
cap, that doesn't work; we need to open a new cap with the auth MDS.  Only
reuse existing caps if we are opening for read or the existing cap is auth.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 09:07:15 -08:00
Sage Weil
d8b16b3d1c ceph: fix bad pointer dereference in ceph_fill_trace
We dereference *in a few lines down, but only set it on rename.  It is
apparently pretty rare for this to trigger, but I have been hitting it
with a clustered MDSs.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 08:40:43 -08:00
Al Viro
a7f9fb205a convert ceph
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:18 -04:00
Sage Weil
2f56f56ad9 Revert "ceph: update issue_seq on cap grant"
This reverts commit d91f2438d8.

The intent of issue_seq is to distinguish between mds->client messages that
(re)create the cap and those that do not, which means we should _only_ be
updating that value in the create paths.  By updating it in handle_cap_grant,
we reset it to zero, which then breaks release.

The larger question is what workload/problem made me think it should be
updated here...

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-27 21:05:54 -07:00
Wu Fengguang
1b430beee5 writeback: remove nonblocking/encountered_congestion references
This removes more dead code that was somehow missed by commit 0d99519efe
(writeback: remove unused nonblocking and congestion checks).  There are
no behavior change except for the removal of two entries from one of the
ext4 tracing interface.

The nonblocking checks in ->writepages are no longer used because the
flusher now prefer to block on get_request_wait() than to skip inodes on
IO congestion.  The latter will lead to more seeky IO.

The nonblocking checks in ->writepage are no longer used because it's
redundant with the WB_SYNC_NONE check.

We no long set ->nonblocking in VM page out and page migration, because
a) it's effectively redundant with WB_SYNC_NONE in current code
b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
   that would skip some dirty inodes on congestion and page out others, which
   is unfair in terms of LRU age.

Inspired by Christoph Hellwig. Thanks!

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: Steve French <sfrench@samba.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:05 -07:00
Sage Weil
efa4c1206e ceph: do not carry i_lock for readdir from dcache
We were taking dcache_lock inside of i_lock, which introduces a dependency
not found elsewhere in the kernel, complicationg the vfs locking
scalability work.  Since we don't actually need it here anyway, remove
it.

We only need i_lock to test for the I_COMPLETE flag, so be careful to do
so without dcache_lock held.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:27 -07:00
Julia Lawall
61413c2f59 fs/ceph/xattr.c: Use kmemdup
Convert a sequence of kmalloc and memcpy to use kmemdup.

The semantic patch that performs this transformation is:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression a,flag,len;
expression arg,e1,e2;
statement S;
@@

  a =
-  \(kmalloc\|kzalloc\)(len,flag)
+  kmemdup(arg,len,flag)
  <... when != a
  if (a == NULL || ...) S
  ...>
- memcpy(a,arg,len+1);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:26 -07:00
Greg Farnum
571dba52a3 ceph: add CEPH_MDS_OP_SETDIRLAYOUT and associated ioctl.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:23 -07:00
Randy Dunlap
6f453ed6c0 ceph: fix debugfs warnings
Include "super.h" outside of CONFIG_DEBUG_FS to eliminate a compiler warning:

fs/ceph/debugfs.c:266: warning: 'struct ceph_fs_client' declared inside parameter list
fs/ceph/debugfs.c:266: warning: its scope is only this definition or declaration, which is probably not what you want
fs/ceph/debugfs.c:271: warning: 'struct ceph_fs_client' declared inside parameter list

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20 15:38:21 -07:00
Sage Weil
496e59553c ceph: switch from BKL to lock_flocks()
Switch from using the BKL explicitly to the new lock_flocks() interface.
Eventually this will turn into a spinlock.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:18 -07:00
Greg Farnum
fca4451acf ceph: preallocate flock state without locks held
When the lock_kernel() turns into lock_flocks() and a spinlock, we won't
be able to do allocations with the lock held.  Preallocate space without
the lock, and retry if the lock state changes out from underneath us.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:17 -07:00
Sage Weil
18a38193ef ceph: use mapping->nrpages to determine if mapping is empty
This is simpler and faster.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:15 -07:00
Sage Weil
93afd449aa ceph: only invalidate on check_caps if we actually have pages
The i_rdcache_gen value only implies we MAY have cached pages; actually
check the mapping to see if it's worth bothering with an invalidate.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:15 -07:00
Sage Weil
4c32f5dda5 ceph: do not hide .snap in root directory
Snaps in the root directory are now supported by the MDS, and harmless on
older versions.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:14 -07:00
Yehuda Sadeh
3d14c5d2b6 ceph: factor out libceph from Ceph file system
This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph.  This
is mostly a matter of moving files around.  However, a few key pieces
of the interface change as well:

 - ceph_client becomes ceph_fs_client and ceph_client, where the latter
   captures the mon and osd clients, and the fs_client gets the mds client
   and file system specific pieces.
 - Mount option parsing and debugfs setup is correspondingly broken into
   two pieces.
 - The mon client gets a generic handler callback for otherwise unknown
   messages (mds map, in this case).
 - The basic supported/required feature bits can be expanded (and are by
   ceph_fs_client).

No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:37:28 -07:00
Yehuda Sadeh
ae1533b62b ceph-rbd: osdc support for osd call and rollback operations
This will be used for rbd snapshots administration.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20 15:37:25 -07:00
Yehuda Sadeh
68b4476b0b ceph: messenger and osdc changes for rbd
Allow the messenger to send/receive data in a bio.  This is added
so that we wouldn't need to copy the data into pages or some other buffer
when doing IO for an rbd block device.

We can now have trailing variable sized data for osd
ops.  Also osd ops encoding is more modular.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:37:18 -07:00
Yehuda Sadeh
3499e8a5d4 ceph: refactor osdc requests creation functions
The osd requests creation are being decoupled from the
vino parameter, allowing clients using the osd to use
other arbitrary object names that are not necessarily
vino based. Also, calc_raw_layout now takes a snap id.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:36:01 -07:00
Yehuda Sadeh
7669a2c95e ceph: lookup pool in osdmap by name
Implement a pool lookup by name.  This will be used by rbd.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:35:36 -07:00
Sage Weil
d91f2438d8 ceph: update issue_seq on cap grant
We need to update the issue_seq on any grant operation, be it via an MDS
reply or a separate grant message.  The update in the grant path was
missing.  This broke cap release for inodes in which the MDS sent an
explicit grant message that was not soon after followed by a successful
MDS reply on the same inode.

Also fix the signedness on seq locals.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:01:50 -07:00
Greg Farnum
21b559de56 ceph: send cap release message early on failed revoke.
If an MDS tries to revoke caps that we don't have, we want to send
releases early since they probably contain the caps message the MDS
is looking for.

Previously, we only sent the messages if we didn't have the inode either. But
in a multi-mds system we can retain the inode after dropping all caps for
a single MDS.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:24 -07:00
Aneesh Kumar K.V
bba0cd0e3d ceph: Update max_len with minimum required size
encode_fh on error should update max_len with minimum required
size, so that caller can redo the call with the reallocated buffer.
This is required with open by handle patch series

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:24 -07:00
Aneesh Kumar K.V
92923dcbfc ceph: Fix return value of encode_fh function
encode_fh function should return 255 on error as done by other file
system to indicate EOVERFLOW. Also max_len is in sizeof(u32) units
and not in bytes.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:23 -07:00
Sage Weil
6bc18876ba ceph: avoid null deref in osd request error path
If we interrupt an osd request, we call __cancel_request, but it wasn't
verifying that req->r_osd was non-NULL before dereferencing it.  This could
cause a crash if osds were flapping and we aborted a request on said osd.

Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:23 -07:00
Henry C Chang
936aeb5c4a ceph: fix list_add usage on unsafe_writes list
Fix argument order.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:23 -07:00
Sage Weil
be4f104dfd ceph: select CRYPTO
We select CRYPTO_AES, but not CRYPTO.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-17 12:30:31 -07:00
Sage Weil
a43fb73101 ceph: check mapping to determine if FILE_CACHE cap is used
See if the i_data mapping has any pages to determine if the FILE_CACHE
capability is currently in use, instead of assuming it is any time the
rdcache_gen value is set (i.e., issued -> used).

This allows the MDS RECALL_STATE process work for inodes that have cached
pages.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-17 09:54:31 -07:00
Sage Weil
e835124c2b ceph: only send one flushsnap per cap_snap per mds session
Sending multiple flushsnap messages is problematic because we ignore
the response if the tid doesn't match, and the server may only respond to
each one once.  It's also a waste.

So, skip cap_snaps that are already on the flushing list, unless the caller
tells us to resend (because we are reconnecting).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-17 08:03:08 -07:00
Sage Weil
ae00d4f37f ceph: fix cap_snap and realm split
The cap_snap creation/queueing relies on both the current i_head_snapc
_and_ the i_snap_realm pointers being correct, so that the new cap_snap
can properly reference the old context and the new i_head_snapc can be
updated to reference the new snaprealm's context.  To fix this, we:

 - move inodes completely to the new (split) realm so that i_snap_realm
   is correct, and
 - generate the new snapc's _before_ queueing the cap_snaps in
   ceph_update_snap_trace().

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-16 16:26:51 -07:00
Sage Weil
cfc0bf6640 ceph: stop sending FLUSHSNAPs when we hit a dirty capsnap
Stop sending FLUSHSNAP messages when we hit a capsnap that has dirty_pages
or is still writing.  We'll send the newer capsnaps only after the older
ones complete.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-14 15:50:59 -07:00
Sage Weil
8bef9239ee ceph: correctly set 'follows' in flushsnap messages
The 'follows' should match the seq for the snap context for the given snap
cap, which is the context under which we have been dirtying and writing
data and metadata.  The snapshot that _contains_ those updates thus
_follows_ that context's seq #.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-14 15:45:44 -07:00
Sage Weil
467c525109 ceph: fix dn offset during readdir_prepopulate
When adding the readdir results to the cache, ceph_set_dentry_offset was
clobbered our just-set offset.  This can cause the readdir result offsets
to get out of sync with the server.  Add an argument to the helper so
that it does not.

This bug was introduced by 1cd3935bed.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-13 11:40:36 -07:00
Sage Weil
a77d9f7dce ceph: fix file offset wrapping at 4GB on 32-bit archs
Cast the value before shifting so that we don't run out of bits with a
32-bit unsigned long.  This fixes wrapping of high file offsets into the
low 4GB of a file on disk, and the subsequent data corruption for large
files.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-11 10:55:25 -07:00
Sage Weil
3612abbd5d ceph: fix reconnect encoding for old servers
Fix the reconnect encoding to encode the cap record when the MDS does not
have the FLOCK capability (i.e., pre v0.22).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-11 10:52:47 -07:00
Yehuda Sadeh
3d4401d9d0 ceph: fix pagelist kunmap tail
A wrong parameter was passed to the kunmap.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-11 10:52:47 -07:00
Sage Weil
ca04d9c3ec ceph: fix null pointer deref on anon root dentry release
When we release a root dentry, particularly after a splice, the parent
(actually our) inode was evaluating to NULL and was getting dereferenced
by ceph_snap().  This is reproduced by something as simple as

 mount -t ceph monhost:/a/b mnt
 mount -t ceph monhost:/a mnt2
 ls mnt2

A splice_dentry() would kill the old 'b' inode's root dentry, and we'd
crash while releasing it.

Fix by checking for both the ROOT and NULL cases explicitly.  We only need
to invalidate the parent dir when we have a correct parent to invalidate.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-11 10:52:47 -07:00
Dan Carpenter
b545787dbb ceph: fix get_ticket_handler() error handling
get_ticket_handler() returns a valid pointer or it returns
ERR_PTR(-ENOMEM) if kzalloc() fails.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-26 09:26:50 -07:00
Sage Weil
e072f8aa35 ceph: don't BUG on ENOMEM during mds reconnect
We are in a position to return an error; do that instead.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-26 09:26:37 -07:00
Dan Carpenter
f44c3890d9 ceph: ceph_mdsc_build_path() returns an ERR_PTR
ceph_mdsc_build_path() returns an ERR_PTR but this code is set up to
handle NULL returns.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-26 09:24:28 -07:00
Alan Cox
ad8453ab0a ceph: Fix warnings
Just scrubbing some warnings so I can see real problem ones in the build
noise. For 32bit we need to coax gcc politely into believing we really
honestly intend to the casts. Using (u64)(unsigned long) means we cast from
a pointer to a type of the right size and then extend it. This stops the
warning spew.

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-25 12:02:14 -07:00
Dan Carpenter
ac1f12ef56 ceph: ceph_get_inode() returns an ERR_PTR
ceph_get_inode() returns an ERR_PTR and it doesn't return a NULL.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-25 12:01:54 -07:00
Sage Weil
36e21687e6 ceph: initialize fields on new dentry_infos
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-24 16:24:19 -07:00
Sage Weil
7d8cb26d7d ceph: maintain i_head_snapc when any caps are dirty, not just for data
We used to use i_head_snapc to keep track of which snapc the current epoch
of dirty data was dirtied under.  It is used by queue_cap_snap to set up
the cap_snap.  However, since we queue cap snaps for any dirty caps, not
just for dirty file data, we need to keep a valid i_head_snapc anytime
we have dirty|flushing caps.  This fixes a NULL pointer deref in
queue_cap_snap when writing back dirty caps without data (e.g.,
snaptest-authwb.sh).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-24 16:24:18 -07:00
Henry C Chang
07a27e226d ceph: fix osd request lru adjustment when sending request
Fix argument order.  We want to move the item to the end of the list, not
change the position of the head.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 21:34:27 -07:00
Sage Weil
124514918b ceph: don't improperly set dir complete when holding EXCL cap
If we hold the EXCL cap, we cannot trust the dir stats from the MDS (num
files, subdirs) and must not incorrectly conclude that the directory is
empty.  If we do, we get can bad results from lookup (bad ENOENT) and
bad readdir results.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 21:33:32 -07:00
Michael Rubin
679ceace84 mm: exporting account_page_dirty
This allows code outside of the mm core to safely manipulate page state
and not worry about the other accounting. Not using these routines means
that some code will lose track of the accounting and we get bugs. This
has happened once already.

Signed-off-by: Michael Rubin <mrubin@google.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:16:51 -07:00
Sage Weil
eb6bb1c5bd ceph: direct requests in snapped namespace based on nonsnap parent
When making a request in the virtual snapdir or a snapped portion of the
namespace, we should choose the MDS based on the first nonsnap parent (and
its caps).  If that is not the best place, we will get forward hints to
find the right MDS in the cluster.  This fixes ESTALE errors when using
the .snap directory and namespace with multiple MDSs.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:16:48 -07:00
Sage Weil
ed32604448 ceph: queue cap snap writeback for realm children on snap update
When a realm is updated, we need to queue writeback on inodes in that
realm _and_ its children.  Otherwise, if the inode gets cowed on the
server, we can get a hang later due to out-of-sync cap/snap state.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:16:47 -07:00
Sage Weil
4a625be472 ceph: include dirty xattrs state in snapped caps
When we snapshot dirty metadata that needs to be written back to the MDS,
include dirty xattr metadata.  Make the capsnap reference the encoded
xattr blob so that it will be written back in the FLUSHSNAP op.

Also fix the capsnap creation guard to include dirty auth or file bits,
not just tests specific to dirty file data or file writes in progress
(this fixes auth metadata writeback).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:16:46 -07:00
Sage Weil
082afec92d ceph: fix xattr cap writeback
We should include the xattr metadata blob in the cap update message any
time we are flushing dirty state, NOT just when we are also dropping the
cap.  This fixes async xattr writeback.

Also, clean up the code slightly to avoid duplicating the bit test.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:16:41 -07:00
Sage Weil
f3c60c5918 ceph: fix multiple mds session shutdown
The use of a completion when waiting for session shutdown during umount is
inappropriate, given the complexity of the condition.  For multiple MDS's,
this resulted in the umount thread spinning, often preventing the session
close message from being processed in some cases.

Switch to a waitqueue and defined a condition helper.  This cleans things
up nicely.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:04:43 -07:00
Yehuda Sadeh
e56fa10e92 ceph: generalize mon requests, add pool op support
Generalize the current statfs synchronous requests, and support pool_ops.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-10 14:41:25 -07:00
Sage Weil
0eb6cd49f6 ceph: only queue async writeback on cap revocation if there is dirty data
Normally, if the Fb cap bit is being revoked, we queue an async writeback.
If there is no dirty data but we still hold the cap, this leaves the
client sitting around doing nothing until the cap timeouts expire and the
cap is released on its own (as it would have been without the revocation).

Instead, only queue writeback if the bit is actually used (i.e., we have
dirty data).  If not, we can reply to the revocation immediately.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-05 13:53:40 -07:00
Sage Weil
e9d1774431 ceph: do not ignore osd_idle_ttl mount option
Actually apply the mount option to the mount_args struct.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-03 12:56:57 -07:00
Sage Weil
52dfb8ac0e ceph: constify dentry_operations
This makes checkpatch happy.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-03 10:25:30 -07:00
Sage Weil
213c99ee0c ceph: whitespace cleanup
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-03 10:25:11 -07:00
Greg Farnum
40819f6fb2 ceph: add flock/fcntl lock support
Implement flock inode operation to support advisory file locking.  All
lock/unlock operations are synchronous with the MDS.  Lock state is
sent when reconnecting to a recovering MDS to restore the shared lock
state.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 16:10:53 -07:00
Greg Farnum
fbaad9797a ceph: define on-wire types, constants for file locking support
Define the MDS operations and data types for doing file advisory locking
with the MDS.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:54 -07:00
Greg Farnum
c6f3fdc592 ceph: add CEPH_FEATURE_FLOCK to the supported feature bits
This informs the server that we will accept v2 client_caps format and v2
client_reconnect format messages.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:51 -07:00
Sage Weil
20cb34ae9e ceph: support v2 reconnect encoding
Encode either old or v2 encoding of client_reconnect message, depending on
whether the peer has the FLOCK feature bit.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:50 -07:00
Sage Weil
ce1fbc8dd6 ceph: support v2 client_caps encoding
Add support for v2 encoding of MClientCaps, which includes a flock blob.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:49 -07:00
Sage Weil
cbbfe49905 ceph: move AES iv definition to shared header
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:31 -07:00
Sage Weil
73a7e693f9 ceph: fix decoding of pool snap info
The pool info contains a vector for snap_info_t, not snap ids.  This fixes
the broken decoding, which would declare teh update corrupt when a pool
snapshot was created.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 11:10:07 -07:00
Sage Weil
2d9c98ae97 ceph: make ->sync_fs not wait if wait==0
The ->sync_fs() super op only needs to wait if wait is true.  Otherwise,
just get some dirty cap writeback started.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil
b8cd07e78e ceph: warn on missing snap realm
Well, this Shouldn't Happen, so it would be helpful to know the caller when
it does.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil
effcb9ed43 ceph: print useful error message when crush rule not found
Include the crush_ruleset in the error message.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil
a8b763a9b3 ceph: use %pU to print uuid (fsid)
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil
f0b18d9f22 ceph: sync header defs with server code
Define ROLLBACK op, IFLOCK inode lock (for advisory file locking).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil
5cd068c200 ceph: clean up header guards
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil
9688f19a18 ceph: strip misleading/obsolete version, feature info
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil
6a2593823a ceph: specify supported features in super.h
Specify the supported/required feature bits in super.h client code instead
of using the definitions from the shared kernel/userspace headers (which
will go away shortly).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil
c309f0ab26 ceph: clean up fsid mount option
Specify the fsid mount option in hex, not via the major/minor u64 hackery we had
before.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil
e0f9f9ee8f ceph: remove unused 'monport' mount option
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Greg Farnum
e55b71f802 ceph: handle ESTALE properly; on receipt send to authority if it wasn't
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Greg Farnum
2bc50259fa ceph: add ceph_get_cap_for_mds function.
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil
154f42c2c3 ceph: connect to export targets on cap export
When we get a cap EXPORT message, make sure we are connected to all export
targets to ensure we can handle the matching IMPORT.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil
cb170a2215 ceph: connect to export targets if mds is laggy
If an MDS we are talking to may have failed, we need to open sessions to
its potential export targets to ensure that any in-progress migration that
may have involved some of our caps is properly handled.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Sage Weil
ed0552a1a2 ceph: introduce helper to connect to mds export targets
There are a few cases where we need to open sessions with a given mds's
potential export targets.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Sage Weil
796d6955a5 ceph: only set num_pages in calc_layout
Setting it elsewhere is unnecessary and more fragile.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Yehuda Sadeh
37151668ba ceph: do caps accounting per mds_client
Caps related accounting is now being done per mds client instead
of just being global. This prepares ground work for a later revision
of the caps preallocated reservation list.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Sage Weil
0deb01c999 ceph: track laggy state of mds from mdsmap
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Yehuda Sadeh
cd84db6e40 ceph: code cleanup
Mainly fixing minor issues reported by sparse.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Sage Weil
ca81f3f6bd ceph: skip if no auth cap in flush_snaps
If we have a capsnap but no auth cap (e.g. because it is migrating to
another mds), bail out and do nothing for now.  Do NOT remove the capsnap
from the flush list.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
3b454c4945 ceph: simplify caps revocation, fix for multimds
The caps revocation should either initiate writeback, invalidateion, or
call check_caps to ack or do the dirty work.  The primary question is
whether we can get away with only checking the auth cap or whether all
caps need to be checked.

The old code was doing...something else.  At the very least, revocations
from non-auth MDSs could break by triggering the "check auth cap only"
case.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
38e8883ee3 ceph: simplify add_cap_releases
No functional change, aside from more useful debug output.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
ee6b272b9c ceph: drop unused argument
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
2962507ca2 ceph: perform lazy reads when file mode and caps permit
If the file mode is marked as "lazy," perform cached/buffered reads when
the caps permit it.  Adjust the rdcache_gen and invalidation logic
accordingly so that we manage our cache based on the FILE_CACHE -or-
FILE_LAZYIO cap bits.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
33caad324b ceph: perform lazy writes when file mode and caps permit
If we have marked a file as "lazy" (using the ceph ioctl), perform buffered
writes when the MDS caps allow it.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
8c6e9229fc ceph: add LAZYIO ioctl to mark a file description for lazy consistency
Allow an application to mark a file descriptor for lazy file consistency
semantics, allowing buffered reads and writes when multiple clients are
accessing the same file.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
84d9509234 ceph: request FILE_LAZYIO cap when LAZY file mode is set
Also clean up the file flags -> file mode -> wanted caps functions while
we're at it.  This resyncs this file with userspace.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:38 -07:00
Yehuda Sadeh
03066f2345 ceph: use complete_all and wake_up_all
This fixes an issue triggered by running concurrent syncs. One of the syncs
would go through while the other would just hang indefinitely. In any case, we
never actually want to wake a single waiter, so the *_all functions should
be used.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-27 13:11:17 -07:00
Robert P. J. Day
25848b3ec6 ceph: Correct obvious typo of Kconfig variable "CRYPTO_AES"
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-24 21:36:07 -07:00
Sage Weil
1dadcce358 ceph: fix dentry lease release
When we embed a dentry lease release notification in a request, invalidate
our lease so we don't think we still have it.  Otherwise we can get all
sorts of incorrect client behavior when multiple clients are interacting
with the same part of the namespace.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 13:54:21 -07:00
Sage Weil
8c696737aa ceph: fix leak of dentry in ceph_init_dentry() error path
If we fail to allocate a ceph_dentry_info, don't leak the dn reference.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 10:02:07 -07:00
Sage Weil
bc4fdca857 ceph: fix pg_mapping leak on pg_temp updates
Free the ceph_pg_mapping structs when they are removed from the pg_temp
rbtree.  Also fix a leak in the __insert_pg_mapping() error path.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 10:02:06 -07:00
Sage Weil
252af52146 ceph: fix d_release dop for snapdir, snapped dentries
We need to set the d_release dop for snapdir and snapped dentries so that
the ceph_dentry_info struct gets released.  We also use the dcache to
cache readdir results when possible, which only works if we know when
dentries are dropped from the cache.  Since we don't use the dcache for
readdir in the hidden snapdir, avoid that case in ceph_dentry_release.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 10:02:05 -07:00
Sage Weil
a0dff78dab ceph: avoid dcache readdir for snapdir
We should always go to the MDS for readdir on the hidden snapdir.  The
set of snapshots can change at any time; the client can't trust its cache
for that.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-22 13:50:45 -07:00
Sage Weil
e979cf5039 ceph: do not include cap/dentry releases in replayed messages
Strip the cap and dentry releases from replayed messages.  They can
cause the shared state to get out of sync because they were generated
(with the request message) earlier, and no longer reflect the current
client state.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-16 10:30:18 -07:00
Sage Weil
01a92f174f ceph: reuse request message when replaying against recovering mds
Replayed rename operations (after an mds failure/recovery) were broken
because the request paths were regenerated from the dentry names, which
get mangled when d_move() is called.

Instead, resend the previous request message when replaying completed
operations.  Just make sure the REPLAY flag is set and the target ino is
filled in.

This fixes problems with workloads doing renames when the MDS restarts,
where the rename operation appears to succeed, but on mds restart then
fails (leading to client confusion, app breakage, etc.).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-16 10:30:17 -07:00
Sage Weil
f91d3471cc ceph: fix creation of ipv6 sockets
Use the address family from the peer address instead of assuming IPv4.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-09 15:00:20 -07:00
Sage Weil
39139f64e1 ceph: fix parsing of ipv6 addresses
Check for brackets around the ipv6 address to avoid ambiguity with the port
number.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-09 15:00:18 -07:00
Sage Weil
d06dbaf6c2 ceph: fix printing of ipv6 addrs
The buffer was too small.  Make it bigger, use snprintf(), put brackets
around the ipv6 address to avoid mixing it up with the :port, and use the
ever-so-handy %pI[46] formats.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-08 16:49:53 -07:00
Dan Carpenter
b0bbb0be8f ceph: add kfree() to error path
We leak a "pi" on this error path.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-08 08:03:24 -07:00
Sage Weil
22b1de06c9 ceph: fix leak of mon authorizer
Fix leak of a struct ceph_buffer on umount.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-05 15:36:49 -07:00
Sage Weil
ed98adad3d ceph: fix message revocation
A message can be on a queue (pending or sent), or out_msg (sending), or
both.  We were assuming that if it's not on a queue it couldn't be out_msg,
but that was false in the case of lossy connections like the OSD.  Fix
ceph_con_revoke() to treat these cases independently.  Also, fix the
out_kvec_is_message check to only trigger if we are currently sending
_this_ message.

This fixes a GPF in tcp_sendpage, triggered by OSD restarts.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-05 12:16:23 -07:00
Sage Weil
153a10939e ceph: fix crush device 'out' threshold to 1.0, not 0.1
Fix a typo that made any OSD weighted between 0.1 and 1.0 effectively
weighted as 1.0 (fully in).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-05 09:44:17 -07:00
Sage Weil
443b3760a0 ceph: fix caps usage accounting for import (non-reserved) case
We need to increase the total and used counters when allocating a new cap
in the non-reserved (cap import) case.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-29 09:31:56 -07:00
Sage Weil
ec97f88ba6 ceph: only release clean, unused caps with mds requests
We can drop caps with an mds request.  Ensure we only drop unused AND
clean caps, since the MDS doesn't support cap writeback in that context,
nor do we track it.  If caps are dirty, and the MDS needs them back, we
it will revoke and we will flush in the normal fashion.

This fixes a possibly loss of metadata.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-29 09:31:55 -07:00
Sage Weil
a1a31e7342 ceph: fix crush CHOOSE_LEAF when type is already a leaf
We may not recurse for CHOOSE_LEAF if we start with a leaf node.  When
that happens, the out2 vector needs to be filled in with the result.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-24 12:58:14 -07:00
Sage Weil
55bda7aacd ceph: fix crush recursion
There was a longstanding problem with recursion through intervening
bucket types on complex hierarchies.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-24 12:55:48 -07:00
Yehuda Sadeh
bfaf148eb2 ceph: fix caps debugfs entry
The ceph client structure was not set correctly.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-24 09:47:36 -07:00
Sage Weil
17c688c3df ceph: delay umount until all mds requests drop inode+dentry refs
This fixes a race between handle_reply finishing an mds request, signalling
completion, and then dropping the request structing and its dentry+inode
refs, and pre_umount function waiting for requests to finish before
letting the vfs tear down the dcache.  If umount was delayed waiting for
mds requests, we could race and BUG in shrink_dcache_for_umount_subtree
because of a slow dput.

This delays umount until the msgr queue flushes, which means handle_reply
will exit and will have dropped the ceph_mds_request struct.  I'm assuming
the VFS has already ensured that its calls have all completed and those
request refs have thus been dropped as well (I haven't seen that race, at
least).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-21 16:11:50 -07:00
Sage Weil
d69ed05a80 ceph: handle splice_dentry/d_materialize_unique error in readdir_prepopulate
Handle a splice_dentry failure (due to a d_materialize_unique error)
without crashing.  (Also, report the error code.)

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-21 16:04:10 -07:00
Sage Weil
cebc5be6b6 ceph: fix crush map update decoding
If the incremental osdmap has a new crush map, advance the position after
decoding so that we can parse the rest of the osdmap properly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-17 10:22:48 -07:00
Sage Weil
ae32be3134 ceph: fix message memory leak, uninitialized variable
We need to properly initialize skip, as not all alloc_msg op instances
set it.

Also, BUG if someone says skip but also allocates a message.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-13 10:34:36 -07:00
Sage Weil
4a32f93d29 ceph: fix map handler error path
Don't leak message if we receive an unexpected message type.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-13 10:34:36 -07:00
Yehuda Sadeh
0cf5537b15 ceph: some endianity fixes
Fix some problems that came up with sparse.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-13 10:34:36 -07:00
Sage Weil
2b2300d62e ceph: try to send partial cap release on cap message on missing inode
If we have enough memory to allocate a new cap release message, do so, so
that we can send a partial release message immediately.  This keeps us from
making the MDS wait when the cap release it needs is in a partially full
release message.

If we fail because of ENOMEM, oh well, they'll just have to wait a bit
longer.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-10 13:30:25 -07:00
Sage Weil
3d7ded4d81 ceph: release cap on import if we don't have the inode
If we get an IMPORT that give us a cap, but we don't have the inode, queue
a release (and try to send it immediately) so that the MDS doesn't get
stuck waiting for us.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-10 13:30:07 -07:00
Sage Weil
9dbd412f56 ceph: fix misleading/incorrect debug message
Nothing is released here: the caps message is simply ignored in this case.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-10 13:29:59 -07:00
Jeff Mahoney
00d5643e7c ceph: fix atomic64_t initialization on ia64
bdi_seq is an atomic_long_t but we're using ATOMIC_INIT, which causes
 build failures on ia64. This patch fixes it to use ATOMIC_LONG_INIT.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-10 13:29:50 -07:00
Sage Weil
1e5ea23df1 ceph: fix lease revocation when seq doesn't match
If the client revokes a lease with a higher seq than what we have, keep
the mds's seq, so that it honors our release.  Otherwise, we can hang
indefinitely.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-04 10:05:40 -07:00
Sage Weil
558d3499bd ceph: fix f_namelen reported by statfs
We were setting f_namelen in kstatfs to PATH_MAX instead of NAME_MAX.
That disagrees with ceph_lookup behavior (which checks against NAME_MAX),
and also makes the pjd posix test suite spit out ugly errors because with
can't clean up its temporary files.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-01 16:56:03 -07:00
Yehuda Sadeh
205475679a ceph: fix memory leak in statfs
Freeing the statfs request structure when required.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-01 16:56:02 -07:00
Henry C Chang
13a4214cd9 ceph: fix d_subdirs ordering problem
We misused list_move_tail() to order the dentry in d_subdirs.
This will screw up the d_subdirs order.

This bug can be reliably reproduced by:
1. mount ceph fs.
2. on ceph fs, git clone git://ceph.newdream.net/git/ceph.git
3. Run autogen.sh in ceph directory.
(Note: Errors only occur at the first time you run autogen.sh.)

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-01 16:55:55 -07:00
b612a05537 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: clean up on forwarded aborted mds request
  ceph: fix leak of osd authorizer
  ceph: close out mds, osd connections before stopping auth
  ceph: make lease code DN specific
  fs/ceph: Use ERR_CAST
  ceph: renew auth tickets before they expire
  ceph: do not resend mon requests on auth ticket renewal
  ceph: removed duplicated #includes
  ceph: avoid possible null dereference
  ceph: make mds requests killable, not interruptible
  sched: add wait_for_completion_killable_timeout
2010-05-30 08:56:39 -07:00
Sage Weil
2a8e5e3637 ceph: clean up on forwarded aborted mds request
If an mds request is aborted (timeout, SIGKILL), it is left registered to
keep our state in sync with the mds.  If we get a forward notification,
though, we know the request didn't succeed and we can unregister it
safely.  We were trying to resend it, but then bailing out (and not
unregistering) in __do_request.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:42:05 -07:00
Sage Weil
79494d1b9b ceph: fix leak of osd authorizer
Release the ceph_authorizer when releasing osd state.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:42:04 -07:00
Sage Weil
a922d38fd1 ceph: close out mds, osd connections before stopping auth
The auth module (part of the mon_client) is needed to free any
ceph_authorizer(s) used by the mds and osd connections.  Flush the msgr
workqueue before stopping monc to ensure that the destroy_authorizer
auth op is available when those connections are closed out.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:42:03 -07:00
Sage Weil
dd1c905736 ceph: make lease code DN specific
The lease code includes a mask in the CEPH_LOCK_* namespace, but that
namespace is changing, and only one mask (formerly _DN == 1) is used, so
hard code for that value for now.

If we ever extend this code to handle leases over different data types we
can extend it accordingly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:42 -07:00
Julia Lawall
7e34bc524e fs/ceph: Use ERR_CAST
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)).  The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.

In the case of fs/ceph/inode.c, ERR_CAST is not needed, because the type of
the returned value is the same as the type of the enclosing function.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
type T;
T x;
identifier f;
@@

T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
 ...+> }

@@
expression x;
@@

- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:41 -07:00
Sage Weil
a41359fa35 ceph: renew auth tickets before they expire
We were only requesting renewal after our tickets expire; do so before
that.  Most of the low-level logic for this was already there; just use
it.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:39 -07:00
Sage Weil
09c4d6a7d4 ceph: do not resend mon requests on auth ticket renewal
We only want to send pending mon requests when we successfully
authenticate.  If we are already authenticated, like when we renew our
ticket, there is no need to resend pending requests.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:38 -07:00
Andrea Gelmini
984c76908e ceph: removed duplicated #includes
fs/ceph/auth.c: linux/slab.h is included more than once.
fs/ceph/super.h: linux/slab.h is included more than once.

Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:37 -07:00
Sage Weil
e95e9a7ae4 ceph: avoid possible null dereference
ac->ops may be null; use protocol id in error message instead.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:36 -07:00
Sage Weil
aa91647c89 ceph: make mds requests killable, not interruptible
The underlying problem is that many mds requests can't be restarted.  For
example, a restarted create() would return -EEXIST if the original request
succeeds.  However, we do not want a hung MDS to hang the client too.  So,
use the _killable wait_for_completion variants to abort on SIGKILL but
nothing else.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:35 -07:00
Christoph Hellwig
7ea8085910 drop unused dentry argument to ->fsync
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27 22:05:02 -04:00
6e188240eb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (59 commits)
  ceph: reuse mon subscribe message instead of allocated anew
  ceph: avoid resending queued message to monitor
  ceph: Storage class should be before const qualifier
  ceph: all allocation functions should get gfp_mask
  ceph: specify max_bytes on readdir replies
  ceph: cleanup pool op strings
  ceph: Use kzalloc
  ceph: use common helper for aborted dir request invalidation
  ceph: cope with out of order (unsafe after safe) mds reply
  ceph: save peer feature bits in connection structure
  ceph: resync headers with userland
  ceph: use ceph. prefix for virtual xattrs
  ceph: throw out dirty caps metadata, data on session teardown
  ceph: attempt mds reconnect if mds closes our session
  ceph: clean up send_mds_reconnect interface
  ceph: wait for mds OPEN reply to indicate reconnect success
  ceph: only send cap releases when mds is OPEN|HUNG
  ceph: dicard cap releases on mds restart
  ceph: make mon client statfs handling more generic
  ceph: drop src address(es) from message header [new protocol feature]
  ...
2010-05-24 07:37:52 -07:00
Sage Weil
240ed68eb5 ceph: reuse mon subscribe message instead of allocated anew
Use the same message, allocated during startup.  No need to reallocate a
new one each time around (and potentially ENOMEM).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-21 16:26:11 -07:00
Christoph Hellwig
8018ab0574 sanitize vfs_fsync calling conventions
Now that the last user passing a NULL file pointer is gone we can remove
the redundant dentry argument and associated hacks inside vfs_fsynmc_range.

The next step will be removig the dentry argument from ->fsync, but given
the luck with the last round of method prototype changes I'd rather
defer this until after the main merge window.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21 18:31:21 -04:00
Al Viro
3981f2e2a0 ceph: should use deactivate_locked_super() on failure exits
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21 18:31:13 -04:00
Sage Weil
970690012c ceph: avoid resending queued message to monitor
The auth_reply handler will (re)send any pending requests.  For the
initial mon authenticate phase, that's correct, but when a auth ticket
renewal races with an in-flight request, we may resend a request message
that is already in flight.  Avoid this by revoking the message before
sending it.

We should also avoid resending requests at all during ticket renewal; that
will come soon.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-21 15:01:22 -07:00
Tobias Klauser
9e32789f63 ceph: Storage class should be before const qualifier
The C99 specification states in section 6.11.5:

The placement of a storage-class specifier other than at the beginning
of the declaration specifiers in a declaration is an obsolescent
feature.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-21 15:01:21 -07:00
Yehuda Sadeh
34d23762d9 ceph: all allocation functions should get gfp_mask
This is essential, as for the rados block device we'll need
to run in different contexts that would need flags that
are other than GFP_NOFS.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:42 -07:00
Sage Weil
23804d91f1 ceph: specify max_bytes on readdir replies
Specify max bytes in request to bound size of reply.  Add associated
mount option with default value of 512 KB.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:41 -07:00
Sage Weil
366837706b ceph: cleanup pool op strings
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:41 -07:00
Julia Lawall
cffe7b6d8c ceph: Use kzalloc
Use kzalloc rather than the combination of kmalloc and memset.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression x,size,flags;
statement S;
@@

-x = kmalloc(size,flags);
+x = kzalloc(size,flags);
 if (x == NULL) S
-memset(x, 0, size);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:40 -07:00
Sage Weil
167c9e352d ceph: use common helper for aborted dir request invalidation
We invalidate I_COMPLETE and dentry leases in two places: on aborted mds
request and on request replay.  Use common helper to avoid duplicate code.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:40 -07:00
Sage Weil
85792d0dd6 ceph: cope with out of order (unsafe after safe) mds reply
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:39 -07:00
Sage Weil
aba558e28a ceph: save peer feature bits in connection structure
These are used for adjusting behavior, such as conditionally encoding a
newer message format.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:38 -07:00
Sage Weil
ca9d93a292 ceph: resync headers with userland
Notable changes include pool op defines and types, FLOCK feature bit, and
new CMPXATTR osd ops.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:38 -07:00
Sage Weil
1a75627896 ceph: use ceph. prefix for virtual xattrs
Drop the 'user.' prefix and use just 'ceph.' for fs virtual xattrs.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:37 -07:00
Sage Weil
6c99f2545d ceph: throw out dirty caps metadata, data on session teardown
The remove_session_caps() helper is called when an MDS closes out our
session (either normally, or as a result of a failed reconnect), and when
we tear down state for umount.  If we remove the last cap, and there are
no cap migrations in progress, then there is little hope of us flushing
out that data to the mds (without heroic efforts to reconnect and flush).

So, to avoid leaving inodes pinned (due to dirty state) and crashing after
umount, throw out dirty caps state and unpin the inodes.  Print a warning
to the console so we know something was lost.

NOTE: Although we drop wrbuffer refs, we don't actually mark pages clean;
maybe a truncate should be queued?

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:37 -07:00
Sage Weil
7e70f0ed9f ceph: attempt mds reconnect if mds closes our session
Currently, if our session is closed (due to a timeout, or explicit close,
or whatever), we just sit there doing nothing unless/until the MDS
restarts, at which point we try to reconnect.

Change client to attempt an immediate reconnect if our session is closed.

Note that currently the MDS doesn't support this, and our attempt will
fail.  We'll get a session CLOSE, our caps and dirty cap state will be
dropped, and the client will be free to attempt to reconnect.  That's
clearly not as nice as a successful reconnect, but it at least allows us
to try to carry on, and in the future the MDS will support a reconnect
and we will fare better.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:36 -07:00
Sage Weil
34b6c855fa ceph: clean up send_mds_reconnect interface
Pass a ceph_mds_session, since the caller has it.

Remove the dead code for sending empty reconnects.  It used to be used
when the MDS contacted _us_ to solicit a reconnect, and we could reply
saying "go away, I have no session."  Now we only send reconnects based
on the mds map, and only when we do in fact have an open session.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:35 -07:00
Sage Weil
29790f26ab ceph: wait for mds OPEN reply to indicate reconnect success
We used to infer reconnect success by watching the MDS state, essentially
assuming that hearing nothing meant things were ok.  That wasn't
particularly reliable.  Instead, the MDS replies with an explicit OPEN
message to indicate success.

Strictly speaking, this is a protocol change, but it is a backwards
compatible one that does not break new clients + old servers or old
clients + new servers.  At least not yet.

Drop unused @all argument from kick_requests while we're at it.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:35 -07:00
Sage Weil
aab53dd9e8 ceph: only send cap releases when mds is OPEN|HUNG
On OPENING we shouldn't have any caps (or releases).
On CLOSING, we should wait until we succeed (and throw it all out), or
don't (and are OPEN again).
On RECONNECTING we can wait until we are OPEN.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:34 -07:00
Sage Weil
e01a594646 ceph: dicard cap releases on mds restart
If the MDS restarts, the expire caps state is no longer shared, and can be
thrown out.  Caps state will be rebuilt on the MDS during the reconnect
process that follows.  Zero out any release messages and adjust the
release counter accordingly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:33 -07:00
Yehuda Sadeh
f8c76f6f25 ceph: make mon client statfs handling more generic
This is being done so that we could reuse the statfs
infrastructure with other requests that return values.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:33 -07:00
Sage Weil
dbad185d49 ceph: drop src address(es) from message header [new protocol feature]
The CEPH_FEATURE_NOSRCADDR protocol feature avoids putting the full source
address in each message header (twice).  This patch switches the client to
the new scheme, and _requires_ this feature on the server.  The server
will support both the old and new schemes.  That means an old client will
work with a new server, but a new client will not work with an old server.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:32 -07:00
Dan Carpenter
a5ee751c15 ceph: cleanup: remove unused assignement
We don't ever use "dirty" so we can remove it.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:32 -07:00
Sage Weil
0f8605f2bd ceph: clean up cap release loop vs spinlock
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:31 -07:00
Sage Weil
31e0cf8f6a ceph: name bdi ceph-%d instead of major:minor
The bdi_setup_and_register() helper doesn't help us since we bdi_init() in
create_client() and bdi_register() only when sget() succeeds.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:30 -07:00
Sage Weil
56b7cf9581 ceph: skip mds sync on forced unmount
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:30 -07:00
Sage Weil
b736b3d9d0 ceph: adjust masked struct_v variable names
Reported-by: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:29 -07:00
Sage Weil
6e19a16ef2 ceph: clean up mount options, ->show_options()
Ensure all options are included in /proc/mounts.  Some cleanup.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:29 -07:00
Sage Weil
1cd3935bed ceph: set dn offset when spliced
We want to assign an offset when the dentry goes from null to linked, which
is always done by splice_dentry().  Notably, we should NOT assign an
offset when a dentry is first created and is still null.

BUG if we try to splice a non-null dentry (we shouldn't).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:28 -07:00
Sage Weil
1b7facc41b ceph: don't clobber i_max_offset on already complete dir
This can screw up offsets assigned to new dentries and break dcache
readdir results.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:27 -07:00
Sage Weil
e8a7498715 ceph: skip set_dentry_offset work if directory not I_COMPLETE
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:27 -07:00
Sage Weil
f1f2765fae ceph: set next_offset on readdir finish
Set next_offset to 2 (always 2!), not 0, on readdir finish.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:26 -07:00
Henry C Chang
bddfa3cc18 ceph: listxattr should compare version by >=
If the version hasn't changed, don't rebuild the index.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:26 -07:00
Sage Weil
a6424e48c8 ceph: fix xattr dangling pointer / double free
If we use the xattr_blob, clear the pointer so we don't release the memory
at the bottom of the fuction.

Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:25 -07:00
Sage Weil
9dd4658db1 ceph: close messenger race
Simplify messenger locking, and close race between ceph_con_close() setting
the CLOSED bit and con_work() checking the bit, then taking the mutex.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:25 -07:00
Sage Weil
4f48280ee1 ceph: name msgpools; useful error messages
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:24 -07:00
Sage Weil
8c6efb58a5 ceph: fix memory leak due to possible dentry init race
Free dentry_info in error path.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:23 -07:00
Sage Weil
559c1e0073 ceph: include auth method in error messages
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:23 -07:00
Sage Weil
f26e681d52 ceph: osdtimeout=0 for now timeout
Allow the osd reset timeout to be disabled.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:22 -07:00
Dan Carpenter
0d509c949a ceph: d_obtain_alias() returns ERR_PTR()
d_obtain_alias() doesn't return NULL, it returns an ERR_PTR().

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:22 -07:00
Yehuda Sadeh
c473ad927e ceph: wake up mount thread when getting osdmap
Now that the mount thread waits for the osdmap, it needs
to be awaken.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-05-17 15:25:21 -07:00
Huang Weiyi
1bb71637d0 ceph: remove unused #includes
Remove unused #include's in
  fs/ceph/super.c

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:21 -07:00
Sage Weil
6822d00b54 ceph: wait for both monmap and osdmap when opening session
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-05-17 15:25:20 -07:00
Sage Weil
6f2bc3ff4c ceph: clean up connection reset
Reset out_keepalive_pending and peer_global_seq, and drop unused var.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:20 -07:00
Sage Weil
bb257664f7 ceph: simplify ceph_msg_new
We only need to pass in front_len.  Callers can attach any other payload
pieces (middle, data) as they see fit.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:19 -07:00
Sage Weil
a79832f26b ceph: make ceph_msg_new return NULL on failure; clean up, fix callers
Returning ERR_PTR(-ENOMEM) is useless extra work.  Return NULL on failure
instead, and fix up the callers (about half of which were wrong anyway).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:18 -07:00
Sage Weil
d52f847a84 ceph: rewrite msgpool using mempool_t
Since we don't need to maintain large pools of messages, we can just
use the standard mempool_t.  We maintain a msgpool 'wrapper' because we
need the mempool_t* in the alloc function, and mempool gives us only
pool_data.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:18 -07:00
Cheng Renquan
640ef79d27 ceph: use ceph_sb_to_client instead of ceph_client
ceph_sb_to_client and ceph_client are really identical, we need to dump
one; while function ceph_client is confusing with "struct ceph_client",
ceph_sb_to_client's definition is more clear; so we'd better switch all
call to ceph_sb_to_client.

  -static inline struct ceph_client *ceph_client(struct super_block *sb)
  -{
  -	return sb->s_fs_info;
  -}

Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:17 -07:00
Cheng Renquan
2d06eeb877 ceph: handle kzalloc() failure
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:16 -07:00
Sage Weil
7c315c552c ceph: drop unnecessary msgpool for mon_client subscribe_ack
Preallocate a single message to reuse instead.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:16 -07:00
Sage Weil
6694d6b95c ceph: drop unnecessary msgpool for mon_client auth_reply
Preallocate a single reply message that we can reuse instead.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:15 -07:00
Sage Weil
3143edd3a1 ceph: clean up statfs
Avoid unnecessary msgpool.  Preallocate reply.  Fix use-after-free race.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:15 -07:00
Sage Weil
6f46cb2935 ceph: fix theoretically possible double-put on connection
This would only trigger if we bailed out before resetting r_con_filling_msg
because the server reply was corrupt (oversized).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:14 -07:00
Dan Carpenter
c7708075f1 ceph: cleanup: remove dead code
"xattr" is never NULL here.  We took care of that in the previous
if statement block.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:14 -07:00
Sage Weil
104648ad3f ceph: reduce build_path debug output
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:13 -07:00
Yehuda Sadeh
31459fe4b2 ceph: use __page_cache_alloc and add_to_page_cache_lru
Following Nick Piggin patches in btrfs, pagecache pages should be
allocated with __page_cache_alloc, so they obey pagecache memory
policies.

Also, using add_to_page_cache_lru instead of using a private
pagevec where applicable.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:12 -07:00
Stephen Rothwell
f553069e5d ceph: update for removal of kref_set
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:12 -07:00
Sage Weil
21b667f69b ceph: simplify page setup for incoming data
Drop largely useless helper __prepare_pages(), and simplify sanity checks.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:11 -07:00
Sage Weil
81a6cf2d30 ceph: invalidate affected dentry leases on aborted requests
If we abort a request, we return to caller, but the request may still
complete.  And if we hold the dir FILE_EXCL bit, we may not release a
lease when sending a request.  A simple un-tar, control-c, un-tar again
will reproduce the bug (manifested as a 'Cannot open: File exists').

Ensure we invalidate affected dentry leases (as well dir I_COMPLETE) so
we don't have valid (but incorrect) leases.  Do the same, consistently, at
other sites where I_COMPLETE is similarly cleared.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 10:25:45 -07:00
Sage Weil
b4556396fa ceph: fix race between aborted requests and fill_trace
When we abort requests we need to prevent fill_trace et al from doing
anything that relies on locks held by the VFS caller.  This fixes a race
between the reply handler and the abort code, ensuring that continue
holding the dir mutex until the reply handler completes.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 10:25:45 -07:00
Sage Weil
e1518c7c0a ceph: clean up mds reply, error handling
We would occasionally BUG out in the reply handler because r_reply was
nonzero, due to a race with ceph_mdsc_do_request temporarily setting
r_reply to an ERR_PTR value.  This is unnecessary, messy, and also wrong
in the EIO case.

Clean up by consistently using r_err for errors and r_reply for messages.
Also fix the abort logic to trigger consistently for all errors that return
to the caller early (e.g., EIO from timeout case).  If an abort races with
a reply, use the result from the reply.

Also fix locking for r_err, r_reply update in the reply handler.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 10:25:44 -07:00
Sage Weil
e84346b726 ceph: preserve seq # on requeued messages after transient transport errors
If the tcp connection drops and we reconnect to reestablish a stateful
session (with the mds), we need to resend previously sent (and possibly
received) messages with the _same_ seq # so that they can be dropped on
the other end if needed.  Only assign a new seq once after the message is
queued.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 21:20:38 -07:00
Sage Weil
f818a73674 ceph: fix cap removal races
The iterate_session_caps helper traverses the session caps list and tries
to grab an inode reference.  However, the __ceph_remove_cap was clearing
the inode backpointer _before_ removing itself from the session list,
causing a null pointer dereference.

Clear cap->ci under protection of s_cap_lock to avoid the race, and to
tightly couple the list and backpointer state.  Use a local flag to
indicate whether we are releasing the cap, as cap->session may be modified
by a racing thread in iterate_session_caps.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 20:56:31 -07:00
Sage Weil
45c6ceb547 ceph: zero unused message header, footer fields
We shouldn't leak any prior memory contents to other parties.  And random
data, particularly in the 'version' field, can cause problems down the
line.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 15:17:40 -07:00
Sage Weil
9abf82b8bc ceph: fix locking for waking session requests after reconnect
The session->s_waiting list is protected by mdsc->mutex, not s_mutex.  This
was causing (rare) s_waiting list corruption.

Fix errors paths too, while we're here.  A more thorough cleanup of this
function is coming soon.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 09:53:57 -07:00
Sage Weil
d85b705663 ceph: resubmit requests on pg mapping change (not just primary change)
OSD requests need to be resubmitted on any pg mapping change, not just when
the pg primary changes.  Resending only when the primary changes results in
occasional 'hung' requests during osd cluster recovery or rebalancing.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 09:53:56 -07:00
Sage Weil
04d000eb35 ceph: fix open file counting on snapped inodes when mds returns no caps
It's possible the MDS will not issue caps on a snapped inode, in which case
an open request may not __ceph_get_fmode(), botching the open file
counting.  (This is actually a server bug, but the client shouldn't BUG out
in this case.)

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 09:53:55 -07:00
Sage Weil
0ceed5db32 ceph: unregister osd request on failure
The osd request wasn't being unregistered when the osd returned a failure
code, even though the result was returned to the caller.  This would cause
it to eventually time out, and then crash the kernel when it tried to
resend the request using a stale page vector.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 09:53:18 -07:00
Sage Weil
54ad023ba8 ceph: don't use writeback_control in writepages completion
The ->writepages writeback_control is not still valid in the writepages
completion.  We were touching it solely to adjust pages_skipped when there
was a writeback error (EIO, ENOSPC, EPERM due to bad osd credentials),
causing an oops in the writeback code shortly thereafter.  Updating
pages_skipped on error isn't correct anyway, so let's just rip out this
(clearly broken) code to pass the wbc to the completion.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-05 21:31:40 -07:00
Sage Weil
5dfc589a84 ceph: unregister bdi before kill_anon_super releases device name
Unregister and destroy the bdi in put_super, after mount is r/o, but before
put_anon_super releases the device name.

For symmetry, bdi_destroy in destroy_client (we bdi_init in create_client).

Only set s_bdi if bdi_register succeeds, since we use it to decide whether
to bdi_unregister.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-04 16:14:46 -07:00
Sage Weil
b0930f8d38 ceph: remove bad auth_x kmem_cache
It's useless, since our allocations are already a power of 2.  And it was
allocated per-instance (not globally), which caused a name collision when
we tried to mount a second file system with auth_x enabled.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:25 -07:00
Sage Weil
7ff899da02 ceph: fix lockless caps check
The __ variant requires caller to hold i_lock.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:25 -07:00
Sage Weil
ea1409f961 ceph: clear dir complete, invalidate dentry on replayed rename
If a rename operation is resent to the MDS following an MDS restart, the
client does not get a full reply (containing the resulting metadata) back.
In that case, a ceph_rename() needs to compensate by doing anything useful
that fill_inode() would have, like d_move().

It also needs to invalidate the dentry (to workaround the vfs_rename_dir()
bug) and clear the dir complete flag, just like fill_trace().

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:25 -07:00
Sage Weil
5c6a2cdb4f ceph: fix direct io truncate offset
truncate_inode_pages_range wants the end offset to align with the last byte
in a page.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:25 -07:00
Sage Weil
ae18756b9f ceph: discard incoming messages with bad seq #
We can get old message seq #'s after a tcp reconnect for stateful sessions
(i.e., the MDS).  If we get a higher seq #, that is an error, and we
shouldn't see any bad seq #'s for stateless (mon, osd) connections.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:24 -07:00
Sage Weil
684be25c52 ceph: fix seq counting for skipped messages
Increment in_seq even when the message is skipped for some reason.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:24 -07:00
Sage Weil
d45d0d970f ceph: add missing #includes
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:24 -07:00
Sage Weil
0b0c06d147 ceph: fix leaked spinlock during mds reconnect
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:23 -07:00
Sage Weil
c8f16584ac ceph: print more useful version info on module load
Decouple the client version from the server side.  Print relevant protocol
and map version info instead.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:23 -07:00
Sage Weil
91dee39eeb ceph: fix snap realm splits
The snap realm split was checking i_snap_realm, not the list_head, to
determine if an inode belonged in the new realm.  The check always failed,
which meant we always moved the inode, corrupting the old realm's list and
causing various crashes.

Also wait to release old realm reference to avoid possibility of use after
free.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:23 -07:00
Sage Weil
c10f5e12ba ceph: clear dir complete on d_move
d_move() reorders the d_subdirs list, breaking the readdir result caching.
Unless/until d_move preserves that ordering, clear CEPH_I_COMPLETE on
rename.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:22 -07:00
96e35b40c0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: use separate class for ceph sockets' sk_lock
  ceph: reserve one more caps space when doing readdir
  ceph: queue_cap_snap should always queue dirty context
  ceph: fix dentry reference leak in dcache readdir
  ceph: decode v5 of osdmap (pool names) [protocol change]
  ceph: fix ack counter reset on connection reset
  ceph: fix leaked inode ref due to snap metadata writeback race
  ceph: fix snap context reference leaks
  ceph: allow writeback of snapped pages older than 'oldest' snapc
  ceph: fix dentry rehashing on virtual .snap dir
2010-04-14 18:45:31 -07:00
Sage Weil
a6a5349d17 ceph: use separate class for ceph sockets' sk_lock
Use a separate class for ceph sockets to prevent lockdep confusion.
Because ceph sockets only get passed kernel pointers, there is no
dependency from sk_lock -> mmap_sem.  If we share the same class as other
sockets, lockdep detects a circular dependency from

	mmap_sem (page fault) -> fs mutex -> sk_lock -> mmap_sem

because dependencies are noted from both ceph and user contexts.  Using
a separate class prevents the sk_lock(ceph) -> mmap_sem dependency and
makes lockdep happy.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-13 14:07:07 -07:00
Yehuda Sadeh
e1e4dd0caa ceph: reserve one more caps space when doing readdir
We were missing space for the directory cap.  The result was a BUG at
fs/ceph/caps.c:2178.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-13 12:28:54 -07:00
Sage Weil
fc837c8f04 ceph: queue_cap_snap should always queue dirty context
This simplifies the calling convention, and fixes a bug where we queue a
capsnap with a context other than i_head_snapc (the one that matches the
dirty pages).  The result was a BUG at fs/ceph/caps.c:2178 on writeback
completion when a capsnap matching the writeback snapc could not be found.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-13 12:28:31 -07:00
Sage Weil
f5b066287c ceph: fix dentry reference leak in dcache readdir
When filldir returned an error (e.g. buffer full for a large directory),
we would leak a dentry reference, causing an oops on umount.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-12 14:25:51 -07:00
Sage Weil
2844a76a25 ceph: decode v5 of osdmap (pool names) [protocol change]
Teach the client to decode an updated format for the osdmap.  The new
format includes pool names, which will be useful shortly.  Get this change
in earlier rather than later.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-09 15:50:58 -07:00
Sage Weil
0e0d5e0c4b ceph: fix ack counter reset on connection reset
If in_seq_acked isn't reset along with in_seq, we don't ack received
messages until we reach the old count, consuming gobs memory on the other
end of the connection and introducing a large delay when those messages
are eventually deleted.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-02 16:07:19 -07:00
Sage Weil
819ccbfa44 ceph: fix leaked inode ref due to snap metadata writeback race
We create a ceph_cap_snap if there is dirty cap metadata (for writeback to
mds) OR dirty pages (for writeback to osd).  It is thus possible that the
metadata has been written back to the MDS but the OSD data has not when
the cap_snap is created.  This results in a cap_snap with dirty(caps) == 0.
The problem is that cap writeback to the MDS isn't necessary, and a
FLUSHSNAP cap op gets no ack from the MDS.  This leaves the cap_snap
attached to the inode along with its inode reference.

Fix the problem by dropping the cap_snap if it becomes 'complete' (all
pages written out) and dirty(caps) == 0 in ceph_put_wrbuffer_cap_refs().

Also, BUG() in __ceph_flush_snaps() if we encounter a cap_snap with
dirty(caps) == 0.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-01 09:34:38 -07:00
Sage Weil
6298a33757 ceph: fix snap context reference leaks
The get_oldest_context() helper takes a reference to the returned snap
context, but most callers weren't dropping that reference.  Fix them.

Also drop the unused locked __get_oldest_context() variant.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-01 09:34:37 -07:00
Sage Weil
80e755fede ceph: allow writeback of snapped pages older than 'oldest' snapc
On snap deletion, we don't regenerate ceph_cap_snaps for inodes with dirty
pages because deletion does not affect metadata writeback.  However, we
did run into problems when we went to write back the pages because the
'oldest' snapc is determined by the oldest cap_snap, and that may be the
newer snapc that reflects the deletion.  This caused confusion and an
infinite loop in ceph_update_writeable_page().

Change the snapc checks to allow writeback of any snapc that is equal to
OR older than the 'oldest' snapc.

When there are no cap_snaps, we were also using the realm's latest snapc
for writeback, which complicates ceph_put_wrbufffer_cap_refs().  Instead,
use i_head_snapc, the most snapc used for the most recent ('head') data.
This makes the writeback snapc (ceph_osd_request.r_snapc) _always_ match a
capsnap or i_head_snapc.

Also, in writepags_finish(), drop the snapc referenced by the _page_
and do not assume it matches the request snapc (it may not anymore).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-01 09:34:36 -07:00
Sage Weil
9358c6d4c0 ceph: fix dentry rehashing on virtual .snap dir
If a lookup fails on the magic .snap directory, we bind it to a magic
snap directory inode in ceph_lookup_finish().  That code assumes the dentry
is unhashed, but a recent server-side change started returning NULL leases
on lookup failure, causing the .snap dentry to be hashed and NULL by
ceph_fill_trace().

This causes dentry hash chain corruption, or a dies when d_rehash()
includes
	BUG_ON(!d_unhashed(entry));

So, avoid processing the NULL dentry lease if it the dentry matches the
snapdir name in ceph_fill_trace().  That allows the lookup completion to
properly bind it to the snapdir inode.  BUG there if dentry is hashed to
be sure.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-30 13:55:22 -07:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Sage Weil
94aa8ae13d ceph: fix use after free on mds __unregister_request
There was a use after free in __unregister_request that would trigger
whenever the request map held the last reference.  This appears to have
triggered an oops during 'umount -f' when requests are being torn down.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-28 21:23:56 -07:00
Sage Weil
393f662096 ceph: fix possible double-free of mds request reference
Clear pointer to mds request after dropping the reference to
ensure we don't drop it again, as there is at least one error
path through this function that does not reset fi->last_readdir
to a new value.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:06 -07:00
Sage Weil
d96d60498f ceph: fix session check on mds reply
Fix a broken check that a reply came back from the same MDS we sent the
request to.  I don't think a case that actually triggers this would ever
come up in practice, but it's clearly wrong and easy to fix.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:05 -07:00
Dan Carpenter
4736b009b8 ceph: handle kmalloc() failure
Return ERR_PTR(-ENOMEM) if kmalloc() fails.  We handle allocation
failures the same way later in the function.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:04 -07:00
Sage Weil
9c423956b8 ceph: propagate mds session allocation failures to caller
Return error to original caller if register_session() fails.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:04 -07:00
Sage Weil
8f883c24de ceph: make write_begin wait propagate ERESTARTSYS
Currently, if the wait_event_interruptible is interrupted, we
return EAGAIN unconditionally and loop, such that we aren't, in
fact, interruptible.  So, propagate ERESTARTSYS if we get it.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:03 -07:00