24114 Commits

Author SHA1 Message Date
Mikulas Patocka
26e15787b9 vmscan: fix shrinker callback bug in fs/super.c
commit 09f363c736 upstream.

The callback must not return -1 when nr_to_scan is zero. Fix the bug in
fs/super.c and add this requirement to the callback specification.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-26 09:08:40 -08:00
Jeff Layton
0d2c754e96 nfs: when attempting to open a directory, fall back on normal lookup (try #5)
commit 1788ea6e3b upstream.

commit d953126 changed how nfs_atomic_lookup handles an -EISDIR return
from an OPEN call. Prior to that patch, that caused the client to fall
back to doing a normal lookup. When that patch went in, the code began
returning that error to userspace. The d_revalidate codepath however
never had the corresponding change, so it was still possible to end up
with a NULL ctx->state pointer after that.

That patch caused a regression. When we attempt to open a directory that
does not have a cached dentry, that open now errors out with EISDIR. If
you attempt the same open with a cached dentry, it will succeed.

Fix this by reverting the change in nfs_atomic_lookup and allowing
attempts to open directories to fall back to a normal lookup

Also, add a NFSv4-specific f_ops->open routine that just returns
-ENOTDIR. This should never be called if things are working properly,
but if it ever is, then the dprintk may help in debugging.

To facilitate this, a new file_operations field is also added to the
nfs_rpc_ops struct.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-26 09:08:36 -08:00
Dan Carpenter
b8988229c7 hfs: add sanity check for file name length
commit bc5b8a9003 upstream.

On a corrupted file system the ->len field could be wrong leading to
a buffer overflow.

Reported-and-acked-by: Clement LECIGNE <clement.lecigne@netasq.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-21 14:35:22 -08:00
Al Viro
edfaaac6ea VFS: we need to set LOOKUP_JUMPED on mountpoint crossing
commit a3fbbde70a upstream.

Mountpoint crossing is similar to following procfs symlinks - we do
not get ->d_revalidate() called for dentry we have arrived at, with
unpleasant consequences for NFS4.

Simple way to reproduce the problem in mainline:

    cat >/tmp/a.c <<'EOF'
    #include <unistd.h>
    #include <fcntl.h>
    #include <stdio.h>
    main()
    {
            struct flock fl = {.l_type = F_RDLCK, .l_whence = SEEK_SET, .l_len = 1};
            if (fcntl(0, F_SETLK, &fl))
                    perror("setlk");
    }
    EOF
    cc /tmp/a.c -o /tmp/test

then on nfs4:

    mount --bind file1 file2
    /tmp/test < file1		# ok
    /tmp/test < file2		# spews "setlk: No locks available"...

What happens is the missing call of ->d_revalidate() after mountpoint
crossing and that's where NFS4 would issue OPEN request to server.

The fix is simple - treat mountpoint crossing the same way we deal with
following procfs-style symlinks.  I.e.  set LOOKUP_JUMPED...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:44:36 -08:00
Dan McGee
5a1c73b64b VFS: fix statfs() automounter semantics regression
commit 5c8a0fbba5 upstream.

No one in their right mind would expect statfs() to not work on a
automounter managed mount point. Fix it.

[ I'm not sure about the "no one in their right mind" part.  It's not
  mounted, and you didn't ask for it to be mounted.  But nobody will
  really care, and this probably makes it match previous semantics, so..
      - Linus ]

This mirrors the fix made to the quota code in 815d405cef.

Signed-off-by: Dan McGee <dpmcgee@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:44:32 -08:00
Tejun Heo
6022736e9b block: make gendisk hold a reference to its queue
commit f992ae801a upstream.

The following command sequence triggers an oops.

# mount /dev/sdb1 /mnt
# echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
# umount /mnt

 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:

 Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
 RIP: 0010:[<ffffffff810d0879>]  [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
...
 Call Trace:
  [<ffffffff810d2845>] lock_acquire+0x95/0x140
  [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
  [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
  [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
  [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
  [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
  [<ffffffff811c40df>] blkdev_put+0x5f/0x190
  [<ffffffff8118f18d>] kill_block_super+0x4d/0x80
  [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
  [<ffffffff8119003a>] deactivate_super+0x4a/0x70
  [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
  [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
  [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b

This is because bdev holds on to disk but disk doesn't pin the
associated queue.  If a SCSI device is removed while the device is
still open, the sdev puts the base reference to the queue on release.
When the bdev is finally released, the associated queue is already
gone along with the bdi and bdev_inode_switch_bdi() ends up
dereferencing already freed bdi.

Even if it were not for this bug, disk not holding onto the associated
queue is very unusual and error-prone.

Fix it by making add_disk() take an extra reference to its queue and
put it on disk_release() and ensuring that disk and its fops owner are
put in that order after all accesses to the disk and queue are
complete.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:44:30 -08:00
Eric Sandeen
87556ecf5d ext4: fix race in xattr block allocation path
commit 6d6a435190 upstream.

Ceph users reported that when using Ceph on ext4, the filesystem
would often become corrupted, containing inodes with incorrect
i_blocks counters.

I managed to reproduce this with a very hacked-up "streamtest"
binary from the Ceph tree.

Ceph is doing a lot of xattr writes, to out-of-inode blocks.
There is also another thread which does sync_file_range and close,
of the same files.  The problem appears to happen due to this race:

sync/flush thread               xattr-set thread
-----------------               ----------------

do_writepages                   ext4_xattr_set
ext4_da_writepages              ext4_xattr_set_handle
mpage_da_map_blocks             ext4_xattr_block_set
        set DELALLOC_RESERVE
                                ext4_new_meta_blocks
                                        ext4_mb_new_blocks
                                                if (!i_delalloc_reserved_flag)
                                                        vfs_dq_alloc_block
ext4_get_blocks
	down_write(i_data_sem)
        set i_delalloc_reserved_flag
	...
	up_write(i_data_sem)
                                        if (i_delalloc_reserved_flag)
                                                vfs_dq_alloc_block_nofail


In other words, the sync/flush thread pops in and sets
i_delalloc_reserved_flag on the inode, which makes the xattr thread
think that it's in a delalloc path in ext4_new_meta_blocks(),
and add the block for a second time, after already having added
it once in the !i_delalloc_reserved_flag case in ext4_mb_new_blocks

The real problem is that we shouldn't be using the DELALLOC_RESERVED
state flag, and instead we should be passing
EXT4_GET_BLOCKS_DELALLOC_RESERVE down to ext4_map_blocks() instead of
using an inode state flag.  We'll fix this for now with using
i_data_sem to prevent this race, but this is really not the right way
to fix things.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:54 -08:00
Yongqiang Yang
6ff60a6de3 ext4: let ext4_page_mkwrite stop started handle in failure
commit fcbb551582 upstream.

The started journal handle should be stopped in failure case.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:53 -08:00
Theodore Ts'o
caf61680a4 ext4: call ext4_handle_dirty_metadata with correct inode in ext4_dx_add_entry
commit 5930ea6438 upstream.

ext4_dx_add_entry manipulates bh2 and frames[0].bh, which are two buffer_heads
that point to directory blocks assigned to the directory inode.  However, the
function calls ext4_handle_dirty_metadata with the inode of the file that's
being added to the directory, not the directory inode itself.  Therefore,
correct the code to dirty the directory buffers with the directory inode, not
the file inode.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:52 -08:00
Darrick J. Wong
523d90b566 ext4: ext4_mkdir should dirty dir_block with newly created directory inode
commit f9287c1f2d upstream.

ext4_mkdir calls ext4_handle_dirty_metadata with dir_block and the inode "dir".
Unfortunately, dir_block belongs to the newly created directory (which is
"inode"), not the parent directory (which is "dir").  Fix the incorrect
association.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:52 -08:00
Darrick J. Wong
76e72d1136 ext4: ext4_rename should dirty dir_bh with the correct directory
commit bcaa992975 upstream.

When ext4_rename performs a directory rename (move), dir_bh is a
buffer that is modified to update the '..' link in the directory being
moved (old_inode).  However, ext4_handle_dirty_metadata is called with
the old parent directory inode (old_dir) and dir_bh, which is
incorrect because dir_bh does not belong to the parent inode.  Fix
this error.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:50 -08:00
Theodore Ts'o
305b7dd039 ext2,ext3,ext4: don't inherit APPEND_FL or IMMUTABLE_FL for new inodes
commit 1cd9f0976a upstream.

This doesn't make much sense, and it exposes a bug in the kernel where
attempts to create a new file in an append-only directory using
O_CREAT will fail (but still leave a zero-length file).  This was
discovered when xfstests #79 was generalized so it could run on all
file systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:50 -08:00
Jiri Kosina
47c5d5e0bb binfmt_elf: fix PIE execution with randomization disabled
commit a3defbe5c3 upstream.

The case of address space randomization being disabled in runtime through
randomize_va_space sysctl is not treated properly in load_elf_binary(),
resulting in SIGKILL coming at exec() time for certain PIE-linked binaries
in case the randomization has been disabled at runtime prior to calling
exec().

Handle the randomize_va_space == 0 case the same way as if we were not
supporting .text randomization at all.

Based on original patch by H.J. Lu and Josh Boyer.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: H.J. Lu <hongjiu.lu@intel.com>
Cc: <stable@kernel.org>
Tested-by: Josh Boyer <jwboyer@redhat.com>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:41 -08:00
Andy Whitcroft
4f0d844bad readlinkat: ensure we return ENOENT for the empty pathname for normal lookups
commit 1fa1e7f615 upstream.

Since the commit below which added O_PATH support to the *at() calls, the
error return for readlink/readlinkat for the empty pathname has switched
from ENOENT to EINVAL:

  commit 65cfc67223
  Author: Al Viro <viro@zeniv.linux.org.uk>
  Date:   Sun Mar 13 15:56:26 2011 -0400

    readlinkat(), fchownat() and fstatat() with empty relative pathnames

This is both unexpected for userspace and makes readlink/readlinkat
inconsistant with all other interfaces; and inconsistant with our stated
return for these pathnames.

As the readlinkat call does not have a flags parameter we cannot use the
AT_EMPTY_PATH approach used in the other calls.  Therefore expose whether
the original path is infact entry via a new user_path_at_empty() path
lookup function.  Use this to determine whether to default to EINVAL or
ENOENT for failures.

Addresses http://bugs.launchpad.net/bugs/817187

[akpm@linux-foundation.org: remove unused getname_flags()]
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:35 -08:00
Andrew Morton
9d6af07ac7 /proc/self/numa_maps: restore "huge" tag for hugetlb vmas
commit fc360bd9cd upstream.

The display of the "huge" tag was accidentally removed in 29ea2f698 ("mm:
use walk_page_range() instead of custom page table walking code").

Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Tested-by: Stephen Hemminger <shemminger@vyatta.com>
Reviewed-by: Stephen Wilson <wilsons@start.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:29 -08:00
Bryan Schumaker
ac155955e6 vfs: add "device" tag to /proc/self/mountstats
commit a877ee03ac upstream.

nfsiostat was failing to find mounted filesystems on kernels after
2.6.38 because of changes to show_vfsstat() by commit
c7f404b40a.  This patch adds back the
"device" tag before the nfs server entry so scripts can parse the
mountstats file correctly.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:16 -08:00
J. Bruce Fields
3a1f47cef0 nfsd4: ignore WANT bits in open downgrade
commit c30e92df30 upstream.

We don't use WANT bits yet--and sending them can probably trigger a
BUG() further down.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:16 -08:00
J. Bruce Fields
68251a1702 nfsd4: fix open downgrade, again
commit 3d02fa29de upstream.

Yet another open-management regression:

	- nfs4_file_downgrade() doesn't remove the BOTH access bit on
	  downgrade, so the server's idea of the stateid's access gets
	  out of sync with the client's.  If we want to keep an O_RDWR
	  open in this case, we should do that in the file_put_access
	  logic rather than here.
	- We forgot to convert v4 access to an open mode here.

This logic has proven too hard to get right.  In the future we may
consider:
	- reexamining the lock/openowner relationship (locks probably
	  don't really need to take their own references here).
	- adding open upgrade/downgrade support to the vfs.
	- removing the atomic operations.  They're redundant as long as
	  this is all under some other lock.

Also, maybe some kind of additional static checking would help catch
O_/NFS4_SHARE_ACCESS confusion.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:16 -08:00
J. Bruce Fields
fdbc365c5c nfsd4: permit read opens of executable-only files
commit a043226bc1 upstream.

A client that wants to execute a file must be able to read it.  Read
opens over nfs are therefore implicitly allowed for executable files
even when those files are not readable.

NFSv2/v3 get this right by using a passed-in NFSD_MAY_OWNER_OVERRIDE on
read requests, but NFSv4 has gotten this wrong ever since
dc730e1737 "nfsd4: fix owner-override on
open", when we realized that the file owner shouldn't override
permissions on non-reclaim NFSv4 opens.

So we can't use NFSD_MAY_OWNER_OVERRIDE to tell nfsd_permission to allow
reads of executable files.

So, do the same thing we do whenever we encounter another weird NFS
permission nit: define yet another NFSD_MAY_* flag.

The industry's future standardization on 128-bit processors will be
motivated primarily by the need for integers with enough bits for all
the NFSD_MAY_* flags.

Reported-by: Leonardo Borda <leonardoborda@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:15 -08:00
J. Bruce Fields
c86f82f0ef nfsd4: fix seqid_mutating_error
commit 576163005d upstream.

The set of errors here does *not* agree with the set of errors specified
in the rfc!

While we're there, turn this macros into a function, for the usual
reasons, and move it to the one place where it's actually used.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:15 -08:00
J. Bruce Fields
ab10f211c7 nfsd4: stop using nfserr_resource for transitory errors
commit 3e77246393 upstream.

The server is returning nfserr_resource for both permanent errors and
for errors (like allocation failures) that might be resolved by retrying
later.  Save nfserr_resource for the former and use delay/jukebox for
the latter.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:14 -08:00
Bernd Schubert
b44ecd56e9 nfsd4: Remove check for a 32-bit cookie in nfsd4_readdir()
commit 832023bffb upstream.

Fan Yong <yong.fan@whamcloud.com> noticed setting
FMODE_32bithash wouldn't work with nfsd v4, as
nfsd4_readdir() checks for 32 bit cookies. However, according to RFC 3530
cookies have a 64 bit type and cookies are also defined as u64 in
'struct nfsd4_readdir'. So remove the test for >32-bit values.

Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:14 -08:00
Jeff Layton
c56f5c1b3e nfs: don't try to migrate pages with active requests
commit 2da9565235 upstream.

nfs_find_and_lock_request will take a reference to the nfs_page and
will then put it if the req is already locked. It's possible though
that the reference will be the last one. That put then can kick off
a whole series of reference puts:

nfs_page
   nfs_open_context
      dentry
          inode

If the inode ends up being deleted, then the VFS will call
truncate_inode_pages. That function will try to take the page lock, but
it was already locked when migrate_page was called. The code
deadlocks.

Fix this by simply refusing the migration request if PagePrivate is
already set, indicating that the page is already associated with an
active read or write request.

We've had a customer test a backported version of this patch and
the preliminary results seem good.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Harshula Jayasuriya <harshula@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:14 -08:00
Jeff Layton
9c8b701a3d nfs: don't redirty inode when ncommit == 0 in nfs_commit_unstable_pages
commit 3236c3e1ad upstream.

commit 420e3646 allowed the kernel to reduce the number of unnecessary
commit calls by skipping the commit when there are a large number of
outstanding pages.

However, the current test in nfs_commit_unstable_pages does not handle
the edge condition properly. When ncommit == 0, then that means that the
kernel doesn't need to do anything more for the inode. The current test
though in the WB_SYNC_NONE case will return true, and the inode will end
up being marked dirty. Once that happens the inode will never be clean
until there's a WB_SYNC_ALL flush.

Fix this by immediately returning from nfs_commit_unstable_pages when
ncommit == 0.

Mike noticed this problem initially in RHEL5 (2.6.18-based kernel) which
has a backported version of 420e3646. The inode cache there was growing
very large. The inode cache was unable to be shrunk since the inodes
were all marked dirty. Calling sync() would essentially "fix" the
problem -- the WB_SYNC_ALL flush would result in the inodes all being
marked clean.

What I'm not clear on is how big a problem this is in mainline kernels
as the writeback code there is very different. Either way, it seems
incorrect to re-mark the inode dirty in this case.

Reported-by: Mike McLean <mikem@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:03 -08:00
Peng Tao
53e34d2425 SUNRPC/NFS: make rpc pipe upcall generic
commit c1225158a8 upstream.

The same function is used by idmap, gss and blocklayout code. Make it
generic.

Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:03 -08:00
Trond Myklebust
ddd5310cb9 Revert "NFS: Ensure that writeback_single_inode() calls write_inode() when syncing"
commit 59b7c05fff upstream.

This reverts commit b80c3cb628.

The reverted commit was rendered obsolete by a VFS fix: commit
5547e8aac6 (writeback: Update dirty flags in
two steps). We now no longer need to worry about writeback_single_inode()
missing our marking the inode for COMMIT in 'do_writepages()' call.

Reverting this patch, fixes a performance regression in which the inode
would continuously get queued to the dirty list, causing the writeback
code to unnecessarily try to send a COMMIT.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:02 -08:00
Peng Tao
9c348b7654 pnfsblock: fix writeback deadlock
commit 7542274519 upstream.

We should check if the sector is already initialized before
trying to grab the page from page cache. Otherwise when two
pages of the same block are written back by two threads each
calling from writepage_locked, it can cause deadlock like bellow.

 [ 1080.972099] INFO: task kswapd0:25 blocked for more than 120 seconds.
 [ 1080.972377] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 [ 1080.972812] kswapd0         D ffff88000c4926c0     0    25      2 0x00000000
 [ 1080.972816]  ffff88000df276b0 0000000000000046 ffff88000df27640 ffffffff81013ba7
 [ 1080.972821]  ffff88000c492310 ffff88000df27fd8 ffff88000df27fd8 00000000001d3440
 [ 1080.972824]  ffff88000c378000 ffff88000c492310 ffff8800175d3d40 ffff880017fc75a8
 [ 1080.972828] Call Trace:
 [ 1080.972860]  [<ffffffff81013ba7>] ? read_tsc+0x9/0x19
 [ 1080.972877]  [<ffffffff810e0b23>] ? lock_page+0x2b/0x2b
 [ 1080.972899]  [<ffffffff81475a1d>] io_schedule+0x63/0x7e
 [ 1080.972902]  [<ffffffff810e0b31>] sleep_on_page+0xe/0x12
 [ 1080.972905]  [<ffffffff81475fe8>] __wait_on_bit_lock+0x46/0x8f
 [ 1080.972916]  [<ffffffff810822d7>] ? lock_release_holdtime.part.7+0x6b/0x72
 [ 1080.972919]  [<ffffffff810e0af6>] __lock_page+0x66/0x68
 [ 1080.972928]  [<ffffffff81072705>] ? autoremove_wake_function+0x3d/0x3d
 [ 1080.972932]  [<ffffffff810e0b1f>] lock_page+0x27/0x2b
 [ 1080.972934]  [<ffffffff810e0bcf>] find_lock_page+0x34/0x57
 [ 1080.972937]  [<ffffffff810e1738>] find_or_create_page+0x34/0x8a
 [ 1080.972947]  [<ffffffffa034245b>] bl_write_pagelist+0x205/0x6da [blocklayoutdriver]
 [ 1080.972951]  [<ffffffffa034145d>] ? bl_free_lseg+0x38/0x38 [blocklayoutdriver]
 [ 1080.972995]  [<ffffffffa02e27b9>] ? nfs_write_rpcsetup+0x118/0x123 [nfs]
 [ 1080.973033]  [<ffffffffa030246b>] pnfs_generic_pg_writepages+0x10b/0x1f4 [nfs]
 [ 1080.973089]  [<ffffffffa02deaae>] nfs_pageio_doio+0x1a/0x43 [nfs]
 [ 1080.973098]  [<ffffffffa02df035>] nfs_pageio_complete+0x16/0x2d [nfs]
 [ 1080.973108]  [<ffffffffa02e2d8f>] nfs_writepage_locked+0xa0/0xbf [nfs]
 [ 1080.973119]  [<ffffffffa02e36a1>] nfs_writepage+0x16/0x2b [nfs]
 [ 1080.973122]  [<ffffffff810e8762>] ? clear_page_dirty_for_io+0x87/0x9a
 [ 1080.973133]  [<ffffffff810efc5b>] shrink_page_list+0x39b/0x6c8
 [ 1080.973139]  [<ffffffff810f03bb>] shrink_inactive_list+0x22c/0x39e
 [ 1080.973144]  [<ffffffff810822d7>] ? lock_release_holdtime.part.7+0x6b/0x72
 [ 1080.973148]  [<ffffffff810f0c33>] shrink_zone+0x445/0x588
 [ 1080.973152]  [<ffffffff810f1a11>] balance_pgdat+0x2c2/0x56b
 [ 1080.973170]  [<ffffffff81254208>] ? __bitmap_weight+0x34/0x80
 [ 1080.973175]  [<ffffffff810f1f78>] kswapd+0x2be/0x2fa
 [ 1080.973179]  [<ffffffff810726c8>] ? __init_waitqueue_head+0x4b/0x4b
 [ 1080.973183]  [<ffffffff810f1cba>] ? balance_pgdat+0x56b/0x56b
 [ 1080.973187]  [<ffffffff81071f69>] kthread+0xa8/0xb0
 [ 1080.973200]  [<ffffffff814806b4>] kernel_thread_helper+0x4/0x10
 [ 1080.973205]  [<ffffffff81071ec1>] ? __init_kthread_worker+0x5a/0x5a
 [ 1080.973210]  [<ffffffff814806b0>] ? gs_change+0x13/0x13
 [ 1080.973213] no locks held by kswapd0/25.

Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:43:00 -08:00
Peng Tao
e799f152fb pnfsblock: fix NULL pointer dereference
commit e6d05a757c upstream.

bl_add_page_to_bio returns error pointer. bio should be reset to
NULL in failure cases as the out path always calls bl_submit_bio.

Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:42:59 -08:00
Peng Tao
01e4e1d236 pnfs: recoalesce when ld read pagelist fails
commit 9b7eecdcfe upstream.

For pnfs pagelist read failure, we need to pg_recoalesce and resend IO to
mds.

Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:42:58 -08:00
Peng Tao
626a6d5541 pnfs: recoalesce when ld write pagelist fails
commit 8ce160c5ef upstream.

For pnfs pagelist write failure, we need to pg_recoalesce and resend IO to
mds.

Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:42:58 -08:00
Peng Tao
bef05def50 pnfs: make _set_lo_fail generic
commit 1b0ae06877 upstream.

file layout and block layout both use it to set mark layout io failure
bit. So make it generic.

Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:42:57 -08:00
Peng Tao
f135656f64 pnfsblock: add missing rpc_put_mount and path_put
commit 760383f1ee upstream.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:42:56 -08:00
Jim Rees
7ab7073b8b pnfsblock: fix size of upcall message
commit fdc17abbc4 upstream.

Make the status field explicitly 32 bits.  "...it's unlikely that the kernel
and userspace would differ on the size of an int here, but it might be a
good idea to go ahead and make that explicitly 32 bits in case we end up
dealing with more exotic arches at some point in the future."

Suggested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:42:55 -08:00
Jim Rees
029bc65941 pnfsblock: fix return code confusion
commit 516f2e24fa upstream.

Always return PTR_ERR, not NULL, from nfs4_blk_get_deviceinfo and
nfs4_blk_decode_device.

Check for IS_ERR, not NULL, in bl_set_layoutdriver when calling
nfs4_blk_get_deviceinfo.

Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:42:55 -08:00
Nelson Elhage
cc1548ae44 epoll: fix spurious lockdep warnings
commit d8805e633e upstream.

epoll can acquire recursively acquire ep->mtx on multiple "struct
eventpoll"s at once in the case where one epoll fd is monitoring another
epoll fd.  This is perfectly OK, since we're careful about the lock
ordering, but it causes spurious lockdep warnings.  Annotate the recursion
using mutex_lock_nested, and add a comment explaining the nesting rules
for good measure.

Recent versions of systemd are triggering this, and it can also be
demonstrated with the following trivial test program:

--------------------8<--------------------

int main(void) {
   int e1, e2;
   struct epoll_event evt = {
       .events = EPOLLIN
   };

   e1 = epoll_create1(0);
   e2 = epoll_create1(0);
   epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);
   return 0;
}
--------------------8<--------------------

Reported-by: Paul Bolle <pebolle@tiscali.nl>
Tested-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Nelson Elhage <nelhage@nelhage.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:42:34 -08:00
Pavel Shilovsky
436318f9b4 CIFS: Fix DFS handling in cifs_get_file_info
commit 42274bb22a upstream.

We should call cifs_all_info_to_fattr in rc == 0 case only.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:42:10 -08:00
Pavel Shilovsky
405a281820 CIFS: Fix incorrect max RFC1002 write size value
commit 94443f4340 upstream.

..the length field has only 17 bits.

Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11 09:42:10 -08:00
Linus Torvalds
480082968a Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: revert to using a kthread for AIL pushing
  xfs: force the log if we encounter pinned buffers in .iop_pushbuf
  xfs: do not update xa_last_pushed_lsn for locked items
2011-10-14 17:06:39 +12:00
Linus Torvalds
b2f9452bd5 Merge branch 'btrfs-3.0' of git://github.com/chrismason/linux
* 'btrfs-3.0' of git://github.com/chrismason/linux:
  Btrfs: make sure not to defrag extents past i_size
  Btrfs: fix recursive auto-defrag
2011-10-13 18:20:40 +12:00
Christoph Hellwig
0030807c66 xfs: revert to using a kthread for AIL pushing
Currently we have a few issues with the way the workqueue code is used to
implement AIL pushing:

 - it accidentally uses the same workqueue as the syncer action, and thus
   can be prevented from running if there are enough sync actions active
   in the system.
 - it doesn't use the HIGHPRI flag to queue at the head of the queue of
   work items

At this point I'm not confident enough in getting all the workqueue flags and
tweaks right to provide a perfectly reliable execution context for AIL
pushing, which is the most important piece in XFS to make forward progress
when the log fills.

Revert back to use a kthread per filesystem which fixes all the above issues
at the cost of having a task struct and stack around for each mounted
filesystem.  In addition this also gives us much better ways to diagnose
any issues involving hung AIL pushing and removes a small amount of code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-10-11 11:02:49 -05:00
Christoph Hellwig
17b38471c3 xfs: force the log if we encounter pinned buffers in .iop_pushbuf
We need to check for pinned buffers even in .iop_pushbuf given that inode
items flush into the same buffers that may be pinned directly due operations
on the unlinked inode list operating directly on buffers.  To do this add a
return value to .iop_pushbuf that tells the AIL push about this and use
the existing log force mechanisms to unpin it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-10-11 11:02:48 -05:00
Christoph Hellwig
bc6e588a89 xfs: do not update xa_last_pushed_lsn for locked items
If an item was locked we should not update xa_last_pushed_lsn and thus skip
it when restarting the AIL scan as we need to be able to lock and write it
out as soon as possible.  Otherwise heavy lock contention might starve AIL
pushing too easily, especially given the larger backoff once we moved
xa_last_pushed_lsn all the way to the target lsn.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-10-11 11:02:48 -05:00
Chris Mason
f7f43cc841 Btrfs: make sure not to defrag extents past i_size
The btrfs file defrag code will loop through the extents and
force COW on them.  But there is a concurrent truncate in the middle of
the defrag, it might end up defragging the same range over and over
again.

The problem is that writepage won't go through and do anything on pages
past i_size, so the cow won't happen, so the file will appear to still
be fragmented.  defrag will end up hitting the same extents again and
again.

In the worst case, the truncate can actually live lock with the defrag
because the defrag keeps creating new ordered extents which the truncate
code keeps waiting on.

The fix here is to make defrag check for i_size inside the main loop,
instead of just once before the looping starts.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-10-11 11:45:55 -04:00
Li Zefan
2a0f7f5769 Btrfs: fix recursive auto-defrag
Follow those steps:

  # mount -o autodefrag /dev/sda7 /mnt
  # dd if=/dev/urandom of=/mnt/tmp bs=200K count=1
  # sync
  # dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc

and then it'll go into a loop: writeback -> defrag -> writeback ...

It's because writeback writes [8K, 200K] and then writes [0, 8K].

I tried to make writeback know if the pages are dirtied by defrag,
but the patch was a bit intrusive. Here I simply set writeback_index
when we defrag a file.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-10-10 15:43:34 -04:00
Linus Torvalds
65112dccf8 Merge git://git.samba.org/sfrench/cifs-2.6
* git://git.samba.org/sfrench/cifs-2.6:
  [CIFS] Fix first time message on mount, ntlmv2 upgrade delayed to 3.2
2011-10-10 14:53:11 +12:00
Steve French
9d1e397b7b [CIFS] Fix first time message on mount, ntlmv2 upgrade delayed to 3.2
Microsoft has a bug with ntlmv2 that requires use of ntlmssp, but
we didn't get the required information on when/how to use ntlmssp to
old (but once very popular) legacy servers (various NT4 fixpacks
for example) until too late to merge for 3.1.  Will upgrade
to NTLMv2 in NTLMSSP in 3.2

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2011-10-07 20:17:56 -05:00
Linus Torvalds
7fd21be75d Merge branch 'btrfs-3.0' of git://github.com/chrismason/linux
* 'btrfs-3.0' of git://github.com/chrismason/linux:
  Btrfs: force a page fault if we have a shorty copy on a page boundary
2011-10-03 12:17:44 -07:00
Josef Bacik
b6316429af Btrfs: force a page fault if we have a shorty copy on a page boundary
A user reported a problem where ceph was getting into 100% cpu usage while doing
some writing.  It turns out it's because we were doing a short write on a not
uptodate page, which means we'd fall back at one page at a time and fault the
page in.  The problem is our position is on the page boundary, so our fault in
logic wasn't actually reading the page, so we'd just spin forever or until the
page got read in by somebody else.  This will force a readpage if we end up
doing a short copy.  Alexandre could reproduce this easily with ceph and reports
it fixes his problem.  I also wrote a reproducer that no longer hangs my box
with this patch.  Thanks,

Reported-and-tested-by: Alexandre Oliva <aoliva@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-30 15:23:54 -04:00
Linus Torvalds
b6c8069d35 vfs: remove LOOKUP_NO_AUTOMOUNT flag
That flag no longer makes sense, since we don't look up automount points
as eagerly any more.  Additionally, it turns out that the NO_AUTOMOUNT
handling was buggy to begin with: it would avoid automounting even for
cases where we really *needed* to do the automount handling, and could
return ENOENT for autofs entries that hadn't been instantiated yet.

With our new non-eager automount semantics, one discussion has been
about adding a AT_AUTOMOUNT flag to vfs_fstatat (and thus the
newfstatat() and fstatat64() system calls), but it's probably not worth
it: you can always force at least directory automounting by simply
adding the final '/' to the filename, which works for *all* of the stat
family system calls, old and new.

So AT_NO_AUTOMOUNT (and thus LOOKUP_NO_AUTOMOUNT) really were just a
result of our bad default behavior.

Acked-by: Ian Kent <raven@themaw.net>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-27 08:12:33 -07:00
Trond Myklebust
815d405cef VFS: Fix the remaining automounter semantics regressions
The concensus seems to be that system calls such as stat() etc should
not trigger an automount.  Neither should the l* versions.

This patch therefore adds a LOOKUP_AUTOMOUNT flag to tag those lookups
that _should_ trigger an automount on the last path element.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[ Edited to leave out the cases that are already covered by LOOKUP_OPEN,
  LOOKUP_DIRECTORY and LOOKUP_CREATE - all of which also fundamentally
  force automounting for their own reasons   - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-26 19:16:46 -07:00