[ Upstream commit 950271d7cc ]
Currently the tun_napi_alloc_frags() function returns -ENOMEM when the
number of iovs exceeds MAX_SKB_FRAGS + 1. However this is inappropriate,
we should use -EMSGSIZE instead of -ENOMEM.
The following distinctions are matters:
1. the caller need to drop the bad packet when -EMSGSIZE is returned,
which means meeting a persistent failure.
2. the caller can try again when -ENOMEM is returned, which means
meeting a transient failure.
Fixes: 90e33d4594 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/1608864736-24332-1-git-send-email-wangyunjian@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4614792eeb ]
The CPTS driver registers PTP PHC clock when first netif is going up and
unregister it when all netif are down. Now ethtool will show:
- PTP PHC clock index 0 after boot until first netif is up;
- the last assigned PTP PHC clock index even if PTP PHC clock is not
registered any more after all netifs are down.
This patch ensures that -1 is returned by ethtool when PTP PHC clock is not
registered any more.
Fixes: 8a2c9a5ab4 ("net: ethernet: ti: cpts: rework initialization/deinitialization")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Link: https://lore.kernel.org/r/20201224162405.28032-1-grygorii.strashko@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4ae2bb8164 ]
Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.
Fixes: 8af2c06ff4 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2d57b4f142 ]
Two race conditions can be triggered when storing xps rxqs, resulting in
various oops and invalid memory accesses:
1. Calling netdev_set_num_tc while netif_set_xps_queue:
- netif_set_xps_queue uses dev->tc_num as one of the parameters to
compute the size of new_dev_maps when allocating it. dev->tc_num is
also used to access the map, and the compiler may generate code to
retrieve this field multiple times in the function.
- netdev_set_num_tc sets dev->tc_num.
If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
is set to a higher value through netdev_set_num_tc, later accesses to
new_dev_maps in netif_set_xps_queue could lead to accessing memory
outside of new_dev_maps; triggering an oops.
2. Calling netif_set_xps_queue while netdev_set_num_tc is running:
2.1. netdev_set_num_tc starts by resetting the xps queues,
dev->tc_num isn't updated yet.
2.2. netif_set_xps_queue is called, setting up the map with the
*old* dev->num_tc.
2.3. netdev_set_num_tc updates dev->tc_num.
2.4. Later accesses to the map lead to out of bound accesses and
oops.
A similar issue can be found with netdev_reset_tc.
One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_rxqs in a concurrent thread. With the right timing an oops is
triggered.
Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_rxqs_store.
Fixes: 8af2c06ff4 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit fb25038586 ]
Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.
Fixes: 184c449f91 ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1ad58225db ]
Two race conditions can be triggered when storing xps cpus, resulting in
various oops and invalid memory accesses:
1. Calling netdev_set_num_tc while netif_set_xps_queue:
- netif_set_xps_queue uses dev->tc_num as one of the parameters to
compute the size of new_dev_maps when allocating it. dev->tc_num is
also used to access the map, and the compiler may generate code to
retrieve this field multiple times in the function.
- netdev_set_num_tc sets dev->tc_num.
If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
is set to a higher value through netdev_set_num_tc, later accesses to
new_dev_maps in netif_set_xps_queue could lead to accessing memory
outside of new_dev_maps; triggering an oops.
2. Calling netif_set_xps_queue while netdev_set_num_tc is running:
2.1. netdev_set_num_tc starts by resetting the xps queues,
dev->tc_num isn't updated yet.
2.2. netif_set_xps_queue is called, setting up the map with the
*old* dev->num_tc.
2.3. netdev_set_num_tc updates dev->tc_num.
2.4. Later accesses to the map lead to out of bound accesses and
oops.
A similar issue can be found with netdev_reset_tc.
One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_cpus in a concurrent thread. With the right timing an oops is
triggered.
Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_cpus_store.
Fixes: 184c449f91 ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit de33212f76 ]
virtnet_set_channels can recursively call cpus_read_lock if CONFIG_XPS
and CONFIG_HOTPLUG are enabled.
The path is:
virtnet_set_channels - calls get_online_cpus(), which is a trivial
wrapper around cpus_read_lock()
netif_set_real_num_tx_queues
netif_reset_xps_queues_gt
netif_reset_xps_queues - calls cpus_read_lock()
This call chain and potential deadlock happens when the number of TX
queues is reduced.
This commit the removes netif_set_real_num_[tr]x_queues calls from
inside the get/put_online_cpus section, as they don't require that it
be held.
Fixes: 47be24796c ("virtio-net: fix the set affinity bug when CPU IDs are not consecutive")
Signed-off-by: Jeff Dike <jdike@akamai.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20201223025421.671-1-jdike@akamai.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1f45dc2206 ]
Commit f9c6cea0b3 ("ibmvnic: Skip fatal error reset after passive init")
says "If the passive
CRQ initialization occurs before the FATAL reset task is processed,
the FATAL error reset task would try to access a CRQ message queue
that was freed, causing an oops. The problem may be most likely to
occur during DLPAR add vNIC with a non-default MTU, because the DLPAR
process will automatically issue a change MTU request.
Fix this by not processing fatal error reset if CRQ is passively
initialized after client-driven CRQ initialization fails."
Even with this commit, we still see similar kernel crashes. In order
to completely solve this problem, we'd better continue the fatal error
reset, capture the kernel crash, and try to fix it from that end.
Fixes: f9c6cea0b3 ("ibmvnic: Skip fatal error reset after passive init")
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Link: https://lore.kernel.org/r/20201219214034.21123-1-ljp@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1385ae5c30 ]
All the buffers and registers are already set up appropriately for an
MTU slightly above 1500, so we just need to expose this to the
networking stack. AFAICT, there's no need to implement .ndo_change_mtu
when the receive buffers are always set up to support the max_mtu.
This fixes several warnings during boot on our mpc8309-board with an
embedded mv88e6250 switch:
mv88e6085 mdio@e0102120:10: nonfatal error -34 setting MTU 1500 on port 0
...
mv88e6085 mdio@e0102120:10: nonfatal error -34 setting MTU 1500 on port 4
ucc_geth e0102000.ethernet eth1: error -22 setting MTU to 1504 to include DSA overhead
The last line explains what the DSA stack tries to do: achieving an MTU
of 1500 on-the-wire requires that the master netdevice connected to
the CPU port supports an MTU of 1500+the tagging overhead.
Fixes: bfcb813203 ("net: dsa: configure the MTU for switch ports")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e925e0cd2a ]
ugeth is the netdiv_priv() part of the netdevice. Accessing the memory
pointed to by ugeth (such as done by ucc_geth_memclean() and the two
of_node_puts) after free_netdev() is thus use-after-free.
Fixes: 80a9fad8e8 ("ucc_geth: fix module removal")
Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 3f48fab62b ]
Issue:
Flow control frame used to pause GoP(MAC) was delivered to the CPU
and created a load on the CPU. Since XOFF/XON frames are used only
by MAC, these frames should be dropped inside MAC.
Fix:
According to 802.3-2012 - IEEE Standard for Ethernet pause frame
has unique destination MAC address 01-80-C2-00-00-01.
Add TCAM parser entry to track and drop pause frames by destination MAC.
Fixes: 3f518509de ("ethernet: Add new driver for Marvell Armada 375 network unit")
Signed-off-by: Stefan Chulski <stefanc@marvell.com>
Link: https://lore.kernel.org/r/1608229817-21951-1-git-send-email-stefanc@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 3ac874fa84 ]
When removing VFs for PF added to bridge there was
an error I40E_AQ_RC_EINVAL. It was caused by not properly
resetting and reinitializing PF when adding/removing VFs.
Changed how reset is performed when adding/removing VFs
to properly reinitialize PFs VSI.
Fixes: fc60861e9b ("i40e: start up in VEPA mode by default")
Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c6c75deda8 ]
Commit 1fde6f21d9 ("proc: fix /proc/net/* after setns(2)") only forced
revalidation of regular files under /proc/net/
However, /proc/net/ is unusual in the sense of /proc/net/foo handlers
take netns pointer from parent directory which is old netns.
Steps to reproduce:
(void)open("/proc/net/sctp/snmp", O_RDONLY);
unshare(CLONE_NEWNET);
int fd = open("/proc/net/sctp/snmp", O_RDONLY);
read(fd, &c, 1);
Read will read wrong data from original netns.
Patch forces lookup on every directory under /proc/net .
Link: https://lkml.kernel.org/r/20201205160916.GA109739@localhost.localdomain
Fixes: 1da4d377f9 ("proc: revalidate misc dentries")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit cedd1862be ]
Commit 436e980e2e ("kbuild: don't hardcode depmod path") stopped
hard-coding the path of depmod, but in the process caused trouble for
distributions that had that /sbin location, but didn't have it in the
PATH (generally because /sbin is limited to the super-user path).
Work around it for now by just adding /sbin to the end of PATH in the
depmod.sh script.
Reported-and-tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3684566384 ]
Some graphic card has very big memory on chip, such as 32G bytes.
In the following case, it will cause overflow:
pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
ret = gen_pool_add(pool, 0x1000000, SZ_32G, NUMA_NO_NODE);
va = gen_pool_alloc(pool, SZ_4G);
The overflow occurs in gen_pool_alloc_algo_owner():
....
size = nbits << order;
....
The @nbits is "int" type, so it will overflow.
Then the gen_pool_avail() will return the wrong value.
This patch converts some "int" to "unsigned long", and
changes the compare code in while.
Link: https://lkml.kernel.org/r/20201229060657.3389-1-sjhuang@iluvatar.ai
Signed-off-by: Huang Shijie <sjhuang@iluvatar.ai>
Reported-by: Shi Jiasheng <jiasheng.shi@iluvatar.ai>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 96d86e6a80 ]
RQF_PREEMPT is used for two different purposes in the legacy IDE code:
1. To mark power management requests.
2. To mark requests that should preempt another request. An (old)
explanation of that feature is as follows: "The IDE driver in the Linux
kernel normally uses a series of busywait delays during its
initialization. When the driver executes these busywaits, the kernel
does nothing for the duration of the wait. The time spent in these
waits could be used for other initialization activities, if they could
be run concurrently with these waits.
More specifically, busywait-style delays such as udelay() in module
init functions inhibit kernel preemption because the Big Kernel Lock is
held, while yielding APIs such as schedule_timeout() allow
preemption. This is true because the kernel handles the BKL specially
and releases and reacquires it across reschedules allowed by the
current thread.
This IDE-preempt specification requires that the driver eliminate these
busywaits and replace them with a mechanism that allows other work to
proceed while the IDE driver is initializing."
Since I haven't found an implementation of (2), do not set the PREEMPT flag
for sense requests. This patch causes sense requests to be postponed while
a drive is suspended instead of being submitted to ide_queue_rq().
If it would ever be necessary to restore the IDE PREEMPT functionality,
that can be done by introducing a new flag in struct ide_request.
Link: https://lore.kernel.org/r/20201209052951.16136-4-bvanassche@acm.org
Cc: David S. Miller <davem@davemloft.net>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Can Guo <cang@codeaurora.org>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 01341fbd0d ]
In realtime scenario, We do not want to have interference on the
isolated cpu cores. but when invoking alloc_workqueue() for percpu wq
on the housekeeping cpu, it kick a kworker on the isolated cpu.
alloc_workqueue
pwq_adjust_max_active
wake_up_worker
The comment in pwq_adjust_max_active() said:
"Need to kick a worker after thawed or an unbound wq's
max_active is bumped"
So it is unnecessary to kick a kworker for percpu's wq when invoking
alloc_workqueue(). this patch only kick a worker based on the actual
activation of delayed works.
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 31784cff7e ]
In preparation for converting exec_update_mutex to a rwsem so that
multiple readers can execute in parallel and not deadlock, add
down_read_interruptible. This is needed for perf_event_open to be
converted (with no semantic changes) from working on a mutex to
wroking on a rwsem.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/87k0tybqfy.fsf@x220.int.ebiederm.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0f9368b5bf ]
In preparation for converting exec_update_mutex to a rwsem so that
multiple readers can execute in parallel and not deadlock, add
down_read_killable_nested. This is needed so that kcmp_lock
can be converted from working on a mutexes to working on rw_semaphores.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/87o8jabqh3.fsf@x220.int.ebiederm.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 78af4dc949 ]
Syzbot reported a lock inversion involving perf. The sore point being
perf holding exec_update_mutex() for a very long time, specifically
across a whole bunch of filesystem ops in pmu::event_init() (uprobes)
and anon_inode_getfile().
This then inverts against procfs code trying to take
exec_update_mutex.
Move the permission checks later, such that we need to hold the mutex
over less code.
Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5d069dbe8a ]
Jan Kara's analysis of the syzbot report (edited):
The reproducer opens a directory on FUSE filesystem, it then attaches
dnotify mark to the open directory. After that a fuse_do_getattr() call
finds that attributes returned by the server are inconsistent, and calls
make_bad_inode() which, among other things does:
inode->i_mode = S_IFREG;
This then confuses dnotify which doesn't tear down its structures
properly and eventually crashes.
Avoid calling make_bad_inode() on a live inode: switch to a private flag on
the fuse inode. Also add the test to ops which the bad_inode_ops would
have caught.
This bug goes back to the initial merge of fuse in 2.6.14...
Reported-by: syzbot+f427adf9324b92652ccc@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7b6b51234d upstream
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable array in the iio_priv() data with alignment
explicitly requested. This data is allocated with kzalloc() so no
data can leak apart from previous readings.
In this driver, depending on which channels are enabled, the timestamp
can be in a number of locations. Hence we cannot use a structure
to specify the data layout without it being misleading.
Fixes: 77c4ad2d6a ("iio: imu: Add initial support for Bosch BMI160")
Reported-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Cc: Daniel Baluta <daniel.baluta@gmail.com>
Cc: Daniel Baluta <daniel.baluta@oss.nxp.com>
Cc: <Stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-6-jic23@kernel.org
[sudip: adjust context]
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts stable commit baad618d07.
This commit is adding lines to spinand_write_to_cache_op, wheras the upstream
commit 868cbe2a6d that this was supposed to
backport was touching spinand_read_from_cache_op.
It causes a crash on writing OOB data by attempting to write to read-only
kernel memory.
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 252bd12563 ]
If emergency system shutdown is called, like by thermal shutdown,
a dm device could be alive when the block device couldn't process
I/O requests anymore. In this state, the handling of I/O errors
by new dm I/O requests or by those already in-flight can lead to
a verity corruption state, which is a misjudgment.
So, skip verity work in response to I/O error when system is shutting
down.
Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 618de0f4ef ]
The PCM hw_params core function tries to clear up the PCM buffer
before actually using for avoiding the information leak from the
previous usages or the usage before a new allocation. It performs the
memset() with runtime->dma_bytes, but this might still leave some
remaining bytes untouched; namely, the PCM buffer size is aligned in
page size for mmap, hence runtime->dma_bytes doesn't necessarily cover
all PCM buffer pages, and the remaining bytes are exposed via mmap.
This patch changes the memory clearance to cover the all buffer pages
if the stream is supposed to be mmap-ready (that guarantees that the
buffer size is aligned in page size).
Reviewed-by: Lars-Peter Clausen <lars@metafoo.de>
Link: https://lore.kernel.org/r/20201218145625.2045-3-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ba8ea8e7dd ]
can_stop_idle_tick() checks whether the do_timer() duty has been taken over
by a CPU on boot. That's silly because the boot CPU always takes over with
the initial clockevent device.
But even if no CPU would have installed a clockevent and taken over the
duty then the question whether the tick on the current CPU can be stopped
or not is moot. In that case the current CPU would have no clockevent
either, so there would be nothing to keep ticking.
Remove it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20201206212002.725238293@linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fc6b6a872d ]
Internally, UBD treats each physical IO segment as a separate command to
be submitted in the execution pipe. If the pipe returns a transient
error after a few segments have already been written, UBD will tell the
block layer to requeue the request, but there is no way to reclaim the
segments already submitted. When a new attempt to dispatch the request
is done, those segments already submitted will get duplicated, causing
the WARN_ON below in the best case, and potentially data corruption.
In my system, running a UML instance with 2GB of RAM and a 50M UBD disk,
I can reproduce the WARN_ON by simply running mkfs.fvat against the
disk on a freshly booted system.
There are a few ways to around this, like reducing the pressure on
the pipe by reducing the queue depth, which almost eliminates the
occurrence of the problem, increasing the pipe buffer size on the host
system, or by limiting the request to one physical segment, which causes
the block layer to submit way more requests to resolve a single
operation.
Instead, this patch modifies the format of a UBD command, such that all
segments are sent through a single element in the communication pipe,
turning the command submission atomic from the point of view of the
block layer. The new format has a variable size, depending on the
number of elements, and looks like this:
+------------+-----------+-----------+------------
| cmd_header | segment 0 | segment 1 | segment ...
+------------+-----------+-----------+------------
With this format, we push a pointer to cmd_header in the submission
pipe.
This has the advantage of reducing the memory footprint of executing a
single request, since it allow us to merge some fields in the header.
It is possible to reduce even further each segment memory footprint, by
merging bitmap_words and cow_offset, for instance, but this is not the
focus of this patch and is left as future work. One issue with the
patch is that for a big number of segments, we now perform one big
memory allocation instead of multiple small ones, but I wasn't able to
trigger any real issues or -ENOMEM because of this change, that wouldn't
be reproduced otherwise.
This was tested using fio with the verify-crc32 option, and by running
an ext4 filesystem over this UBD device.
The original WARN_ON was:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at lib/refcount.c:28 refcount_warn_saturate+0x13f/0x141
refcount_t: underflow; use-after-free.
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 5.5.0-rc6-00002-g2a5bb2cf75c8 #346
Stack:
6084eed0 6063dc77 00000009 6084ef60
00000000 604b8d9f 6084eee0 6063dcbc
6084ef40 6006ab8d e013d780 1c00000000
Call Trace:
[<600a0c1c>] ? printk+0x0/0x94
[<6004a888>] show_stack+0x13b/0x155
[<6063dc77>] ? dump_stack_print_info+0xdf/0xe8
[<604b8d9f>] ? refcount_warn_saturate+0x13f/0x141
[<6063dcbc>] dump_stack+0x2a/0x2c
[<6006ab8d>] __warn+0x107/0x134
[<6008da6c>] ? wake_up_process+0x17/0x19
[<60487628>] ? blk_queue_max_discard_sectors+0x0/0xd
[<6006b05f>] warn_slowpath_fmt+0xd1/0xdf
[<6006af8e>] ? warn_slowpath_fmt+0x0/0xdf
[<600acc14>] ? raw_read_seqcount_begin.constprop.0+0x0/0x15
[<600619ae>] ? os_nsecs+0x1d/0x2b
[<604b8d9f>] refcount_warn_saturate+0x13f/0x141
[<6048bc8f>] refcount_sub_and_test.constprop.0+0x2f/0x37
[<6048c8de>] blk_mq_free_request+0xf1/0x10d
[<6048ca06>] __blk_mq_end_request+0x10c/0x114
[<6005ac0f>] ubd_intr+0xb5/0x169
[<600a1a37>] __handle_irq_event_percpu+0x6b/0x17e
[<600a1b70>] handle_irq_event_percpu+0x26/0x69
[<600a1bd9>] handle_irq_event+0x26/0x34
[<600a1bb3>] ? handle_irq_event+0x0/0x34
[<600a5186>] ? unmask_irq+0x0/0x37
[<600a57e6>] handle_edge_irq+0xbc/0xd6
[<600a131a>] generic_handle_irq+0x21/0x29
[<60048f6e>] do_IRQ+0x39/0x54
[...]
---[ end trace c6e7444e55386c0f ]---
Cc: Christopher Obbard <chris.obbard@collabora.com>
Reported-by: Martyn Welch <martyn@collabora.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Tested-by: Christopher Obbard <chris.obbard@collabora.com>
Acked-by: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit edf7ddbf1c ]
Missing calls to mntget() (or equivalently, too many calls to mntput())
are hard to detect because mntput() delays freeing mounts using
task_work_add(), then again using call_rcu(). As a result, mnt_count
can often be decremented to -1 without getting a KASAN use-after-free
report. Such cases are still bugs though, and they point to real
use-after-frees being possible.
For an example of this, see the bug fixed by commit 1b0b9cc8d3
("vfs: fsmount: add missing mntget()"), discussed at
https://lkml.kernel.org/linux-fsdevel/20190605135401.GB30925@xxxxxxxxxxxxxxxxxxxxxxxxx/T/#u.
This bug *should* have been trivial to find. But actually, it wasn't
found until syzkaller happened to use fchdir() to manipulate the
reference count just right for the bug to be noticeable.
Address this by making mntput_no_expire() issue a WARN if mnt_count has
become negative.
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>