gcc-7 flags the use of integer math inside of a condition
as a potential bug:
fs/xfs/xfs_bmap_util.c: In function 'xfs_swap_extents_check_format':
fs/xfs/xfs_bmap_util.c:1619:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context]
fs/xfs/xfs_bmap_util.c:1629:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context]
There is already a helper function for testing the di_forkoff
field for zero, so let's use that instead to shut up the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The t_lsn is not used anymore and the t_commit_lsn is used as a tmp
storage for the checkpoint sequence number only in the current code.
And the start/commit lsn are tracked as a transaction group tag in
the xfs_cil_ctx instead of a single transaction, so remove them from
the xfs_trans structure and their users to match with the design.
Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
XFS_HSIZE is an extremly confusing way to calculate the size of handle_t.
Given that handle_t always only had two sizes, and one of them isn't
even covered by XFS_HSIZE to start with just remove the macro and use
a constant sizeof expression.
Note that XFS_HSIZE isn't used in xfsprogs, xfsdump or xfstests either.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If a transaction log reservation overrun occurs, the ticket data
associated with the reservation is dumped in xfs_log_commit_cil().
This occurs long after the transaction items and details have been
removed from the transaction and effectively lost. This limited set
of ticket data provides very little information to support debugging
transaction overruns based on the typical report.
To improve transaction log reservation overrun reporting, create a
helper to dump transaction details such as log items, log vector
data, etc., as well as the underlying ticket data for the
transaction. Move the overrun detection from xfs_log_commit_cil() to
xlog_cil_insert_items() so it occurs prior to migration of the
logged items to the CIL. Call the new helper such that it is able to
dump this transaction data before it is lost.
Also, warn on overrun to provide callstack context for the offending
transaction and include a few additional messages from
xlog_cil_insert_items() to display the reservation consumed locally
for overhead such as log vector headers, split region headers and
the context ticket. This provides a complete general breakdown of
the reservation consumption of a transaction when/if it happens to
overrun the reservation.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Transaction reservation overrun detection currently occurs too late
to print useful information about the offending transaction.
Ideally, the transaction data is printed before the associated log
items are moved from the transaction to the CIL, which occurs in
xlog_cil_insert_items(), such that details of the items logged by
the transaction are available for analysis.
Refactor xlog_cil_insert_items() to facilitate moving tx overrun
detection to this function. Update the function to track each bit of
extra log reservation stolen from the transaction (i.e., such as for
the CIL context ticket) and perform the log item migration as the
last operation before the CIL lock is released. This creates a
context where the transaction reservation consumption has been fully
calculated when the log items are moved to the CIL. This patch makes
no functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xlog_print_tic_res() pre-dates delayed logging and the committed
items list (CIL) and thus retains some factoring warts, such as hard
coded function names in the output and the fact that it induces a
shutdown.
In preparation for more detailed logging of regular transaction
overrun situations, refactor xlog_print_tic_res() to be slightly
more generic. Reword some of the warning messages and pull the
shutdown into the callers.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
While configurable at runtime, the DEBUG mode assert failure
behavior is usually either desired or not for a particular
situation. For example, developers using kernel modules may prefer
for fatal asserts to remain disabled across module reloads while QE
engineers doing broad regression testing may prefer to have fatal
asserts enabled on boot to facilitate data collection for bug
reports.
To provide a compromise/convenience for developers, create a Kconfig
option that sets the default value of the DEBUG mode 'bug_on_assert'
sysfs tunable. The default behavior remains to trigger kernel BUGs
on assert failures to preserve existing behavior across kernel
configuration updates with DEBUG mode enabled.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In DEBUG mode, assert failures unconditionally trigger a kernel BUG.
This is useful in diagnostic situations to panic a system and
collect detailed state information at the time of a failure.
This can also cause problems in cases where DEBUG mode code is
desired but it is preferable not trigger kernel BUGs on assert
failure. For example, during development of new code or during
certain xfstests tests that intentionally cause corruption and test
the kernel for survival (but otherwise may expect to trigger assert
failures).
To provide additional flexibility, create the
<sysfs>/fs/xfs/debug/bug_on_assert tunable to configure assert
failure behavior at runtime. This tunable is only available in DEBUG
mode and is enabled by default to preserve existing default
behavior. When disabled, assert failures in DEBUG mode result in
kernel warnings.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In a pathological scenario where we are trying to bunmapi a single
extent in which every other block is shared, it's possible that trying
to unmap the entire large extent in a single transaction can generate so
many EFIs that we overflow the transaction reservation.
Therefore, use a heuristic to guess at the number of blocks we can
safely unmap from a reflink file's data fork in an single transaction.
This should prevent problems such as the log head slamming into the tail
and ASSERTs that trigger because we've exceeded the transaction
reservation.
Note that since bunmapi can fail to unmap the entire range, we must also
teach the deferred unmap code to roll into a new transaction whenever we
get low on reservation.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: random edits, all bugs are my fault]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Currently, the dir2 leaf block getdents function uses a complex state
tracking mechanism to create a shadow copy of the block mappings and
then uses the shadow copy to schedule readahead. Since the read and
readahead functions are perfectly capable of reading the mappings
themselves, we can tear all that out in favor of a simpler function that
simply keeps pushing the readahead window further out.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reclaim during quotacheck can lead to deadlocks on the dquot flush
lock:
- Quotacheck populates a local delwri queue with the physical dquot
buffers.
- Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
dirties all of the dquots.
- Reclaim kicks in and attempts to flush a dquot whose buffer is
already queud on the quotacheck queue. The flush succeeds but
queueing to the reclaim delwri queue fails as the backing buffer is
already queued. The flush unlock is now deferred to I/O completion
of the buffer from the quotacheck queue.
- The dqadjust bulkstat continues and dirties the recently flushed
dquot once again.
- Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
the flush lock to update the backing buffers with the in-core
recalculated values. It deadlocks on the redirtied dquot as the
flush lock was already acquired by reclaim, but the buffer resides
on the local delwri queue which isn't submitted until the end of
quotacheck.
This is reproduced by running quotacheck on a filesystem with a
couple million inodes in low memory (512MB-1GB) situations. This is
a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
buffer lists"), which removed a trylock and buffer I/O submission
from the quotacheck dquot flush sequence.
Quotacheck first resets and collects the physical dquot buffers in a
delwri queue. Then, it traverses the filesystem inodes via bulkstat,
updates the in-core dquots, flushes the corrected dquots to the
backing buffers and finally submits the delwri queue for I/O. Since
the backing buffers are queued across the entire quotacheck
operation, dquot reclaim cannot possibly complete a dquot flush
before quotacheck completes.
Therefore, quotacheck must submit the buffer for I/O in order to
cycle the flush lock and flush the dirty in-core dquot to the
buffer. Add a delwri queue buffer push mechanism to submit an
individual buffer for I/O without losing the delwri queue status and
use it from quotacheck to avoid the deadlock. This restores
quotacheck behavior to as before the regression was introduced.
Reported-by: Martin Svec <martin.svec@zoner.cz>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
At Linux v3.5, packet processing can be done in process context of ALSA
PCM application as well as software IRQ context for OHCI 1394. Below is
an example of the callgraph (some calls are omitted).
ioctl(2) with e.g. HWSYNC
(sound/core/pcm_native.c)
->snd_pcm_common_ioctl1()
->snd_pcm_hwsync()
->snd_pcm_stream_lock_irq
(sound/core/pcm_lib.c)
->snd_pcm_update_hw_ptr()
->snd_pcm_udpate_hw_ptr0()
->struct snd_pcm_ops.pointer()
(sound/firewire/*)
= Each handler on drivers in ALSA firewire stack
(sound/firewire/amdtp-stream.c)
->amdtp_stream_pcm_pointer()
(drivers/firewire/core-iso.c)
->fw_iso_context_flush_completions()
->struct fw_card_driver.flush_iso_completion()
(drivers/firewire/ohci.c)
= flush_iso_completions()
->struct fw_iso_context.callback.sc
(sound/firewire/amdtp-stream.c)
= in_stream_callback() or out_stream_callback()
->...
->snd_pcm_stream_unlock_irq
When packet queueing error occurs or detecting invalid packets in
'in_stream_callback()' or 'out_stream_callback()', 'snd_pcm_stop_xrun()'
is called on local CPU with disabled IRQ.
(sound/firewire/amdtp-stream.c)
in_stream_callback() or out_stream_callback()
->amdtp_stream_pcm_abort()
->snd_pcm_stop_xrun()
->snd_pcm_stream_lock_irqsave()
->snd_pcm_stop()
->snd_pcm_stream_unlock_irqrestore()
The process is stalled on the CPU due to attempt to acquire recursive lock.
[ 562.630853] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 562.630861] 2-...: (1 GPs behind) idle=37d/140000000000000/0 softirq=38323/38323 fqs=7140
[ 562.630862] (detected by 3, t=15002 jiffies, g=21036, c=21035, q=5933)
[ 562.630866] Task dump for CPU 2:
[ 562.630867] alsa-source-OXF R running task 0 6619 1 0x00000008
[ 562.630870] Call Trace:
[ 562.630876] ? vt_console_print+0x79/0x3e0
[ 562.630880] ? msg_print_text+0x9d/0x100
[ 562.630883] ? up+0x32/0x50
[ 562.630885] ? irq_work_queue+0x8d/0xa0
[ 562.630886] ? console_unlock+0x2b6/0x4b0
[ 562.630888] ? vprintk_emit+0x312/0x4a0
[ 562.630892] ? dev_vprintk_emit+0xbf/0x230
[ 562.630895] ? do_sys_poll+0x37a/0x550
[ 562.630897] ? dev_printk_emit+0x4e/0x70
[ 562.630900] ? __dev_printk+0x3c/0x80
[ 562.630903] ? _raw_spin_lock+0x20/0x30
[ 562.630909] ? snd_pcm_stream_lock+0x31/0x50 [snd_pcm]
[ 562.630914] ? _snd_pcm_stream_lock_irqsave+0x2e/0x40 [snd_pcm]
[ 562.630918] ? snd_pcm_stop_xrun+0x16/0x70 [snd_pcm]
[ 562.630922] ? in_stream_callback+0x3e6/0x450 [snd_firewire_lib]
[ 562.630925] ? handle_ir_packet_per_buffer+0x8e/0x1a0 [firewire_ohci]
[ 562.630928] ? ohci_flush_iso_completions+0xa3/0x130 [firewire_ohci]
[ 562.630932] ? fw_iso_context_flush_completions+0x15/0x20 [firewire_core]
[ 562.630935] ? amdtp_stream_pcm_pointer+0x2d/0x40 [snd_firewire_lib]
[ 562.630938] ? pcm_capture_pointer+0x19/0x20 [snd_oxfw]
[ 562.630943] ? snd_pcm_update_hw_ptr0+0x47/0x3d0 [snd_pcm]
[ 562.630945] ? poll_select_copy_remaining+0x150/0x150
[ 562.630947] ? poll_select_copy_remaining+0x150/0x150
[ 562.630952] ? snd_pcm_update_hw_ptr+0x10/0x20 [snd_pcm]
[ 562.630956] ? snd_pcm_hwsync+0x45/0xb0 [snd_pcm]
[ 562.630960] ? snd_pcm_common_ioctl1+0x1ff/0xc90 [snd_pcm]
[ 562.630962] ? futex_wake+0x90/0x170
[ 562.630966] ? snd_pcm_capture_ioctl1+0x136/0x260 [snd_pcm]
[ 562.630970] ? snd_pcm_capture_ioctl+0x27/0x40 [snd_pcm]
[ 562.630972] ? do_vfs_ioctl+0xa3/0x610
[ 562.630974] ? vfs_read+0x11b/0x130
[ 562.630976] ? SyS_ioctl+0x79/0x90
[ 562.630978] ? entry_SYSCALL_64_fastpath+0x1e/0xad
This commit fixes the above bug. This assumes two cases:
1. Any error is detected in software IRQ context of OHCI 1394 context.
In this case, PCM substream should be aborted in packet handler. On the
other hand, it should not be done in any process context. TO distinguish
these two context, use 'in_interrupt()' macro.
2. Any error is detect in process context of ALSA PCM application.
In this case, PCM substream should not be aborted in packet handler
because PCM substream lock is acquired. The task to abort PCM substream
should be done in ALSA PCM core. For this purpose, SNDRV_PCM_POS_XRUN is
returned at 'struct snd_pcm_ops.pointer()'.
Suggested-by: Clemens Ladisch <clemens@ladisch.de>
Fixes: e9148dddc3c7("ALSA: firewire-lib: flush completed packets when reading PCM position")
Cc: <stable@vger.kernel.org> # 4.9+
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Add ioctl support to IPoIB device driver. For now, this
ioctl will support timestamp get and set.
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Eitan Rabin <rabin@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Enable PTP for IPoIB rdma_netdev and add the ability
to get the time stamping parameters using ethtool.
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Eitan Rabin <rabin@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add the ndo that supports change mtu for IPoIB.
The callback called from the ipoib ULP driver, that gives the ability to
change the SW and HW resources accordingly in the lower driver.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The mtu extra space that kept for the HW is specific for each link type,
and it is different in mlx5e and mlx5i modules.
Now it is kept in the priv structures, set by the mlx5e/mlx5i driver
accordingly.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add function that sets the default values for ipoib, setting/clearing
abilities that IPoIB doesn't support, like RQ size in this case.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Updating the carrier involves specific HW setting, each profile should
use its own function for that.
Both IPoIB and VF representor don't need carrier update function, since
VF representor has only a logical link to VF and IPoIB manages its own
link via ib_core upper layer.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Port flow control supported only for ethernet ports,
therefore, prevent any call if the port type differs from
MLX5_CAP_PORT_TYPE_ETH.
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
IPoIB netdevice driver was only introduced in previous kernel release
and it is growing in terms of features and LOC, move it to a separate
directory.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The MCLK for DA7219 does not change in this platform, but is
currently being configured everytime as part of the platform_clock
event handler for DAPM. The upshot of this is that we have
unnecessary calls to this function, and it also means that if
a stream hasn't yet been started, DA7219 driver does not have the
correct MCLK rates programmed and so the HP detection feature does
not operate as expected.
This patch rectifies this issue by moving the sysclk call to
codec_init function so it's only called once at initialisation.
Signed-off-by: Adam Thomson <Adam.Thomson.Opensource@diasemi.com>
Acked-by: Sathyanarayana Nujella <sathyanarayana.nujella@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Currently when HP detection procedure runs for certain MCLK
frequencies, when PLL is bypassed, the procedure will incorrectly
report Lineout instead of Headphones due to timing incosistencies.
To avoid this problem, the PLL is temporarily enabled (if currently
bypassed and MCLK present) to provide consistent timings for the
procedure, regardless of MCLK frequency.
Signed-off-by: Adam Thomson <Adam.Thomson.Opensource@diasemi.com>
Acked-by: Sathyanarayana Nujella <sathyanarayana.nujella@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
This patch adds the deepbuffer device which can be opened with a bigger
buffer size. The application can disable interrupts and sleep for longer
duration.
Signed-off-by: Subhransu S. Prusty <subhransu.s.prusty@intel.com>
Acked-By: Vinod Koul <vinod.koul@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
DMA buffer size for gateway copier will be calculated based on:
For host DMA copier:
Input buffer size (ibs) for output direction (playback)
Output buffer size (obs) for input direction (capture)
For link DMA copier:
IBS for input direction (capture)
OBS for output direction (playback)
Update the driver to use the above.
Signed-off-by: Ramesh Babu <ramesh.babu@intel.com>
Signed-off-by: Subhransu S. Prusty <subhransu.s.prusty@intel.com>
Acked-By: Vinod Koul <vinod.koul@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
With this patch, the dma buffer size is fetched from topology binary. This
buffer size is applicable for gateway copier modules.
Now that we can configure DSP dma buffer size, the device can support deep
buffer playback. DSP fetches large buffer and can result fewer wakes,
which helps in power reduction.
Signed-off-by: Ramesh Babu <ramesh.babu@intel.com>
Signed-off-by: Subhransu S. Prusty <subhransu.s.prusty@intel.com>
Acked-By: Vinod Koul <vinod.koul@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Use a specific flag for SAI and I2S interfaces,
instead of common flag.
Signed-off-by: olivier moysan <olivier.moysan@st.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
This commit updates the CP110 system controller driver to add the
definition for a missing clock.
The SDIO clock is dedicated driving the SDHCI interface and its frequency
is 400MHz (2/5 of PLL source clock).
The SDIO interface should be bound to this clock and not the core clock
as in the older code.
Using the wrong clock lead to a maximum SDHCI frequency of 250 Mhz, while
the HW really supports up to 400 Mhz.
This patch also fixes the NAND clock relationship documentation.
Signed-off-by: Konstantin Porotchkin <kostap@marvell.com>
[gregory.clement@free-electrons.com:
- use sdio instead of emmc to name the clock]
Reviewed-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
The initial intent when the binding of the cp110 system controller was to
have one flat node. The idea being that what is currently a clock-only
driver in drivers would become a MFD driver, exposing the clock, GPIO and
pinctrl functionality. However, after taking a step back, this would lead
to a messy binding. Indeed, a single node would be a GPIO controller,
clock controller, pinmux controller, and more.
This patch adopts a more classical solution of a top-level syscon node
with sub-nodes for the individual devices. The main benefit will be to
have each functional block associated to its own sub-node where we can
put its own properties.
The introduction of the Armada 7K/8K is still in the early stage so the
plan is to remove the old binding. However, we don't want to break the
device tree compatibility for the few devices already in the field. For
this we still keep the support of the legacy compatible string with a big
warning in the kernel about updating the device tree.
Reviewed-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Using the *-clock-output-names property was a convenient way to have a
unique name for each clock even when there are multiple cp110 blocks
as we can find on Armada 8K.
However it has some drawbacks: the main one being a stronger link than
necessary between the driver and the device tree. For example the clock
name can't be changed, removed or moved. It is still the early stage of
introduction of the Armada 7K/8K and the hardware is still not totally
documented, especially for the clock part. By removing the use of
*-clock-output-names it will be easier to add new clocks without breaking
the compatibility.
The name of each clock is now created by using its physical address as a
prefix (as it was done for the platform device names). Thanks to this we
have an automatic way to compute a unique name.
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Document the device tree binding for the pin controllers found on the
Armada 7K and Armada 8K SoCs.
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Document the device tree binding for the pin controllers found on the
Armada 7K and Armada 8K SoCs.
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
This patch updates the documentation according to the change made in the
patch "clk: mvebu: cp110: add sdio clock to cp-110 system controller"
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
This patch updates the documentation according to the change made in the
patch "pinctrl: dt-bindings: cp110: introduce a new binding".
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
This patch updates the documentation according to the change made in the
patch "clk: mvebu: cp110: do not depend anymore of the
*-clock-output-names": the clock names are no more part of the binding.
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
The dm-zoned device mapper target provides transparent write access
to zoned block devices (ZBC and ZAC compliant block devices).
dm-zoned hides to the device user (a file system or an application
doing raw block device accesses) any constraint imposed on write
requests by the device, equivalent to a drive-managed zoned block
device model.
Write requests are processed using a combination of on-disk buffering
using the device conventional zones and direct in-place processing for
requests aligned to a zone sequential write pointer position.
A background reclaim process implemented using dm_kcopyd_copy ensures
that conventional zones are always available for executing unaligned
write requests. The reclaim process overhead is minimized by managing
buffer zones in a least-recently-written order and first targeting the
oldest buffer zones. Doing so, blocks under regular write access (such
as metadata blocks of a file system) remain stored in conventional
zones, resulting in no apparent overhead.
dm-zoned implementation focus on simplicity and on minimizing overhead
(CPU, memory and storage overhead). For a 14TB host-managed disk with
256 MB zones, dm-zoned memory usage per disk instance is at most about
3 MB and as little as 5 zones will be used internally for storing metadata
and performing buffer zone reclaim operations. This is achieved using
zone level indirection rather than a full block indirection system for
managing block movement between zones.
dm-zoned primary target is host-managed zoned block devices but it can
also be used with host-aware device models to mitigate potential
device-side performance degradation due to excessive random writing.
Zoned block devices can be formatted and checked for use with the dm-zoned
target using the dmzadm utility available at:
https://github.com/hgst/dm-zoned-tools
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
[Mike Snitzer partly refactored Damien's original work to cleanup the code]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When copyying blocks to host-managed zoned block devices, writes must be
sequential. However, dm_kcopyd_copy() does not guarantee this as writes
are issued in the completion order of reads, and reads may complete out
of order despite being issued sequentially.
Fix this by introducing the DM_KCOPYD_WRITE_SEQ feature flag. This can
be specified when calling dm_kcopyd_copy() and should be set
automatically if one of the destinations is a host-managed zoned block
device. For a split job, the master job maintains the write position at
which writes must be issued. This is checked with the pop() function
which is modified to not return any write I/O sub job that is not at the
correct write position.
When DM_KCOPYD_WRITE_SEQ is specified for a job, errors cannot be
ignored and the flag DM_KCOPYD_IGNORE_ERROR is ignored, even if
specified by the user.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add support for zoned block devices by allowing host-managed zoned block
device mapped targets, the remapping of REQ_OP_ZONE_RESET and the post
processing (reply remapping) of REQ_OP_ZONE_REPORT.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
With the development of file system support for zoned block devices
(e.g. f2fs), having dm-flakey support these devices is interesting
to improve testing.
Add host-aware and host-managed zoned block devices support to in
dm-flakey. The target type feature is set to DM_TARGET_ZONED_HM to
indicate support for host-managed models. Also add hooks for remapping
of REQ_OP_ZONE_RESET and REQ_OP_ZONE_REPORT bios. Additionally, in the
bio completion path, (backward) remapping of a zone report reply is
added.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A target driver support zoned block devices and exposing it as such may
receive REQ_OP_ZONE_REPORT request for the user to determine the mapped
device zone configuration. To process properly such request, the target
driver may need to remap the zone descriptors provided in the report
reply. The helper function dm_remap_zone_report() does this generically
using only the target start offset and length and the start offset
within the target device.
dm_remap_zone_report() will remap the start sector of all zones
reported. If the report includes sequential zones, the write pointer
position of these zones will also be remapped.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A REQ_OP_ZONE_REPORT bio is not a medium access command. Its number of
sectors indicates the maximum size allowed for the report reply size and
not an amount of sectors accessed from the device. REQ_OP_ZONE_REPORT
bios should thus not be split depending on the target device maximum I/O
length but passed as-is. Note that it is the responsability of the
target to remap and format the report reply.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The REQ_OP_ZONE_RESET bio has no payload and zero sectors. Its position
is the only information used to indicate the zone to reset on the
device. Due to its zero length, this bio is not cloned and sent to the
target through the non-flush case in __split_and_process_bio(). Add an
additional case in that function to call __split_and_process_non_flush()
without checking the clone info size.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
1) Introduce DM_TARGET_ZONED_HM feature flag:
The target drivers currently available will not operate correctly if a
table target maps onto a host-managed zoned block device.
To avoid problems, introduce the new feature flag DM_TARGET_ZONED_HM to
allow a target to explicitly state that it supports host-managed zoned
block devices. This feature is checked for all targets in a table if
any of the table's block devices are host-managed.
Note that as host-aware zoned block devices are backward compatible with
regular block devices, they can be used by any of the current target
types. This new feature is thus restricted to host-managed zoned block
devices.
2) Check device area zone alignment:
If a target maps to a zoned block device, check that the device area is
aligned on zone boundaries to avoid problems with REQ_OP_ZONE_RESET
operations (resetting a partially mapped sequential zone would not be
possible). This also facilitates the processing of zone report with
REQ_OP_ZONE_REPORT bios.
3) Check block devices zone model compatibility
When setting the DM device's queue limits, several possibilities exists
for zoned block devices:
1) The DM target driver may want to expose a different zone model
(e.g. host-managed device emulation or regular block device on top of
host-managed zoned block devices)
2) Expose the underlying zone model of the devices as-is
To allow both cases, the underlying block device zone model must be set
in the target limits in dm_set_device_limits() and the compatibility of
all devices checked similarly to the logical block size alignment. For
this last check, introduce validate_hardware_zoned_model() to check that
all targets of a table have the same zone model and that the zone size
of the target devices are equal.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
[Mike Snitzer refactored Damien's original work to simplify the code]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Using pr_<level> is the more common logging style.
Standardize style and use new macro DM_FMT.
Use no_printk in DMDEBUG macros when CONFIG_DM_DEBUG is not #defined.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The big-endian IV (plain64be) is needed to map images from extracted
disks that are used in some external (on-chip FDE) disk encryption
drives, e.g.: data recovery from external USB/SATA drives that support
"internal" encryption.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Report the event numbers for all the devices, so that the user doesn't
have to ask them one by one. The event number is reported after the
name field in the dm_name_list structure.
The location of the next record is specified in the dm_name_list->next
field, that means that we can put the new data after the end of name and
it is backward compatible with the old code. The old code just skips
the event number without interpreting it.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This ioctl will record the current global event number in the structure
dm_file, so that next select or poll call will wait until new events
arrived since this ioctl.
The DM_DEV_ARM_POLL ioctl has the same effect as closing and reopening
the handle.
Using the DM_DEV_ARM_POLL ioctl is optional - if the userspace is OK
with closing and reopening the /dev/mapper/control handle after select
or poll, there is no need to re-arm via ioctl.
Usage:
1. open the /dev/mapper/control device
2. send the DM_DEV_ARM_POLL ioctl
3. scan the event numbers of all devices we are interested in and process
them
4. call select, poll or epoll on the handle (it waits until some new event
happens since the DM_DEV_ARM_POLL ioctl)
5. go to step 2
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add the ability to poll on the /dev/mapper/control device. The select
or poll function waits until any event happens on any dm device since
opening the /dev/mapper/control device. When select or poll returns the
device as readable, we must close and reopen the device to wait for new
dm events.
Usage:
1. open the /dev/mapper/control device
2. scan the event numbers of all devices we are interested in and process
them
3. call select, poll or epoll on the handle (it waits until some new event
happens since opening the device)
4. close the /dev/mapper/control handle
5. go to step 1
The next commit allows to re-arm the polling without closing and
reopening the device.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>