In commit f2f57f51b5 ("io_uring/af_unix: disable sending io_uring over
sockets") a new .h file was added to the include list, which broke the
crc generation checks with the following error:
function symbol 'int put_cmsg(struct msghdr*, int, int, int, void*)' changed
CRC changed from 0x31108fe3 to 0xd66fe827
Fix this by only including the .h file if the crc checker is not being
run.
Bug: 161946584
Fixes: f2f57f51b5 ("io_uring/af_unix: disable sending io_uring over sockets")
Change-Id: Ie7a6d5627f169a0fea3eac2b43024cff977b8360
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Export sysctl_sched_min_granularity and
sysctl_sched_idle_min_granularity. In the vendor module, it will use
several static function in GKI, while we do not want to export these
static functions, which will need to make them not static, we copied
them to the vendor module, so we need the export the symbols used in
those static functions. For example, sysctl_sched_min_granularity
and sysctl_sched_idle_min_granularity are referred in sched_slice(),
and they are only used as read-only.
Bug: 316276520
Change-Id: I976d0a1f3a70e8e60099e55fdd3cc99a90053fbb
Signed-off-by: Rick Yiu <rickyiu@google.com>
Setting the PARKMODE_DISABLE_HS bit in the DWC3_USB3_GUCTL1.
When this bit is set to '1' all HS bus instances in park mode are disabled
For some USB wifi devices, if enable this feature it will reduce the
performance. Therefore, add an option for disabling HS park mode by
device-tree.
In Synopsys's dwc3 data book:
In a few high speed devices when an IN request is sent within 900ns of the
ACK of the previous packet, these devices send a NAK. When connected to
these devices, if required, the software can disable the park mode if you
see performance drop in your system. When park mode is disabled,
pipelining of multiple packet is disabled and instead one packet at a time
is requested by the scheduler. This allows up to 12 NAKs in a micro-frame
and improves performance of these slow devices.
Bug: 300024866
Acked-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com>
Signed-off-by: Stanley Chang <stanley_chang@realtek.com>
Link: https://lore.kernel.org/r/20230419020044.15475-1-stanley_chang@realtek.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: William Wu <william.wu@rock-chips.com>
(cherry picked from commit d21a797a3e)
Change-Id: I43ee416e54779a073a0ba4057edf4be8bd7886de
Signed-off-by: Kever Yang <kever.yang@rock-chips.com>
This reverts commit 75b5016ce3 which is
commit 5c0930ccaad5a74d74e8b18b648c5eb21ed2fe94 upstream.
It breaks the Android kernel abi and can be brought back in the future
in an abi-safe way if it is really needed.
Bug: 161946584
Change-Id: Ifdc5474de6e7488eb24959df1697b61b2e3e7880
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
This reverts commit b5ca945612 which is
commit e03781879a0d524ce3126678d50a80484a513c4b upstream.
It breaks the Android kernel abi and can be brought back in the future
in an abi-safe way if it is really needed.
Bug: 161946584
Change-Id: Iecbd6b6537bd4cd2d178d0afbdc7557e521429c5
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
As we reserve only 1GB of memory for the MMIO region don't prepopulate
the entire remaining address space with MMIO as this is prone to failure.
Instead, let the MMIO regions to be created lazily on the fault path and
keep only the RAM regions prepopulated.
Bug: 307805059
Test: Boot pKVM with CONFIG_ARM64_16K_PAGES
Change-Id: I6327f42eb17c6588335a1e04736393c9032114ab
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Changes in 6.1.68
vdpa/mlx5: preserve CVQ vringh index
hrtimers: Push pending hrtimers away from outgoing CPU earlier
i2c: designware: Fix corrupted memory seen in the ISR
netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test
zstd: Fix array-index-out-of-bounds UBSAN warning
tg3: Move the [rt]x_dropped counters to tg3_napi
tg3: Increment tx_dropped in tg3_tso_bug()
kconfig: fix memory leak from range properties
drm/amdgpu: correct chunk_ptr to a pointer to chunk.
x86: Introduce ia32_enabled()
x86/coco: Disable 32-bit emulation by default on TDX and SEV
x86/entry: Convert INT 0x80 emulation to IDTENTRY
x86/entry: Do not allow external 0x80 interrupts
x86/tdx: Allow 32-bit emulation by default
dt: dt-extract-compatibles: Handle cfile arguments in generator function
dt: dt-extract-compatibles: Don't follow symlinks when walking tree
platform/x86: asus-wmi: Move i8042 filter install to shared asus-wmi code
of: dynamic: Fix of_reconfig_get_state_change() return value documentation
platform/x86: wmi: Skip blocks with zero instances
ipv6: fix potential NULL deref in fib6_add()
octeontx2-pf: Add missing mutex lock in otx2_get_pauseparam
octeontx2-af: Check return value of nix_get_nixlf before using nixlf
hv_netvsc: rndis_filter needs to select NLS
r8152: Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE
r8152: Add RTL8152_INACCESSIBLE checks to more loops
r8152: Add RTL8152_INACCESSIBLE to r8156b_wait_loading_flash()
r8152: Add RTL8152_INACCESSIBLE to r8153_pre_firmware_1()
r8152: Add RTL8152_INACCESSIBLE to r8153_aldps_en()
mlxbf-bootctl: correctly identify secure boot with development keys
platform/mellanox: Add null pointer checks for devm_kasprintf()
platform/mellanox: Check devm_hwmon_device_register_with_groups() return value
arcnet: restoring support for multiple Sohard Arcnet cards
octeontx2-pf: consider both Rx and Tx packet stats for adaptive interrupt coalescing
net: stmmac: fix FPE events losing
xsk: Skip polling event check for unbound socket
octeontx2-af: fix a use-after-free in rvu_npa_register_reporters
i40e: Fix unexpected MFS warning message
iavf: validate tx_coalesce_usecs even if rx_coalesce_usecs is zero
net: bnxt: fix a potential use-after-free in bnxt_init_tc
tcp: fix mid stream window clamp.
ionic: fix snprintf format length warning
ionic: Fix dim work handling in split interrupt mode
ipv4: ip_gre: Avoid skb_pull() failure in ipgre_xmit()
net: atlantic: Fix NULL dereference of skb pointer in
net: hns: fix wrong head when modify the tx feature when sending packets
net: hns: fix fake link up on xge port
octeontx2-af: Adjust Tx credits when MCS external bypass is disabled
octeontx2-af: Fix mcs sa cam entries size
octeontx2-af: Fix mcs stats register address
octeontx2-af: Add missing mcs flr handler call
octeontx2-af: Update Tx link register range
dt-bindings: interrupt-controller: Allow #power-domain-cells
netfilter: nft_exthdr: add boolean DCCP option matching
netfilter: nf_tables: fix 'exist' matching on bigendian arches
netfilter: nf_tables: bail out on mismatching dynset and set expressions
netfilter: nf_tables: validate family when identifying table via handle
netfilter: xt_owner: Fix for unsafe access of sk->sk_socket
tcp: do not accept ACK of bytes we never sent
bpf: sockmap, updating the sg structure should also update curr
psample: Require 'CAP_NET_ADMIN' when joining "packets" group
drop_monitor: Require 'CAP_SYS_ADMIN' when joining "events" group
mm/damon/sysfs: eliminate potential uninitialized variable warning
tee: optee: Fix supplicant based device enumeration
RDMA/hns: Fix unnecessary err return when using invalid congest control algorithm
RDMA/irdma: Do not modify to SQD on error
RDMA/irdma: Add wait for suspend on SQD
arm64: dts: rockchip: Expand reg size of vdec node for RK3328
arm64: dts: rockchip: Expand reg size of vdec node for RK3399
ASoC: fsl_sai: Fix no frame sync clock issue on i.MX8MP
RDMA/rtrs-srv: Do not unconditionally enable irq
RDMA/rtrs-clt: Start hb after path_up
RDMA/rtrs-srv: Check return values while processing info request
RDMA/rtrs-srv: Free srv_mr iu only when always_invalidate is true
RDMA/rtrs-srv: Destroy path files after making sure no IOs in-flight
RDMA/rtrs-clt: Fix the max_send_wr setting
RDMA/rtrs-clt: Remove the warnings for req in_use check
RDMA/bnxt_re: Correct module description string
RDMA/irdma: Refactor error handling in create CQP
RDMA/irdma: Fix UAF in irdma_sc_ccq_get_cqe_info()
hwmon: (acpi_power_meter) Fix 4.29 MW bug
ASoC: codecs: lpass-tx-macro: set active_decimator correct default value
hwmon: (nzxt-kraken2) Fix error handling path in kraken2_probe()
ASoC: wm_adsp: fix memleak in wm_adsp_buffer_populate
RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
RDMA/irdma: Avoid free the non-cqp_request scratch
drm/bridge: tc358768: select CONFIG_VIDEOMODE_HELPERS
arm64: dts: imx8mq: drop usb3-resume-missing-cas from usb
arm64: dts: imx8mp: imx8mq: Add parkmode-disable-ss-quirk on DWC3
ARM: dts: imx6ul-pico: Describe the Ethernet PHY clock
tracing: Fix a warning when allocating buffered events fails
scsi: be2iscsi: Fix a memleak in beiscsi_init_wrb_handle()
ARM: imx: Check return value of devm_kasprintf in imx_mmdc_perf_init
ARM: dts: imx7: Declare timers compatible with fsl,imx6dl-gpt
ARM: dts: imx28-xea: Pass the 'model' property
riscv: fix misaligned access handling of C.SWSP and C.SDSP
md: introduce md_ro_state
md: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly()
iommu: Avoid more races around device probe
rethook: Use __rcu pointer for rethook::handler
kprobes: consistent rcu api usage for kretprobe holder
ASoC: amd: yc: Fix non-functional mic on ASUS E1504FA
io_uring/af_unix: disable sending io_uring over sockets
nvme-pci: Add sleep quirk for Kingston drives
io_uring: fix mutex_unlock with unreferenced ctx
ALSA: usb-audio: Add Pioneer DJM-450 mixer controls
ALSA: pcm: fix out-of-bounds in snd_pcm_state_names
ALSA: hda/realtek: Enable headset on Lenovo M90 Gen5
ALSA: hda/realtek: add new Framework laptop to quirks
ALSA: hda/realtek: Add Framework laptop 16 to quirks
ring-buffer: Test last update in 32bit version of __rb_time_read()
nilfs2: fix missing error check for sb_set_blocksize call
nilfs2: prevent WARNING in nilfs_sufile_set_segment_usage()
cgroup_freezer: cgroup_freezing: Check if not frozen
checkstack: fix printed address
tracing: Always update snapshot buffer size
tracing: Disable snapshot buffer when stopping instance tracers
tracing: Fix incomplete locking when disabling buffered events
tracing: Fix a possible race when disabling buffered events
packet: Move reference count in packet_sock to atomic_long_t
r8169: fix rtl8125b PAUSE frames blasting when suspended
regmap: fix bogus error on regcache_sync success
platform/surface: aggregator: fix recv_buf() return value
hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write
mm: fix oops when filemap_map_pmd() without prealloc_pte
powercap: DTPM: Fix missing cpufreq_cpu_put() calls
md/raid6: use valid sector values to determine if an I/O should wait on the reshape
arm64: dts: mediatek: mt7622: fix memory node warning check
arm64: dts: mediatek: mt8183-kukui-jacuzzi: fix dsi unnecessary cells properties
arm64: dts: mediatek: cherry: Fix interrupt cells for MT6360 on I2C7
arm64: dts: mediatek: mt8173-evb: Fix regulator-fixed node names
arm64: dts: mediatek: mt8195: Fix PM suspend/resume with venc clocks
arm64: dts: mediatek: mt8183: Fix unit address for scp reserved memory
arm64: dts: mediatek: mt8183: Move thermal-zones to the root node
arm64: dts: mediatek: mt8183-evb: Fix unit_address_vs_reg warning on ntc
binder: fix memory leaks of spam and pending work
coresight: etm4x: Make etm4_remove_dev() return void
coresight: etm4x: Remove bogous __exit annotation for some functions
hwtracing: hisi_ptt: Add dummy callback pmu::read()
misc: mei: client.c: return negative error code in mei_cl_write
misc: mei: client.c: fix problem of return '-EOVERFLOW' in mei_cl_write
LoongArch: BPF: Don't sign extend memory load operand
LoongArch: BPF: Don't sign extend function return value
ring-buffer: Force absolute timestamp on discard of event
tracing: Set actual size after ring buffer resize
tracing: Stop current tracer when resizing buffer
parisc: Reduce size of the bug_table on 64-bit kernel by half
parisc: Fix asm operand number out of range build error in bug table
arm64: dts: mediatek: add missing space before {
arm64: dts: mt8183: kukui: Fix underscores in node names
perf: Fix perf_event_validate_size()
x86/sev: Fix kernel crash due to late update to read-only ghcb_version
gpiolib: sysfs: Fix error handling on failed export
drm/amdgpu: fix memory overflow in the IB test
drm/amd/amdgpu: Fix warnings in amdgpu/amdgpu_display.c
drm/amdgpu: correct the amdgpu runtime dereference usage count
drm/amdgpu: Update ras eeprom support for smu v13_0_0 and v13_0_10
drm/amdgpu: Add EEPROM I2C address support for ip discovery
drm/amdgpu: Remove redundant I2C EEPROM address
drm/amdgpu: Decouple RAS EEPROM addresses from chips
drm/amdgpu: Add support for RAS table at 0x40000
drm/amdgpu: Remove second moot switch to set EEPROM I2C address
drm/amdgpu: Return from switch early for EEPROM I2C address
drm/amdgpu: simplify amdgpu_ras_eeprom.c
drm/amdgpu: Add I2C EEPROM support on smu v13_0_6
drm/amdgpu: Update EEPROM I2C address for smu v13_0_0
usb: gadget: f_hid: fix report descriptor allocation
serial: 8250_dw: Add ACPI ID for Granite Rapids-D UART
parport: Add support for Brainboxes IX/UC/PX parallel cards
cifs: Fix non-availability of dedup breaking generic/304
Revert "xhci: Loosen RPM as default policy to cover for AMD xHC 1.1"
smb: client: fix potential NULL deref in parse_dfs_referrals()
usb: typec: class: fix typec_altmode_put_partner to put plugs
ARM: PL011: Fix DMA support
serial: sc16is7xx: address RX timeout interrupt errata
serial: 8250: 8250_omap: Clear UART_HAS_RHR_IT_DIS bit
serial: 8250: 8250_omap: Do not start RX DMA on THRI interrupt
serial: 8250_omap: Add earlycon support for the AM654 UART controller
devcoredump: Send uevent once devcd is ready
x86/CPU/AMD: Check vendor in the AMD microcode callback
USB: gadget: core: adjust uevent timing on gadget unbind
cifs: Fix flushing, invalidation and file size with copy_file_range()
cifs: Fix flushing, invalidation and file size with FICLONE
MIPS: kernel: Clear FPU states when setting up kernel threads
KVM: s390/mm: Properly reset no-dat
KVM: SVM: Update EFER software model on CR0 trap for SEV-ES
MIPS: Loongson64: Reserve vgabios memory on boot
MIPS: Loongson64: Handle more memory types passed from firmware
MIPS: Loongson64: Enable DMA noncoherent support
netfilter: nft_set_pipapo: skip inactive elements during set walk
riscv: Kconfig: Add select ARM_AMBA to SOC_STARFIVE
drm/i915/display: Drop check for doublescan mode in modevalid
drm/i915/lvds: Use REG_BIT() & co.
drm/i915/sdvo: stop caching has_hdmi_monitor in struct intel_sdvo
drm/i915: Skip some timing checks on BXT/GLK DSI transcoders
Linux 6.1.68
Change-Id: I0a824071a80b24dc4a2e0077f305b7cac42235b8
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
directly replacing the entries of VMAs in the new maple tree can result in
better performance. __mt_dup() uses DFS pre-order to duplicate the maple
tree, so it is efficient.
The average time complexity of __mt_dup() is O(n), where n is the number
of VMAs. The proof of the time complexity is provided in the commit log
that introduces __mt_dup(). After duplicating the maple tree, each
element is traversed and replaced (ignoring the cases of deletion, which
are rare). Since it is only a replacement operation for each element,
this process is also O(n).
Analyzing the exact time complexity of the previous algorithm is
challenging because each insertion can involve appending to a node,
pushing data to adjacent nodes, or even splitting nodes. The frequency of
each action is difficult to calculate. The worst-case scenario for a
single insertion is when the tree undergoes splitting at every level. If
we consider each insertion as the worst-case scenario, we can determine
that the upper bound of the time complexity is O(n*log(n)), although this
is a loose upper bound. However, based on the test data, it appears that
the actual time complexity is likely to be O(n).
As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
fails, there will be a portion of VMAs that have not been duplicated in
the maple tree. To handle this, we mark the failure point with
XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop
releasing VMAs that have not been duplicated after this point.
There is a "spawn" in byte-unixbench[1], which can be used to test the
performance of fork(). I modified it slightly to make it work with
different number of VMAs.
Below are the test results. The first row shows the number of VMAs. The
second and third rows show the number of fork() calls per ten seconds,
corresponding to next-20231006 and the this patchset, respectively. The
test results were obtained with CPU binding to avoid scheduler load
balancing that could cause unstable results. There are still some
fluctuations in the test results, but at least they are better than the
original performance.
21 121 221 421 821 1621 3221 6421 12821 25621 51221
112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
[1] https://github.com/kdlucas/byte-unixbench/tree/master
Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit d2406291483775ecddaee929231a39c70c08fda2
https://git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable)
[surenb: open-coded vma_iter_clear_gfp(), vma_iter_bulk_store();
replaced vma_next() with mas_find()]
Bug: 308042511
Change-Id: I42d6620e8ce6a0b16211c231a9b72ba16ba9c0d2
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Introduce interfaces __mt_dup() and mtree_dup(), which are used to
duplicate a maple tree. They duplicate a maple tree in Depth-First Search
(DFS) pre-order traversal. It uses memcopy() to copy nodes in the source
tree and allocate new child nodes in non-leaf nodes. The new node is
exactly the same as the source node except for all the addresses stored in
it. It will be faster than traversing all elements in the source tree and
inserting them one by one into the new tree. The time complexity of these
two functions is O(n).
The difference between __mt_dup() and mtree_dup() is that mtree_dup()
handles locks internally.
Analysis of the average time complexity of this algorithm:
For simplicity, let's assume that the maximum branching factor of all
non-leaf nodes is 16 (in allocation mode, it is 10), and the tree is a
full tree.
Under the given conditions, if there is a maple tree with n elements, the
number of its leaves is n/16. From bottom to top, the number of nodes in
each level is 1/16 of the number of nodes in the level below. So the
total number of nodes in the entire tree is given by the sum of n/16 +
n/16^2 + n/16^3 + ... + 1. This is a geometric series, and it has log(n)
terms with base 16. According to the formula for the sum of a geometric
series, the sum of this series can be calculated as (n-1)/15. Each node
has only one parent node pointer, which can be considered as an edge. In
total, there are (n-1)/15-1 edges.
This algorithm consists of two operations:
1. Traversing all nodes in DFS order.
2. For each node, making a copy and performing necessary modifications
to create a new node.
For the first part, DFS traversal will visit each edge twice. Let
T(ascend) represent the cost of taking one step downwards, and T(descend)
represent the cost of taking one step upwards. And both of them are
constants (although mas_ascend() may not be, as it contains a loop, but
here we ignore it and treat it as a constant). So the time spent on the
first part can be represented as ((n-1)/15-1) * (T(ascend) + T(descend)).
For the second part, each node will be copied, and the cost of copying a
node is denoted as T(copy_node). For each non-leaf node, it is necessary
to reallocate all child nodes, and the cost of this operation is denoted
as T(dup_alloc). The behavior behind memory allocation is complex and not
specific to the maple tree operation. Here, we assume that the time
required for a single allocation is constant. Since the size of a node is
fixed, both of these symbols are also constants. We can calculate that
the time spent on the second part is ((n-1)/15) * T(copy_node) + ((n-1)/15
- n/16) * T(dup_alloc).
Adding both parts together, the total time spent by the algorithm can be
represented as:
((n-1)/15) * (T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)) -
n/16 * T(dup_alloc) - (T(ascend) + T(descend))
Let C1 = T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)
Let C2 = T(dup_alloc)
Let C3 = T(ascend) + T(descend)
Finally, the expression can be simplified as:
((16 * C1 - 15 * C2) / (15 * 16)) * n - (C1 / 15 + C3).
This is a linear function, so the average time complexity is O(n).
Link: https://lkml.kernel.org/r/20231027033845.90608-4-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit fd32e4e9b7646510ee9010e0d5f8b8857d48a6f7
https://git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable)
Bug: 308042511
Change-Id: I385759a1184a202498e086458b572c203616b9b4
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Patch series "Introduce __mt_dup() to improve the performance of fork()", v7.
This series introduces __mt_dup() to improve the performance of fork().
During the duplication process of mmap, all VMAs are traversed and
inserted one by one into the new maple tree, causing the maple tree to be
rebalanced multiple times. Balancing the maple tree is a costly
operation. To duplicate VMAs more efficiently, mtree_dup() and __mt_dup()
are introduced for the maple tree. They can efficiently duplicate a maple
tree.
Here are some algorithmic details about {mtree,__mt}_dup(). We perform a
DFS pre-order traversal of all nodes in the source maple tree. During
this process, we fully copy the nodes from the source tree to the new
tree. This involves memory allocation, and when encountering a new node,
if it is a non-leaf node, all its child nodes are allocated at once.
This idea was originally from Liam R. Howlett's Maple Tree Work email,
and I added some of my own ideas to implement it. Some previous
discussions can be found in [1]. For a more detailed analysis of the
algorithm, please refer to the logs for patch [3/10] and patch [10/10].
There is a "spawn" in byte-unixbench[2], which can be used to test the
performance of fork(). I modified it slightly to make it work with
different number of VMAs.
Below are the test results. The first row shows the number of VMAs. The
second and third rows show the number of fork() calls per ten seconds,
corresponding to next-20231006 and the this patchset, respectively. The
test results were obtained with CPU binding to avoid scheduler load
balancing that could cause unstable results. There are still some
fluctuations in the test results, but at least they are better than the
original performance.
21 121 221 421 821 1621 3221 6421 12821 25621 51221
112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
Thanks to Liam and Matthew for the review.
This patch (of 10):
Add two helpers:
1. mt_free_one(), used to free a maple node.
2. mt_attr(), used to obtain the attributes of maple tree.
Link: https://lkml.kernel.org/r/20231027033845.90608-1-zhangpeng.00@bytedance.com
Link: https://lkml.kernel.org/r/20231027033845.90608-2-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 4f2267b58a22d972be98edef8e6b3c7a67c9fb91
https://git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable)
Bug: 308042511
Change-Id: Ib9b13dee357ac4c85668901c20a3c370fbdd08da
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
This reverts commit 38d3216032 which is
commit 8d91f3f8ae upstream.
It breaks the Android kernel abi and can be brought back in the future
in an abi-safe way if it is really needed.
Bug: 161946584
Change-Id: I67705a9f4b04cf051f9b0f1a3494284ae0122da2
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
This reverts commit 8b01195be4 which is
commit 477865af60b2117ceaa1d558e03559108c15c78c upstream.
It breaks the Android kernel abi and can be brought back in the future
in an abi-safe way if it is really needed.
Bug: 161946584
Change-Id: I3e7316f074393f1b84e8d6a7a845060c268366b0
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Adding the following symbols:
- __drmm_crtc_alloc_with_planes
Bug: 275278929
Change-Id: I41b6069612d44214f474ed82ee2a4b07ca739302
Signed-off-by: Ken Huang <kenbshuang@google.com>
The structures that define hyp events must be packed so they match
their format definitions in the tracefs file
hyp/events/hyp/<event>/format.
Bug: 299430621
Change-Id: Ia7e1a686744d5c9c3f8a21881f03228c8acecade
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
From pKVM point of view, unknown SMCs are simply forwarded, we can't
consider them invalid or not. This was probably a typo following a copy
of the host_hcall event.
Bug: 299430621
Change-Id: Ieb53f985a5187a8b5a9feb4a95982b15cdc1b04a
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
If we return the error, there's no way to recover the status as of now, since
fsck does not fix the xattr boundary issue.
Bug: 305658663
Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(cherry picked from commit 50a472bbc79ff9d5a88be8019a60e936cadf9f13
https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev)
Change-Id: I55060a4eede3f5f85066aba22a6ab7155517e5c4
(cherry picked from commit 70113b9d489050d3e7a6f28e0cd6e43f104fc132)
(cherry picked from commit 2c1f3789d609bd549f14c019b6c7b311bfd2fa64)
When this pKVM module ops has been introduced, the documentation has
been omitted.
Bug: 308373293
Change-Id: I9e471414e72a1ee04c132de4ed95d77e815ae8c9
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
This reverts commit 38d3216032 which is
commit 8d91f3f8ae upstream.
It breaks the Android kernel abi and can be brought back in the future
in an abi-safe way if it is really needed.
Bug: 161946584
Change-Id: I67705a9f4b04cf051f9b0f1a3494284ae0122da2
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
This reverts commit 8b01195be4 which is
commit 477865af60b2117ceaa1d558e03559108c15c78c upstream.
It breaks the Android kernel abi and can be brought back in the future
in an abi-safe way if it is really needed.
Bug: 161946584
Change-Id: I3e7316f074393f1b84e8d6a7a845060c268366b0
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
I think this is a pretty rare occurrence, but for consistency handle
faults with the VMA lock held the same way that we handle other faults
with the VMA lock held.
Link: https://lkml.kernel.org/r/20231006195318.4087158-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 4a68fef16df9d88d528094116f8bbd2dbfa62089)
Bug: 293665307
Change-Id: I69cec218c8a1fe14df3268722e6b1be6dffe7978
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Most file-backed faults are already handled through ->map_pages(), but if
we need to do I/O we'll come this way. Since filemap_fault() is now safe
to be called under the VMA lock, we can handle these faults under the VMA
lock now.
Link: https://lkml.kernel.org/r/20231006195318.4087158-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 12214eba1992642eee5813a9cc9f626e5b2d1815)
Bug: 293665307
Change-Id: Iee48af98b866d88d88ec01143eb26389ab373b6b
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
If the page is not currently present in the page tables, we need to call
the page fault handler to find out which page we're supposed to COW, so we
need to both check that there is already an anon_vma and that the fault
handler doesn't need the mmap_lock.
Link: https://lkml.kernel.org/r/20231006195318.4087158-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 4de8c93a4751e10737b6af65db42c743228c67a6)
Bug: 293665307
Change-Id: If749a6f8fcf69d83bbf872c1d45865d1b1b77ea0
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
There are many implementations of ->fault and some of them depend on
mmap_lock being held. All vm_ops that implement ->map_pages() end up
calling filemap_fault(), which I have audited to be sure it does not rely
on mmap_lock. So (for now) key off ->map_pages existing as a flag to
indicate that it's safe to call ->fault while only holding the vma lock.
Link: https://lkml.kernel.org/r/20231006195318.4087158-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 4ed4379881aa62588aba6442a9f362a8cf7624e6)
Bug: 293665307
Change-Id: Ifb5ab3df5d05fb182d0cb52820fa24e28e2d6496
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
It is usually safe to call wp_page_copy() under the VMA lock. The only
unsafe situation is when no anon_vma has been allocated for this VMA, and
we have to look at adjacent VMAs to determine if their anon_vma can be
shared. Since this happens only for the first COW of a page in this VMA,
the majority of calls to wp_page_copy() do not need to fall back to the
mmap_sem.
Add vmf_anon_prepare() as an alternative to anon_vma_prepare() which will
return RETRY if we currently hold the VMA lock and need to allocate an
anon_vma. This lets us drop the check in do_wp_page().
Link: https://lkml.kernel.org/r/20231006195318.4087158-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 164b06f238b986317131e6b61b2f22aabcbc2cc0)
[surenb: resolved merge conflicts due to folio/page differences]
Bug: 293665307
Change-Id: I39bdc247b375bd3dae8078b52c60fd4ce12e1850
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Patch series "Handle more faults under the VMA lock", v2.
At this point, we're handling the majority of file-backed page faults
under the VMA lock, using the ->map_pages entry point. This patch set
attempts to expand that for the following siutations:
- We have to do a read. This could be because we've hit the point in
the readahead window where we need to kick off the next readahead,
or because the page is simply not present in cache.
- We're handling a write fault. Most applications don't do I/O by writes
to shared mmaps for very good reasons, but some do, and it'd be nice
to not make that slow unnecessarily.
- We're doing a COW of a private mapping (both PTE already present
and PTE not-present). These are two different codepaths and I handle
both of them in this patch set.
There is no support in this patch set for drivers to mark themselves as
being VMA lock friendly; they could implement the ->map_pages
vm_operation, but if they do, they would be the first. This is probably
something we want to change at some point in the future, and I've marked
where to make that change in the code.
There is very little performance change in the benchmarks we've run;
mostly because the vast majority of page faults are handled through the
other paths. I still think this patch series is useful for workloads that
may take these paths more often, and just for cleaning up the fault path
in general (it's now clearer why we have to retry in these cases).
This patch (of 6):
Drop the VMA lock instead of the mmap_lock if that's the one which
is held.
Link: https://lkml.kernel.org/r/20231006195318.4087158-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231006195318.4087158-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 5d74b2ab2c15d596c470bae6626f345d5575a9d0)
Bug: 293665307
Change-Id: Ife2d11ab12fb428868cd44751784cf731fbffe62
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
For modules to reuse default dma_map_ops implementations they need to be
exported. Export the following functions:
dma_direct_alloc
dma_direct_free
dma_common_mmap
dma_common_get_sgtable
dma_direct_get_required_mask
Bug: 151050914
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ia77b797fcd909fce01da7431bfbde282dc70b3b3
(cherry picked from commit fd31496dae939c7bf2ef874e08d4bf8c6ab738b3)
Signed-off-by: Qian-Hao Huang <qhhuang@google.com>
(cherry picked from commit cdc9f6ef94)
Sub-page block support is still unusable even with previous commits if
interlaced PLAIN pclusters exist. Such pclusters can be found if the
fragment feature is enabled.
This commit tries to handle "the head part" of interlaced PLAIN
pclusters first: it was once explained in commit fdffc091e6 ("erofs:
support interlaced uncompressed data for compressed files").
It uses a unique way for both shifted and interlaced PLAIN pclusters.
As an added bonus, PLAIN pclusters larger than the block size is also
supported now for the upcoming large lclusters.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206091057.87027-5-hsiangkao@linux.alibaba.com
Bug: 318378021
Change-Id: I3d50132664f8754f56d62744420060108ed0da4f
(cherry picked from commit 192351616a9dde686492bcb9d1e4895a1411a527
https: //git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git dev)
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Previously, the block size always equaled to PAGE_SIZE, therefore
`lclusterbits` couldn't be less than 12.
Since sub-page compressed blocks are now considered, `lobits` for
a lcluster in each pack cannot always be `lclusterbits` as before.
Otherwise, there is no enough room for the special value
`Z_EROFS_VLE_DI_D0_CBLKCNT`.
To support smaller block sizes, `lobits` for each compacted lcluster is
now calculated as:
lobits = max(lclusterbits, ilog2(Z_EROFS_VLE_DI_D0_CBLKCNT) + 1)
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206091057.87027-4-hsiangkao@linux.alibaba.com
Bug: 318378021
Change-Id: Iacd89e2b33ddf39ea40b90e88a2bf99bb5a83b31
(cherry picked from commit 8d2517aaeea3ab8651bb517bca8f3c8664d318ea
https: //git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git dev)
[dhavale: resolved conflicts in zmap.c due to older naming of constants
and updated commit message also to use the older names]
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Currently, compressed sizes are recorded in pages using `pclusterpages`,
However, for tailpacking pclusters, `tailpacking_size` is used instead.
This approach doesn't work when dealing with sub-page blocks. To address
this, let's switch them to the unified `pclustersize` in bytes.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206091057.87027-3-hsiangkao@linux.alibaba.com
Bug: 318378021
Change-Id: Ia8c50a7b4adcf6cd161b1d6f8bfc5a7fd3371079
(cherry picked from commit 54ed3fdd66055d073cb1cd2c6c65bbc0683c40cf
https: //git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git dev)
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Add a basic I/O submission path first to support sub-page blocks:
- Temporary short-lived pages will be used entirely;
- In-place I/O pages can be used partially, but compressed pages need
to be able to be mapped in contiguous virtual memory.
As a start, currently cache decompression is explicitly disabled for
sub-page blocks, which will be supported in the future.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206091057.87027-2-hsiangkao@linux.alibaba.com
Bug: 318378021
Change-Id: Ib2cb6120805ab479a450580fc8774af131271791
(cherry picked from commit 192351616a9dde686492bcb9d1e4895a1411a527
https: //git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git dev)
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Currently EROFS can map another compressed buffer for inplace
decompression, that was used to handle the cases that some pages of
compressed data are actually not in-place I/O.
However, like most simple LZ77 algorithms, LZ4 expects the compressed
data is arranged at the end of the decompressed buffer and it
explicitly uses memmove() to handle overlapping:
__________________________________________________________
|_ direction of decompression --> ____ |_ compressed data _|
Although EROFS arranges compressed data like this, it typically maps two
individual virtual buffers so the relative order is uncertain.
Previously, it was hardly observed since LZ4 only uses memmove() for
short overlapped literals and x86/arm64 memmove implementations seem to
completely cover it up and they don't have this issue. Juhyung reported
that EROFS data corruption can be found on a new Intel x86 processor.
After some analysis, it seems that recent x86 processors with the new
FSRM feature expose this issue with "rep movsb".
Let's strictly use the decompressed buffer for lz4 inplace
decompression for now. Later, as an useful improvement, we could try
to tie up these two buffers together in the correct order.
Reported-and-tested-by: Juhyung Park <qkrwngud825@gmail.com>
Closes: https://lore.kernel.org/r/CAD14+f2AVKf8Fa2OO1aAUdDNTDsVzzR6ctU_oJSmTyd6zSYR2Q@mail.gmail.com
Fixes: 0ffd71bcc3 ("staging: erofs: introduce LZ4 decompression inplace")
Fixes: 598162d050 ("erofs: support decompress big pcluster for lz4 backend")
Cc: stable <stable@vger.kernel.org> # 5.4+
Tested-by: Yifan Zhao <zhaoyifan@sjtu.edu.cn>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206045534.3920847-1-hsiangkao@linux.alibaba.com
Bug: 318378021
Change-Id: Ifd2981320f9f79b27bc7484d8906501a2fa05359
(cherry picked from commit 3c12466b6b7bf1e56f9b32c366a3d83d87afb4de
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git dev)
Signed-off-by: Sandeep Dhavale <dhavale@google.com>