[ Upstream commit bce22552f9 ]
Now that the backlog manages the reschedule() logic correctly we can drop
the partial fix to reschedule from recvmsg hook.
Rescheduling on recvmsg hook was added to address a corner case where we
still had data in the backlog state but had nothing to kick it and
reschedule the backlog worker to run and finish copying data out of the
state. This had a couple limitations, first it required user space to
kick it introducing an unnecessary EBUSY and retry. Second it only
handled the ingress case and egress redirects would still be hung.
With the correct fix, pushing the reschedule logic down to where the
enomem error occurs we can drop this fix.
Fixes: bec217197b ("skmsg: Schedule psock work if the cached skb exists on the psock")
Change-Id: Ibf8b70dbeca5122c2ef954504dbe44724456899e
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-4-john.fastabend@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 1e4e379ccd)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 29173d07f7 ]
Sk_buffs are fed into sockmap verdict programs either from a strparser
(when the user might want to decide how framing of skb is done by attaching
another parser program) or directly through tcp_read_sock. The
tcp_read_sock is the preferred method for performance when the BPF logic is
a stream parser.
The flow for Cilium's common use case with a stream parser is,
tcp_read_sock()
sk_psock_verdict_recv
ret = bpf_prog_run_pin_on_cpu()
sk_psock_verdict_apply(sock, skb, ret)
// if system is under memory pressure or app is slow we may
// need to queue skb. Do this queuing through ingress_skb and
// then kick timer to wake up handler
skb_queue_tail(ingress_skb, skb)
schedule_work(work);
The work queue is wired up to sk_psock_backlog(). This will then walk the
ingress_skb skb list that holds our sk_buffs that could not be handled,
but should be OK to run at some later point. However, its possible that
the workqueue doing this work still hits an error when sending the skb.
When this happens the skbuff is requeued on a temporary 'state' struct
kept with the workqueue. This is necessary because its possible to
partially send an skbuff before hitting an error and we need to know how
and where to restart when the workqueue runs next.
Now for the trouble, we don't rekick the workqueue. This can cause a
stall where the skbuff we just cached on the state variable might never
be sent. This happens when its the last packet in a flow and no further
packets come along that would cause the system to kick the workqueue from
that side.
To fix we could do simple schedule_work(), but while under memory pressure
it makes sense to back off some instead of continue to retry repeatedly. So
instead to fix convert schedule_work to schedule_delayed_work and add
backoff logic to reschedule from backlog queue on errors. Its not obvious
though what a good backoff is so use '1'.
To test we observed some flakes whil running NGINX compliance test with
sockmap we attributed these failed test to this bug and subsequent issue.
>From on list discussion. This commit
bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock")
was intended to address similar race, but had a couple cases it missed.
Most obvious it only accounted for receiving traffic on the local socket
so if redirecting into another socket we could still get an sk_buff stuck
here. Next it missed the case where copied=0 in the recv() handler and
then we wouldn't kick the scheduler. Also its sub-optimal to require
userspace to kick the internal mechanisms of sockmap to wake it up and
copy data to user. It results in an extra syscall and requires the app
to actual handle the EAGAIN correctly.
Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Change-Id: I61dbe914b0abf5f0f7e16f95d246c8e4fa0f5afa
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 9f4d7efb33)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 78fa0d61d9 ]
The read_skb hook calls consume_skb() now, but this means that if the
recv_actor program wants to use the skb it needs to inc the ref cnt
so that the consume_skb() doesn't kfree the sk_buff.
This is problematic because in some error cases under memory pressure
we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
Then we get this,
skb_linearize()
__pskb_pull_tail()
pskb_expand_head()
BUG_ON(skb_shared(skb))
Because we incremented users refcnt from sk_psock_verdict_recv() we
hit the bug on with refcnt > 1 and trip it.
To fix lets simply pass ownership of the sk_buff through the skb_read
call. Then we can drop the consume from read_skb handlers and assume
the verdict recv does any required kfree.
Bug found while testing in our CI which runs in VMs that hit memory
constraints rather regularly. William tested TCP read_skb handlers.
[ 106.536188] ------------[ cut here ]------------
[ 106.536197] kernel BUG at net/core/skbuff.c:1693!
[ 106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
[ 106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
[ 106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
[ 106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
[ 106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
[ 106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
[ 106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
[ 106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
[ 106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
[ 106.540568] FS: 00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
[ 106.540954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
[ 106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 106.542255] Call Trace:
[ 106.542383] <IRQ>
[ 106.542487] __pskb_pull_tail+0x4b/0x3e0
[ 106.542681] skb_ensure_writable+0x85/0xa0
[ 106.542882] sk_skb_pull_data+0x18/0x20
[ 106.543084] bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
[ 106.543536] ? migrate_disable+0x66/0x80
[ 106.543871] sk_psock_verdict_recv+0xe2/0x310
[ 106.544258] ? sk_psock_write_space+0x1f0/0x1f0
[ 106.544561] tcp_read_skb+0x7b/0x120
[ 106.544740] tcp_data_queue+0x904/0xee0
[ 106.544931] tcp_rcv_established+0x212/0x7c0
[ 106.545142] tcp_v4_do_rcv+0x174/0x2a0
[ 106.545326] tcp_v4_rcv+0xe70/0xf60
[ 106.545500] ip_protocol_deliver_rcu+0x48/0x290
[ 106.545744] ip_local_deliver_finish+0xa7/0x150
Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Reported-by: William Findlay <will@isovalent.com>
Change-Id: I0dadf18f695e4305ba1043a7fbec7ef3f58baba7
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 4ae2af3e59)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Add an overwrite to platform specific callback for setting up the
xhci_vendor_ops, allow vendor to store the xhci_vendor_ops and
overwrite them when xhci_plat_probe invoked.
This change is depend on Commit in this patch series
("usb: host: add xhci hooks for USB offload"), vendor needs
to invoke xhci_plat_register_vendor_ops() to register the vendor specific
vendor_ops. And the vendor_ops will overwrite the vendor_ops inside
xhci_plat_priv in xhci_vendor_init() during xhci-plat-hcd probe.
Change-Id: I8030fe3bd274615f5926f19014c3a3e066ca9dba
Signed-off-by: Howard Yen <howardyen@google.com>
Bug: 175358363
Link: https://lore.kernel.org/r/20210119101044.1637023-1-howardyen@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Signed-off-by: JaeHun Jung <jh0801.jung@samsung.com>
Export symbols for xhci hooks usage:
xhci_ring_free
- Allow xhci hook to free xhci_ring.
xhci_get_slot_ctx
- Allow xhci hook to get slot_ctx from the xhci_container_ctx
for getting the slot_ctx information to know which slot is
offloading and compare the context in remote subsystem memory
if needed.
xhci_get_ep_ctx
- Allow xhci hook to get ep_ctx from the xhci_container_ctx for
getting the ep_ctx information to know which ep is offloading and
comparing the context in remote subsystem memory if needed.
Export below xhci symbols for vendor modules to manage additional secondary rings.
These will be used to manage the secondary ring for usb audio offload.
xhci_segment_free
- Free a segment struct.
xhci_remove_stream_mapping
- Free for sram
xhci_link_segments
- Make the prev segment point to the next segment.
xhci_initialze_ring_info
- Initialze a ring struct.
xhci_check_trb_in_td_math
- Check TRB math for validation.
xhci_address_device
- Issue an address device command
xhci_bus_suspend
xhci_bus_resume
- Suspend and resume for power scenario
Change-Id: I2d99bded67024b2a7c625f934567e39ac03a6e5f
Signed-off-by: Howard Yen <howardyen@google.com>
Bug: 175358363
Bug: 183761108
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Signed-off-by: Daehwan Jung <dh10.jung@samsung.com>
Signed-off-by: JaeHun Jung <jh0801.jung@samsung.com>
To enable supporting for USB offload, define "offload" in usb controller
node of device tree. "offload" value can be used to determine which type
of offload was been enabled in the SoC.
For example:
&usbdrd_dwc3 {
...
/* support usb offloading, 0: disabled, 1: audio */
offload = <1>;
...
};
There are several vendor_ops introduced by this patch:
c - function callbacks for vendor specific operations
{
@vendor_init:
- called for vendor init process during xhci-plat-hcd
probe.
@vendor_cleanup:
- called for vendor cleanup process during xhci-plat-hcd
remove.
@is_usb_offload_enabled:
- called to check if usb offload enabled.
@queue_irq_work:
- called to queue vendor specific irq work.
@alloc_dcbaa:
- called when allocating vendor specific dcbaa during
memory initializtion.
@free_dcbaa:
- called to free vendor specific dcbaa when cleanup the
memory.
@alloc_transfer_ring:
- called when vendor specific transfer ring allocation is required
@free_transfer_ring:
- called to free vendor specific transfer ring
@sync_dev_ctx:
- called when synchronization for device context is required
@usb_offload_skip_urb:
- skip urb control for offloading
@alloc_container_ctx:
@free_container_ctx:
- called to alloc and free vendor specific container context
}
The xhci hooks with prefix "xhci_vendor_" on the ops in xhci_vendor_ops.
For example, vendor_init ops will be invoked by xhci_vendor_init() hook,
is_usb_offload_enabled ops will be invoked by
xhci_vendor_is_usb_offload_enabled(), and so on.
Change-Id: Ib7f6952e6d44a2fcfe9d19a78f1d9f5093417613
Signed-off-by: Howard Yen <howardyen@google.com>
Bug: 175358363
Signed-off-by: Greg Kroah-Harktman <gregkh@google.com>
Signed-off-by: Puma Hsu <pumahsu@google.com>
Signed-off-by: J. Avila <elavila@google.com>
Signed-off-by: Daehwan Jung <dh10.jung@samsung.com>
Signed-off-by: JaeHun Jung <jh0801.jung@samsung.com>
Set KMI_GENERATION=9 for 6/16 KMI update
function symbol 'struct block_device* I_BDEV(struct inode*)' changed
CRC changed from 0xb3d19fd2 to 0xc8597fa
function symbol 'void __ClearPageMovable(struct page*)' changed
CRC changed from 0x66921e4f to 0xb4e74d22
function symbol 'void __SetPageMovable(struct page*, const struct movable_operations*)' changed
CRC changed from 0x2b34667d to 0xe8b6d861
... 4484 omitted; 4487 symbols have only CRC changes
type 'struct request' changed
byte size changed from 312 to 320
member 'u64 alloc_time_ns' was added
19 members ('u64 start_time_ns' .. 'u64 android_kabi_reserved1') changed
offset changed by 64
type 'struct bio' changed
byte size changed from 152 to 160
member 'u64 bi_iocost_cost' was added
12 members ('struct bio_crypt_ctx* bi_crypt_context' .. 'struct bio_vec bi_inline_vecs[0]') changed
offset changed by 64
type 'enum cpuhp_state' changed
enumerator 'CPUHP_AP_ARM_SDEI_STARTING' (116) was removed
enumerator 'CPUHP_AP_ARM_VFP_STARTING' value changed from 117 to 116
enumerator 'CPUHP_AP_ARM64_DEBUG_MONITORS_STARTING' value changed from 118 to 117
enumerator 'CPUHP_AP_PERF_ARM_HW_BREAKPOINT_STARTING' value changed from 119 to 118
enumerator 'CPUHP_AP_PERF_ARM_ACPI_STARTING' value changed from 120 to 119
enumerator 'CPUHP_AP_PERF_ARM_STARTING' value changed from 121 to 120
enumerator 'CPUHP_AP_PERF_RISCV_STARTING' value changed from 122 to 121
enumerator 'CPUHP_AP_ARM_L2X0_STARTING' value changed from 123 to 122
enumerator 'CPUHP_AP_EXYNOS4_MCT_TIMER_STARTING' value changed from 124 to 123
enumerator 'CPUHP_AP_ARM_ARCH_TIMER_STARTING' value changed from 125 to 124
enumerator 'CPUHP_AP_ARM_GLOBAL_TIMER_STARTING' value changed from 126 to 125
enumerator 'CPUHP_AP_JCORE_TIMER_STARTING' value changed from 127 to 126
enumerator 'CPUHP_AP_ARM_TWD_STARTING' value changed from 128 to 127
enumerator 'CPUHP_AP_QCOM_TIMER_STARTING' value changed from 129 to 128
enumerator 'CPUHP_AP_TEGRA_TIMER_STARTING' value changed from 130 to 129
enumerator 'CPUHP_AP_ARMADA_TIMER_STARTING' value changed from 131 to 130
enumerator 'CPUHP_AP_MARCO_TIMER_STARTING' value changed from 132 to 131
enumerator 'CPUHP_AP_MIPS_GIC_TIMER_STARTING' value changed from 133 to 132
enumerator 'CPUHP_AP_ARC_TIMER_STARTING' value changed from 134 to 133
enumerator 'CPUHP_AP_RISCV_TIMER_STARTING' value changed from 135 to 134
enumerator 'CPUHP_AP_CLINT_TIMER_STARTING' value changed from 136 to 135
enumerator 'CPUHP_AP_CSKY_TIMER_STARTING' value changed from 137 to 136
enumerator 'CPUHP_AP_TI_GP_TIMER_STARTING' value changed from 138 to 137
enumerator 'CPUHP_AP_HYPERV_TIMER_STARTING' value changed from 139 to 138
enumerator 'CPUHP_AP_KVM_STARTING' value changed from 140 to 139
enumerator 'CPUHP_AP_KVM_ARM_VGIC_INIT_STARTING' value changed from 141 to 140
enumerator 'CPUHP_AP_KVM_ARM_VGIC_STARTING' value changed from 142 to 141
enumerator 'CPUHP_AP_KVM_ARM_TIMER_STARTING' value changed from 143 to 142
enumerator 'CPUHP_AP_DUMMY_TIMER_STARTING' value changed from 144 to 143
enumerator 'CPUHP_AP_ARM_XEN_STARTING' value changed from 145 to 144
enumerator 'CPUHP_AP_ARM_CORESIGHT_STARTING' value changed from 146 to 145
enumerator 'CPUHP_AP_ARM_CORESIGHT_CTI_STARTING' value changed from 147 to 146
enumerator 'CPUHP_AP_ARM64_ISNDEP_STARTING' value changed from 148 to 147
enumerator 'CPUHP_AP_SMPCFD_DYING' value changed from 149 to 148
enumerator 'CPUHP_AP_X86_TBOOT_DYING' value changed from 150 to 149
enumerator 'CPUHP_AP_ARM_CACHE_B15_RAC_DYING' value changed from 151 to 150
enumerator 'CPUHP_AP_ONLINE' value changed from 152 to 151
enumerator 'CPUHP_TEARDOWN_CPU' value changed from 153 to 152
enumerator 'CPUHP_AP_ONLINE_IDLE' value changed from 154 to 153
enumerator 'CPUHP_AP_SCHED_WAIT_EMPTY' value changed from 155 to 154
enumerator 'CPUHP_AP_SMPBOOT_THREADS' value changed from 156 to 155
enumerator 'CPUHP_AP_X86_VDSO_VMA_ONLINE' value changed from 157 to 156
enumerator 'CPUHP_AP_IRQ_AFFINITY_ONLINE' value changed from 158 to 157
enumerator 'CPUHP_AP_BLK_MQ_ONLINE' value changed from 159 to 158
enumerator 'CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS' value changed from 160 to 159
enumerator 'CPUHP_AP_X86_INTEL_EPB_ONLINE' value changed from 161 to 160
enumerator 'CPUHP_AP_PERF_ONLINE' value changed from 162 to 161
enumerator 'CPUHP_AP_PERF_X86_ONLINE' value changed from 163 to 162
enumerator 'CPUHP_AP_PERF_X86_UNCORE_ONLINE' value changed from 164 to 163
enumerator 'CPUHP_AP_PERF_X86_AMD_UNCORE_ONLINE' value changed from 165 to 164
enumerator 'CPUHP_AP_PERF_X86_AMD_POWER_ONLINE' value changed from 166 to 165
enumerator 'CPUHP_AP_PERF_X86_RAPL_ONLINE' value changed from 167 to 166
enumerator 'CPUHP_AP_PERF_X86_CQM_ONLINE' value changed from 168 to 167
enumerator 'CPUHP_AP_PERF_X86_CSTATE_ONLINE' value changed from 169 to 168
enumerator 'CPUHP_AP_PERF_X86_IDXD_ONLINE' value changed from 170 to 169
enumerator 'CPUHP_AP_PERF_S390_CF_ONLINE' value changed from 171 to 170
enumerator 'CPUHP_AP_PERF_S390_SF_ONLINE' value changed from 172 to 171
enumerator 'CPUHP_AP_PERF_ARM_CCI_ONLINE' value changed from 173 to 172
enumerator 'CPUHP_AP_PERF_ARM_CCN_ONLINE' value changed from 174 to 173
enumerator 'CPUHP_AP_PERF_ARM_HISI_CPA_ONLINE' value changed from 175 to 174
enumerator 'CPUHP_AP_PERF_ARM_HISI_DDRC_ONLINE' value changed from 176 to 175
enumerator 'CPUHP_AP_PERF_ARM_HISI_HHA_ONLINE' value changed from 177 to 176
enumerator 'CPUHP_AP_PERF_ARM_HISI_L3_ONLINE' value changed from 178 to 177
enumerator 'CPUHP_AP_PERF_ARM_HISI_PA_ONLINE' value changed from 179 to 178
enumerator 'CPUHP_AP_PERF_ARM_HISI_SLLC_ONLINE' value changed from 180 to 179
enumerator 'CPUHP_AP_PERF_ARM_HISI_PCIE_PMU_ONLINE' value changed from 181 to 180
enumerator 'CPUHP_AP_PERF_ARM_HNS3_PMU_ONLINE' value changed from 182 to 181
enumerator 'CPUHP_AP_PERF_ARM_L2X0_ONLINE' value changed from 183 to 182
enumerator 'CPUHP_AP_PERF_ARM_QCOM_L2_ONLINE' value changed from 184 to 183
enumerator 'CPUHP_AP_PERF_ARM_QCOM_L3_ONLINE' value changed from 185 to 184
enumerator 'CPUHP_AP_PERF_ARM_APM_XGENE_ONLINE' value changed from 186 to 185
enumerator 'CPUHP_AP_PERF_ARM_CAVIUM_TX2_UNCORE_ONLINE' value changed from 187 to 186
enumerator 'CPUHP_AP_PERF_ARM_MARVELL_CN10K_DDR_ONLINE' value changed from 188 to 187
enumerator 'CPUHP_AP_PERF_POWERPC_NEST_IMC_ONLINE' value changed from 189 to 188
enumerator 'CPUHP_AP_PERF_POWERPC_CORE_IMC_ONLINE' value changed from 190 to 189
enumerator 'CPUHP_AP_PERF_POWERPC_THREAD_IMC_ONLINE' value changed from 191 to 190
enumerator 'CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE' value changed from 192 to 191
enumerator 'CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE' value changed from 193 to 192
enumerator 'CPUHP_AP_PERF_POWERPC_HV_GPCI_ONLINE' value changed from 194 to 193
enumerator 'CPUHP_AP_PERF_CSKY_ONLINE' value changed from 195 to 194
enumerator 'CPUHP_AP_WATCHDOG_ONLINE' value changed from 196 to 195
enumerator 'CPUHP_AP_WORKQUEUE_ONLINE' value changed from 197 to 196
enumerator 'CPUHP_AP_RANDOM_ONLINE' value changed from 198 to 197
enumerator 'CPUHP_AP_RCUTREE_ONLINE' value changed from 199 to 198
enumerator 'CPUHP_AP_BASE_CACHEINFO_ONLINE' value changed from 200 to 199
enumerator 'CPUHP_AP_ONLINE_DYN' value changed from 201 to 200
enumerator 'CPUHP_AP_ONLINE_DYN_END' value changed from 231 to 230
enumerator 'CPUHP_AP_MM_DEMOTION_ONLINE' value changed from 232 to 231
enumerator 'CPUHP_AP_X86_HPET_ONLINE' value changed from 233 to 232
enumerator 'CPUHP_AP_X86_KVM_CLK_ONLINE' value changed from 234 to 233
enumerator 'CPUHP_AP_ACTIVE' value changed from 235 to 234
enumerator 'CPUHP_ANDROID_RESERVED_1' value changed from 236 to 235
enumerator 'CPUHP_ANDROID_RESERVED_2' value changed from 237 to 236
enumerator 'CPUHP_ANDROID_RESERVED_3' value changed from 238 to 237
enumerator 'CPUHP_ANDROID_RESERVED_4' value changed from 239 to 238
enumerator 'CPUHP_ONLINE' value changed from 240 to 239
type 'struct task_struct' changed
byte size changed from 4736 to 4800
104 members ('const struct cred* ptracer_cred' .. 'struct thread_struct thread') changed
offset changed by 384
type 'struct platform_driver' changed
byte size changed from 240 to 248
member 'void(* remove_new)(struct platform_device*)' was added
8 members ('void(* shutdown)(struct platform_device*)' .. 'u64 android_kabi_reserved1') changed
offset changed by 64
type 'struct tipc_bearer' changed
member 'u16 encap_hlen' was added
type 'struct posix_cputimers_work' changed
byte size changed from 24 to 72
member 'struct mutex mutex' was added
member 'unsigned int scheduled' changed
offset changed by 384
type 'struct binder_alloc' changed
member 'struct vm_area_struct* vma' was added
member 'unsigned long vma_addr' was removed
type 'struct usb_udc' changed
byte size changed from 1000 to 952
member 'struct mutex connect_lock' was removed
type 'enum kvm_pgtable_prot' changed
enumerator 'KVM_PGTABLE_PROT_PXN' (32) was added
enumerator 'KVM_PGTABLE_PROT_UXN' (64) was added
Bug: 287162457
Change-Id: Ic3aad43bd3a6083cf91e71e79ece713bef0e8172
Signed-off-by: Carlos Llamas <cmllamas@google.com>
commit d1d8875c8c upstream.
[ cmllamas: clean forward port from commit 015ac18be7 ("binder: fix
UAF of alloc->vma in race with munmap()") in 5.10 stable. It is needed
in mainline after the revert of commit a43cfc87ca ("android: binder:
stop saving a pointer to the VMA") as pointed out by Liam. The commit
log and tags have been tweaked to reflect this. ]
In commit 720c241924 ("ANDROID: binder: change down_write to
down_read") binder assumed the mmap read lock is sufficient to protect
alloc->vma inside binder_update_page_range(). This used to be accurate
until commit dd2283f260 ("mm: mmap: zap pages with read mmap_sem in
munmap"), which now downgrades the mmap_lock after detaching the vma
from the rbtree in munmap(). Then it proceeds to teardown and free the
vma with only the read lock held.
This means that accesses to alloc->vma in binder_update_page_range() now
will race with vm_area_free() in munmap() and can cause a UAF as shown
in the following KASAN trace:
==================================================================
BUG: KASAN: use-after-free in vm_insert_page+0x7c/0x1f0
Read of size 8 at addr ffff16204ad00600 by task server/558
CPU: 3 PID: 558 Comm: server Not tainted 5.10.150-00001-gdc8dcf942daa #1
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x2a0
show_stack+0x18/0x2c
dump_stack+0xf8/0x164
print_address_description.constprop.0+0x9c/0x538
kasan_report+0x120/0x200
__asan_load8+0xa0/0xc4
vm_insert_page+0x7c/0x1f0
binder_update_page_range+0x278/0x50c
binder_alloc_new_buf+0x3f0/0xba0
binder_transaction+0x64c/0x3040
binder_thread_write+0x924/0x2020
binder_ioctl+0x1610/0x2e5c
__arm64_sys_ioctl+0xd4/0x120
el0_svc_common.constprop.0+0xac/0x270
do_el0_svc+0x38/0xa0
el0_svc+0x1c/0x2c
el0_sync_handler+0xe8/0x114
el0_sync+0x180/0x1c0
Allocated by task 559:
kasan_save_stack+0x38/0x6c
__kasan_kmalloc.constprop.0+0xe4/0xf0
kasan_slab_alloc+0x18/0x2c
kmem_cache_alloc+0x1b0/0x2d0
vm_area_alloc+0x28/0x94
mmap_region+0x378/0x920
do_mmap+0x3f0/0x600
vm_mmap_pgoff+0x150/0x17c
ksys_mmap_pgoff+0x284/0x2dc
__arm64_sys_mmap+0x84/0xa4
el0_svc_common.constprop.0+0xac/0x270
do_el0_svc+0x38/0xa0
el0_svc+0x1c/0x2c
el0_sync_handler+0xe8/0x114
el0_sync+0x180/0x1c0
Freed by task 560:
kasan_save_stack+0x38/0x6c
kasan_set_track+0x28/0x40
kasan_set_free_info+0x24/0x4c
__kasan_slab_free+0x100/0x164
kasan_slab_free+0x14/0x20
kmem_cache_free+0xc4/0x34c
vm_area_free+0x1c/0x2c
remove_vma+0x7c/0x94
__do_munmap+0x358/0x710
__vm_munmap+0xbc/0x130
__arm64_sys_munmap+0x4c/0x64
el0_svc_common.constprop.0+0xac/0x270
do_el0_svc+0x38/0xa0
el0_svc+0x1c/0x2c
el0_sync_handler+0xe8/0x114
el0_sync+0x180/0x1c0
[...]
==================================================================
To prevent the race above, revert back to taking the mmap write lock
inside binder_update_page_range(). One might expect an increase of mmap
lock contention. However, binder already serializes these calls via top
level alloc->mutex. Also, there was no performance impact shown when
running the binder benchmark tests.
Fixes: c0fd210178 ("Revert "android: binder: stop saving a pointer to the VMA"")
Fixes: dd2283f260 ("mm: mmap: zap pages with read mmap_sem in munmap")
Reported-by: Jann Horn <jannh@google.com>
Closes: https://lore.kernel.org/all/20230518144052.xkj6vmddccq4v66b@revolver
Cc: <stable@vger.kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Change-Id: I4215750a81e94bccf5340e4d79f7b26bb039c573
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Acked-by: Todd Kjos <tkjos@google.com>
Link: https://lore.kernel.org/r/20230519195950.1775656-1-cmllamas@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 931ea1ed31)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 0fa53349c3 upstream.
Bring back the original lockless design in binder_alloc to determine
whether the buffer setup has been completed by the ->mmap() handler.
However, this time use smp_load_acquire() and smp_store_release() to
wrap all the ordering in a single macro call.
Also, add comments to make it evident that binder uses alloc->vma to
determine when the binder_alloc has been fully initialized. In these
scenarios acquiring the mmap_lock is not required.
Fixes: a43cfc87ca ("android: binder: stop saving a pointer to the VMA")
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: stable@vger.kernel.org
Change-Id: I2a8040417790b6b82bf44e838146fd68403fdb51
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Link: https://lore.kernel.org/r/20230502201220.1756319-3-cmllamas@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d7cee853bc)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit c0fd210178 upstream.
This reverts commit a43cfc87ca.
This patch fixed an issue reported by syzkaller in [1]. However, this
turned out to be only a band-aid in binder. The root cause, as bisected
by syzkaller, was fixed by commit 5789151e48 ("mm/mmap: undo ->mmap()
when mas_preallocate() fails"). We no longer need the patch for binder.
Reverting such patch allows us to have a lockless access to alloc->vma
in specific cases where the mmap_lock is not required. This approach
avoids the contention that caused a performance regression.
[1] https://lore.kernel.org/all/0000000000004a0dbe05e1d749e0@google.com
[cmllamas: resolved conflicts with rework of alloc->mm and removal of
binder_alloc_set_vma() also fixed comment section]
Fixes: a43cfc87ca ("android: binder: stop saving a pointer to the VMA")
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: stable@vger.kernel.org
Change-Id: I208b4ebf832790eb155d52ec3115e1e6c58f6f80
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Link: https://lore.kernel.org/r/20230502201220.1756319-2-cmllamas@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 72a94f8c14)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 35a089b5d7 ]
Checking the bearer min mtu with tipc_udp_mtu_bad() only works for
IPv4 UDP bearer, and IPv6 UDP bearer has a different value for the
min mtu. This patch checks with encap_hlen + TIPC_MIN_BEARER_MTU
for min mtu, which works for both IPv4 and IPv6 UDP bearer.
Note that tipc_udp_mtu_bad() is still used to check media min mtu
in __tipc_nl_media_set(), as m->mtu currently is only used by the
IPv4 UDP bearer as its default mtu value.
Fixes: 682cd3cf94 ("tipc: confgiure and apply UDP bearer MTU on running links")
Change-Id: I384afae6ffa9c43f72c1cda34ad2f1dd611fc675
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit f215b62f59)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 56077b56cd ]
When doing link mtu negotiation, a malicious peer may send Activate msg
with a very small mtu, e.g. 4 in Shuang's testing, without checking for
the minimum mtu, l->mtu will be set to 4 in tipc_link_proto_rcv(), then
n->links[bearer_id].mtu is set to 4294967228, which is a overflow of
'4 - INT_H_SIZE - EMSG_OVERHEAD' in tipc_link_mss().
With tipc_link.mtu = 4, tipc_link_xmit() kept printing the warning:
tipc: Too large msg, purging xmit list 1 5 0 40 4!
tipc: Too large msg, purging xmit list 1 15 0 60 4!
And with tipc_link_entry.mtu 4294967228, a huge skb was allocated in
named_distribute(), and when purging it in tipc_link_xmit(), a crash
was even caused:
general protection fault, probably for non-canonical address 0x2100001011000dd: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 6.3.0.neta #19
RIP: 0010:kfree_skb_list_reason+0x7e/0x1f0
Call Trace:
<IRQ>
skb_release_data+0xf9/0x1d0
kfree_skb_reason+0x40/0x100
tipc_link_xmit+0x57a/0x740 [tipc]
tipc_node_xmit+0x16c/0x5c0 [tipc]
tipc_named_node_up+0x27f/0x2c0 [tipc]
tipc_node_write_unlock+0x149/0x170 [tipc]
tipc_rcv+0x608/0x740 [tipc]
tipc_udp_recv+0xdc/0x1f0 [tipc]
udp_queue_rcv_one_skb+0x33e/0x620
udp_unicast_rcv_skb.isra.72+0x75/0x90
__udp4_lib_rcv+0x56d/0xc20
ip_protocol_deliver_rcu+0x100/0x2d0
This patch fixes it by checking the new mtu against tipc_bearer_min_mtu(),
and not updating mtu if it is too small.
Fixes: ed193ece26 ("tipc: simplify link mtu negotiation")
Reported-by: Shuang Li <shuali@redhat.com>
Change-Id: I95f28cbfaf6dc4899e0695ba6168c7c58737f06b
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 259683001d)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 3ae6d66b60 ]
As different media may requires different min mtu, and even the
same media with different net family requires different min mtu,
add tipc_bearer_min_mtu() to calculate min mtu accordingly.
This API will be used to check the new mtu when doing the link
mtu negotiation in the next patch.
Change-Id: I960cf07506388294eb6028938025e1073a2c4be5
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 56077b56cd ("tipc: do not update mtu if msg_max is too small in mtu negotiation")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 735c64ea88)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 5c5a7680e6 ]
struct platform_driver::remove returning an integer made driver authors
expect that returning an error code was proper error handling. However
the driver core ignores the error and continues to remove the device
because there is nothing the core could do anyhow and reentering the
remove callback again is only calling for trouble.
So this is an source for errors typically yielding resource leaks in the
error path.
As there are too many platform drivers to neatly convert them all to
return void in a single go, do it in several steps after this patch:
a) Convert all drivers to implement .remove_new() returning void instead
of .remove() returning int;
b) Change struct platform_driver::remove() to return void and so make
it identical to .remove_new();
c) Change all drivers back to .remove() now with the better prototype;
d) drop struct platform_driver::remove_new().
While this touches all drivers eventually twice, steps a) and c) can be
done one driver after another and so reduces coordination efforts
immensely and simplifies review.
Change-Id: I7da6828a301462bad53470cf94db94d55ac51d37
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20221209150914.3557650-1-u.kleine-koenig@pengutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Stable-dep-of: 17955aba78 ("ASoC: fsl_micfil: Fix error handler with pm_runtime_enable")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 9d3ac384cb)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit d2c48b2387 ]
Running a preempt-rt (v6.2-rc3-rt1) based kernel on an Ampere Altra
triggers:
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
in_atomic(): 0, irqs_disabled(): 128, non_block: 0, pid: 24, name: cpuhp/0
preempt_count: 0, expected: 0
RCU nest depth: 0, expected: 0
3 locks held by cpuhp/0/24:
#0: ffffda30217c70d0 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x5c/0x248
#1: ffffda30217c7120 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x5c/0x248
#2: ffffda3021c711f0 (sdei_list_lock){....}-{3:3}, at: sdei_cpuhp_up+0x3c/0x130
irq event stamp: 36
hardirqs last enabled at (35): [<ffffda301e85b7bc>] finish_task_switch+0xb4/0x2b0
hardirqs last disabled at (36): [<ffffda301e812fec>] cpuhp_thread_fun+0x21c/0x248
softirqs last enabled at (0): [<ffffda301e80b184>] copy_process+0x63c/0x1ac0
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 0 PID: 24 Comm: cpuhp/0 Not tainted 5.19.0-rc3-rt5-[...]
Hardware name: WIWYNN Mt.Jade Server [...]
Call trace:
dump_backtrace+0x114/0x120
show_stack+0x20/0x70
dump_stack_lvl+0x9c/0xd8
dump_stack+0x18/0x34
__might_resched+0x188/0x228
rt_spin_lock+0x70/0x120
sdei_cpuhp_up+0x3c/0x130
cpuhp_invoke_callback+0x250/0xf08
cpuhp_thread_fun+0x120/0x248
smpboot_thread_fn+0x280/0x320
kthread+0x130/0x140
ret_from_fork+0x10/0x20
sdei_cpuhp_up() is called in the STARTING hotplug section,
which runs with interrupts disabled. Use a CPUHP_AP_ONLINE_DYN entry
instead to execute the cpuhp cb later, with preemption enabled.
SDEI originally got its own cpuhp slot to allow interacting
with perf. It got superseded by pNMI and this early slot is not
relevant anymore. [1]
Some SDEI calls (e.g. SDEI_1_0_FN_SDEI_PE_MASK) take actions on the
calling CPU. It is checked that preemption is disabled for them.
_ONLINE cpuhp cb are executed in the 'per CPU hotplug thread'.
Preemption is enabled in those threads, but their cpumask is limited
to 1 CPU.
Move 'WARN_ON_ONCE(preemptible())' statements so that SDEI cpuhp cb
don't trigger them.
Also add a check for the SDEI_1_0_FN_SDEI_PRIVATE_RESET SDEI call
which acts on the calling CPU.
[1]:
https://lore.kernel.org/all/5813b8c5-ae3e-87fd-fccc-94c9cd08816d@arm.com/
Suggested-by: James Morse <james.morse@arm.com>
Change-Id: I9f73aadd24096d8298b5ae8f26f955e9f6ee2b9a
Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Link: https://lore.kernel.org/r/20230216084920.144064-1-pierre.gondois@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit a8267bc8de)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 31088f6f79 ]
typeof is (still) a GNU extension, which means that it cannot be used when
building ISO C (e.g. -std=c99). It should therefore be avoided in uapi
headers in favour of the ISO-friendly __typeof__.
Unfortunately this issue could not be detected by
CONFIG_UAPI_HEADER_TEST=y as the __ALIGN_KERNEL() macro is not expanded in
any uapi header.
This matters from a userspace perspective, not a kernel one. uapi
headers and their contents are expected to be usable in a variety of
situations, and in particular when building ISO C applications (with
-std=c99 or similar).
This particular problem can be reproduced by trying to use the
__ALIGN_KERNEL macro directly in application code, say:
int align(int x, int a)
{
return __KERNEL_ALIGN(x, a);
}
and trying to build that with -std=c99.
Link: https://lkml.kernel.org/r/20230411092747.3759032-1-kevin.brodsky@arm.com
Fixes: a79ff731a1 ("netfilter: xtables: make XT_ALIGN() usable in exported headers by exporting __ALIGN_KERNEL()")
Change-Id: I05462cdee00da59617f3dfb875c233a246f7d2f6
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reported-by: Ruben Ayrapetyan <ruben.ayrapetyan@arm.com>
Tested-by: Ruben Ayrapetyan <ruben.ayrapetyan@arm.com>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Tested-by: Petr Vorel <pvorel@suse.cz>
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit ef9f854103)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit f7abf14f00 upstream.
For some unknown reason the introduction of the timer_wait_running callback
missed to fixup posix CPU timers, which went unnoticed for almost four years.
Marco reported recently that the WARN_ON() in timer_wait_running()
triggers with a posix CPU timer test case.
Posix CPU timers have two execution models for expiring timers depending on
CONFIG_POSIX_CPU_TIMERS_TASK_WORK:
1) If not enabled, the expiry happens in hard interrupt context so
spin waiting on the remote CPU is reasonably time bound.
Implement an empty stub function for that case.
2) If enabled, the expiry happens in task work before returning to user
space or guest mode. The expired timers are marked as firing and moved
from the timer queue to a local list head with sighand lock held. Once
the timers are moved, sighand lock is dropped and the expiry happens in
fully preemptible context. That means the expiring task can be scheduled
out, migrated, interrupted etc. So spin waiting on it is more than
suboptimal.
The timer wheel has a timer_wait_running() mechanism for RT, which uses
a per CPU timer-base expiry lock which is held by the expiry code and the
task waiting for the timer function to complete blocks on that lock.
This does not work in the same way for posix CPU timers as there is no
timer base and expiry for process wide timers can run on any task
belonging to that process, but the concept of waiting on an expiry lock
can be used too in a slightly different way:
- Add a mutex to struct posix_cputimers_work. This struct is per task
and used to schedule the expiry task work from the timer interrupt.
- Add a task_struct pointer to struct cpu_timer which is used to store
a the task which runs the expiry. That's filled in when the task
moves the expired timers to the local expiry list. That's not
affecting the size of the k_itimer union as there are bigger union
members already
- Let the task take the expiry mutex around the expiry function
- Let the waiter acquire a task reference with rcu_read_lock() held and
block on the expiry mutex
This avoids spin-waiting on a task which might not even be on a CPU and
works nicely for RT too.
Fixes: ec8f954a40 ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT")
Reported-by: Marco Elver <elver@google.com>
Change-Id: Ic069585c15bc968dec3c2b99cc70256f56a70b32
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marco Elver <elver@google.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87zg764ojw.ffs@tglx
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit bccf9fe296)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Over the lifetime of the kernel, new arm64 cpucaps need to be added to
handle errata and other fun stuff. So reserve 20 spots for us to use in
the future as this is an ABI-stable structure that we can not increase
over time without major problems.
Bug: 151154716
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I37bdac374e2570f61ab54919712fd62c7e541e67
This is a placeholder to workaround NXP iMX8QM A53 Cache coherency issue.
The full patch is still under review upstream.
Considering the patch adds a new cpucap, which breaks KMI, and
the KMI freeze date is coming, so use a placeholder
here to update KMI before the freeze.
According to NXP errata document[1] i.MX8QuadMax SoC suffers from
serious cache coherence issue. It was also mentioned in initial
support[2] for imx8qm mek machine.
Following is excerpt from NXP IMX8_1N94W "Mask Set Errata" document
Rev. 5, 3/2023. Just in case it gets lost somehow.
"ERR050104: Arm/A53: Cache coherency issue"
Description
Some maintenance operations exchanged between the A53 and A72
core clusters, involving some Translation Look-aside Buffer
Invalidate (TLBI) and Instruction Cache (IC) instructions can
be corrupted. The upper bits, above bit-35, of ARADDR and ACADDR
buses within in Arm A53 sub-system have been incorrectly connected.
Therefore ARADDR and ACADDR address bits above bit-35 should not
be used.
Workaround
The following software instructions are required to be downgraded
to TLBI VMALLE1IS: TLBI ASIDE1, TLBI ASIDE1IS, TLBI VAAE1,
TLBI VAAE1IS, TLBI VAALE1, TLBI VAALE1IS, TLBI VAE1, TLBI VAE1IS,
TLBI VALE1, TLBI VALE1IS
The following software instructions are required to be downgraded
to TLBI VMALLS12E1IS: TLBI IPAS2E1IS, TLBI IPAS2LE1IS
The following software instructions are required to be downgraded
to TLBI ALLE2IS: TLBI VAE2IS, TLBI VALE2IS.
The following software instructions are required to be downgraded
to TLBI ALLE3IS: TLBI VAE3IS, TLBI VALE3IS.
The following software instructions are required to be downgraded
to TLBI VMALLE1IS when the Force Broadcast (FB) bit [9] of the
Hypervisor Configuration Register (HCR_EL2) is set:
TLBI ASIDE1, TLBI VAAE1, TLBI VAALE1, TLBI VAE1, TLBI VALE1
The following software instruction is required to be downgraded
to IC IALLUIS: IC IVAU, Xt
Specifically for the IC IVAU, Xt downgrade, setting SCTLR_EL1.UCI
to 0 will disable EL0 access to this instruction. Any attempt to
execute from EL0 will generate an EL1 trap, where the downgrade to
IC ALLUIS can be implemented.
[1] https://www.nxp.com/docs/en/errata/IMX8_1N94W.pdf
[2] commit 307fd14d4b ("arm64: dts: imx: add imx8qm mek support")
Bug: 284762900
Link: https://lore.kernel.org/linux-arm-kernel/20230420112952.28340-1-iivanov@suse.de/
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I8dd50b369412de73b608805d1b5bb8424ea23280
FEAT_XNX allows to specify PXN and UXN attributes on stage-2 entries.
Make this usable from pKVM by exposing two new kvm_pgtable_prot entries
for each of them.
No functional changes intended.
Bug: 264070847
Change-Id: I47d861fa64ba511370b182f4609fe1c27695a949
Signed-off-by: Quentin Perret <qperret@google.com>
Nothing currently prevents the donation of an MMIO region to the
hypervisor for backing e.g. guest stage-2 page-tables, tracing buffers,
hyp vm and vcpu metadata, or any other donation to EL2. However, the
only confirmed use-case for MMIO donations are for protecting the IOMMU
registers as well as for vendor module usage.
Restrict the donation of MMIO regions to these two paths only by
introducing a new helper function.
Bug: 264070847
Change-Id: I914508fb3e3547fcfabca8557bdf7948cb796099
Signed-off-by: Quentin Perret <qperret@google.com>
We've historically disallowed state changes for MMIO pages -- the host
had sole ownership of all of them. However, changing the state of those
pages has clearly become a goal both to support vendor extensions to
the hypervisor, as well as to support device assignment in the longer
term. To pave the way towards this support, let's allow certain state
transitions for MMIO pages.
Bug: 264070847
Change-Id: I9803b572c90d8a694c3d43a0ee0d7b4f4124fe4a
Signed-off-by: Quentin Perret <qperret@google.com>
We now allow donations of MMIO ranges, let's also allow modules to
change host stage-2 permissions.
Bug: 264070847
Change-Id: Ia72678bb27559d9a7963dbc5ffb5a101efcbbad2
Signed-off-by: Quentin Perret <qperret@google.com>
There shouldn't be any reason to ever need allocating from the host
stage-2 pool during mem aborts now that the base page-table structure
is pinned. To prevent future regressions in this area, introduce a new
sanity check that will warn when hyp_page_alloc() is used from the mem
wrong paths.
Bug: 264070847
Change-Id: I7a7c606fe01558790e4ffcd3534f8976caf48bd0
Signed-off-by: Quentin Perret <qperret@google.com>
The MMIO register space for IOMMUs controlled by the hypervisor is
currently unmapped from the host stage-2, and we rely on the host abort
path to not accidentally map them. However, this approach becomes
increasingly difficult to maintain as we introduce support for donating
MMIO regions and not just memory -- nothing prevents the host from
donating a protected MMIO register to another entity for example.
Now that MMIO donations are possible, let's use the proper
host-donate-hyp machinery to implement this. As a nice side effect, this
guarantees the host stage-2 page-table is annotated with hyp ownership
for those IOMMU regions, which guarantees the core range alignment
feature in the host mem abort parth will do the right thing without
requiring a second pass in the IOMMU code. This also turns the host
stage-2 PTEs into "non-default" entries, hence avoiding issues with the
coallescing code looking forward.
Bug: 264070847
Change-Id: I1fad1b1be36f3b654190a912617e780141945a8f
Signed-off-by: Quentin Perret <qperret@google.com>
We now support donations of MMIO ranges to the hypervisor. Make sure to
update the donation logic to correctly map these pages with device
mappings.
Bug: 264070847
Change-Id: I36558f05ed47d1e3dc06e4e24151241474b4ff77
Signed-off-by: Quentin Perret <qperret@google.com>
We're now guaranteed by construction to not require structural changes
to the host stage-2 page-table from the host memory abort path, so let's
use the low-level __host_stage2_idmap() function directly instead of the
higher-level wrapper that attempts page recycling when running out of
memory.
Bug: 264070847
Change-Id: I2db34777386931bfb3f93ea3b3e51e1e2a10ea79
Signed-off-by: Quentin Perret <qperret@google.com>
Now that the host stage-2 page-table is entirely pre-populated in
__pkvm_init_finalize(), we know that by the end of this function, the
structure of the page-table will remain stable until the host calls in
the hypervisor to require e.g. a page-table changes (by e.g. running a
guest). This does not necessarily mean that no host mem aborts will
occur -- there may be null PTEs in the host stage-2 due to collapsed
block mappings from fix_host_ownership() for example -- but all those
aborts should be trivially handled without requiring structural changes
to the page-table. This has the nice side effect of guaranteeing that
host_mem_abort() will not allocate from the host stage-2 pool. In order
to ensure this desirable property is retained for the lifetime of the
system even in the presence of the coalescing feature, let's 'pin' the
structure of the page-table as-is by taking an additional reference
from each table entry.
Bug: 264070847
Change-Id: If870d7485cc38f6ad714901e710287911f111897
Signed-off-by: Quentin Perret <qperret@google.com>
We will soon need to use kvm_pte_follow() from outside pgtable.c, so
move it to the header file as static inline.
Bug: 264070847
Change-Id: I319dff1b352a4acd8d9a5cc74acb5f1758be358f
Signed-off-by: Quentin Perret <qperret@google.com>
We will soon attempt to avoid any memory allocations from the host mem
abort path. In order to pave the way towards supporting this, let's
pre-populate the host stage-2 for the entire address space using as many
block mappings as possible. Some of these mappings may need to be
collapsed shortly after from fix_host_ownership() for example, so this
doesn't guarantee the absence of memory aborts altogether, but helps
getting the structure of the page-table in the right shape early on.
Bug: 264070847
Change-Id: Ib3ce25c893f779437ce473d64e08e8876870556c
Signed-off-by: Quentin Perret <qperret@google.com>
The fix_host_ownership() path walks the hypervisor's stage-1 page-table
to adjust the host's stage-2 accordingly. However, this is done before
the hyp stage-1 refcount has been fixed up, and before the hyp percpu
fixmap has been created. This all works right now as we start off with
an empty host stage-2, so none of the changes require the usage of the
fixmap for e.g. CMOs.
To prepare the ground for doing fix_host_ownership() with a non-empty
page-table, finalize the hyp stage-1 upfront.
Bug: 264070847
Change-Id: I6aff3ac2f835be3fb3fba7660540c0a9b99c097d
Signed-off-by: Quentin Perret <qperret@google.com>
When recycling host stage-2 page-table pages, we currenly blindly
unmap all 'non-moveable' regions. To prepare the ground for allowing the
mapping of those regions with non-default attributes, let's switch to
using the recently introduced kvm_pgtable_stage2_reclaim_leaf() helper
which will only reclaim pages containing PTEs with default attributes.
Bug: 264070847
Change-Id: I4a441a20abe84d2405efcfa403908078c10be841
Signed-off-by: Quentin Perret <qperret@google.com>
We will soon improve the mechanism by which the host's stage-2
page-table pages are recycled whenever its pool runs out of pages. To
prepare thecground for this, introduce a new helper function in the
page-table code allowing to reclaim leaf pages that don't hold counted
PTEs.
Bug: 264070847
Change-Id: Ie172bf11f2980e45bc908002368759f74f42d195
Signed-off-by: Quentin Perret <qperret@google.com>
Enable CONFIG_BLK_CGROUP_IOCOST to help control IO resources.
Bug: 188749221
Bug: 285074916
Change-Id: I611b3ff5929d0a998fa6241967887803636b7588
(cherry picked from commit 19316b4889)
Signed-off-by: Yang Yang <yang.yang@vivo.com>
Expose usb device state to userland as the information is useful in
detecting non-compliant setups and diagnosing enumeration failures.
For example:
- End-to-end signal integrity issues: the device would fail port reset
repeatedly and thus be stuck in POWERED state.
- Charge-only cables (missing D+/D- lines): the device would never enter
POWERED state as the HC would not see any pullup.
What's the status quo?
We do have error logs such as "Cannot enable. Maybe the USB cable is bad?"
to flag potential setup issues, but there's no good way to expose them to
userspace.
Why add a sysfs entry in struct usb_port instead of struct usb_device?
The struct usb_device is not device_add() to the system until it's in
ADDRESS state hence we would miss the first two states. The struct
usb_port is a better place to keep the information because its life
cycle is longer than the struct usb_device that is attached to the port.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202306042228.e532af6e-oliver.sang@intel.com
Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
Change-Id: Ib78d4c7b4b1db402828c92dc792838a1015f0f2c
Signed-off-by: Roy Luo <royluo@google.com>
Message-ID: <20230608015913.1679984-1-royluo@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(Backport conflicts: the adjacent sysfs entry is different in
ABI documentation)
Bug: 285199434
(cherry picked from commit 83cb2604f6
https: //git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git/ usb-testing)
Change-Id: I1a0da6686e57be05ef10ae98892599eb37074014
Signed-off-by: Roy Luo <royluo@google.com>
Add symbol list for oplus in android/abi_gki_aarch64_oplus
1 function symbol(s) added
'int public_key_verify_signature(const struct public_key*, const struct public_key_signature*)'
Bug: 286993971
Change-Id: I748437d61b46b6ee3736b3c7df36ab7249b187f6
Signed-off-by: zuoyonghua <zuoyonghua@oppo.com>
Presently, when a report is processed, its proposed size, provided by
the user of the API (as Report Size * Report Count) is compared against
the subsystem default HID_MAX_BUFFER_SIZE (16k). However, some
low-level HID drivers allocate a reduced amount of memory to their
buffers (e.g. UHID only allocates UHID_DATA_MAX (4k) buffers), rending
this check inadequate in some cases.
In these circumstances, if the received report ends up being smaller
than the proposed report size, the remainder of the buffer is zeroed.
That is, the space between sizeof(csize) (size of the current report)
and the rsize (size proposed i.e. Report Size * Report Count), which can
be handled up to HID_MAX_BUFFER_SIZE (16k). Meaning that memset()
shoots straight past the end of the buffer boundary and starts zeroing
out in-use values, often resulting in calamity.
This is an Android specific patch which essentially achieves the same
goal as the recently reverted upstream commits b1a37ed00d "(HID:
core: Provide new max_buffer_size attribute to over-ride the default")
and 1c5d422124 ("HID: uhid: Over-ride the default maximum data buffer
value with our own") only it does so in an ABI friendly (albeit more
hacky) way.
Bug: 260007429
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I1f56673bb67b63ab14b58634bfe74a04b0758e3d
This reverts commit 52ace503ecf894ec2f63b8137f181868ea61d95a.
The issue that required the revert is fixed by:
0257d9908d ("maple_tree: make maple state reusable after mas_empty_area()")
Bug: 281094761
Change-Id: I97b45525689097d0c1369f81a994d50f0662c9c2
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR is introduced for host that
implement different interrupt topology from UFSHCI 4.0 spec.
Some host raise per hw queue interrupt in addition to
CQES (traditional) when ESI is disabled.
Enable this quirk will disable CQES and use only per hw queue
interrupt.
Bug: 267974767
Link: https://lore.kernel.org/all/20230612085817.12275-2-powen.kao@mediatek.com/
Signed-off-by: Po-Wen Kao <powen.kao@mediatek.com>
Reviewed-by: Stanley Chu <stanley.chu@mediatek.com>
Change-Id: I42b24f668ed501bc6c7511898d5b90e8d9fd1492
[ Upstream commit 8fe72b76db ]
There was a bug where this code forgot to unlock the tdev->mutex if the
kzalloc() failed. Fix this issue, by moving the allocation outside the
lock.
Bug: 275340532
Fixes: 2d1e952a2b ("mailbox: mailbox-test: Fix potential double-free in mbox_test_message_write()")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Lee Jones <lee@kernel.org>
Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 7d233f9359)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I7a4a1bf06abbb2092aceb72610e3f894b2bfbf0f