linux

mirror of https://github.com/hardkernel/linux.git synced 2026-05-18 01:48:55 +09:00

Author	SHA1	Message	Date
Li Yang	a019ab4067	arm64: defconfig: Enable FSL_EDMA driver Enables the FSL EDMA driver by default. This also works around an issue that imx-i2c driver keeps deferring the probe because of the DMA is not ready. And currently the DMA engine framework can not correctly tell if the DMA channels will truly become available later (it will never be available if the DMA driver is not enabled). This will cause indefinite messages like below: [ 3.335829] imx-i2c 2180000.i2c: can't get pinctrl, bus recovery not supported [ 3.344455] ina2xx 0-0040: power monitor ina220 (Rshunt = 1000 uOhm) [ 3.350917] lm90 0-004c: 0-004c supply vcc not found, using dummy regulator [ 3.362089] imx-i2c 2180000.i2c: can't get pinctrl, bus recovery not supported [ 3.370741] ina2xx 0-0040: power monitor ina220 (Rshunt = 1000 uOhm) [ 3.377205] lm90 0-004c: 0-004c supply vcc not found, using dummy regulator [ 3.388455] imx-i2c 2180000.i2c: can't get pinctrl, bus recovery not supported ..... Signed-off-by: Li Yang <leoyang.li@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org>	2019-06-18 14:32:43 +08:00
Bjorn Andersson	73786fea02	arm64: dts: qcom: qcs404-evb: Enable PCIe Enable the PCIe PHY and controller found on the QCS404 EVB. Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>	2019-06-17 23:23:45 -07:00
Bjorn Andersson	431f64642c	arm64: dts: qcom: qcs404: Add PCIe related nodes The QCS404 has a PCIe2 PHY and a Qualcomm PCIe controller, define these to for the platform. Reviewed-by: Niklas Cassel <niklas.cassel@linaro.org> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>	2019-06-17 23:23:45 -07:00
David S. Miller	13091aa305	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Honestly all the conflicts were simple overlapping changes, nothing really interesting to report. Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-17 20:20:36 -07:00
Gal Pressman	529254340c	RDMA/efa: Fix success return value in case of error Existing code would mistakenly return success in case of error instead of a proper return value. Fixes: `e9c6c53730` ("RDMA/efa: Add common command handlers") Reviewed-by: Firas JahJah <firasj@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2019-06-17 21:35:21 -04:00
Mike Marciniszyn	942a899335	IB/hfi1: Handle port down properly in pio The call to sc_buffer_alloc currently returns NULL (no buffer) or a buffer descriptor. There is a third case when the port is down. Currently that returns NULL and this prevents the caller from properly handling the sc_buffer_alloc() failure. A verbs code link test after the call is racy so the indication needs to come from the state check inside the allocation routine to be valid. Fix by encoding the ECOMM failure like SDMA. IS_ERR_OR_NULL() tests are added at all call sites. For verbs send, this needs to treat any error by returning a completion without any MMIO copy. Fixes: `7724105686` ("IB/hfi1: add driver files") Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2019-06-17 21:15:40 -04:00
Mike Marciniszyn	099a884ba4	IB/hfi1: Handle wakeup of orphaned QPs for pio Once a send context is taken down due to a link failure, any QPs waiting for pio credits will stay on the waitlist indefinitely. Fix by wakeing up all QPs linked to piowait list. Fixes: `7724105686` ("IB/hfi1: add driver files") Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2019-06-17 21:15:40 -04:00
Mike Marciniszyn	f972775b1c	IB/hfi1: Wakeup QPs orphaned on wait list after flush Once an SDMA engine is taken down due to a link failure, any waiting QPs that do not have outstanding descriptors in the ring will stay on the dmawait list as long as the port is down. Since there is no timer running, they will stay there for a long time. The fix is to wake up all iowaits linked to dmawait. The send engine will build and post packets that get flushed back. Fixes: `7724105686` ("IB/hfi1: add driver files") Reviewed-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2019-06-17 21:15:40 -04:00
Mike Marciniszyn	4bb02e9572	IB/hfi1: Use aborts to trigger RC throttling SDMA and pio flushes will cause a lot of packets to be transmitted after a link has gone down, using a lot of CPU to retransmit packets. Fix for RC QPs by recognizing the flush status and: - Forcing a timer start - Putting the QP into a "send one" mode Fixes: `7724105686` ("IB/hfi1: add driver files") Reviewed-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2019-06-17 21:15:40 -04:00
Mike Marciniszyn	9755f72496	IB/hfi1: Create inline to get extended headers This paves the way for another patch that reacts to a flush sdma completion for RC. Fixes: `81cd3891f0` ("IB/hfi1: Add support for 16B Management Packets") Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2019-06-17 21:15:40 -04:00
Mike Marciniszyn	3230f4a8d4	IB/hfi1: Silence txreq allocation warnings The following warning can happen when a memory shortage occurs during txreq allocation: [10220.939246] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC) [10220.939246] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0018.C4.072020161249 07/20/2016 [10220.939247] cache: mnt_cache, object size: 384, buffer size: 384, default order: 2, min order: 0 [10220.939260] Workqueue: hfi0_0 _hfi1_do_send [hfi1] [10220.939261] node 0: slabs: 1026568, objs: 43115856, free: 0 [10220.939262] Call Trace: [10220.939262] node 1: slabs: 820872, objs: 34476624, free: 0 [10220.939263] dump_stack+0x5a/0x73 [10220.939265] warn_alloc+0x103/0x190 [10220.939267] ? wake_all_kswapds+0x54/0x8b [10220.939268] __alloc_pages_slowpath+0x86c/0xa2e [10220.939270] ? __alloc_pages_nodemask+0x2fe/0x320 [10220.939271] __alloc_pages_nodemask+0x2fe/0x320 [10220.939273] new_slab+0x475/0x550 [10220.939275] ___slab_alloc+0x36c/0x520 [10220.939287] ? hfi1_make_rc_req+0x90/0x18b0 [hfi1] [10220.939299] ? __get_txreq+0x54/0x160 [hfi1] [10220.939310] ? hfi1_make_rc_req+0x90/0x18b0 [hfi1] [10220.939312] __slab_alloc+0x40/0x61 [10220.939323] ? hfi1_make_rc_req+0x90/0x18b0 [hfi1] [10220.939325] kmem_cache_alloc+0x181/0x1b0 [10220.939336] hfi1_make_rc_req+0x90/0x18b0 [hfi1] [10220.939348] ? hfi1_verbs_send_dma+0x386/0xa10 [hfi1] [10220.939359] ? find_prev_entry+0xb0/0xb0 [hfi1] [10220.939371] hfi1_do_send+0x1d9/0x3f0 [hfi1] [10220.939372] process_one_work+0x171/0x380 [10220.939374] worker_thread+0x49/0x3f0 [10220.939375] kthread+0xf8/0x130 [10220.939377] ? max_active_store+0x80/0x80 [10220.939378] ? kthread_bind+0x10/0x10 [10220.939379] ret_from_fork+0x35/0x40 [10220.939381] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC) The shortage is handled properly so the message isn't needed. Silence by adding the no warn option to the slab allocation. Fixes: `45842abbb2` ("staging/rdma/hfi1: move txreq header code") Cc: <stable@vger.kernel.org> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2019-06-17 21:15:40 -04:00
Mike Marciniszyn	cf131a8196	IB/hfi1: Avoid hardlockup with flushlist_lock Heavy contention of the sde flushlist_lock can cause hard lockups at extreme scale when the flushing logic is under stress. Mitigate by replacing the item at a time copy to the local list with an O(1) list_splice_init() and using the high priority work queue to do the flushes. Fixes: `7724105686` ("IB/hfi1: add driver files") Cc: <stable@vger.kernel.org> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2019-06-17 21:15:40 -04:00
Gustavo A. R. Silva	f0553dcb97	tracepoint: Use struct_size() in kmalloc() One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct tp_probes { ... struct tracepoint_func probes[0]; }; instance = kmalloc(sizeof(sizeof(struct tp_probes) + sizeof(struct tracepoint_func) * count, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kmalloc(struct_size(instance, probes, count) GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2019-06-17 21:13:32 -04:00
Suraj Jitindar Singh	84b028243e	KVM: PPC: Book3S HV: Only write DAWR[X] when handling h_set_dawr in real mode The hcall H_SET_DAWR is used by a guest to set the data address watchpoint register (DAWR). This hcall is handled in the host in kvmppc_h_set_dawr() which can be called in either real mode on the guest exit path from hcall_try_real_mode() in book3s_hv_rmhandlers.S, or in virtual mode when called from kvmppc_pseries_do_hcall() in book3s_hv.c. The function kvmppc_h_set_dawr() updates the dawr and dawrx fields in the vcpu struct accordingly and then also writes the respective values into the DAWR and DAWRX registers directly. It is necessary to write the registers directly here when calling the function in real mode since the path to re-enter the guest won't do this. However when in virtual mode the host DAWR and DAWRX values have already been restored, and so writing the registers would overwrite these. Additionally there is no reason to write the guest values here as these will be read from the vcpu struct and written to the registers appropriately the next time the vcpu is run. This also avoids the case when handling h_set_dawr for a nested guest where the guest hypervisor isn't able to write the DAWR and DAWRX registers directly and must rely on the real hypervisor to do this for it when it calls H_ENTER_NESTED. Fixes: `c1fe190c06` ("powerpc: Add force enable of DAWR on P9 option") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2019-06-18 10:21:19 +10:00
Michael Neuling	fabb2efcf0	KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr() Commit `c1fe190c06` ("powerpc: Add force enable of DAWR on P9 option") screwed up some assembler and corrupted a pointer in r3. This resulted in crashes like the below: BUG: Kernel NULL pointer dereference at 0x000013bf Faulting instruction address: 0xc00000000010b044 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA pSeries CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc4+ #3 NIP: c00000000010b044 LR: c0080000089dacf4 CTR: c00000000010aff4 REGS: c00000179b397710 TRAP: 0300 Not tainted (5.2.0-rc4+) MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 42244842 XER: 00000000 CFAR: c00000000010aff8 DAR: 00000000000013bf DSISR: 42000000 IRQMASK: 0 GPR00: c0080000089dd6bc c00000179b3979a0 c008000008a04300 ffffffffffffffff GPR04: 0000000000000000 0000000000000003 000000002444b05d c0000017f11c45d0 ... NIP kvmppc_h_set_dabr+0x50/0x68 LR kvmppc_pseries_do_hcall+0xa3c/0xeb0 [kvm_hv] Call Trace: 0xc0000017f11c0000 (unreliable) kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv] kvmppc_vcpu_run+0x34/0x48 [kvm] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] kvm_vcpu_ioctl+0x460/0x850 [kvm] do_vfs_ioctl+0xe4/0xb40 ksys_ioctl+0xc4/0x110 sys_ioctl+0x28/0x80 system_call+0x5c/0x70 Instruction dump: 4082fff4 4c00012c 38600000 4e800020 e96280c0 896b0000 2c2b0000 3860ffff 4d820020 50852e74 508516f6 78840724 <f88313c0> f8a313c8 7c942ba6 7cbc2ba6 Fix the bug by only changing r3 when we are returning immediately. Fixes: `c1fe190c06` ("powerpc: Add force enable of DAWR on P9 option") Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reported-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2019-06-18 10:19:22 +10:00
Mika Westerberg	000dd5316e	PCI: Do not poll for PME if the device is in D3cold PME polling does not take into account that a device that is directly connected to the host bridge may go into D3cold as well. This leads to a situation where the PME poll thread reads from a config space of a device that is in D3cold and gets incorrect information because the config space is not accessible. Here is an example from Intel Ice Lake system where two PCIe root ports are in D3cold (I've instrumented the kernel to log the PMCSR register contents): [ 62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff [ 62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff Since 0xffff is interpreted so that PME is pending, the root ports will be runtime resumed. This repeats over and over again essentially blocking all runtime power management. Prevent this from happening by checking whether the device is in D3cold before its PME status is read. Fixes: `71a83bd727` ("PCI/PM: add runtime PM support to PCIe port") Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Lukas Wunner <lukas@wunner.de> Cc: 3.6+ <stable@vger.kernel.org> # v3.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2019-06-18 01:40:41 +02:00
Mika Westerberg	c2bf1fc212	PCI: Add missing link delays required by the PCIe spec Currently Linux does not follow PCIe spec regarding the required delays after reset. A concrete example is a Thunderbolt add-in-card that consists of a PCIe switch and two PCIe endpoints: +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller +-01.0-[04-36]-- DS hotplug port +-02.0-[37]----00.0 xHCI controller \-04.0-[38-6b]-- DS hotplug port The root port (1b.0) and the PCIe switch downstream ports are all PCIe gen3 so they support 8GT/s link speeds. We wait for the PCIe hierarchy to enter D3cold (runtime): pcieport 0000:00:1b.0: power state changed by ACPI to D3cold When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the PCIe switch is put to reset and its power is re-applied. This means that we must follow the rules in PCIe 4.0 section 6.6.1. For the PCIe gen3 ports we are dealing with here, the following applies: With a Downstream Port that supports Link speeds greater than 5.0 GT/s, software must wait a minimum of 100 ms after Link training completes before sending a Configuration Request to the device immediately below that Port. Software can determine when Link training completes by polling the Data Link Layer Link Active bit or by setting up an associated interrupt (see Section 6.7.3.3). Translating this into the above topology we would need to do this (DLLLA stands for Data Link Layer Link Active): pcieport 0000:00:1b.0: wait for 100ms after DLLLA is set before access to 0000:01:00.0 pcieport 0000:02:00.0: wait for 100ms after DLLLA is set before access to 0000:03:00.0 pcieport 0000:02:02.0: wait for 100ms after DLLLA is set before access to 0000:37:00.0 I've instrumented the kernel with additional logging so we can see the actual delays the kernel performs: pcieport 0000:00:1b.0: power state changed by ACPI to D0 pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms pcieport 0000:00:1b.0: waking up bus pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60) ... pcieport 0000:00:1b.0: PME# disabled pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff) ... pcieport 0000:01:00.0: PME# disabled pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff) ... pcieport 0000:02:00.0: PME# disabled pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff) ... pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407) pcieport 0000:02:01.0: PME# disabled pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff) ... pcieport 0000:02:02.0: PME# disabled pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff) ... pcieport 0000:02:04.0: PME# disabled pcieport 0000:02:01.0: PME# enabled pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms pcieport 0000:02:04.0: PME# enabled pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000) ... thunderbolt 0000:03:00.0: PME# disabled xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000) ... xhci_hcd 0000:37:00.0: PME# disabled For the switch upstream port (01:00.0) we wait for 100ms but not taking into account the DLLLA requirement. We then wait 10ms for D3hot -> D0 transition of the root port and the two downstream hotplug ports. This means that we deviate from what the spec requires. Performing the same check for system sleep (s2idle) transitions we can see following when resuming from s2idle: pcieport 0000:00:1b.0: power state changed by ACPI to D0 pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60) ... pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff) ... pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff) pcieport 0000:02:02.0: restoring config space at offset 0x2c (was 0x0, writing 0x0) pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff) pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff) pcieport 0000:02:02.0: restoring config space at offset 0x28 (was 0x0, writing 0x0) pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff) pcieport 0000:02:02.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1) pcieport 0000:02:01.0: restoring config space at offset 0x2c (was 0x0, writing 0x60) pcieport 0000:02:02.0: restoring config space at offset 0x20 (was 0x0, writing 0x73f073f0) pcieport 0000:02:04.0: restoring config space at offset 0x2c (was 0x0, writing 0x60) pcieport 0000:02:01.0: restoring config space at offset 0x28 (was 0x0, writing 0x60) pcieport 0000:02:00.0: restoring config space at offset 0x2c (was 0x0, writing 0x0) pcieport 0000:02:02.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1) pcieport 0000:02:04.0: restoring config space at offset 0x28 (was 0x0, writing 0x60) pcieport 0000:02:01.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1ff10001) pcieport 0000:02:00.0: restoring config space at offset 0x28 (was 0x0, writing 0x0) pcieport 0000:02:02.0: restoring config space at offset 0x18 (was 0x0, writing 0x373702) pcieport 0000:02:04.0: restoring config space at offset 0x24 (was 0x10001, writing 0x49f12001) pcieport 0000:02:01.0: restoring config space at offset 0x20 (was 0x0, writing 0x73e05c00) pcieport 0000:02:00.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1) pcieport 0000:02:04.0: restoring config space at offset 0x20 (was 0x0, writing 0x89f07400) pcieport 0000:02:01.0: restoring config space at offset 0x1c (was 0x101, writing 0x5151) pcieport 0000:02:00.0: restoring config space at offset 0x20 (was 0x0, writing 0x8a008a00) pcieport 0000:02:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020) pcieport 0000:02:04.0: restoring config space at offset 0x1c (was 0x101, writing 0x6161) pcieport 0000:02:01.0: restoring config space at offset 0x18 (was 0x0, writing 0x360402) pcieport 0000:02:00.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1) pcieport 0000:02:04.0: restoring config space at offset 0x18 (was 0x0, writing 0x6b3802) pcieport 0000:02:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407) pcieport 0000:02:00.0: restoring config space at offset 0x18 (was 0x0, writing 0x30302) pcieport 0000:02:01.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020) pcieport 0000:02:04.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020) pcieport 0000:02:00.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020) pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407) pcieport 0000:02:04.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407) pcieport 0000:02:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407) xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000) ... thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000) This is even worse. None of the mandatory delays are performed. If this would be S3 instead of s2idle then according to PCI FW spec 3.2 section 4.6.8. there is a specific _DSM that allows the OS to skip the delays but this platform does not provide the _DSM and does not go to S3 anyway so no firmware is involved that could already handle these delays. In this particular Intel Coffee Lake platform these delays are not actually needed because there is an additional delay as part of the ACPI power resource that is used to turn on power to the hierarchy but since that additional delay is not required by any of standards (PCIe, ACPI) it is not present in the Intel Ice Lake, for example where missing the mandatory delays causes pciehp to start tearing down the stack too early (links are not yet trained). For this reason, change the PCIe portdrv PM resume hooks so that they perform the mandatory delays before the downstream component gets resumed. We perform the delays before port services are resumed because otherwise pciehp might find that the link is not up (even if it is just training) and tears-down the hierarchy. Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2019-06-18 01:40:41 +02:00
David S. Miller	f97252a8c3	Merge branch 'UDP-GSO-audit-tests' Fred Klassen says: ==================== UDP GSO audit tests Updates to UDP GSO selftests ot optionally stress test CMSG subsytem, and report the reliability and performance of both TX Timestamping and ZEROCOPY messages. ==================== Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-17 16:30:38 -07:00
Fred Klassen	4ffc37f5c0	net/udpgso_bench.sh test fails on error Ensure that failure on any individual test results in an overall failure of the test script. Signed-off-by: Fred Klassen <fklassen@appneta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-17 16:30:37 -07:00
Fred Klassen	ade90d69ff	net/udpgso_bench.sh add UDP GSO audit tests Audit tests count the total number of messages sent and compares with total number of CMSG received on error queue. Example: udp gso zerocopy timestamp audit udp rx: 1599 MB/s 1166414 calls/s udp tx: 1615 MB/s 27395 calls/s 27395 msg/s udp rx: 1634 MB/s 1192261 calls/s udp tx: 1633 MB/s 27699 calls/s 27699 msg/s udp rx: 1633 MB/s 1191358 calls/s udp tx: 1631 MB/s 27678 calls/s 27678 msg/s Summary over 4.000 seconds... sum udp tx: 1665 MB/s 82772 calls (27590/s) 82772 msgs (27590/s) Tx Timestamps: 82772 received 0 errors Zerocopy acks: 82772 received Errors are thrown if CMSG count does not equal send count, example: Summary over 4.000 seconds... sum tcp tx: 7451 MB/s 493706 calls (123426/s) 493706 msgs (123426/s) ./udpgso_bench_tx: Unexpected number of Zerocopy completions: 493706 expected 493704 received Also reduce individual test time from 4 to 3 seconds so that overall test time does not increase significantly. v3: Enhancements as per Willem de Bruijn <willemb@google.com> - document -P option for TCP audit Signed-off-by: Fred Klassen <fklassen@appneta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-17 16:30:37 -07:00
Fred Klassen	79ebc3c260	net/udpgso_bench_tx: options to exercise TX CMSG This enhancement adds options that facilitate load testing with additional TX CMSG options, and to optionally print results of various send CMSG operations. These options are especially useful in isolating situations where error-queue messages are lost when combined with other CMSG operations (e.g. SO_ZEROCOPY). New options: -a - count all CMSG messages and match to sent messages -T - add TX CMSG that requests TX software timestamps -H - similar to -T except request TX hardware timestamps -P - call poll() before reading error queue -v - print detailed results v2: Enhancements as per Willem de Bruijn <willemb@google.com> - Updated control and buffer parameters for recvmsg - poll() parameter cleanup - fail on bad audit results - remove TOS options - improved reporting v3: Enhancements as per Willem de Bruijn <willemb@google.com> - add SOF_TIMESTAMPING_OPT_TSONLY to eliminate MSG_TRUNC - general code cleanup Signed-off-by: Fred Klassen <fklassen@appneta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-17 16:30:37 -07:00
Linus Torvalds	29f785ff76	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "MS_MOVE regression fix + breakage in fsmount(2) (also introduced in this cycle, along with fsmount(2) itself). I'm still digging through the piles of mail, so there might be more fixes to follow, but these two are obvious and self-contained, so there's no point delaying those..." * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs/namespace: fix unprivileged mount propagation vfs: fsmount: add missing mntget()	2019-06-17 16:28:28 -07:00
David S. Miller	4bd366cece	Merge branch 'net-ipv4-remove-erroneous-advancement-of-list-pointer' Florian Westphal says: ==================== net: ipv4: remove erroneous advancement of list pointer Tariq reported a soft lockup on net-next that Mellanox was able to bisect to `2638eb8b50` ("net: ipv4: provide __rcu annotation for ifa_list"). While reviewing above patch I found a regression when addresses have a lifetime specified. Second patch extends rtnetlink.sh to trigger crash (without first patch applied). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-17 16:27:43 -07:00
Florian Westphal	3cfa148826	selftests: rtnetlink: add addresses with fixed life time This exercises kernel code path that deal with addresses that have a limited lifetime. Without previous fix, this triggers following crash on net-next: BUG: KASAN: null-ptr-deref in check_lifetime+0x403/0x670 Read of size 8 at addr 0000000000000010 by task kworker [..] Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-17 16:27:42 -07:00
Florian Westphal	40008e9211	net: ipv4: remove erroneous advancement of list pointer Causes crash when lifetime expires on an adress as garbage is dereferenced soon after. This used to look like this: for (ifap = &ifa->ifa_dev->ifa_list; ifap != NULL; ifap = &(ifap)->ifa_next) { if (ifap == ifa) ... but this was changed to: struct in_ifaddr tmp; ifap = &ifa->ifa_dev->ifa_list; tmp = rtnl_dereference(ifap); while (tmp) { tmp = rtnl_dereference(tmp->ifa_next); // Bogus if (rtnl_dereference(ifap) == ifa) { ... ifap = &tmp->ifa_next; // Can be NULL tmp = rtnl_dereference(*ifap); // Dereference } } Remove the bogus assigment/list entry skip. Fixes: `2638eb8b50` ("net: ipv4: provide __rcu annotation for ifa_list") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-17 16:27:42 -07:00
Arnd Bergmann	78fe8a28fb	net: dsa: sja1105: fix ptp link error Due to a reversed dependency, it is possible to build the lower ptp driver as a loadable module and the actual driver using it as built-in, causing a link error: drivers/net/dsa/sja1105/sja1105_spi.o: In function `sja1105_static_config_upload': sja1105_spi.c:(.text+0x6f0): undefined reference to `sja1105_ptp_reset' drivers/net/dsa/sja1105/sja1105_spi.o:(.data+0x2d4): undefined reference to `sja1105et_ptp_cmd' drivers/net/dsa/sja1105/sja1105_spi.o:(.data+0x604): undefined reference to `sja1105pqrs_ptp_cmd' drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_remove': sja1105_main.c:(.text+0x8d4): undefined reference to `sja1105_ptp_clock_unregister' drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_rxtstamp_work': sja1105_main.c:(.text+0x964): undefined reference to `sja1105_tstamp_reconstruct' drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_setup': sja1105_main.c:(.text+0xb7c): undefined reference to `sja1105_ptp_clock_register' drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_port_deferred_xmit': sja1105_main.c:(.text+0x1fa0): undefined reference to `sja1105_ptpegr_ts_poll' sja1105_main.c:(.text+0x1fc4): undefined reference to `sja1105_tstamp_reconstruct' drivers/net/dsa/sja1105/sja1105_main.o:(.rodata+0x5b0): undefined reference to `sja1105_get_ts_info' Change the Makefile logic to always build the ptp module the same way as the rest. Another option would be to just add it to the same module and remove the exports, but I don't know if there was a good reason to keep them separate. Fixes: `bb77f36ac2` ("net: dsa: sja1105: Add support for the PTP clock") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-17 16:25:29 -07:00
Arnd Bergmann	c63d1e5c2d	net: stmmac: fix unused-variable warning When building without CONFIG_OF, we get a harmless build warning: drivers/net/ethernet/stmicro/stmmac/stmmac_main.c: In function 'stmmac_phy_setup': drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:973:22: error: unused variable 'node' [-Werror=unused-variable] struct device_node *node = priv->plat->phy_node; Reword it so we always use the local variable, by making it the fwnode pointer instead of the device_node. Fixes: `74371272f9` ("net: stmmac: Convert to phylink and remove phylib logic") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-17 16:24:12 -07:00
Linus Torvalds	da0f382029	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: "Lots of bug fixes here: 1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer. 2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John Crispin. 3) Use after free in psock backlog workqueue, from John Fastabend. 4) Fix source port matching in fdb peer flow rule of mlx5, from Raed Salem. 5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet. 6) Network header needs to be set for packet redirect in nfp, from John Hurley. 7) Fix udp zerocopy refcnt, from Willem de Bruijn. 8) Don't assume linear buffers in vxlan and geneve error handlers, from Stefano Brivio. 9) Fix TOS matching in mlxsw, from Jiri Pirko. 10) More SCTP cookie memory leak fixes, from Neil Horman. 11) Fix VLAN filtering in rtl8366, from Linus Walluij. 12) Various TCP SACK payload size and fragmentation memory limit fixes from Eric Dumazet. 13) Use after free in pneigh_get_next(), also from Eric Dumazet. 14) LAPB control block leak fix from Jeremy Sowden" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits) lapb: fixed leak of control-blocks. tipc: purge deferredq list for each grp member in tipc_group_delete ax25: fix inconsistent lock state in ax25_destroy_timer neigh: fix use-after-free read in pneigh_get_next tcp: fix compile error if !CONFIG_SYSCTL hv_sock: Suppress bogus "may be used uninitialized" warnings be2net: Fix number of Rx queues used for flow hashing net: handle 802.1P vlan 0 packets properly tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() tcp: add tcp_min_snd_mss sysctl tcp: tcp_fragment() should apply sane memory limits tcp: limit payload size of sacked skbs Revert "net: phylink: set the autoneg state in phylink_phy_change" bpf: fix nested bpf tracepoints with per-cpu data bpf: Fix out of bounds memory access in bpf_sk_storage vsock/virtio: set SOCK_DONE on peer shutdown net: dsa: rtl8366: Fix up VLAN filtering net: phylink: set the autoneg state in phylink_phy_change net: add high_order_alloc_disable sysctl/static key tcp: add tcp_tx_skb_cache sysctl ...	2019-06-17 15:55:34 -07:00
Mitch Williams	efa14c3985	iavf: allow null RX descriptors In some circumstances, the hardware can hand us a null receive descriptor, with no data attached but otherwise valid. Unfortunately, the driver was ill-equipped to handle such an event, and would stop processing packets at that point. To fix this, use the Descriptor Done bit instead of the size to determine whether or not a descriptor is ready to be processed. Add some checks to allow for unused buffers. Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2019-06-17 15:39:26 -07:00
Paul Greenwalt	68dfe6348f	iavf: add call to iavf_[add\|del]_cloud_filter Add call to iavf_add_cloud_filter and iavf_del_cloud_filter from iavf_process_aq_command to clear aq_required IAVF_FLAG_AQ_ADD_CLOUD_FILTER and IAVF_FLAG_AQ_DEL_CLOUD_FILTER bits. aq_required IAVF_FLAG_AQ_DEL_CLOUD_FILTER bit is being set in iavf_down and iavf_delete_clsflower, and are never cleared. aq_required IAVF_FLAG_AQ_ADD_CLOUD_FILTER bit is being set in iavf_handle_reset and iavf_configure_clsflower, and are never cleared. Since the aq_required is not zero, iavf_watchdog_task is setting the queue_delayed_work to 20 msec instead of the longer delay. Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2019-06-17 15:39:26 -07:00
Jakub Pawlak	b66c7bc1cd	iavf: Refactor init state machine Cleanup of init state machine, move state specific code to separate functions and rewrite the iavf_init_task() function. Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2019-06-17 15:39:26 -07:00
Jan Sokolowski	bac8486116	iavf: Refactor the watchdog state machine Refactor the watchdog state machine implementation. Add the additional state __IAVF_COMM_FAILED to process the PF communication fails. Prepare the watchdog state machine to integrate with init state machine. Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2019-06-17 15:39:26 -07:00
Jakub Pawlak	fdd4044ffd	iavf: Remove timer for work triggering, use delaying work instead Remove the watchdog timer, instead declare watchdog task as delayed work and use dedicated workqueue to service driver tasks. The dedicated driver workqueue iavf_wq is common for all driver instances. Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2019-06-17 15:39:26 -07:00
Jakub Pawlak	b476b0030e	iavf: Move commands processing to the separate function Move the commands processing outside the watchdog_task() function. This reduce length and complexity of the function which is mainly designed to process the watchdog state machine. Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2019-06-17 15:39:26 -07:00
Avinash Dayanand	16e00c25ac	iavf: Fix the math for valid length for ADq enable There was a calculation error in virtchnl regarding the valid length which was fixed recently and a corresponding change needs to go into the code while we enable ADq. Signed-off-by: Avinash Dayanand <avinash.dayanand@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2019-06-17 15:39:26 -07:00
Aleksandr Loktionov	f0a48fb441	iavf: Change GFP_KERNEL to GFP_ATOMIC in kzalloc() iavf_add_vlan() is being called in atomic context so kzalloc() needs GFP_ATOMIC. This patch fixes it. Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2019-06-17 15:39:25 -07:00
Mitch Williams	88ec7308ea	iavf: wait longer for close to complete On some hardware/driver/architecture combinations, it may take longer than 200msec for all close operations to be completed, causing a spurious error message to be logged. Increase the timeout value to 500msec to avoid this erroneous error. Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2019-06-17 15:39:25 -07:00
Mitch Williams	168d91cf2a	iavf: use signed variable The counter variable in iavf_clean_tx_irq starts out negative and climbs to 0. So allocating it as u16 is actually a really bad idea that just happens to work because the value underflows and overflows consistently on most architectures. Replace the u16 with an int so signed math works as expected. Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2019-06-17 15:39:25 -07:00
Akeem G Abodunrin	c2417a7b0e	iavf: Create VLAN tag elements starting from the first element This patch changes how VLAN tag are being populated and programmed into the HW - Instead of start adding VF VLAN tag from the last member of the element list, start from the first member of the list, until number of allowed VLAN tags is exhausted in the HW. Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2019-06-17 15:39:25 -07:00
Daniel T. Lee	4d18f6de6a	samples: bpf: refactor header include path Currently, header inclusion in each file is inconsistent. For example, "libbpf.h" header is included as multiple ways. #include "bpf/libbpf.h" #include "libbpf.h" Due to commit `b552d33c80` ("samples/bpf: fix include path in Makefile"), $(srctree)/tools/lib/bpf/ path had been included during build, path "bpf/" in header isn't necessary anymore. This commit removes path "bpf/" in header inclusion. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-18 00:28:36 +02:00
Daniel T. Lee	fa206dccd8	samples: bpf: remove unnecessary include options in Makefile Due to recent change of include path at commit `b552d33c80` ("samples/bpf: fix include path in Makefile"), some of the previous include options became unnecessary. This commit removes duplicated include options in Makefile. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-18 00:28:36 +02:00
Daniel Borkmann	32b88d3743	Merge branch 'bpf-libbpf-btf-defined-maps' Andrii Nakryiko says: ==================== This patch set implements initial version (as discussed at LSF/MM2019 conference) of a new way to specify BPF maps, relying on BTF type information, which allows for easy extensibility, preserving forward and backward compatibility. See details and examples in description for patch #6. [0] contains an outline of follow up extensions to be added after this basic set of features lands. They are useful by itself, but also allows to bring libbpf to feature-parity with iproute2 BPF loader. That should open a path forward for BPF loaders unification. Patch #1 centralizes commonly used min/max macro in libbpf_internal.h. Patch #2 extracts .BTF and .BTF.ext loading loging from elf_collect(). Patch #3 simplifies elf_collect() error-handling logic. Patch #4 refactors map initialization logic into user-provided maps and global data maps, in preparation to adding another way (BTF-defined maps). Patch #5 adds support for map definitions in multiple ELF sections and deprecates bpf_object__find_map_by_offset() API which doesn't appear to be used anymore and makes assumption that all map definitions reside in single ELF section. Patch #6 splits BTF intialization from sanitization/loading into kernel to preserve original BTF at the time of map initialization. Patch #7 adds support for BTF-defined maps. Patch #8 adds new test for BTF-defined map definition. Patches #9-11 convert test BPF map definitions to use BTF way. [0] https://lore.kernel.org/bpf/CAEf4BzbfdG2ub7gCi0OYqBrUoChVHWsmOntWAkJt47=FE+km+A@mail.gmail.com/ v1->v2: - more BTF-sanity checks in parsing map definitions (Song); - removed confusing usage of "attribute", switched to "field; - split off elf_collect() refactor from btf loading refactor (Song); - split selftests conversion into 3 patches (Stanislav): 1. test already relying on BTF; 2. tests w/ custom types as key/value (so benefiting from BTF); 3. all the rest tests (integers as key/value, special maps w/o BTF support). - smaller code improvements (Song); rfc->v1: - error out on unknown field by default (Stanislav, Jakub, Lorenz); ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-18 00:11:40 +02:00
Andrii Nakryiko	df0b779259	selftests/bpf: convert tests w/ custom values to BTF-defined maps Convert a bulk of selftests that have maps with custom (not integer) key and/or value. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-18 00:10:43 +02:00
Andrii Nakryiko	f654407481	selftests/bpf: switch BPF_ANNOTATE_KV_PAIR tests to BTF-defined maps Switch tests that already rely on BTF to BTF-defined map definitions. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-18 00:10:43 +02:00
Andrii Nakryiko	9e3d709c47	selftests/bpf: add test for BTF-defined maps Add file test for BTF-defined map definition. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-18 00:10:42 +02:00
Andrii Nakryiko	abd29c9314	libbpf: allow specifying map definitions using BTF This patch adds support for a new way to define BPF maps. It relies on BTF to describe mandatory and optional attributes of a map, as well as captures type information of key and value naturally. This eliminates the need for BPF_ANNOTATE_KV_PAIR hack and ensures key/value sizes are always in sync with the key/value type. Relying on BTF, this approach allows for both forward and backward compatibility w.r.t. extending supported map definition features. By default, any unrecognized attributes are treated as an error, but it's possible relax this using MAPS_RELAX_COMPAT flag. New attributes, added in the future will need to be optional. The outline of the new map definition (short, BTF-defined maps) is as follows: 1. All the maps should be defined in .maps ELF section. It's possible to have both "legacy" map definitions in `maps` sections and BTF-defined maps in .maps sections. Everything will still work transparently. 2. The map declaration and initialization is done through a global/static variable of a struct type with few mandatory and extra optional fields: - type field is mandatory and specified type of BPF map; - key/value fields are mandatory and capture key/value type/size information; - max_entries attribute is optional; if max_entries is not specified or initialized, it has to be provided in runtime through libbpf API before loading bpf_object; - map_flags is optional and if not defined, will be assumed to be 0. 3. Key/value fields should be a pointer to a type describing key/value. The pointee type is assumed (and will be recorded as such and used for size determination) to be a type describing key/value of the map. This is done to save excessive amounts of space allocated in corresponding ELF sections for key/value of big size. 4. As some maps disallow having BTF type ID associated with key/value, it's possible to specify key/value size explicitly without associating BTF type ID with it. Use key_size and value_size fields to do that (see example below). Here's an example of simple ARRAY map defintion: struct my_value { int x, y, z; }; struct { int type; int max_entries; int key; struct my_value value; } btf_map SEC(".maps") = { .type = BPF_MAP_TYPE_ARRAY, .max_entries = 16, }; This will define BPF ARRAY map 'btf_map' with 16 elements. The key will be of type int and thus key size will be 4 bytes. The value is struct my_value of size 12 bytes. This map can be used from C code exactly the same as with existing maps defined through struct bpf_map_def. Here's an example of STACKMAP definition (which currently disallows BTF type IDs for key/value): struct { __u32 type; __u32 max_entries; __u32 map_flags; __u32 key_size; __u32 value_size; } stackmap SEC(".maps") = { .type = BPF_MAP_TYPE_STACK_TRACE, .max_entries = 128, .map_flags = BPF_F_STACK_BUILD_ID, .key_size = sizeof(__u32), .value_size = PERF_MAX_STACK_DEPTH * sizeof(struct bpf_stack_build_id), }; This approach is naturally extended to support map-in-map, by making a value field to be another struct that describes inner map. This feature is not implemented yet. It's also possible to incrementally add features like pinning with full backwards and forward compatibility. Support for static initialization of BPF_MAP_TYPE_PROG_ARRAY using pointers to BPF programs is also on the roadmap. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-18 00:10:41 +02:00
Andrii Nakryiko	063183bf04	libbpf: split initialization and loading of BTF Libbpf does sanitization of BTF before loading it into kernel, if kernel doesn't support some of newer BTF features. This removes some of the important information from BTF (e.g., DATASEC and VAR description), which will be used for map construction. This patch splits BTF processing into initialization step, in which BTF is initialized from ELF and all the original data is still preserved; and sanitization/loading step, which ensures that BTF is safe to load into kernel. This allows to use full BTF information to construct maps, while still loading valid BTF into older kernels. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-18 00:10:41 +02:00
Andrii Nakryiko	db48814bd2	libbpf: identify maps by section index in addition to offset To support maps to be defined in multiple sections, it's important to identify map not just by offset within its section, but section index as well. This patch adds tracking of section index. For global data, we record section index of corresponding .data/.bss/.rodata ELF section for uniformity, and thus don't need a special value of offset for those maps. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-18 00:10:40 +02:00
Andrii Nakryiko	bf82927125	libbpf: refactor map initialization User and global data maps initialization has gotten pretty complicated and unnecessarily convoluted. This patch splits out the logic for global data map and user-defined map initialization. It also removes the restriction of pre-calculating how many maps will be initialized, instead allowing to keep adding new maps as they are discovered, which will be used later for BTF-defined map definitions. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-18 00:10:39 +02:00
Andrii Nakryiko	01b29d1dc9	libbpf: streamline ELF parsing error-handling Simplify ELF parsing logic by exiting early, as there is no common clean up path to execute. That makes it unnecessary to track when err was set and when it was cleared. It also reduces nesting in some places. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2019-06-18 00:10:39 +02:00

... 222 223 224 225 226 ...

858756 Commits