Commit Graph

858756 Commits

Author SHA1 Message Date
Li Yang
a019ab4067 arm64: defconfig: Enable FSL_EDMA driver
Enables the FSL EDMA driver by default.  This also works around an issue
that imx-i2c driver keeps deferring the probe because of the DMA is not
ready.  And currently the DMA engine framework can not correctly tell
if the DMA channels will truly become available later (it will never be
available if the DMA driver is not enabled).

This will cause indefinite messages like below:
[    3.335829] imx-i2c 2180000.i2c: can't get pinctrl, bus recovery not supported
[    3.344455] ina2xx 0-0040: power monitor ina220 (Rshunt = 1000 uOhm)
[    3.350917] lm90 0-004c: 0-004c supply vcc not found, using dummy regulator
[    3.362089] imx-i2c 2180000.i2c: can't get pinctrl, bus recovery not supported
[    3.370741] ina2xx 0-0040: power monitor ina220 (Rshunt = 1000 uOhm)
[    3.377205] lm90 0-004c: 0-004c supply vcc not found, using dummy regulator
[    3.388455] imx-i2c 2180000.i2c: can't get pinctrl, bus recovery not supported
.....

Signed-off-by: Li Yang <leoyang.li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
2019-06-18 14:32:43 +08:00
Bjorn Andersson
73786fea02 arm64: dts: qcom: qcs404-evb: Enable PCIe
Enable the PCIe PHY and controller found on the QCS404 EVB.

Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2019-06-17 23:23:45 -07:00
Bjorn Andersson
431f64642c arm64: dts: qcom: qcs404: Add PCIe related nodes
The QCS404 has a PCIe2 PHY and a Qualcomm PCIe controller, define these
to for the platform.

Reviewed-by: Niklas Cassel <niklas.cassel@linaro.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2019-06-17 23:23:45 -07:00
David S. Miller
13091aa305 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Honestly all the conflicts were simple overlapping changes,
nothing really interesting to report.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 20:20:36 -07:00
Gal Pressman
529254340c RDMA/efa: Fix success return value in case of error
Existing code would mistakenly return success in case of error instead
of a proper return value.

Fixes: e9c6c53730 ("RDMA/efa: Add common command handlers")
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17 21:35:21 -04:00
Mike Marciniszyn
942a899335 IB/hfi1: Handle port down properly in pio
The call to sc_buffer_alloc currently returns NULL (no buffer) or
a buffer descriptor.

There is a third case when the port is down.  Currently that
returns NULL and this prevents the caller from properly handling the
sc_buffer_alloc() failure.  A verbs code link test after the call is
racy so the indication needs to come from the state check inside the allocation
routine to be valid.

Fix by encoding the ECOMM failure like SDMA.   IS_ERR_OR_NULL() tests
are added at all call sites.  For verbs send, this needs to treat any
error by returning a completion without any MMIO copy.

Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17 21:15:40 -04:00
Mike Marciniszyn
099a884ba4 IB/hfi1: Handle wakeup of orphaned QPs for pio
Once a send context is taken down due to a link failure, any QPs waiting
for pio credits will stay on the waitlist indefinitely.

Fix by wakeing up all QPs linked to piowait list.

Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17 21:15:40 -04:00
Mike Marciniszyn
f972775b1c IB/hfi1: Wakeup QPs orphaned on wait list after flush
Once an SDMA engine is taken down due to a link failure, any waiting QPs
that do not have outstanding descriptors in the ring will stay
on the dmawait list as long as the port is down.

Since there is no timer running, they will stay there for a long time.

The fix is to wake up all iowaits linked to dmawait. The send engine
will build and post packets that get flushed back.

Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17 21:15:40 -04:00
Mike Marciniszyn
4bb02e9572 IB/hfi1: Use aborts to trigger RC throttling
SDMA and pio flushes will cause a lot of packets to be transmitted
after a link has gone down, using a lot of CPU to retransmit
packets.

Fix for RC QPs by recognizing the flush status and:
- Forcing a timer start
- Putting the QP into a "send one" mode

Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17 21:15:40 -04:00
Mike Marciniszyn
9755f72496 IB/hfi1: Create inline to get extended headers
This paves the way for another patch that reacts to a
flush sdma completion for RC.

Fixes: 81cd3891f0 ("IB/hfi1: Add support for 16B Management Packets")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17 21:15:40 -04:00
Mike Marciniszyn
3230f4a8d4 IB/hfi1: Silence txreq allocation warnings
The following warning can happen when a memory shortage
occurs during txreq allocation:

[10220.939246] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
[10220.939246] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0018.C4.072020161249 07/20/2016
[10220.939247]   cache: mnt_cache, object size: 384, buffer size: 384, default order: 2, min order: 0
[10220.939260] Workqueue: hfi0_0 _hfi1_do_send [hfi1]
[10220.939261]   node 0: slabs: 1026568, objs: 43115856, free: 0
[10220.939262] Call Trace:
[10220.939262]   node 1: slabs: 820872, objs: 34476624, free: 0
[10220.939263]  dump_stack+0x5a/0x73
[10220.939265]  warn_alloc+0x103/0x190
[10220.939267]  ? wake_all_kswapds+0x54/0x8b
[10220.939268]  __alloc_pages_slowpath+0x86c/0xa2e
[10220.939270]  ? __alloc_pages_nodemask+0x2fe/0x320
[10220.939271]  __alloc_pages_nodemask+0x2fe/0x320
[10220.939273]  new_slab+0x475/0x550
[10220.939275]  ___slab_alloc+0x36c/0x520
[10220.939287]  ? hfi1_make_rc_req+0x90/0x18b0 [hfi1]
[10220.939299]  ? __get_txreq+0x54/0x160 [hfi1]
[10220.939310]  ? hfi1_make_rc_req+0x90/0x18b0 [hfi1]
[10220.939312]  __slab_alloc+0x40/0x61
[10220.939323]  ? hfi1_make_rc_req+0x90/0x18b0 [hfi1]
[10220.939325]  kmem_cache_alloc+0x181/0x1b0
[10220.939336]  hfi1_make_rc_req+0x90/0x18b0 [hfi1]
[10220.939348]  ? hfi1_verbs_send_dma+0x386/0xa10 [hfi1]
[10220.939359]  ? find_prev_entry+0xb0/0xb0 [hfi1]
[10220.939371]  hfi1_do_send+0x1d9/0x3f0 [hfi1]
[10220.939372]  process_one_work+0x171/0x380
[10220.939374]  worker_thread+0x49/0x3f0
[10220.939375]  kthread+0xf8/0x130
[10220.939377]  ? max_active_store+0x80/0x80
[10220.939378]  ? kthread_bind+0x10/0x10
[10220.939379]  ret_from_fork+0x35/0x40
[10220.939381] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)

The shortage is handled properly so the message isn't needed. Silence by
adding the no warn option to the slab allocation.

Fixes: 45842abbb2 ("staging/rdma/hfi1: move txreq header code")
Cc: <stable@vger.kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17 21:15:40 -04:00
Mike Marciniszyn
cf131a8196 IB/hfi1: Avoid hardlockup with flushlist_lock
Heavy contention of the sde flushlist_lock can cause hard lockups at
extreme scale when the flushing logic is under stress.

Mitigate by replacing the item at a time copy to the local list with
an O(1) list_splice_init() and using the high priority work queue to
do the flushes.

Fixes: 7724105686 ("IB/hfi1: add driver files")
Cc: <stable@vger.kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17 21:15:40 -04:00
Gustavo A. R. Silva
f0553dcb97 tracepoint: Use struct_size() in kmalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct tp_probes {
	...
        struct tracepoint_func probes[0];
};

instance = kmalloc(sizeof(sizeof(struct tp_probes) +
			sizeof(struct tracepoint_func) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, probes, count) GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-17 21:13:32 -04:00
Suraj Jitindar Singh
84b028243e KVM: PPC: Book3S HV: Only write DAWR[X] when handling h_set_dawr in real mode
The hcall H_SET_DAWR is used by a guest to set the data address
watchpoint register (DAWR). This hcall is handled in the host in
kvmppc_h_set_dawr() which can be called in either real mode on the
guest exit path from hcall_try_real_mode() in book3s_hv_rmhandlers.S,
or in virtual mode when called from kvmppc_pseries_do_hcall() in
book3s_hv.c.

The function kvmppc_h_set_dawr() updates the dawr and dawrx fields in
the vcpu struct accordingly and then also writes the respective values
into the DAWR and DAWRX registers directly. It is necessary to write
the registers directly here when calling the function in real mode
since the path to re-enter the guest won't do this. However when in
virtual mode the host DAWR and DAWRX values have already been
restored, and so writing the registers would overwrite these.
Additionally there is no reason to write the guest values here as
these will be read from the vcpu struct and written to the registers
appropriately the next time the vcpu is run.

This also avoids the case when handling h_set_dawr for a nested guest
where the guest hypervisor isn't able to write the DAWR and DAWRX
registers directly and must rely on the real hypervisor to do this for
it when it calls H_ENTER_NESTED.

Fixes: c1fe190c06 ("powerpc: Add force enable of DAWR on P9 option")
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-18 10:21:19 +10:00
Michael Neuling
fabb2efcf0 KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()
Commit c1fe190c06 ("powerpc: Add force enable of DAWR on P9 option")
screwed up some assembler and corrupted a pointer in r3. This resulted
in crashes like the below:

  BUG: Kernel NULL pointer dereference at 0x000013bf
  Faulting instruction address: 0xc00000000010b044
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc4+ #3
  NIP:  c00000000010b044 LR: c0080000089dacf4 CTR: c00000000010aff4
  REGS: c00000179b397710 TRAP: 0300   Not tainted  (5.2.0-rc4+)
  MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 42244842  XER: 00000000
  CFAR: c00000000010aff8 DAR: 00000000000013bf DSISR: 42000000 IRQMASK: 0
  GPR00: c0080000089dd6bc c00000179b3979a0 c008000008a04300 ffffffffffffffff
  GPR04: 0000000000000000 0000000000000003 000000002444b05d c0000017f11c45d0
  ...
  NIP kvmppc_h_set_dabr+0x50/0x68
  LR  kvmppc_pseries_do_hcall+0xa3c/0xeb0 [kvm_hv]
  Call Trace:
    0xc0000017f11c0000 (unreliable)
    kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
    kvmppc_vcpu_run+0x34/0x48 [kvm]
    kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
    kvm_vcpu_ioctl+0x460/0x850 [kvm]
    do_vfs_ioctl+0xe4/0xb40
    ksys_ioctl+0xc4/0x110
    sys_ioctl+0x28/0x80
    system_call+0x5c/0x70
  Instruction dump:
  4082fff4 4c00012c 38600000 4e800020 e96280c0 896b0000 2c2b0000 3860ffff
  4d820020 50852e74 508516f6 78840724 <f88313c0> f8a313c8 7c942ba6 7cbc2ba6

Fix the bug by only changing r3 when we are returning immediately.

Fixes: c1fe190c06 ("powerpc: Add force enable of DAWR on P9 option")
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reported-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-18 10:19:22 +10:00
Mika Westerberg
000dd5316e PCI: Do not poll for PME if the device is in D3cold
PME polling does not take into account that a device that is directly
connected to the host bridge may go into D3cold as well. This leads to a
situation where the PME poll thread reads from a config space of a
device that is in D3cold and gets incorrect information because the
config space is not accessible.

Here is an example from Intel Ice Lake system where two PCIe root ports
are in D3cold (I've instrumented the kernel to log the PMCSR register
contents):

  [   62.971442] pcieport 0000:00:07.1: Check PME status, PMCSR=0xffff
  [   62.971504] pcieport 0000:00:07.0: Check PME status, PMCSR=0xffff

Since 0xffff is interpreted so that PME is pending, the root ports will
be runtime resumed. This repeats over and over again essentially
blocking all runtime power management.

Prevent this from happening by checking whether the device is in D3cold
before its PME status is read.

Fixes: 71a83bd727 ("PCI/PM: add runtime PM support to PCIe port")
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Cc: 3.6+ <stable@vger.kernel.org> # v3.6+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-06-18 01:40:41 +02:00
Mika Westerberg
c2bf1fc212 PCI: Add missing link delays required by the PCIe spec
Currently Linux does not follow PCIe spec regarding the required delays
after reset. A concrete example is a Thunderbolt add-in-card that
consists of a PCIe switch and two PCIe endpoints:

  +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
                                  +-01.0-[04-36]-- DS hotplug port
                                  +-02.0-[37]----00.0 xHCI controller
                                  \-04.0-[38-6b]-- DS hotplug port

The root port (1b.0) and the PCIe switch downstream ports are all PCIe
gen3 so they support 8GT/s link speeds.

We wait for the PCIe hierarchy to enter D3cold (runtime):

  pcieport 0000:00:1b.0: power state changed by ACPI to D3cold

When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
PCIe switch is put to reset and its power is re-applied. This means that
we must follow the rules in PCIe 4.0 section 6.6.1.

For the PCIe gen3 ports we are dealing with here, the following applies:

  With a Downstream Port that supports Link speeds greater than 5.0
  GT/s, software must wait a minimum of 100 ms after Link training
  completes before sending a Configuration Request to the device
  immediately below that Port. Software can determine when Link training
  completes by polling the Data Link Layer Link Active bit or by setting
  up an associated interrupt (see Section 6.7.3.3).

Translating this into the above topology we would need to do this (DLLLA
stands for Data Link Layer Link Active):

  pcieport 0000:00:1b.0: wait for 100ms after DLLLA is set before access to 0000:01:00.0
  pcieport 0000:02:00.0: wait for 100ms after DLLLA is set before access to 0000:03:00.0
  pcieport 0000:02:02.0: wait for 100ms after DLLLA is set before access to 0000:37:00.0

I've instrumented the kernel with additional logging so we can see the
actual delays the kernel performs:

  pcieport 0000:00:1b.0: power state changed by ACPI to D0
  pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
  pcieport 0000:00:1b.0: waking up bus
  pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
  pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
  ...
  pcieport 0000:00:1b.0: PME# disabled
  pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  ...
  pcieport 0000:01:00.0: PME# disabled
  pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  ...
  pcieport 0000:02:00.0: PME# disabled
  pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  ...
  pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
  pcieport 0000:02:01.0: PME# disabled
  pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  ...
  pcieport 0000:02:02.0: PME# disabled
  pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  ...
  pcieport 0000:02:04.0: PME# disabled
  pcieport 0000:02:01.0: PME# enabled
  pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
  pcieport 0000:02:04.0: PME# enabled
  pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms
  thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)
  ...
  thunderbolt 0000:03:00.0: PME# disabled
  xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
  ...
  xhci_hcd 0000:37:00.0: PME# disabled

For the switch upstream port (01:00.0) we wait for 100ms but not taking
into account the DLLLA requirement. We then wait 10ms for D3hot -> D0
transition of the root port and the two downstream hotplug ports. This
means that we deviate from what the spec requires.

Performing the same check for system sleep (s2idle) transitions we can
see following when resuming from s2idle:

  pcieport 0000:00:1b.0: power state changed by ACPI to D0
  pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
  ...
  pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  ...
  pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  pcieport 0000:02:02.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
  pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  pcieport 0000:02:02.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
  pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
  pcieport 0000:02:02.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
  pcieport 0000:02:01.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
  pcieport 0000:02:02.0: restoring config space at offset 0x20 (was 0x0, writing 0x73f073f0)
  pcieport 0000:02:04.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
  pcieport 0000:02:01.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
  pcieport 0000:02:00.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
  pcieport 0000:02:02.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
  pcieport 0000:02:04.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
  pcieport 0000:02:01.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1ff10001)
  pcieport 0000:02:00.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
  pcieport 0000:02:02.0: restoring config space at offset 0x18 (was 0x0, writing 0x373702)
  pcieport 0000:02:04.0: restoring config space at offset 0x24 (was 0x10001, writing 0x49f12001)
  pcieport 0000:02:01.0: restoring config space at offset 0x20 (was 0x0, writing 0x73e05c00)
  pcieport 0000:02:00.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
  pcieport 0000:02:04.0: restoring config space at offset 0x20 (was 0x0, writing 0x89f07400)
  pcieport 0000:02:01.0: restoring config space at offset 0x1c (was 0x101, writing 0x5151)
  pcieport 0000:02:00.0: restoring config space at offset 0x20 (was 0x0, writing 0x8a008a00)
  pcieport 0000:02:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
  pcieport 0000:02:04.0: restoring config space at offset 0x1c (was 0x101, writing 0x6161)
  pcieport 0000:02:01.0: restoring config space at offset 0x18 (was 0x0, writing 0x360402)
  pcieport 0000:02:00.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
  pcieport 0000:02:04.0: restoring config space at offset 0x18 (was 0x0, writing 0x6b3802)
  pcieport 0000:02:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
  pcieport 0000:02:00.0: restoring config space at offset 0x18 (was 0x0, writing 0x30302)
  pcieport 0000:02:01.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
  pcieport 0000:02:04.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
  pcieport 0000:02:00.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
  pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
  pcieport 0000:02:04.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
  pcieport 0000:02:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
  xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
  ...
  thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)

This is even worse. None of the mandatory delays are performed. If this
would be S3 instead of s2idle then according to PCI FW spec 3.2 section
4.6.8.  there is a specific _DSM that allows the OS to skip the delays
but this platform does not provide the _DSM and does not go to S3 anyway
so no firmware is involved that could already handle these delays.

In this particular Intel Coffee Lake platform these delays are not
actually needed because there is an additional delay as part of the ACPI
power resource that is used to turn on power to the hierarchy but since
that additional delay is not required by any of standards (PCIe, ACPI)
it is not present in the Intel Ice Lake, for example where missing the
mandatory delays causes pciehp to start tearing down the stack too early
(links are not yet trained).

For this reason, change the PCIe portdrv PM resume hooks so that they
perform the mandatory delays before the downstream component gets
resumed. We perform the delays before port services are resumed because
otherwise pciehp might find that the link is not up (even if it is just
training) and tears-down the hierarchy.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-06-18 01:40:41 +02:00
David S. Miller
f97252a8c3 Merge branch 'UDP-GSO-audit-tests'
Fred Klassen says:

====================
UDP GSO audit tests

Updates to UDP GSO selftests ot optionally stress test CMSG
subsytem, and report the reliability and performance of both
TX Timestamping and ZEROCOPY messages.
====================

Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 16:30:38 -07:00
Fred Klassen
4ffc37f5c0 net/udpgso_bench.sh test fails on error
Ensure that failure on any individual test results in an overall
failure of the test script.

Signed-off-by: Fred Klassen <fklassen@appneta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 16:30:37 -07:00
Fred Klassen
ade90d69ff net/udpgso_bench.sh add UDP GSO audit tests
Audit tests count the total number of messages sent and compares
with total number of CMSG received on error queue. Example:

    udp gso zerocopy timestamp audit
    udp rx:   1599 MB/s  1166414 calls/s
    udp tx:   1615 MB/s    27395 calls/s  27395 msg/s
    udp rx:   1634 MB/s  1192261 calls/s
    udp tx:   1633 MB/s    27699 calls/s  27699 msg/s
    udp rx:   1633 MB/s  1191358 calls/s
    udp tx:   1631 MB/s    27678 calls/s  27678 msg/s
    Summary over 4.000 seconds...
    sum udp tx:   1665 MB/s      82772 calls (27590/s)      82772 msgs (27590/s)
    Tx Timestamps:               82772 received                 0 errors
    Zerocopy acks:               82772 received

Errors are thrown if CMSG count does not equal send count,
example:

    Summary over 4.000 seconds...
    sum tcp tx:   7451 MB/s     493706 calls (123426/s)     493706 msgs (123426/s)
    ./udpgso_bench_tx: Unexpected number of Zerocopy completions:    493706 expected    493704 received

Also reduce individual test time from 4 to 3 seconds so that
overall test time does not increase significantly.

v3: Enhancements as per Willem de Bruijn <willemb@google.com>
    - document -P option for TCP audit

Signed-off-by: Fred Klassen <fklassen@appneta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 16:30:37 -07:00
Fred Klassen
79ebc3c260 net/udpgso_bench_tx: options to exercise TX CMSG
This enhancement adds options that facilitate load testing with
additional TX CMSG options, and to optionally print results of
various send CMSG operations.

These options are especially useful in isolating situations
where error-queue messages are lost when combined with other
CMSG operations (e.g. SO_ZEROCOPY).

New options:
    -a - count all CMSG messages and match to sent messages
    -T - add TX CMSG that requests TX software timestamps
    -H - similar to -T except request TX hardware timestamps
    -P - call poll() before reading error queue
    -v - print detailed results

v2: Enhancements as per Willem de Bruijn <willemb@google.com>
    - Updated control and buffer parameters for recvmsg
    - poll() parameter cleanup
    - fail on bad audit results
    - remove TOS options
    - improved reporting

v3: Enhancements as per Willem de Bruijn <willemb@google.com>
    - add SOF_TIMESTAMPING_OPT_TSONLY to eliminate MSG_TRUNC
    - general code cleanup

Signed-off-by: Fred Klassen <fklassen@appneta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 16:30:37 -07:00
Linus Torvalds
29f785ff76 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "MS_MOVE regression fix + breakage in fsmount(2) (also introduced in
  this cycle, along with fsmount(2) itself).

  I'm still digging through the piles of mail, so there might be more
  fixes to follow, but these two are obvious and self-contained, so
  there's no point delaying those..."

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/namespace: fix unprivileged mount propagation
  vfs: fsmount: add missing mntget()
2019-06-17 16:28:28 -07:00
David S. Miller
4bd366cece Merge branch 'net-ipv4-remove-erroneous-advancement-of-list-pointer'
Florian Westphal says:

====================
net: ipv4: remove erroneous advancement of list pointer

Tariq reported a soft lockup on net-next that Mellanox was able to
bisect to 2638eb8b50 ("net: ipv4: provide __rcu annotation for ifa_list").

While reviewing above patch I found a regression when addresses have a
lifetime specified.

Second patch extends rtnetlink.sh to trigger crash
(without first patch applied).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 16:27:43 -07:00
Florian Westphal
3cfa148826 selftests: rtnetlink: add addresses with fixed life time
This exercises kernel code path that deal with addresses that have
a limited lifetime.

Without previous fix, this triggers following crash on net-next:
 BUG: KASAN: null-ptr-deref in check_lifetime+0x403/0x670
 Read of size 8 at addr 0000000000000010 by task kworker [..]

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 16:27:42 -07:00
Florian Westphal
40008e9211 net: ipv4: remove erroneous advancement of list pointer
Causes crash when lifetime expires on an adress as garbage is
dereferenced soon after.

This used to look like this:

 for (ifap = &ifa->ifa_dev->ifa_list;
      *ifap != NULL; ifap = &(*ifap)->ifa_next) {
          if (*ifap == ifa) ...

but this was changed to:

struct in_ifaddr *tmp;

ifap = &ifa->ifa_dev->ifa_list;
tmp = rtnl_dereference(*ifap);
while (tmp) {
   tmp = rtnl_dereference(tmp->ifa_next); // Bogus
   if (rtnl_dereference(*ifap) == ifa) {
     ...
   ifap = &tmp->ifa_next;		// Can be NULL
   tmp = rtnl_dereference(*ifap);	// Dereference
   }
}

Remove the bogus assigment/list entry skip.

Fixes: 2638eb8b50 ("net: ipv4: provide __rcu annotation for ifa_list")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 16:27:42 -07:00
Arnd Bergmann
78fe8a28fb net: dsa: sja1105: fix ptp link error
Due to a reversed dependency, it is possible to build
the lower ptp driver as a loadable module and the actual
driver using it as built-in, causing a link error:

drivers/net/dsa/sja1105/sja1105_spi.o: In function `sja1105_static_config_upload':
sja1105_spi.c:(.text+0x6f0): undefined reference to `sja1105_ptp_reset'
drivers/net/dsa/sja1105/sja1105_spi.o:(.data+0x2d4): undefined reference to `sja1105et_ptp_cmd'
drivers/net/dsa/sja1105/sja1105_spi.o:(.data+0x604): undefined reference to `sja1105pqrs_ptp_cmd'
drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_remove':
sja1105_main.c:(.text+0x8d4): undefined reference to `sja1105_ptp_clock_unregister'
drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_rxtstamp_work':
sja1105_main.c:(.text+0x964): undefined reference to `sja1105_tstamp_reconstruct'
drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_setup':
sja1105_main.c:(.text+0xb7c): undefined reference to `sja1105_ptp_clock_register'
drivers/net/dsa/sja1105/sja1105_main.o: In function `sja1105_port_deferred_xmit':
sja1105_main.c:(.text+0x1fa0): undefined reference to `sja1105_ptpegr_ts_poll'
sja1105_main.c:(.text+0x1fc4): undefined reference to `sja1105_tstamp_reconstruct'
drivers/net/dsa/sja1105/sja1105_main.o:(.rodata+0x5b0): undefined reference to `sja1105_get_ts_info'

Change the Makefile logic to always build the ptp module
the same way as the rest. Another option would be to
just add it to the same module and remove the exports,
but I don't know if there was a good reason to keep them
separate.

Fixes: bb77f36ac2 ("net: dsa: sja1105: Add support for the PTP clock")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 16:25:29 -07:00
Arnd Bergmann
c63d1e5c2d net: stmmac: fix unused-variable warning
When building without CONFIG_OF, we get a harmless build warning:

drivers/net/ethernet/stmicro/stmmac/stmmac_main.c: In function 'stmmac_phy_setup':
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:973:22: error: unused variable 'node' [-Werror=unused-variable]
  struct device_node *node = priv->plat->phy_node;

Reword it so we always use the local variable, by making it the
fwnode pointer instead of the device_node.

Fixes: 74371272f9 ("net: stmmac: Convert to phylink and remove phylib logic")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 16:24:12 -07:00
Linus Torvalds
da0f382029 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Lots of bug fixes here:

   1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer.

   2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John
      Crispin.

   3) Use after free in psock backlog workqueue, from John Fastabend.

   4) Fix source port matching in fdb peer flow rule of mlx5, from Raed
      Salem.

   5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet.

   6) Network header needs to be set for packet redirect in nfp, from
      John Hurley.

   7) Fix udp zerocopy refcnt, from Willem de Bruijn.

   8) Don't assume linear buffers in vxlan and geneve error handlers,
      from Stefano Brivio.

   9) Fix TOS matching in mlxsw, from Jiri Pirko.

  10) More SCTP cookie memory leak fixes, from Neil Horman.

  11) Fix VLAN filtering in rtl8366, from Linus Walluij.

  12) Various TCP SACK payload size and fragmentation memory limit fixes
      from Eric Dumazet.

  13) Use after free in pneigh_get_next(), also from Eric Dumazet.

  14) LAPB control block leak fix from Jeremy Sowden"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits)
  lapb: fixed leak of control-blocks.
  tipc: purge deferredq list for each grp member in tipc_group_delete
  ax25: fix inconsistent lock state in ax25_destroy_timer
  neigh: fix use-after-free read in pneigh_get_next
  tcp: fix compile error if !CONFIG_SYSCTL
  hv_sock: Suppress bogus "may be used uninitialized" warnings
  be2net: Fix number of Rx queues used for flow hashing
  net: handle 802.1P vlan 0 packets properly
  tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
  tcp: add tcp_min_snd_mss sysctl
  tcp: tcp_fragment() should apply sane memory limits
  tcp: limit payload size of sacked skbs
  Revert "net: phylink: set the autoneg state in phylink_phy_change"
  bpf: fix nested bpf tracepoints with per-cpu data
  bpf: Fix out of bounds memory access in bpf_sk_storage
  vsock/virtio: set SOCK_DONE on peer shutdown
  net: dsa: rtl8366: Fix up VLAN filtering
  net: phylink: set the autoneg state in phylink_phy_change
  net: add high_order_alloc_disable sysctl/static key
  tcp: add tcp_tx_skb_cache sysctl
  ...
2019-06-17 15:55:34 -07:00
Mitch Williams
efa14c3985 iavf: allow null RX descriptors
In some circumstances, the hardware can hand us a null receive
descriptor, with no data attached but otherwise valid. Unfortunately,
the driver was ill-equipped to handle such an event, and would stop
processing packets at that point.

To fix this, use the Descriptor Done bit instead of the size to
determine whether or not a descriptor is ready to be processed. Add some
checks to allow for unused buffers.

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-17 15:39:26 -07:00
Paul Greenwalt
68dfe6348f iavf: add call to iavf_[add|del]_cloud_filter
Add call to iavf_add_cloud_filter and iavf_del_cloud_filter from
iavf_process_aq_command to clear aq_required
IAVF_FLAG_AQ_ADD_CLOUD_FILTER and IAVF_FLAG_AQ_DEL_CLOUD_FILTER bits.

aq_required IAVF_FLAG_AQ_DEL_CLOUD_FILTER bit is being set in
iavf_down and iavf_delete_clsflower, and are never cleared.

aq_required IAVF_FLAG_AQ_ADD_CLOUD_FILTER bit is being set in
iavf_handle_reset and iavf_configure_clsflower, and are never
cleared.

Since the aq_required is not zero, iavf_watchdog_task is setting the
queue_delayed_work to 20 msec instead of the longer delay.

Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-17 15:39:26 -07:00
Jakub Pawlak
b66c7bc1cd iavf: Refactor init state machine
Cleanup of init state machine, move state specific
code to separate functions and rewrite the
iavf_init_task() function.

Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-17 15:39:26 -07:00
Jan Sokolowski
bac8486116 iavf: Refactor the watchdog state machine
Refactor the watchdog state machine implementation.
Add the additional state __IAVF_COMM_FAILED to process
the PF communication fails. Prepare the watchdog state machine
to integrate with init state machine.

Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com>
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-17 15:39:26 -07:00
Jakub Pawlak
fdd4044ffd iavf: Remove timer for work triggering, use delaying work instead
Remove the watchdog timer, instead declare watchdog task
as delayed work and use dedicated workqueue to service driver
tasks. The dedicated driver workqueue iavf_wq is common
for all driver instances.

Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-17 15:39:26 -07:00
Jakub Pawlak
b476b0030e iavf: Move commands processing to the separate function
Move the commands processing outside the watchdog_task()
function. This reduce length and complexity of the function
which is mainly designed to process the watchdog state machine.

Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-17 15:39:26 -07:00
Avinash Dayanand
16e00c25ac iavf: Fix the math for valid length for ADq enable
There was a calculation error in virtchnl regarding the valid
length which was fixed recently and a corresponding change needs
to go into the code while we enable ADq.

Signed-off-by: Avinash Dayanand <avinash.dayanand@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-17 15:39:26 -07:00
Aleksandr Loktionov
f0a48fb441 iavf: Change GFP_KERNEL to GFP_ATOMIC in kzalloc()
iavf_add_vlan() is being called in atomic context
so kzalloc() needs GFP_ATOMIC. This patch fixes it.

Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-17 15:39:25 -07:00
Mitch Williams
88ec7308ea iavf: wait longer for close to complete
On some hardware/driver/architecture combinations, it may take longer
than 200msec for all close operations to be completed, causing a
spurious error message to be logged.

Increase the timeout value to 500msec to avoid this erroneous error.

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-17 15:39:25 -07:00
Mitch Williams
168d91cf2a iavf: use signed variable
The counter variable in iavf_clean_tx_irq starts out negative and climbs
to 0. So allocating it as u16 is actually a really bad idea that just
happens to work because the value underflows and overflows consistently
on most architectures.

Replace the u16 with an int so signed math works as expected.

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-17 15:39:25 -07:00
Akeem G Abodunrin
c2417a7b0e iavf: Create VLAN tag elements starting from the first element
This patch changes how VLAN tag are being populated and programmed into
the HW - Instead of start adding VF VLAN tag from the last member of the
element list, start from the first member of the list, until number of
allowed VLAN tags is exhausted in the HW.

Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-17 15:39:25 -07:00
Daniel T. Lee
4d18f6de6a samples: bpf: refactor header include path
Currently, header inclusion in each file is inconsistent.
For example, "libbpf.h" header is included as multiple ways.

    #include "bpf/libbpf.h"
    #include "libbpf.h"

Due to commit b552d33c80 ("samples/bpf: fix include path
in Makefile"), $(srctree)/tools/lib/bpf/ path had been included
during build, path "bpf/" in header isn't necessary anymore.

This commit removes path "bpf/" in header inclusion.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-18 00:28:36 +02:00
Daniel T. Lee
fa206dccd8 samples: bpf: remove unnecessary include options in Makefile
Due to recent change of include path at commit b552d33c80
("samples/bpf: fix include path in Makefile"), some of the
previous include options became unnecessary.

This commit removes duplicated include options in Makefile.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-18 00:28:36 +02:00
Daniel Borkmann
32b88d3743 Merge branch 'bpf-libbpf-btf-defined-maps'
Andrii Nakryiko says:

====================
This patch set implements initial version (as discussed at LSF/MM2019
conference) of a new way to specify BPF maps, relying on BTF type information,
which allows for easy extensibility, preserving forward and backward
compatibility. See details and examples in description for patch #6.

[0] contains an outline of follow up extensions to be added after this basic
set of features lands. They are useful by itself, but also allows to bring
libbpf to feature-parity with iproute2 BPF loader. That should open a path
forward for BPF loaders unification.

Patch #1 centralizes commonly used min/max macro in libbpf_internal.h.
Patch #2 extracts .BTF and .BTF.ext loading loging from elf_collect().
Patch #3 simplifies elf_collect() error-handling logic.
Patch #4 refactors map initialization logic into user-provided maps and global
data maps, in preparation to adding another way (BTF-defined maps).
Patch #5 adds support for map definitions in multiple ELF sections and
deprecates bpf_object__find_map_by_offset() API which doesn't appear to be
used anymore and makes assumption that all map definitions reside in single
ELF section.
Patch #6 splits BTF intialization from sanitization/loading into kernel to
preserve original BTF at the time of map initialization.
Patch #7 adds support for BTF-defined maps.
Patch #8 adds new test for BTF-defined map definition.
Patches #9-11 convert test BPF map definitions to use BTF way.

[0] https://lore.kernel.org/bpf/CAEf4BzbfdG2ub7gCi0OYqBrUoChVHWsmOntWAkJt47=FE+km+A@mail.gmail.com/

v1->v2:
- more BTF-sanity checks in parsing map definitions (Song);
- removed confusing usage of "attribute", switched to "field;
- split off elf_collect() refactor from btf loading refactor (Song);
- split selftests conversion into 3 patches (Stanislav):
  1. test already relying on BTF;
  2. tests w/ custom types as key/value (so benefiting from BTF);
  3. all the rest tests (integers as key/value, special maps w/o BTF support).
- smaller code improvements (Song);

rfc->v1:
- error out on unknown field by default (Stanislav, Jakub, Lorenz);
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-18 00:11:40 +02:00
Andrii Nakryiko
df0b779259 selftests/bpf: convert tests w/ custom values to BTF-defined maps
Convert a bulk of selftests that have maps with custom (not integer) key
and/or value.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-18 00:10:43 +02:00
Andrii Nakryiko
f654407481 selftests/bpf: switch BPF_ANNOTATE_KV_PAIR tests to BTF-defined maps
Switch tests that already rely on BTF to BTF-defined map definitions.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-18 00:10:43 +02:00
Andrii Nakryiko
9e3d709c47 selftests/bpf: add test for BTF-defined maps
Add file test for BTF-defined map definition.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-18 00:10:42 +02:00
Andrii Nakryiko
abd29c9314 libbpf: allow specifying map definitions using BTF
This patch adds support for a new way to define BPF maps. It relies on
BTF to describe mandatory and optional attributes of a map, as well as
captures type information of key and value naturally. This eliminates
the need for BPF_ANNOTATE_KV_PAIR hack and ensures key/value sizes are
always in sync with the key/value type.

Relying on BTF, this approach allows for both forward and backward
compatibility w.r.t. extending supported map definition features. By
default, any unrecognized attributes are treated as an error, but it's
possible relax this using MAPS_RELAX_COMPAT flag. New attributes, added
in the future will need to be optional.

The outline of the new map definition (short, BTF-defined maps) is as follows:
1. All the maps should be defined in .maps ELF section. It's possible to
   have both "legacy" map definitions in `maps` sections and BTF-defined
   maps in .maps sections. Everything will still work transparently.
2. The map declaration and initialization is done through
   a global/static variable of a struct type with few mandatory and
   extra optional fields:
   - type field is mandatory and specified type of BPF map;
   - key/value fields are mandatory and capture key/value type/size information;
   - max_entries attribute is optional; if max_entries is not specified or
     initialized, it has to be provided in runtime through libbpf API
     before loading bpf_object;
   - map_flags is optional and if not defined, will be assumed to be 0.
3. Key/value fields should be **a pointer** to a type describing
   key/value. The pointee type is assumed (and will be recorded as such
   and used for size determination) to be a type describing key/value of
   the map. This is done to save excessive amounts of space allocated in
   corresponding ELF sections for key/value of big size.
4. As some maps disallow having BTF type ID associated with key/value,
   it's possible to specify key/value size explicitly without
   associating BTF type ID with it. Use key_size and value_size fields
   to do that (see example below).

Here's an example of simple ARRAY map defintion:

struct my_value { int x, y, z; };

struct {
	int type;
	int max_entries;
	int *key;
	struct my_value *value;
} btf_map SEC(".maps") = {
	.type = BPF_MAP_TYPE_ARRAY,
	.max_entries = 16,
};

This will define BPF ARRAY map 'btf_map' with 16 elements. The key will
be of type int and thus key size will be 4 bytes. The value is struct
my_value of size 12 bytes. This map can be used from C code exactly the
same as with existing maps defined through struct bpf_map_def.

Here's an example of STACKMAP definition (which currently disallows BTF type
IDs for key/value):

struct {
	__u32 type;
	__u32 max_entries;
	__u32 map_flags;
	__u32 key_size;
	__u32 value_size;
} stackmap SEC(".maps") = {
	.type = BPF_MAP_TYPE_STACK_TRACE,
	.max_entries = 128,
	.map_flags = BPF_F_STACK_BUILD_ID,
	.key_size = sizeof(__u32),
	.value_size = PERF_MAX_STACK_DEPTH * sizeof(struct bpf_stack_build_id),
};

This approach is naturally extended to support map-in-map, by making a value
field to be another struct that describes inner map. This feature is not
implemented yet. It's also possible to incrementally add features like pinning
with full backwards and forward compatibility. Support for static
initialization of BPF_MAP_TYPE_PROG_ARRAY using pointers to BPF programs
is also on the roadmap.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-18 00:10:41 +02:00
Andrii Nakryiko
063183bf04 libbpf: split initialization and loading of BTF
Libbpf does sanitization of BTF before loading it into kernel, if kernel
doesn't support some of newer BTF features. This removes some of the
important information from BTF (e.g., DATASEC and VAR description),
which will be used for map construction. This patch splits BTF
processing into initialization step, in which BTF is initialized from
ELF and all the original data is still preserved; and
sanitization/loading step, which ensures that BTF is safe to load into
kernel. This allows to use full BTF information to construct maps, while
still loading valid BTF into older kernels.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-18 00:10:41 +02:00
Andrii Nakryiko
db48814bd2 libbpf: identify maps by section index in addition to offset
To support maps to be defined in multiple sections, it's important to
identify map not just by offset within its section, but section index as
well. This patch adds tracking of section index.

For global data, we record section index of corresponding
.data/.bss/.rodata ELF section for uniformity, and thus don't need
a special value of offset for those maps.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-18 00:10:40 +02:00
Andrii Nakryiko
bf82927125 libbpf: refactor map initialization
User and global data maps initialization has gotten pretty complicated
and unnecessarily convoluted. This patch splits out the logic for global
data map and user-defined map initialization. It also removes the
restriction of pre-calculating how many maps will be initialized,
instead allowing to keep adding new maps as they are discovered, which
will be used later for BTF-defined map definitions.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-18 00:10:39 +02:00
Andrii Nakryiko
01b29d1dc9 libbpf: streamline ELF parsing error-handling
Simplify ELF parsing logic by exiting early, as there is no common clean
up path to execute. That makes it unnecessary to track when err was set
and when it was cleared. It also reduces nesting in some places.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-18 00:10:39 +02:00