resctrl_arch_rmid_read() is intended as the function that an
architecture agnostic resctrl filesystem driver can use to
read a value in bytes from a counter. Currently the function returns
the MBM values in chunks directly from hardware. When reading a bandwidth
counter, mbm_overflow_count() must be used to correct for any possible
overflow.
mbm_overflow_count() is architecture specific, its behaviour should
be part of resctrl_arch_rmid_read().
Move the mbm_overflow_count() calls into resctrl_arch_rmid_read().
This allows the resctrl filesystems's prev_msr to be removed in
favour of the architecture private version.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-18-james.morse@arm.com
resctrl_arch_rmid_read() is intended as the function that an
architecture agnostic resctrl filesystem driver can use to
read a value in bytes from a hardware register. Currently the function
returns the MBM values in chunks directly from hardware.
To convert this to bytes, some correction and overflow calculations
are needed. These depend on the resource and domain structures.
Overflow detection requires the old chunks value. None of this
is available to resctrl_arch_rmid_read(). MPAM requires the
resource and domain structures to find the MMIO device that holds
the registers.
Pass the resource and domain to resctrl_arch_rmid_read(). This makes
rmid_dirty() too big. Instead merge it with its only caller, and the
name is kept as a local variable.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-17-james.morse@arm.com
__rmid_read() selects the specified eventid and returns the counter
value from the MSR. The error handling is architecture specific, and
handled by the callers, rdtgroup_mondata_show() and __mon_event_count().
Error handling should be handled by architecture specific code, as
a different architecture may have different requirements. MPAM's
counters can report that they are 'not ready', requiring a second
read after a short delay. This should be hidden from resctrl.
Make __rmid_read() the architecture specific function for reading
a counter. Rename it resctrl_arch_rmid_read() and move the error
handling into it.
A read from a counter that hardware supports but resctrl does not
now returns -EINVAL instead of -EIO from the default case in
__mon_event_count(). It isn't possible for user-space to see this
change as resctrl doesn't expose counters it doesn't support.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-16-james.morse@arm.com
DMA allocated buffers are a precious resource. If there is no need for
DMA allocations, then it might be worth to use non-dma allocated
buffers.
After testing the gs_usb driver with and without DMA allocation, there
does not seem to be a significant change in latency or CPU utilization
either way. Therefore, DMA allocation is not necessary and removed.
Internal buffers used within urbs were managed and freed manually.
These buffers are no longer needed to be managed by the driver. The
URB_FREE_BUFFER flag, allows for the buffers in question to be
automatically freed.
Co-developed-by: Rhett Aultman <rhett.aultman@samsara.com>
Signed-off-by: Rhett Aultman <rhett.aultman@samsara.com>
Signed-off-by: Vasanth Sadhasivan <vasanth.sadhasivan@samsara.com>
Link: https://lore.kernel.org/all/20220920154724.861093-2-rhett.aultman@samsara.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Introduce a label to free the allocated candev in case of an error and
make use of if. Fix a memory leak if the extended bit timing cannot be
read. Extend the error messages to print the number of the failing
channel and the symbolic error name.
Link: https://lore.kernel.org/all/20220921193902.575416-4-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Convert the driver to use usb_control_msg_{send,recv}() instead of
usb_control_msg(). These functions allow the data to be placed on the
stack. This makes the driver a lot easier as we don't have to deal
with dynamically allocated memory.
Link: https://lore.kernel.org/all/20220921193902.575416-3-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
On busy networks the CAN controller might receive CAN frames directly
after starting it but before the timecounter is setup. This will lead
to NULL pointer deref while converting the converting the CAN frame's
timestamp with the timecounter.
Close the race window by setting up the timecounter before starting
the CAN controller.
Fixes: 45dfa45f52 ("can: gs_usb: add RX and TX hardware timestamp support")
Link: https://lore.kernel.org/all/20220921081329.385509-1-mkl@pengutronix.de
Cc: John Whittington <git@jbrengineering.co.uk
Tested-by: John Whittington <git@jbrengineering.co.uk>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The struct timecounter::cycle_last is a 64 bit variable, read by
timecounter_cyc2time(), and written by timecounter_read(). On 32 bit
architectures this is not atomic.
Add a spinlock to protect access to struct timecounter::cycle_last. In
the gs_usb_timestamp_read() callback the lock is dropped to execute a
sleeping synchronous USB transfer. This is safe, as the variable we
want to protect is accessed during this call.
Fixes: 45dfa45f52 ("can: gs_usb: add RX and TX hardware timestamp support")
Link: https://lore.kernel.org/all/20220920100416.959226-3-mkl@pengutronix.de
Cc: John Whittington <git@jbrengineering.co.uk>
Tested-by: John Whittington <git@jbrengineering.co.uk>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The 2nd argument of usb_control_msg_recv() is the "endpoint",
usb_control_msg_recv() will internally convert the endpoint into a
pipe with usb_rcvctrlpipe().
In gs_usb_get_timestamp() not the endpoint "0" is passed, but the
pipe. This worked by accident as endpoint is a __u8 and the lowest 8
bits of the pipe are 0. Fix this copy/paste error by using the correct
endpoint of "0".
Fixes: 45dfa45f52 ("can: gs_usb: add RX and TX hardware timestamp support")
Link: https://lore.kernel.org/all/20220920100416.959226-2-mkl@pengutronix.de
Cc: John Whittington <git@jbrengineering.co.uk>
Tested-by: John Whittington <git@jbrengineering.co.uk>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Now, register_netdevice_notifier() and register_pernet_subsys() are both
after can_proto_register(). It can create CAN_BCM socket and process socket
once can_proto_register() successfully, so it is possible missing notifier
event or proc node creation because notifier or bcm proc directory is not
registered or created yet. Although this is a low probability scenario, it
is not impossible.
Move register_pernet_subsys() and register_netdevice_notifier() to the
front of can_proto_register(). In addition, register_pernet_subsys() and
register_netdevice_notifier() may fail, check their results are necessary.
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Link: https://lore.kernel.org/all/823cff0ebec33fa9389eeaf8b8ded3217c32cb38.1663206163.git.william.xuanziyang@huawei.com
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
In preparation for reducing the use of ksize(), record the actual
allocation size for later memcpy(). This avoids copying extra
(uninitialized!) bytes into the patch buffer when the requested
allocation size isn't exactly the size of a kmalloc bucket.
Additionally, fix potential future issues where runtime bounds checking
will notice that the buffer was allocated to a smaller value than
returned by ksize().
Fixes: 757885e94a ("x86, microcode, amd: Early microcode patch loading support for AMD")
Suggested-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/lkml/CA+DvKQ+bp7Y7gmaVhacjv9uF6Ar-o4tet872h4Q8RPYPJjcJQA@mail.gmail.com/
When GEM is in SGMII mode and disabled as a wakeup source, the power
management controller can power down the entire full power domain(FPD)
if none of the FPD devices are in use.
Incase of FPD off, there are below ethernet link up issues on non-wakeup
suspend/resume. To fix it add phy_exit() in suspend and phy_init() in the
resume path which reinitializes PS GTR SGMII lanes.
$ echo +20 > /sys/class/rtc/rtc0/wakealarm
$ echo mem > /sys/power/state
After resume:
$ ifconfig eth0 up
xilinx-psgtr fd400000.phy: lane 0 (type 10, protocol 5): PLL lock timeout
phy phy-fd400000.phy.0: phy poweron failed --> -110
xilinx-psgtr fd400000.phy: lane 0 (type 10, protocol 5): PLL lock timeout
SIOCSIFFLAGS: Connection timed out
phy phy-fd400000.phy.0: phy poweron failed --> -110
Fixes: 8b73fa3ae0 ("net: macb: Added ZynqMP-specific initialization")
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Horatiu Vultur says:
====================
net: lan966x: Add mqprio and taprio support
Add support for offloading QoS features with tc command to lan966x. The
offloaded QoS features are mqprio and taprio.
v1->v2:
- fix compilation warning
- rename lan966x_taprio_enable/disable to lan966x_taprio_add/del
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lan966x switch supports time-based egress shaping in hardware
according to IEEE 802.1Qbv. Add support for TAS configuration on
egress port of lan966x switch.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement mqprio qdisc support using tc command.
The HW supports 8 priority queues from highest (7) to lowest (0).
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a define for the number of priority queues on lan966x. Because there
will be more checks for this, so instead of using hardcoded value all
over the place add a define for this.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ->ring_idx_mask variable is a u64 so static checkers, Smatch in
this case, complain if the BIT() is not also a u64.
drivers/gpu/drm/virtio/virtgpu_ioctl.c:50 virtio_gpu_fence_event_create()
warn: should '(1 << ring_idx)' be a 64 bit type?
Fixes: cd7f5ca335 ("drm/virtio: implement context init: add virtio_gpu_fence_event")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chia-I Wu <olvaffe@gmail.com>
Link: http://patchwork.freedesktop.org/patch/msgid/YygN7jY0GdUSQSy0@kili
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
This change adds support for not enabling carrier during TUNSETIFF
interface creation by specifying the IFF_NO_CARRIER flag.
Our tests make heavy use of tun interfaces. In some scenarios, the test
process creates the interface but another process brings it up after the
interface is discovered via netlink notification. In that case, it is
not possible to create a tun/tap interface with carrier off without it
racing against the bring up. Immediately setting carrier off via
TUNSETCARRIER is still too late.
Signed-off-by: Patrick Rohr <prohr@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Read link status from SGMII PCS for in-band managed 2500Base-X and
1000Base-X connection on a MAC port of the MT7531. This is needed to
get the SFP cage working which is connected to SGMII interface of
port 5 of the MT7531 switch IC on the Bananapi BPi-R3 board.
While at it also handle an_complete for both the autoneg and the
non-autoneg codepath.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson says:
====================
net: phy: Add support for rate matching
This adds support for phy rate matching: when a phy adapts between
differing phy interface and link speeds. It was originally submitted as
part of [1], which is considered "v1" of this series.
Several past discussions [2-4] around adding rate adaptation provide
some context.
Although in earlier versions of this series, userspace could disable
rate matching, now it is only possible to determine the current rate
adaptation type. Disabling or otherwise configuring rate adaptation has
been left for future work. However, because currently only
RATE_MATCH_PAUSE is implemented, it is possible to disable rate
adaptation by modifying the advertisement appropriately.
[1] https://lore.kernel.org/netdev/20220715215954.1449214-1-sean.anderson@seco.com/T/#t
[2] https://lore.kernel.org/netdev/1579701573-6609-1-git-send-email-madalin.bucur@oss.nxp.com/
[3] https://lore.kernel.org/netdev/1580137671-22081-1-git-send-email-madalin.bucur@oss.nxp.com/
[4] https://lore.kernel.org/netdev/20200116181933.32765-1-olteanv@gmail.com/
Changes in v6:
- Don't announce that we've enabled pause frames for rate adaptation
- Merry Christmas
- Rename rate adaptation to rate matching
- Reword documentation, (hopefully) taking into account feedback
Changes in v5:
- Break off patch "net: phy: Add 1000BASE-KX interface mode" for
separate submission.
- Document phy_rate_adaptation_to_str
- Drop patch "Add some helpers for working with mac caps"; it has been
incorperated into the autonegotiation patch.
- Move phylink_cap_from_speed_duplex to this commit
- Rebase onto net-next/master
- Remove unnecessary comma
Changes in v4:
- Export phy_rate_adaptation_to_str
- Remove phylink_interface_max_speed, which was accidentally added
- Split off the LS1046ARDB 1G fix
Changes in v3:
- Add phylink_cap_from_speed_duplex to look up the mac capability
corresponding to the interface's speed.
- Document MAC_(A)SYM_PAUSE
- Include RATE_ADAPT_CRS; it's a few lines and it doesn't hurt.
- Modify link settings directly in phylink_link_up, instead of doing
things more indirectly via link_*.
- Move unused defines to next commit (where they will be used)
- Remove "Support differing link/interface speed/duplex". It has been
rendered unnecessary due to simplification of the rate adaptation
patches. Thanks Russell!
- Rewrite cover letter to better reflect the opinions of the developers
involved
Changes in v2:
- Add (read-only) ethtool support for rate adaptation
- Add comments clarifying the register defines
- Add locking to phy_get_rate_adaptation
- Always use the rate adaptation setting to determine the interface
speed/duplex (instead of sometimes using the interface mode).
- Determine the interface speed and max mac speed directly instead of
guessing based on the caps.
- Move part of commit message to cover letter, as it gives a good
overview of the whole series, and allows this patch to focus more on
the specifics.
- Reorder variables in aqr107_read_rate
- Use int/defines instead of enum to allow for use in ioctls/netlink
- Use the phy's rate adaptation setting to determine whether to use its
link speed/duplex or the MAC's speed/duplex with MLO_AN_INBAND.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds support for rate matching for phys similar to the AQR107. We
assume that all phys using aqr107_read_status support rate matching.
However, it could be possible to determine support based on the firmware
revision if there are phys discovered which do not support rate
matching. However, as rate matching is advertised in the datasheets for
these phys, I suspect it is supported most boards.
Despite the name, the "config" registers are updated with the current
rate matching method (if any). Because they appear to be updated
automatically, I don't know if these registers can be used to disable
rate matching.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we are using CPU_PM_CPU_IDLE_ENTER_PARAM() for all SBI HSM
suspend types so retentive suspend types are also treated non-retentive
and kernel will do redundant additional work for these states.
The BIT[31] of SBI HSM suspend types allows us to differentiate between
retentive and non-retentive suspend types so we should use this BIT
to call appropriate CPU_PM_CPU_IDLE_ENTER_xyz() macro.
Fixes: 6abf32f1d9 ("cpuidle: Add RISC-V SBI CPU idle driver")
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Link: https://lore.kernel.org/r/20220718084553.2056169-1-apatel@ventanamicro.com/
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
These are documented in the AQR115 register reference. I haven't tested
them, but perhaps they'll be useful to someone.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds support for adjusting the advertisement for pause-based rate
matching. This may result in a lossy link, since the final link settings
are not adjusted. Asymmetric pause support is necessary. It would be
possible for a MAC supporting only symmetric pause to use pause-based rate
adaptation, but only if pause reception was enabled as well.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the phy is configured to use pause-based rate matching, ensure that
the link is full duplex with pause frame reception enabled. As
suggested, if pause-based rate matching is enabled by the phy, then
pause reception is unconditionally enabled.
The interface duplex is determined based on the rate matching type.
When rate matching is enabled, so is the speed. We assume the maximum
interface speed is used. This is only relevant for MLO_AN_PHY. For
MLO_AN_INBAND, the MAC/PCS's view of the interface speed will be used.
Although there are no RATE_ADAPT_CRS phys in-tree, it has been added for
comparison (and the implementation is quite simple).
Co-developed-by: Russell King <linux@armlinux.org.uk>
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds support for rate matching (also known as rate adaptation) to
the phy subsystem. The general idea is that the phy interface runs at
one speed, and the MAC throttles the rate at which it sends packets to
the link speed. There's a good overview of several techniques for
achieving this at [1]. This patch adds support for three: pause-frame
based (such as in Aquantia phys), CRS-based (such as in 10PASS-TS and
2BASE-TL), and open-loop-based (such as in 10GBASE-W).
This patch makes a few assumptions and a few non assumptions about the
types of rate matching available. First, it assumes that different phys
may use different forms of rate matching. Second, it assumes that phys
can use rate matching for any of their supported link speeds (e.g. if a
phy supports 10BASE-T and XGMII, then it can adapt XGMII to 10BASE-T).
Third, it does not assume that all interface modes will use the same
form of rate matching. Fourth, it does not assume that all phy devices
will support rate matching (even if some do). Relaxing or strengthening
these (non-)assumptions could result in a different API. For example, if
all interface modes were assumed to use the same form of rate matching,
then a bitmask of interface modes supportting rate matching would
suffice.
For some better visibility into the process, the current rate matching
mode is exposed as part of the ethtool ksettings. For the moment, only
read access is supported. I'm not sure what userspace might want to
configure yet (disable it altogether, disable just one mode, specify the
mode to use, etc.). For the moment, since only pause-based rate
adaptation support is added in the next few commits, rate matching can
be disabled altogether by adjusting the advertisement.
802.3 calls this feature "rate adaptation" in clause 49 (10GBASE-R) and
"rate matching" in clause 61 (10PASS-TL and 2BASE-TS). Aquantia also calls
this feature "rate adaptation". I chose "rate matching" because it is
shorter, and because Russell doesn't think "adaptation" is correct in this
context.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we call phylink_caps_to_linkmodes directly from
phylink_get_linkmodes, it is difficult to re-use this functionality in
MAC drivers. This is because MAC drivers must then work with an ethtool
linkmode bitmap, instead of with mac capabilities. Instead, let the
caller of phylink_get_linkmodes do the conversion. To reflect this
change, rename the function to phylink_get_capabilities.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function is convenient for MAC drivers. They can use it to add or
remove particular link modes based on capabilities (such as if half
duplex is not supported for a particular interface mode).
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This documents the possible MLO_PAUSE_* settings which can result from
different combinations of MAC_(A)SYM_PAUSE. Special note is paid to
settings which can result from user configuration (MLO_PAUSE_AN). The
autonegotiation results are more-or-less a direct consequence of IEEE
802.3 Table 28B-2.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
kmalloc's API family is critical for mm, with one nature that it will
round up the request size to a fixed one (mostly power of 2). Say
when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so in worst case, there is around 50% memory
space waste.
The wastage is not a big issue for requests that get allocated/freed
quickly, but may cause problems with objects that have longer life
time.
We've met a kernel boot OOM panic (v5.10), and from the dumped slab
info:
[ 26.062145] kmalloc-2k 814056KB 814056KB
From debug we found there are huge number of 'struct iova_magazine',
whose size is 1032 bytes (1024 + 8), so each allocation will waste
1016 bytes. Though the issue was solved by giving the right (bigger)
size of RAM, it is still nice to optimize the size (either use a
kmalloc friendly size or create a dedicated slab for it).
And from lkml archive, there was another crash kernel OOM case [1]
back in 2019, which seems to be related with the similar slab waste
situation, as the log is similar:
[ 4.332648] iommu: Adding device 0000:20:02.0 to group 16
[ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
...
[ 4.857565] kmalloc-2048 59164KB 59164KB
The crash kernel only has 256M memory, and 59M is pretty big here.
(Note: the related code has been changed and optimised in recent
kernel [2], these logs are just picked to demo the problem, also
a patch changing its size to 1024 bytes has been merged)
So add an way to track each kmalloc's memory waste info, and
leverage the existing SLUB debug framework (specifically
SLUB_STORE_USER) to show its call stack of original allocation,
so that user can evaluate the waste situation, identify some hot
spots and optimize accordingly, for a better utilization of memory.
The waste info is integrated into existing interface:
'/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of
'kmalloc-4k' after boot is:
126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1
__kmem_cache_alloc_node+0x11f/0x4e0
__kmalloc_node+0x4e/0x140
ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe]
ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe]
ixgbe_probe+0x165f/0x1d20 [ixgbe]
local_pci_probe+0x78/0xc0
work_for_cpu_fn+0x26/0x40
...
which means in 'kmalloc-4k' slab, there are 126 requests of
2240 bytes which got a 4KB space (wasting 1856 bytes each
and 233856 bytes in total), from ixgbe_alloc_q_vector().
And when system starts some real workload like multiple docker
instances, there could are more severe waste.
[1]. https://lkml.org/lkml/2019/8/12/266
[2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/
[Thanks Hyeonggon for pointing out several bugs about sorting/format]
[Thanks Vlastimil for suggesting way to reduce memory usage of
orig_size and keep it only for kmalloc objects]
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Since the most that's mixed into the pool is sizeof(long)*2, don't
credit more than that many bytes of entropy.
Fixes: e3e33fc2ea ("random: do not use input pool from hard IRQs")
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
If a hwrng source does not provide an entropy estimate, it currently
does not contribute at all to the CRNG. In order to help fix this, in
case add_hwgenerator_randomness() is called with the entropy parameter
set to zero, go to sleep until one reseed interval has passed.
While the hwrng thread currently only runs under conditions where this
is non-zero, this change is not harmful and prepares for future updates
to the hwrng core.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Mix in randomness from hw-rng sources more frequently during early
boot, approximately once for every rng reseed.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Prior to 5.6, when /dev/random was opened with O_NONBLOCK, it would
return -EAGAIN if there was no entropy. When the pools were unified in
5.6, this was lost. The post 5.6 behavior of blocking until the pool is
initialized, and ignoring O_NONBLOCK in the process, went unnoticed,
with no reports about the regression received for two and a half years.
However, eventually this indeed did break somebody's userspace.
So we restore the old behavior, by returning -EAGAIN if the pool is not
initialized. Unlike the old /dev/random, this can only occur during
early boot, after which it never blocks again.
In order to make this O_NONBLOCK behavior consistent with other
expectations, also respect users reading with preadv2(RWF_NOWAIT) and
similar.
Fixes: 30c08efec8 ("random: make /dev/random be almost like /dev/urandom")
Reported-by: Guozihua <guozihua@huawei.com>
Reported-by: Zhongguohua <zhongguohua1@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>