drm_mode_config_cleanup is idempotent, so no harm in calling this
twice. This allows us to gradually switch drivers over by removing
explicit drm_mode_config_cleanup calls.
With this step it's now also possible that (at least for simple
drivers) automatic resource cleanup can be done correctly without a
drm_driver->release hook. Therefore allow this now in
devm_drm_dev_init().
Also with drmm_ explicit drm_driver->release hooks are kinda not the
best option: Drivers can always just register their current release
hook with drmm_add_action, but even better they could split them up to
simplify the unwinding for the driver load failure case. So deprecate
that hook to discourage future users.
v2: Fixup the example in the kerneldoc too.
v3:
- For paranoia, double check that minor->dev == dev in the release
hook, because I botched the pointer math in the drmm library.
- Call drm_mode_config_cleanup when drmm_add_action fails, we'd be
missing some mutex_destroy and ida_cleanup otherwise (Laurent)
v4: Add a drmm_add_action_or_reset (like devm_ has) to encapsulate this
pattern (Noralf).
v5: Fix oversight in the new drmm_add_action_or_reset macro (Noralf)
v4: Review from Sam:
- drmm_mode_config_init wrapper (also suggested by Thomas)
- improve commit message, explain better why ->relase is deprecated
v5:
- Make drmm_ the main function, with the old one as compat wrapper
(Sam)
- Add FIXME comments to drm_mode_config_cleanup/init() that drivers
shouldn't use these anymore.
- Move drmm_add_action_or_reset helper to an earlier patch.
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: "Noralf Trønnes" <noralf@tronnes.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Noralf Trønnes <noralf@tronnes.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200323144950.3018436-27-daniel.vetter@ffwll.ch
Nothing special here, except that this is the first time that we
automatically clean up something that's initialized with an explicit
driver call. But the cleanup was done at the very end of the release
sequence for all drivers, and that's still the case. At least without
more uses of drmm_ through explicit driver calls.
Also for this one we need drmm_kcalloc, so lets add those.
The motivation here is to allow us to remove the explicit calls to
drm_dev_fini() from all drivers.
v2: Sort includes (Laurent)
v3: Motivate the change in the commit message better (Sam)
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200323144950.3018436-25-daniel.vetter@ffwll.ch
The cleanup here is somewhat tricky, since we can't tell apart the
allocated minor index from 0. So register a cleanup action first, and
if the index allocation fails, unregister that cleanup action again to
avoid bad mistakes.
The kdev for the minor already handles NULL, so no problem there.
Hence add drmm_remove_action() to the drm_managed library.
v2: Make pointer math around void ** consistent with what Laurent
suggested.
v3: Use drmm_add_action_or_reset and remove drmm_remove_action. Noticed
because of some questions from Thomas. This also means we need to move
the drmm_add_action_or_reset helper earlier in the series.
v4: Uh ... fix slightly embarrassing bug CI spotted.
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200324203936.3330994-1-daniel.vetter@ffwll.ch
There is one one corner case at dma_fence_signal_locked
which will raise the NULL pointer problem just like below.
->dma_fence_signal
->dma_fence_signal_locked
->test_and_set_bit
here trigger dma_fence_release happen due to the zero of fence refcount.
->dma_fence_put
->dma_fence_release
->drm_sched_fence_release_scheduled
->call_rcu
here make the union fled “cb_list” at finished fence
to NULL because struct rcu_head contains two pointer
which is same as struct list_head cb_list
Therefore, to hold the reference of finished fence at drm_sched_process_job
to prevent the null pointer during finished fence dma_fence_signal
[ 732.912867] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 732.914815] #PF: supervisor write access in kernel mode
[ 732.915731] #PF: error_code(0x0002) - not-present page
[ 732.916621] PGD 0 P4D 0
[ 732.917072] Oops: 0002 [#1] SMP PTI
[ 732.917682] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G OE 5.4.0-rc7 #1
[ 732.918980] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 732.920906] RIP: 0010:dma_fence_signal_locked+0x3e/0x100
[ 732.938569] Call Trace:
[ 732.939003] <IRQ>
[ 732.939364] dma_fence_signal+0x29/0x50
[ 732.940036] drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[ 732.940996] drm_sched_process_job+0x34/0xa0 [gpu_sched]
[ 732.941910] dma_fence_signal_locked+0x85/0x100
[ 732.942692] dma_fence_signal+0x29/0x50
[ 732.943457] amdgpu_fence_process+0x99/0x120 [amdgpu]
[ 732.944393] sdma_v4_0_process_trap_irq+0x81/0xa0 [amdgpu]
v2: hold the finished fence at drm_sched_process_job instead of
amdgpu_fence_process
v3: resume the blank line
Signed-off-by: Yintian Tao <yttao@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
With this we can drop the final kfree from the release function.
I also noticed that the unwind code is wrong, after drm_dev_init the
drm_device owns the v3d allocation, so the kfree(v3d) is a double-free.
Reorder the setup to fix this issue.
After a bit more prep in drivers and drm core v3d should be able to
switch over to devm_drm_dev_init, which should clean this up further.
Acked-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Eric Anholt <eric@anholt.net>
Link: https://patchwork.freedesktop.org/patch/msgid/20200323144950.3018436-11-daniel.vetter@ffwll.ch
With this we can drop the final kfree from the release function.
The mock device in the selftests needed it's pci_device split
up from the drm_device. In the future we could simplify this again
by allocating the pci_device as a managed allocation too.
v2: I overlooked that i915_driver_destroy is also called in the
unwind code of the error path. There we need a drm_dev_put.
Similar for the mock object.
Now the problem with that is that the drm_driver->release callbacks
for both the real driver and the mock one assume everything has been
set up. Hence going through that path for a partially set up driver
will result in issues. Quickest fix is to disable the ->release() hook
until the driver is fully initialized, and keep the onion unwinding.
Long term would be cleanest to move everything over to drmm_ release
actions, but that's a lot of work for a big driver like i915. Plus
more core work needed first anyway.
v3: Fix i915_drm pointer wrangling in mock_gem_device. Also switch
over to start using drm_dev_put() to clean up even on the error path.
Aside I think the current error path is leaking the allocation.
v4: more fixes for intel-gfx-ci, some if it damage from v3 :-/
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Andi Shyti <andi.shyti@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Abdiel Janulgue <abdiel.janulgue@linux.intel.com>
Cc: intel-gfx@lists.freedesktop.org
Link: https://patchwork.freedesktop.org/patch/msgid/20200323144950.3018436-9-daniel.vetter@ffwll.ch
We have lots of these. And the cleanup code tends to be of dubious
quality. The biggest wrong pattern is that developers use devm_, which
ties the release action to the underlying struct device, whereas
all the userspace visible stuff attached to a drm_device can long
outlive that one (e.g. after a hotunplug while userspace has open
files and mmap'ed buffers). Give people what they want, but with more
correctness.
Mostly copied from devres.c, with types adjusted to fit drm_device and
a few simplifications - I didn't (yet) copy over everything. Since
the types don't match code sharing looked like a hopeless endeavour.
For now it's only super simplified, no groups, you can't remove
actions (but kfree exists, we'll need that soon). Plus all specific to
drm_device ofc, including the logging. Which I didn't bother to make
compile-time optional, since none of the other drm logging is compile
time optional either.
One tricky bit here is the chicken&egg between allocating your
drm_device structure and initiliazing it with drm_dev_init. For
perfect onion unwinding we'd need to have the action to kfree the
allocation registered before drm_dev_init registers any of its own
release handlers. But drm_dev_init doesn't know where exactly the
drm_device is emebedded into the overall structure, and by the time it
returns it'll all be too late. And forcing drivers to be able clean up
everything except the one kzalloc is silly.
Work around this by having a very special final_kfree pointer. This
also avoids troubles with the list head possibly disappearing from
underneath us when we release all resources attached to the
drm_device.
v2: Do all the kerneldoc at the end, to avoid lots of fairly pointless
shuffling while getting everything into shape.
v3: Add static to add/del_dr (Neil)
Move typo fix to the right patch (Neil)
v4: Enforce contract for drmm_add_final_kfree:
Use ksize() to check that the drm_device is indeed contained somewhere
in the final kfree(). Because we need that or the entire managed
release logic blows up in a pile of use-after-frees. Motivated by a
discussion with Laurent.
v5: Review from Laurent:
- %zu instead of casting size_t
- header guards
- sorting of includes
- guarding of data assignment if we didn't allocate it for a NULL
pointer
- delete spurious newline
- cast void* data parameter correctly in ->release call, no idea how
this even worked before
v6: Review from Sam
- Add the kerneldoc for the managed sub-struct back in, even if it
doesn't show up in the generated html somehow.
- Explain why __always_inline.
- Fix bisectability around the final kfree() in drm_dev_relase(). This
is just interim code which will disappear again.
- Some whitespace polish.
- Add debug output when drmm_add_action or drmm_kmalloc fail.
v7: My bisectability fix wasn't up to par as noticed by smatch.
v8: Remove unecessary {} around if else
v9: Use kstrdup_const, which requires kfree_const and introducing a free_dr()
helper (Thomas).
v10: kfree_const goes boom on the plain "kmalloc" assignment, somehow
we need to wrap that in kstrdup_const() too!! Also renumber revision
log, I somehow reset it midway thruh.
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: Neil Armstrong <narmstrong@baylibre.com
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200324124540.3227396-1-daniel.vetter@ffwll.ch
For two reasons:
- The driver core clears this already for us after we're unloaded in
__device_release_driver().
- It's way too late, the drm_device ->release callback might massively
outlive the underlying physical device, since a drm_device can be
kept alive by open drm_file or well really anything else userspace
is still hanging onto. So if we clear this ourselves, we should
clear it in the pci ->remove callback, not in the drm_device
->release callback.
Looking at git history this was fixed in the driver core with
commit 0998d06310
Author: Hans de Goede <hdegoede@redhat.com>
Date: Wed May 23 00:09:34 2012 +0200
device-core: Ensure drvdata = NULL when no driver is bound
v2: Cite the core fix in the commit message (Chris).
v3: Fix commit message and unused variable warning (Jani).
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200323144950.3018436-3-daniel.vetter@ffwll.ch
1) SRIOV guest KMD doesn't care training buffer
2) if we resered training buffer that will overlap with IP discovery
reservation because training buffer is at vram_size - 0x8000 and
IP discovery is at ()vram_size - 0x10000 => vram_size -1)
v2: squash in warning fix from Nirmoy
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
There is one one corner case at dma_fence_signal_locked
which will raise the NULL pointer problem just like below.
->dma_fence_signal
->dma_fence_signal_locked
->test_and_set_bit
here trigger dma_fence_release happen due to the zero of fence refcount.
->dma_fence_put
->dma_fence_release
->drm_sched_fence_release_scheduled
->call_rcu
here make the union fled “cb_list” at finished fence
to NULL because struct rcu_head contains two pointer
which is same as struct list_head cb_list
Therefore, to hold the reference of finished fence at drm_sched_process_job
to prevent the null pointer during finished fence dma_fence_signal
[ 732.912867] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 732.914815] #PF: supervisor write access in kernel mode
[ 732.915731] #PF: error_code(0x0002) - not-present page
[ 732.916621] PGD 0 P4D 0
[ 732.917072] Oops: 0002 [#1] SMP PTI
[ 732.917682] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G OE 5.4.0-rc7 #1
[ 732.918980] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 732.920906] RIP: 0010:dma_fence_signal_locked+0x3e/0x100
[ 732.938569] Call Trace:
[ 732.939003] <IRQ>
[ 732.939364] dma_fence_signal+0x29/0x50
[ 732.940036] drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[ 732.940996] drm_sched_process_job+0x34/0xa0 [gpu_sched]
[ 732.941910] dma_fence_signal_locked+0x85/0x100
[ 732.942692] dma_fence_signal+0x29/0x50
[ 732.943457] amdgpu_fence_process+0x99/0x120 [amdgpu]
[ 732.944393] sdma_v4_0_process_trap_irq+0x81/0xa0 [amdgpu]
v2: hold the finished fence at drm_sched_process_job instead of
amdgpu_fence_process
v3: resume the blank line
Signed-off-by: Yintian Tao <yttao@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
events at vsartup for DCN")' introduces a new way of pageflip
completion handling for DCN, and some trouble.
The current implementation introduces a race condition, which
can cause pageflip completion events to be sent out one vblank
too early, thereby confusing userspace and causing flicker:
prepare_flip_isr():
1. Pageflip programming takes the ddev->event_lock.
2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
3. Releases ddev->event_lock.
--> Deadline for surface address regs double-buffering passes on
target pipe.
4. dc_commit_updates_for_stream() MMIO programs the new pageflip
into hw, but too late for current vblank.
=> pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
in current vblank due to missing the double-buffering deadline
by a tiny bit.
5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
dm_dcn_crtc_high_irq() gets called.
6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
pageflip has been completed/will complete in this vblank and
sends out pageflip completion event to userspace and resets
pflip_status = AMDGPU_FLIP_NONE.
=> Flip completion event sent out one vblank too early.
This behaviour has been observed during my testing with measurement
hardware a couple of time.
The commit message says that the extra flip event code was added to
dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
in case the pflip irq doesn't fire, because the "DCH HUBP" component
is clock gated and doesn't fire pflip irqs in that state. Also that
this clock gating may happen if no planes are active. This suggests
that the problem addressed by that commit can't happen if planes
are active.
The proposed solution is therefore to only execute the extra pflip
completion code iff the count of active planes is zero and otherwise
leave pflip completion handling to the pflip irq handler, for a
more race-free experience.
Note that i don't know if this fixes the problem the original commit
tried to address, as i don't know what the test scenario was. It
does fix the observed too early pageflip events though and points
out the problem introduced.
Fixes: 16f17eda8b ("drm/amd/display: Send vblank and user events at vsartup for DCN")
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>