We now have two implementations for vmapping a whole object, one for
dma-buf and one for the ringbuffer. If we couple the mapping into the
obj->pages lifetime, then we can reuse an obj->mapping for both and at
the same time couple it into the shrinker. There is a third vmapping
routine in the cmdparser that maps only a range within the object, for
the time being that is left alone, but will eventually use these routines
in order to cache the mapping between invocations.
v2: Mark the failable kmalloc() as __GFP_NOWARN (vsyrjala)
v3: Call unpin_vmap from the right dmabuf unmapper
v4: Rename vmap to map as we don't wish to imply the type of mapping
involved, just that it contiguously maps the object into kernel space.
Add kerneldoc and lockdep annotations
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Dave Gordon <david.s.gordon@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1460113874-17366-4-git-send-email-chris@chris-wilson.co.uk
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
In order to simplify future patches, extract the
lazy_coherency optimisation our of the engine->get_seqno() vfunc into
its own callback.
v2: Rename the barrier to engine->irq_seqno_barrier to try and better
reflect that the barrier is only required after the user interrupt before
reading the seqno (to ensure that the seqno update lands in time as we
do not have strict seqno-irq ordering on all platforms).
Reviewed-by: Dave Gordon <david.s.gordon@intel.com> [#v2]
v3: Comments for hangcheck paranoia. Mika wanted to keep the extra
barrier inside the hangcheck, just in case. I can argue that it doesn't
provide a barrier against anything, but the side-effects of applying the
barrier may prevent a false declaration of a hung GPU.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Dave Gordon <david.s.gordon@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1460195877-20520-2-git-send-email-chris@chris-wilson.co.uk
Currently for the case where there is enough space at the end of Ring
buffer for accommodating only the base request, the wrapround is done
immediately and as a result the base request gets added at the start
of Ring buffer. But there may not be enough free space at the beginning
to accommodate the base request, as before the wraparound, the wait was
effectively done for the reserved_size free space from the start of
Ring buffer. In such a case there is a potential of Ring buffer overflow,
the instructions at the head of Ring (ACTHD) can get overwritten.
Since the base request can fit in the remaining space, there is no need
to wraparound immediately. The wraparound will anyway happen later when
the reserved part starts getting used.
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Akash Goel <akash.goel@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://patchwork.freedesktop.org/patch/msgid/1457688402-10411-1-git-send-email-akash.goel@intel.com
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: stable@vger.kernel.org
At BXT DSI, PIPE registers are inactive. So we can't get the
PIPE's mode parameters from them. The possible option is
retriving them from the PORT registers.
The required changes are added for BXT in intel_dsi_get_config
(encoder->get_config).
v2: Addressed the Jani's comments
-removed the redundant call to encoder->get_config
-read bpp from port register
-removed retrival of src_size from encoder->get_config
v3: pipe_config->pipe_bpp is fixed
Jani's review comments addressed:
Few horizontal timing parameters dropped from the patch to make
progress, as there seems to be some disagreement on
best/feasible/possible options.
Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Previously Reviewed at: https://lists.freedesktop.org/archives/intel-gfx/2016-April/091737.html
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1460019967-26501-2-git-send-email-ramalingam.c@intel.com
This patch sets the invert bit for hpd detection for each port
based on VBT configuration. Since each AOB can be designed to
depend on invert bit or not, it is expected if an AOB requires
invert bit, the user will set respective bit in VBT.
v2: Separated VBT parsing from the rest of the logic. (Jani)
v3: Moved setting invert bit logic to bxt_hpd_irq_setup()
and changed its logic to avoid looping twice. (Ville)
v4: Changed the logic to mask out the bits first and then
set them to remove need of temporary variable. (Ville)
v5: Moved defines to existing set of defines for the register
and added required breaks. (Ville)
Signed-off-by: Sivakumar Thulasimani <sivakumar.thulasimani@intel.com>
Signed-off-by: Durgadoss R <durgadoss.r@intel.com>
Signed-off-by: Shubhangi Shrivastava <shubhangi.shrivastava@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
[Jani: fixed some checkpatch noise, added kernel-doc.]
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1459420907-11383-2-git-send-email-shubhangi.shrivastava@intel.com
Extract the GPLL reference frequency from CCK and use it in the
GPU freq<->opcode conversions on VLV/CHV. This eliminates all the
assumptions we have about which divider is used for which czclk
frequency.
Note that unlike most clocks from CCK, the GPLL ref clock is a divided
down version of the CZ clock rather than the HPLL clock. CZ clock itself
is a divided down version of the HPLL clock though, so in effect it just
gets divided down twice.
While at it, throw in a few comments explaining the remaining constants
for anyone who later wants to compare this to the spreadsheets.
v2: Add slow/fast notes for CHV clocks (Imre)
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1457120584-26080-2-git-send-email-ville.syrjala@linux.intel.com
Reviewed-by: Imre Deak <imre.deak@intel.com> (v1)
After a suspend-resume cycle, the resumed kernel has no idea what the
booted kernel may have done to the GuC before replacing itself with the
resumed image. In particular, it may have already loaded the GuC with
firmware, which will then cause this kernel's attempt to (re)load the
firmware to fail (GuC program memory is write-once!). The symptoms
(GuC firmware reload fails after hibernation) are further described
in the Bugzilla reference below.
So let's *always* reset the GuC just before (re)loading the firmware;
the hardware should then be in a well-known state, and we may even
avoid some of the issues arising from unpredictable timing.
Also added some more fields & values to the definition of the GUC_STATUS
register, which is the key diagnostic indicator if the GuC load fails.
Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
Reviewed-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Cc: Alex Dai <yu.dai@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94390
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Due to timing issues in the HW, some of the status bits required for GuC
authentication occasionally don't get set; when that happens, the GuC
cannot be initialized and we will be left with a wedged GPU. The W/A
suggested is to perform a soft reset of the GuC and attempt to reload
the F/W again for few times before giving up.
As the failure is dependent on timing, tests performed by triggering
manual full gpu reset (i915_wedged) showed that we could sometimes hit
this after several thousand iterations, but sometimes tests ran even
longer without any issues. Reset and reload mechanism proved helpful
when we indeed hit f/w load failure, so it is better to include this
to improve driver stability.
This change implements the following WAs,
WaEnableuKernelHeaderValidFix:skl,bxt
WaEnableGuCBootHashCheckNotSet:skl,bxt
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
Reviewed-by: Alex Dai <yu.dai@intel.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Doing a lot of work in the interrupt handler introduces huge
latencies to the system as a whole.
Most dramatic effect can be seen by running an all engine
stress test like igt/gem_exec_nop/all where, when the kernel
config is lean enough, the whole system can be brought into
multi-second periods of complete non-interactivty. That can
look for example like this:
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
Modules linked in: [redacted for brevity]
CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183
Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
Workqueue: i915 gen6_pm_rps_work [i915]
task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
RSP: 0000:ffff88014f403f38 EFLAGS: 00000206
RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
Call Trace:
<IRQ>
[<ffffffff8104a716>] irq_exit+0x86/0x90
[<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
[<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
<EOI>
[<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
[<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
[<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
[<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
[<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
[<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
[<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
[<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
[<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
[<ffffffff8105ab29>] process_one_work+0x139/0x350
[<ffffffff8105b186>] worker_thread+0x126/0x490
[<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
[<ffffffff8105fa64>] kthread+0xc4/0xe0
[<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
[<ffffffff814f351f>] ret_from_fork+0x3f/0x70
[<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
I could not explain, or find a code path, which would explain
a +20 second lockup, but from some instrumentation it was
apparent the interrupts off proportion of time was between
10-25% under heavy load which is quite bad.
When a interrupt "cliff" is reached, which was >~320k irq/s on
my machine, the whole system goes into a terrible state of the
above described multi-second lockups.
By moving the GT interrupt handling to a tasklet in a most
simple way, the problem above disappears completely.
Testing the effect on sytem-wide latencies using
igt/gem_syslatency shows the following before this patch:
gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us
gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us
gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us
gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us
This shows that the unrelated process is experiencing huge
delays in its wake-up latency. After the patch the results
look like this:
gem_syslatency: cycles=808907, latency mean=53.133us max=1640us
gem_syslatency: cycles=862154, latency mean=62.778us max=2117us
gem_syslatency: cycles=856039, latency mean=58.079us max=2123us
gem_syslatency: cycles=841683, latency mean=56.914us max=1667us
Showing a huge improvement in the unrelated process wake-up
latency. It also shows an approximate halving in the number
of total empty batches submitted during the test. This may
not be worrying since the test puts the driver under
a very unrealistic load with ncpu threads doing empty batch
submission to all GPU engines each.
Another benefit compared to the hard-irq handling is that now
work on all engines can be dispatched in parallel since we can
have up to number of CPUs active tasklets. (While previously
a single hard-irq would serially dispatch on one engine after
another.)
More interesting scenario with regards to throughput is
"gem_latency -n 100" which shows 25% better throughput and
CPU usage, and 14% better dispatch latencies.
I did not find any gains or regressions with Synmark2 or
GLbench under light testing. More benchmarking is certainly
required.
v2:
* execlists_lock should be taken as spin_lock_bh when
queuing work from userspace now. (Chris Wilson)
* uncore.lock must be taken with spin_lock_irq when
submitting requests since that now runs from either
softirq or process context.
v3:
* Expanded commit message with more testing data;
* converted missed locking sites to _bh;
* added execlist_lock comment. (Chris Wilson)
v4:
* Mention dispatch parallelism in commit. (Chris Wilson)
* Do not hold uncore.lock over MMIO reads since the block
is already serialised per-engine via the tasklet itself.
(Chris Wilson)
* intel_lrc_irq_handler should be static. (Chris Wilson)
* Cancel/sync the tasklet on GPU reset. (Chris Wilson)
* Document and WARN that tasklet cannot be active/pending
on engine cleanup. (Chris Wilson/Imre Deak)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Imre Deak <imre.deak@intel.com>
Testcase: igt/gem_exec_nop/all
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://patchwork.freedesktop.org/patch/msgid/1459768316-6670-1-git-send-email-tvrtko.ursulin@linux.intel.com