Lenovo X12 Detachable Gen 2 unresponsive under light load

Bug #2076361 reported by Gustavo Niemeyer
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
New
High
AaronMa
Noble
In Progress
High
AaronMa
linux-oem-6.8 (Ubuntu)
New
Undecided
Unassigned
Noble
In Progress
Undecided
Unassigned
linux-signed (Ubuntu)
New
Undecided
Andy Whitcroft
Noble
New
Undecided
Unassigned
linux-signed-lowlatency (Ubuntu)
New
Undecided
Andy Whitcroft
Noble
New
Undecided
Unassigned

Bug Description

SRU Justification:
==============

[Impact]
To encode some video files with ffmpeg, and the system becomes completely
unresponsive for as long as the process executes.

[Fix]
Enable Wa_14019159160 and Wa_16019325821 for MTL.

[Test]
Tested on hardware, the system works fine when run the same script to
encode.

[Where problems could occur]
It may break intel i915 driver.
=========================

I've been using a Lenovo X12 Detachable Gen 2 model from 2024 [1] to encode some video files with ffmpeg, and the system becomes completely unresponsive for as long as the process executes. Although small, this is a pretty good device hardware wise, and while executing the system has plenty of RAM (32GB total, 10GB+ left), almost all CPUs are idle, no IO wait.

[1] Intel Core Ultra 164U variant at: https://psref.lenovo.com/syspool/Sys/PDF/Think_Tablets/ThinkPad_X12_Detachable_Gen_2/ThinkPad_X12_Detachable_Gen_2_Spec.pdf

Things I've tried:

1) Lowering the priority of ffmpeg with renice
2) Lowering the priority of ffmpeg with ionice
3) Using a single thread in ffmpeg
4) Switching to the lowlatency kernel, with all recommended fiddling
5) Switching to the OEM kernel (6.8.0-1010-oem)

Nothing even touches the complete lack of responsiveness. The system becomes so unresponsive that when typing nothing shows up, and then characters show repeated as long sequences all at once.

In addition to the attached information, some details about the moment the problem happens:

top - 16:27:49 up 3:26, 1 user, load average: 1.13, 0.70, 0.68
Tasks: 397 total, 1 running, 396 sleeping, 0 stopped, 0 zombie
%Cpu0 : 2.7 us, 0.0 sy, 0.0 ni, 56.3 id, 41.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 3.3 us, 0.3 sy, 0.0 ni, 96.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 3.0 us, 0.0 sy, 0.0 ni, 97.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.7 us, 1.0 sy, 6.3 ni, 2.7 id, 89.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 3.7 us, 0.0 sy, 0.0 ni, 96.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 4.0 us, 0.3 sy, 0.0 ni, 95.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 3.3 us, 0.0 sy, 0.0 ni, 96.3 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 3.7 us, 0.0 sy, 0.0 ni, 96.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu8 : 4.3 us, 0.0 sy, 0.0 ni, 95.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu9 : 3.7 us, 0.0 sy, 0.0 ni, 96.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu10 : 4.0 us, 0.0 sy, 0.0 ni, 96.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu11 : 3.7 us, 0.0 sy, 0.0 ni, 96.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu12 : 3.0 us, 0.3 sy, 0.0 ni, 96.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu13 : 4.0 us, 0.0 sy, 0.0 ni, 96.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 31537.6 total, 9596.6 free, 9784.3 used, 15292.1 buff/cache
MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 21753.3 avail Mem

Linux 6.8.0-39-lowlatency (x12) 08/08/2024 _x86_64_ (14 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
           2.87 0.04 0.26 7.48 0.00 89.35

It may also be worth mentioning, the ffmpeg process is using hardware encoding/decoding.

Thanks for any help on this.

ProblemType: Bug
DistroRelease: Ubuntu 24.04
Package: linux-image-6.8.0-39-lowlatency 6.8.0-39.39.1
ProcVersionSignature: Ubuntu 6.8.0-39.39.1-lowlatency 6.8.8
Uname: Linux 6.8.0-39-lowlatency x86_64
ApportVersion: 2.28.1-0ubuntu3
Architecture: amd64
CasperMD5CheckResult: pass
CurrentDesktop: ubuntu:GNOME
Date: Thu Aug 8 17:23:32 2024
InstallationDate: Installed on 2024-06-25 (44 days ago)
InstallationMedia: Ubuntu 24.04 LTS "Noble Numbat" - Release amd64 (20240424)
SourcePackage: linux-signed-lowlatency
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :
Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :
Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :
Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :
Changed in linux-signed (Ubuntu):
assignee: nobody → Andy Whitcroft (apw)
description: updated
Revision history for this message
AaronMa (mapengyu) wrote (last edit ):

Hi niemeyer,

I tested ffmpeg for encoding hevc_vaapi and h264_vaapi, but not reproduced the issue on ThinkPad X1 with Intel(R) Core(TM) Ultra 7 165U.
My environment:
Ubuntu 24.04 LTS
ffmpeg version 6.1.1-3ubuntu5
6.8.0-40-generic

$ ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i INPUT -vf 'scale_vaapi=format=p010' -c:v hevc_vaapi -profile 2 -b:v 15M output.mp4

From the attached log, there is no obviously error in dmesg,
was it recorded after the issue is duplicated?

Could you check the status as following while encoding?
And attach the dmesg after the issue is duplicated?

$ sudo apt install intel-gpu-tools lm-sensors

Now we need 3 windows to show the status of system:
1, $ sudo watch -n1 sensors
2, $ watch -n 1 "grep \"^[c]pu MHz\" /proc/cpuinfo"
3, $ sudo intel_gpu_top

Then start encoding as the command above.
My result is like following:
intel_gpu_top will show the video engines usage is high like 40%.
The cpu temperature is around 60 and cpu usage around 30%.

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

Thanks for the reply, Aaron.

The dmesg.log file is already attached above.

Your ffmpeg example is on the simple side compared to the pipeline that creates the issue for me, where ffmpeg has multiple inputs and a pipeline that touches both the GPU and the CPU. For simple things it works okay here as well.

intel_gpu_top spikes to 100% on both Render/3D and Compute with very brief interruptions, probably while it moves the frame for processing by the CPU. Blitter, Video, and VideoEnhance remains at 0%.

/proc/cpuinfo shows CPU comfortably alternating between 400MHz and 4GHz, apparently with plenty to spare and confirming the previous output provided by iostat and top in the original report above.

sensors output shows the temperature for all CPUs comfortably under 70 degrees at all times, with almost all of them under 60.

Revision history for this message
AaronMa (mapengyu) wrote :

Hi niemeyer,

The issue should be caused by usage 100% of Render/3D.
Ideally, encoder should use Video/VideoEnhance only.

Could you share your script of ffmpeg?
I'd like to find a way to duplicate this issue.

Revision history for this message
AaronMa (mapengyu) wrote :

100% usage of Render/3D means copy, vpp or display are stuck,
since multiple inputs are set, could it work when encoding a single input each time?

Maybe there is an unsupported video format input by Intel GPU decoder/encoder.

Otherwise, the ffmpeg parameter requests specific filter or scaling to render engine.

Revision history for this message
AaronMa (mapengyu) wrote :

Or try to install non-free to support more video formats:

$sudo apt install intel-media-va-driver-non-free

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

> 100% usage of Render/3D means copy, vpp or display are stuck,

The issue is certainly related to the 100% render/compute, but this looks like a symptom and a consequence rather than a cause.

> since multiple inputs are set, could it work when encoding a single input each time?

Yeah, there are several ways to workaround it. But to be clear, the concern here is not that ffmpeg is slow, but rather that whatever is happening is taking the entire system down instead of gracefully degrading. Even keyboard input stops working.

> Maybe there is an unsupported video format input by Intel GPU decoder/encoder.

No, the input/output encoding/decoding is fine. It's using H265/HEVC, well supported by the chip.

I've done some more testing and bisected the problem to the following, with left.mp4 and right.mp4 being 4K videos.

This works fine, with Video fluctuating beteween 20 and 30%, Render close to zero:

```
ffmpeg -y \
        -vaapi_device /dev/dri/renderD128 \
        -hwaccel vaapi -hwaccel_output_format vaapi -i left.mp4 \
        -hwaccel vaapi -hwaccel_output_format vaapi -i right.mp4 \
        -filter_complex "
                [0:v]hwdownload,format=nv12,copy[v0];
                [1:v]hwdownload,format=nv12,copy[v1];
                [v0][v1]hstack=shortest=true,format=nv12,hwupload[out];
                " \
        -c:v hevc_vaapi -map "[out]" slowtest.mp4
```

This causes the issue, with Render at 100%, Video at 0%, system unresponsive:

```
fmpeg -y \
        -vaapi_device /dev/dri/renderD128 \
        -hwaccel vaapi -hwaccel_output_format vaapi -i left.mp4 \
        -hwaccel vaapi -hwaccel_output_format vaapi -i right.mp4 \
        -filter_complex "
                [0:v]hwdownload,format=nv12,copy,format=nv12,hwupload[v0];
                [1:v]hwdownload,format=nv12,copy,format=nv12,hwupload[v1];
                [v0][v1]hstack_vaapi=shortest=true[out];
                " \
        -c:v hevc_vaapi -map "[out]" out.mp4

```

The key difference between these two pipelines is whether hstack happens in the GPU or the CPU. When it works, two frames are downloaded from the GPU, stacked together, and the merged frame is then re-uploaded for encoding. When it breaks, the two frames are uploaded individually and stacked in the GPU instead.

The problem may be related to the amount of memory used in the GPU, since when the stacking happens in the GPU there is twice as much data. Even then, instead of degrading, the system came down to a halt with the GPU misbehaving, keyboard interrupts going wild, etc.

Anything else I can do to help diagnose why that's the case or whether we can address it?

Revision history for this message
AaronMa (mapengyu) wrote :

Hi Niemeyer,

I can reproduce the issue with your ffmpeg script.

The issue is related to Intel MTL hardware issue and fixed by Wa_16019325821 [1] which is included in v6.10-rc1

A test kernel image is built, please try.
Once you confirmed it's fixed by [1], I will SRU it to noble kernel.

[1] https://patchwork.freedesktop.org/series/130335/

Revision history for this message
AaronMa (mapengyu) wrote :
Changed in linux (Ubuntu):
importance: Undecided → High
assignee: nobody → AaronMa (mapengyu)
Changed in linux (Ubuntu Noble):
assignee: nobody → AaronMa (mapengyu)
importance: Undecided → High
Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

Can confirm the fix works here too. Compute jumps to 100%, but system remains fully usable and the process moves on slowly.

Thanks again for your help, Aaron.

AaronMa (mapengyu)
description: updated
Revision history for this message
AaronMa (mapengyu) wrote :

Thanks, Niemeyer,

The SRU is in progress.

description: updated
Changed in linux (Ubuntu Noble):
status: New → In Progress
Changed in linux-oem-6.8 (Ubuntu Noble):
status: New → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.