Black screen on boot - Oops: 0002 [#1] SMP on Lenovo T400 laptop with radeon

Bug #820746 reported by Mynk on 2011-08-04
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned

Bug Description

When I boot with the latest kernel -

# uname -a # I am guessing this to be *-10 as I am booted from *-8
Linux mayankr-T400 2.6.38-10-generic #42-Ubuntu SMP Mon Apr 11 03:31:50 UTC 2011 i686 i686 i386 GNU/Linux

I get a black screen. Trying to switch between consoles (Ctrl+Alt+1 & then back to 7) gives me an Oops.

I have isolated the relevant log from /var/log/kern.log and attached it along with - Aug4_kern.log.

# cat version.log
Ubuntu 2.6.38-8.42-generic 2.6.38.2 # I have booted from this version as 2.6.38-10 gives the black screen

The Oops footprint for your convenience -

Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354151] BUG: unable to handle kernel paging request at fa501ffc
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354200] IP: [<f89159e8>] r600_cp_start+0x48/0x380 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354263] *pde = 31ae7067 *pte = 00000000
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354294] Oops: 0002 [#1] SMP
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354321] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda1/alignment_offset
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354380] Modules linked in: joydev snd_hda_codec_conexant thinkpad_acpi snd_hda_intel snd_hda_codec pcmcia snd_hwdep arc4 snd_pcm snd_seq_midi iwlagn snd_rawmidi iwlcore radeon(+) mac80211 snd_seq_midi_event i915 cfg80211 snd_seq snd_timer ttm snd_seq_device yenta_socket pcmcia_rsrc snd pcmcia_core psmouse drm_kms_helper drm serio_raw tpm_tis nvram soundcore i2c_algo_bit snd_page_alloc tpm video tpm_bios lp parport usb_storage uas firewire_ohci ahci firewire_core libahci crc_itu_t e1000e
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354775]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354787] Pid: 515, comm: modprobe Not tainted 2.6.38-10-generic #46-Ubuntu LENOVO 2768GZ9/2768GZ9
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354854] EIP: 0060:[<f89159e8>] EFLAGS: 00010287 CPU: 1
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354906] EIP is at r600_cp_start+0x48/0x380 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354937] EAX: 00000000 EBX: f6f8d000 ECX: ffffffff EDX: fa501ffc
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354971] ESI: 00000000 EDI: 00000010 EBP: f1999d18 ESP: f1999d00
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355005] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355036] Process modprobe (pid: 515, ti=f1998000 task=f19c3280 task.ti=f1998000)
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355077] Stack:
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355091] 59413760 f6f8d000 00000911 f6f8d000 00000911 00000011 f1999d2c f8916065
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355159] f6f8d000 00000000 f45d581c f1999d48 f8918b3f f8911621 f895d820 f1999d48
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355230] f6f8d000 00002000 f1999d60 f8919c11 c16b99bd f6f8d000 00002000 f6f8d000
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355298] Call Trace:
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355335] [<f8916065>] r600_cp_resume+0x345/0x580 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355389] [<f8918b3f>] r600_startup+0xbf/0x170 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355441] [<f8911621>] ? r600_pcie_gart_init+0x61/0x70 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355495] [<f8919c11>] r600_init+0x181/0x2b0 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355545] [<f88be701>] radeon_device_init+0x381/0x400 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355595] [<f88bcf50>] ? radeon_switcheroo_can_switch+0x0/0x50 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355652] [<f88bfbb4>] radeon_driver_load_kms+0x94/0x160 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355699] [<f85c6694>] drm_get_pci_dev+0x144/0x2b0 [drm]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355737] [<c102d978>] ? default_spin_lock_flags+0x8/0x10
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355770] [<c102d978>] ? default_spin_lock_flags+0x8/0x10
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355820] [<f8946d08>] ? radeon_pci_probe+0x2c/0x94 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355872] [<f8946d08>] ? radeon_pci_probe+0x2c/0x94 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355923] [<f8946d6b>] radeon_pci_probe+0x8f/0x94 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355957] [<c12924e7>] local_pci_probe+0x47/0xb0
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.355987] [<c1293968>] pci_device_probe+0x68/0x90
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c133440d>] really_probe+0x4d/0x150
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c133c67b>] ? pm_runtime_barrier+0x4b/0xb0
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c13346ac>] driver_probe_device+0x3c/0x60
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c1334751>] __driver_attach+0x81/0x90
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c13346d0>] ? __driver_attach+0x0/0x90
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c1333808>] bus_for_each_dev+0x48/0x70
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c1293370>] ? pci_device_remove+0x0/0xf0
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c13342be>] driver_attach+0x1e/0x20
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c13346d0>] ? __driver_attach+0x0/0x90
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c1333ed8>] bus_add_driver+0xb8/0x250
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c1293370>] ? pci_device_remove+0x0/0xf0
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c1334996>] driver_register+0x66/0x110
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c1292a85>] __pci_register_driver+0x45/0xb0
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<f85c6ac6>] drm_pci_init+0x96/0xc0 [drm]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<f85bf044>] drm_init+0x54/0x70 [drm]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<f89830b7>] radeon_init+0xb7/0x1000 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c1001255>] do_one_initcall+0x35/0x170
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<f8983000>] ? radeon_init+0x0/0x1000 [radeon]
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c1088afb>] sys_init_module+0xdb/0x230
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] [<c150a194>] syscall_call+0x7/0xb
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] Code: c6 0f 85 fd 02 00 00 8b bb a4 07 00 00 85 ff 0f 8e d6 02 00 00 8b 83 94 07 00 00 8d 14 85 00 00 00 00 83 c0 01 03 93 8c 07 00 00 <c7> 02 00 44 05 c0 8b 93 a4 07 00 00 23 83 b4 07 00 00 83 ab a0
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] EIP: [<f89159e8>] r600_cp_start+0x48/0x380 [radeon] SS:ESP 0068:f1999d00
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] CR2: 00000000fa501ffc
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] ---[ end trace 8a26ed967ec47429 ]---
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.587160] EXT4-fs (sda8): re-mounted. Opts: errors=remount-ro,user_xattr
Aug 4 09:25:20 mayankr-T400 kernel: [ 11.334726] EXT3-fs: barriers not enabled
Aug 4 09:25:20 mayankr-T400 kernel: [ 11.335168] kjournald starting. Commit interval 5 seconds
Aug 4 09:25:20 mayankr-T400 kernel: [ 11.335819] EXT3-fs (sda7): using internal journal
Aug 4 09:25:20 mayankr-T400 kernel: [ 11.335823] EXT3-fs (sda7): mounted filesystem with ordered data mode
Aug 4 09:25:20 mayankr-T400 kernel: [ 11.439036] ppdev: user-space parallel port driver
Aug 4 09:25:20 mayankr-T400 kernel: [ 11.450801] type=1400 audit(1312430120.789:8): apparmor="STATUS" operation="profile_load" name="/usr/lib/cups/backend/cups-pdf" pid=972 comm="apparmor_parser"
Aug 4 09:25:20 mayankr-T400 kernel: [ 11.453698] type=1400 audit(1312430120.793:9): apparmor="STATUS" operation="profile_load" name="/usr/sbin/cupsd" pid=972 comm="apparmor_parser"

I don't see an option to attach another file (why?). Will upload lscpi-vnvn.log as suggested if needed - I am assuming apport did that.

Thanks,
Mayank

ProblemType: Bug
DistroRelease: Ubuntu 11.04
Package: linux-image-2.6.38-8-generic 2.6.38-8.42
ProcVersionSignature: Ubuntu 2.6.38-8.42-generic 2.6.38.2
Uname: Linux 2.6.38-8-generic i686
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.23.
Architecture: i386
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: CONEXANT Analog [CONEXANT Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: mayankr 1502 F.... pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xfc620000 irq 50'
   Mixer name : 'Conexant CX20561 (Hermosa)'
   Components : 'HDA:14f15051,17aa211c,00100000 HDA:14f12c06,17aa2122,00100000'
   Controls : 16
   Simple ctrls : 8
Card29.Amixer.info:
 Card hw:29 'ThinkPadEC'/'ThinkPad Console Audio Control at EC reg 0x30, fw 7VHT16WW-1.06'
   Mixer name : 'ThinkPad EC 7VHT16WW-1.06'
   Components : ''
   Controls : 1
   Simple ctrls : 1
Card29.Amixer.values:
 Simple mixer control 'Console',0
   Capabilities: pswitch pswitch-joined penum
   Playback channels: Mono
   Mono: Playback [on]
Date: Thu Aug 4 10:03:51 2011
HibernationDevice: RESUME=UUID=e11e3cd6-3571-4ff0-8bd2-936410ff335c
InstallationMedia: Ubuntu 11.04 "Natty Narwhal" - Release i386 (20110427.1)
MachineType: LENOVO 2768GZ9
PccardctlIdent:
 Socket 0:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
ProcEnviron:
 LANGUAGE=en_US:en
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.38-8-generic root=UUID=e407c972-f9c6-456c-91e5-794d12f49acc ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-2.6.38-8-generic N/A
 linux-backports-modules-2.6.38-8-generic N/A
 linux-firmware 1.52
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: yes
  Hard blocked: yes
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 04/19/2010
dmi.bios.vendor: LENOVO
dmi.bios.version: 7UET86WW (3.16 )
dmi.board.name: 2768GZ9
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.asset.tag: 50009928
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvr7UET86WW(3.16):bd04/19/2010:svnLENOVO:pn2768GZ9:pvrThinkPadT400:rvnLENOVO:rn2768GZ9:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 2768GZ9
dmi.product.version: ThinkPad T400
dmi.sys.vendor: LENOVO

Mynk (mr-mynk) wrote :
Mynk (mr-mynk) wrote :

Ok I see that lspci details are attached.

I am able to reproduce this problem at will. The laptop won't boot with the latest kernel. I have to boot from the previous kernel.

Kindly note that the bug report was taken when booted from the older version of the kernel.

Hope this helps,
Mayank

Seth Forshee (sforshee) on 2011-08-04
tags: added: regression-update
Download full text (5.2 KiB)

Curious guy not knowing much about radeon/video drivers but with a can-debug approach trying to take a stab at this issue based on the radeon.ko objdump disassembly provided by the bug reporter who happens to be my friend : (please try the recommendation suggested at the end of this report in r600.c, r600_cp_resume to see if it resolves the OOPs)

The kernel panic is a result of an invalid ring write pointer while updating a value to the radeon ring buffer.
The write pointer read from the radeon control register (r100_mm_rreq function in radeon.h) is returning an incorrect (or seemingly negative value on RESUME). Looks like we may have to add a retry on r600_cp_resume to make it work.

RCA enclosed below:
------------------

Mapping the Oops to the disassembly, its clear that the kernel panic was triggered by this instruction:
static inline void radeon_ring_write(struct radeon_device *rdev, uint32_t v)
{
#if DRM_DEBUG_CODE
        if (rdev->cp.count_dw <= 0) {
                DRM_ERROR("radeon: writting more dword to ring than expected !\n");
        }
#endif
        rdev->cp.ring[rdev->cp.wptr++] = v; ->PANICs here as rdev->cp.wptr seems to be negative
        rdev->cp.wptr &= rdev->cp.ptr_mask;
        rdev->cp.count_dw--;
        rdev->cp.ring_free_dw--;
}

Lets now map the above to the EIP first:

EIP at the time of the kernel panic was r600_cp_start+0x48
From objdump disassembly, it maps to:

r600_cp_start: (0x709e8):
  709e8: c7 02 00 44 05 c0 movl $0xc0054400,(%edx)

Also it exactly matches the OOPs hex dump:
Aug 4 09:25:20 mayankr-T400 kernel: [ 10.356006] Code: c6 0f 85 fd 02 00 00 8b bb a4 07 00 00 85 ff 0f 8e d6 02 00 00 8b 83 94 07 00 00 8d 14 85 00 00 00 00 83 c0 01 03 93 8c 07 00 00 <c7> 02 00 44 05 c0 8b 93 a4 07 00 00 23 83 b4 07 00 00 83 ab a0

Refer to the instruction <c7> in angular brackets which represents the faulting instruction and hexcodes also match the above EIP. (c7 02 00 44 05 c0 )

If you reverse engineer the code to the disassembly, the panic EIP is evident.
From the objdump, EBX holds the radeon_device pointer *rdev.

The ring buffer remaining count cp.dw_count is 16 and held in register EDI.

The write pointer or rdev->cp.wptr index for the radeon ring buffer is stored in EAX.
From the panic this value is shown as 0 as EAX holds 0.

The ring buffer pointer that triggered the OOps is in EDX as seen also from the above target of the store.
EDX value from the OOPs is: 0xfa501ffc. And this is also the PTE entry that took the page fault as seen from the OOPs:

Aug 4 09:25:20 mayankr-T400 kernel: [ 10.354151] BUG: unable to handle kernel paging request at fa501ffc

movl $0xc0054400, (%edx)

The "C" call for the above store is

    radeon_ring_write(rdev, PACKET3(PACKET3_ME_INITIALIZE, 5));

from r600_cp_start. PACKET3(PACKET3_ME_INITIALIZE, 5) macro evaluates to 0xc0054400.
So now we are dead sure that we had an incorrect radeon ring write pointer read from the register in r600_cp_resume:
before calling r600_cp_start:

 rdev->cp.rptr = RREG32(CP_RB_RPTR);
 rdev->cp.wptr = RREG32(CP_RB_WPTR);

Now from the assembly, the value of the write pointer is stored in EAX at the time of the panic:

Taking a few i...

Read more...

Just to avoid confusion as rdev->cp.wptr is unsigned, by negative I implied that it was ~0 or 0xffffffffU; when read in r600_cp_resume.
So an increment: add $1, %eax before the OOPs wrapped it back to 0. as EAX or the ring buffer write pointer index at the time of the panic was shown as 0. Since the increment had happened before the fault, it had to be ~0 or 0xffffffffU on resume which is again an invalid write pointer value for the radeon ring buffer.

So we retry till we get a sane value for the read and write pointer for the radeon ring buffer on RESUME in r600_cp_resume. Have a strong hunch that it would fix the panic as this seems to be a timing issue with reading registers on resume. (maybe the device isn't ready yet when the resume tries to fetch the read and write indexes)

Mynk (mr-mynk) wrote :

Adding the objdump referred to in the update. Will try the suggestions and update.

Mynk (mr-mynk) wrote :

Thanks Karthick,

Your suggestion works -

T400:~/linux-2.6.38/drivers/gpu/drm/radeon# diff r600.c.orig r600.c
2221a2222,2234
> /*
> * Re-read the read and write if the value returned isn't sane. before calling r600_cp_start
> */
> do {
> rdev->cp.rptr = RREG32(CP_RB_RPTR);
> mdelay(15);
> } while((int)rdev->cp.rptr < 0);
>
> do {
> rdev->cp.wptr = RREG32(CP_RB_WPTR);
> mdelay(15);
> } while( (int)rdev->cp.wptr < 0 );
>

If I boot without this fix I run into the Oops. With the patched module it works fine.

If a patch would help kindly let know what steps to use to produce the patch (which directory to run it from) and I can upload the same.

Hope this fix is reviewed and makes into the next release.

Thanks,
Mayank

Great news!
Attach your r600.c and I can make a proper patch which you can retest before we can pitch
for it to be included in the next release.

Mynk (mr-mynk) wrote :

Attaching the r600.c as requested.

Herton R. Krzesinski (herton) wrote :

a-r-karthick: thanks for your analysis.

This invalid register read seems something very similar which was also workarounded at r100_cp_init function, take a look at this change:

commit 9e5786bd14cb9ffe29ebe66d41cedf03311b0d30
Author: Dave Airlie <email address hidden>
Date: Wed Mar 31 13:38:56 2010 +1000

    drm/radeon/kms: add sanity check to wptr.

    If we resume in a bad way, we'll get 0xffffffff in wptr, and then
    oops with no console. This just adds a sanity check so that we can
    avoid the oops and hopefully get more details out of people's systems.

    Signed-off-by: Dave Airlie <email address hidden>

diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c
index 138ddd4..c8f4b03 100644
--- a/drivers/gpu/drm/radeon/r100.c
+++ b/drivers/gpu/drm/radeon/r100.c
@@ -744,6 +744,8 @@ int r100_cp_init(struct radeon_device *rdev, unsigned ring_size)
        udelay(10);
        rdev->cp.rptr = RREG32(RADEON_CP_RB_RPTR);
        rdev->cp.wptr = RREG32(RADEON_CP_RB_WPTR);
+ /* protect against crazy HW on resume */
+ rdev->cp.wptr &= rdev->cp.ptr_mask;
        /* Set cp mode to bus mastering & enable cp*/
        WREG32(RADEON_CP_CSQ_MODE,
               REG_SET(RADEON_INDIRECT2_START, indirect2_start) |

a-r-karthick: Can you raise this issue upstream, at <email address hidden> mailing list? You can just test and send a similar patch for review, using ptr_mask also.

Herton R. Krzesinski (herton) wrote :

I mean, r600_cp_resume also seems to need same workaround already present on r100_cp_init

Herton R. Krzesinski (herton) wrote :

And this isn't a regression, the update from 2.6.38-8.42 doesn't touch this area, seems the issue happens by "luck", some code shuffle or anything else (timing?) may be made the issue more likely to happen on newer kernel.

@Herton : Good point regarding a similar fix in r100.c that I wasn't aware.
I didn't even reproduce this as @mynk (reporter and buddy) reproduced this as I don't have a hardware with radeon chipset :)

I debugged this with the objdump disassembly and the OOPs information. Seems to exactly match the fix and the comment in r100.c that corrects the write index.

Masking the write pointer with the ring buffer ptr_mask like in r100.c makes sense even if it doesn't match the exact write index (would write to the last byte before rolling back to 0 in the ring buffer) on resume since I was trying to re-read with a delay. The fact that the retry with mdelay was working for him on a resume implies that it was indeed fetching the right values on a re-read.

But my patch was causing a boot time lockup as it was doing the same thing for ring buffer read index and it seems that the ring buffer read index returned from the hardware is always uninitialized or ~0U during boot.
So maybe a mask for read index makes sense but the fact that its acceptable to issue a read to the iommu at an invalid offset before its corrected in the next read pass that patches it with ring ptr_mask suggests that its not a deal breaker.

So lets play it safe and retain the same fix in r600.c as it is in r100.c:

@Mayank : Please revert the last patch and apply the patch on top of your original un-patched r600.c.
(also attached)

@Herton: Once Mayank re-tests with the patch and confirms that it works, I can push for it upstream quoting this bug as the reference. I am surprised that the bug still exists in 3.0 as well. And I don't believe it has anything to do with regression. It could be that we are plain lucky with resume since this is related to the hardware returning an invalid index on resume.

--- r600_orig.c 2011-08-05 13:39:25.833427436 -0700
+++ r600.c 2011-08-05 14:50:32.037670946 -0700
@@ -2218,6 +2218,8 @@

    rdev->cp.rptr = RREG32(CP_RB_RPTR);
    rdev->cp.wptr = RREG32(CP_RB_WPTR);
+ /* protect against crazy HW on resume */
+ rdev->cp.wptr &= rdev->cp.ptr_mask;

    r600_cp_start(rdev);
    rdev->cp.ready = true;

Herton R. Krzesinski (herton) wrote :

@a-r-karthick: yes once you get tested please submit it upstream, thank you.

Mynk (mr-mynk) wrote :

@a-r-karthick: the patch works fine. I saw some errors in r600 you might want to check. Attaching along with.

@Mynk: Thanks for your efforts in testing out the patch. I know you spent sleepless nights to get this verified amidst other preemptions.
Regarding the error log on resume, its a different one altogether. It has something to do with the fact that the radeon_pcie_gart_enable never really happend. (advanced relocation table/iommu for PCI express slots).
So on device startup, the GPU acceleration was disabled for your hardware which also releases the gart table. (radeon_gart_table_vram_free invoked on gart finalize which clears up the vram)

Aug 8 05:34:29 mayankr-T400 kernel: [ 10.481258] [drm:radeon_ring_write] *ERROR* radeon: writting more dword to ring than expected !
Aug 8 05:34:29 mayankr-T400 kernel: [ 10.626140] [drm:r600_ring_test] *ERROR* radeon: ring test failed (scratch(0x8504)=0xFFFFFFFF)
Aug 8 05:34:29 mayankr-T400 kernel: [ 10.626146] radeon 0000:01:00.0: disabling GPU acceleration

However this fact isn't marked by the radeon driver when it disables the GPU acceleration in r600_startup (no flags marked).
So during resume when it tries to re-enable the gart table, it found an empty: vram object in r600_pcie_gart_enable.
And then fails the resume but since your ring and gpu were anyway initialized and suspend doesn't touch it, you were not impacted but just got left with a "Resume failed message".
I think this has something to do with the fact that your hardware is returning invalid (~0U) values during initialize for the ring buffer write index which we are now anding it with the ring buffer size which effectively leaves us with 1 byte of write space at the tail or reduced ring buffer write space for the GPU acceleration feature to be enabled.
So I guess our quirk for your hardware is making you _live_ or exist with a broken/crazy hardware :)
Otherwise you would be Oopsing as before. If you want to enable GPU acceleration, maybe we retry a finite number of times on receiving invalid ring buffer write index values as my original patch with the expectation that the subsequent retries work but I guess its not worth it and it makes sense to live without GPU acceleration for your seemingly broken graphics chipset.

I also believe that we can mark a flag in rdev->flags like rdev->flags |= RADEON_IS_GART_DISABLED and then check against this flag when the pcie_gart_enable fails on resume and continue by rdev->flags &= ~RADEON_IS_GART_DISABLED with the resume instead of failing the resume since the gart vram object was freed during r600_startup while disabling the gart and continuing.

But I don't think its a big deal and we can treat it as benign for now for the reasons mentioned above.

So to cut it short, lets now pull the trigger for the patch to be pushed upstream :)

Brad Figg (brad-figg) on 2011-08-15
Changed in linux (Ubuntu):
status: New → Confirmed
Brad Figg (brad-figg) wrote :

@a-r-karthick,

I don't see this patch in Linus' tree. Has this been submitted upstream?

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Download full text (6.6 KiB)

@brad-figg

Yes we did mail lkml and dri-devel. They had ack'ed it back then and proposed to resolve it by just resetting the write ring buffer index to 0 on resume to be safe.

But not sure it got submitted upstream as it had asked for confirmation. I am sure the patch would have worked but I debugged the issue which was reproduced by @mynk in his hardware setup as it is typical to his setup. Since I didn't have access or could reproduce locally, I just debugged it with the objdump. Maybe you should pitch for it upstream if it hasn't been merged.

Here is the mail that was sent back in response to our submission:

From c564bc8e6d449216d74ee134d5bf470221f79e8d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Michel=20D=C3=A4nzer?= <email address hidden>
Date: Thu, 8 Sep 2011 11:09:39 +0200
Subject: [PATCH] drm/radeon: Don't read from CP ring write pointer registers.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

"
The patch below is what I had in mind. Does this fix the problem above?
"

Apparently this doesn't always work reliably, e.g. at resume time.

Just initialize to 0, so the ring is considered empty.

Tested with hibernation on Sumo and Cayman cards.

Should fix https://bugs.launchpad.net/ubuntu/+source/linux/+bug/820746/ .

Signed-off-by: Michel Dänzer <email address hidden>
---
 drivers/gpu/drm/radeon/evergreen.c | 4 ++--
 drivers/gpu/drm/radeon/ni.c | 12 ++++++------
 drivers/gpu/drm/radeon/r100.c | 6 ++----
 drivers/gpu/drm/radeon/r600.c | 4 ++--
 4 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/radeon/evergreen.c b/drivers/gpu/drm/radeon/evergreen.c
index 15bd047..f2bd90a 100644
--- a/drivers/gpu/drm/radeon/evergreen.c
+++ b/drivers/gpu/drm/radeon/evergreen.c
@@ -1378,7 +1378,8 @@ int evergreen_cp_resume(struct radeon_device *rdev)
       /* Initialize the ring buffer's read and write pointers */
       WREG32(CP_RB_CNTL, tmp | RB_RPTR_WR_ENA);
       WREG32(CP_RB_RPTR_WR, 0);
- WREG32(CP_RB_WPTR, 0);
+ rdev->cp.wptr = 0;
+ WREG32(CP_RB_WPTR, rdev->cp.wptr);

       /* set the wb address wether it's enabled or not */
       WREG32(CP_RB_RPTR_ADDR,
@@ -1403,7 +1404,6 @@ int evergreen_cp_resume(struct radeon_device *rdev)
       WREG32(CP_DEBUG, (1 << 27) | (1 << 28));

       rdev->cp.rptr = RREG32(CP_RB_RPTR);
- rdev->cp.wptr = RREG32(CP_RB_WPTR);

       evergreen_cp_start(rdev);
       rdev->cp.ready = true;
diff --git a/drivers/gpu/drm/radeon/ni.c b/drivers/gpu/drm/radeon/ni.c
index 559dbd4..e3489ee 100644
--- a/drivers/gpu/drm/radeon/ni.c
+++ b/drivers/gpu/drm/radeon/ni.c
@@ -1182,7 +1182,8 @@ int cayman_cp_resume(struct radeon_device *rdev)

       /* Initialize the ring buffer's read and write pointers */
       WREG32(CP_RB0_CNTL, tmp | RB_RPTR_WR_ENA);
- WREG32(CP_RB0_WPTR, 0);
+ rdev->cp.wptr = 0;
+ WREG32(CP_RB0_WPTR, rdev->cp.wptr);

       /* set the wb address wether it's enabled or not */
       WREG32(CP_RB0_RPTR_ADDR, (rdev->wb.gpu_addr + RADEON_WB_CP_RPTR_OFFSET) & 0xFFFFFFFC);
@@ -1202,7 +1203,6 @@ int cayman_cp_resume(struct radeon_device *rdev)
       WREG32(CP_RB0_BASE...

Read more...

tags: added: patch
Brad Figg (brad-figg) wrote :

@a-r-karthick,

I'd be happy to build some kernels with that patch applied if someone is willing to test them. I'll do it first thing tomorrow.

Mynk (mr-mynk) wrote :

I have changed my setup and I don't remember seeing this issue on linux 3.0.x kernel. I will need to check if I can reproduce the problem. If I can I am willing to test the same.

Brad Figg (brad-figg) wrote :

@a-r-karthick,

Trying to apply that patch it seems to be corrupt. Can you just add the patch email as you received it as an attachment to this bug and I'll see if I can apply that. Thanks.

Patch for this bug that also addresses other chipsets affected on resume.

Brad Figg (brad-figg) wrote :

This patch was picked up as part of an upstream stable release. This commit first appeared in 3.0.0-13.21.

Changed in linux (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.