Comment 59 for bug 91966

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

Found it!
I'll put y'all out of your misery. The workaround is:

    Edit /etc/defaults/acpi-support, to comment out "POST_VIDEO=true".

This is hardly a general solution for Gutsy (then again, acpi-support seems to be a seething mass of hacks upon hacks already, it would seem to fit in somewhere...).

Basically, it's not suspend/resume which causes the page table corruption, it's the "vbetool post" command. You can cause identical effects without suspending, using this command from inside a terminal window:

    sudo sh -c 'chvt 1; sleep 2; vbetool post; vbetool vbestate restore </var/lib/acpi-support/vbestate; sleep 2; chvt 7'

If you run that command after booting and starting X, from a terminal window, then when it finally returns to showing X, the screen artifact will be present, and it may be causing more serious random memory corruption behind the scenes (which hasn't been noticed so far).

If you don't run that command during suspend/resume (by editing /etc/default/acpi-support), then there's no screen artifact or page table corruption.

The other vbetool commands: vbetool vbemode get; vbetool vbestate restore don't seem to affect the page table *in my tests*. The vbestate restore could in principle clobber the page table; this is no guarantee that it doesn't.

vbetool vbestate restore is still required, unfortunately, by ACPI resume, to restore the text console. It does *not* seem to be required for X to resume, only the text console fails to display if you disable this command.

Perhaps the cleanest fix is either (1) to have "vbetool post" save and restore the page table around POST, or (2) for the kernel to implement memory protection when executing VBE calls and detect such changes and update its own structures. Is the page table a generic AGP page table, independent of the video driver, which the kernel tracks?

Now that I have a fix which works on my laptop, I'm happy. (Just command out POST_VIDEO in /etc/default/acpi-support). Still, it will be nice when the bug is fixed in the general distribution, without a laptop-specific workaround.

Peter Clifton wrote:
> [snip] what is the reason for the corrupt page tables?
> Urm... pass on that one ;)
> Jamie? Does the page table look the same before suspend? (Or when the bug is not showing?)

At first I though I saw the screen corruption after boot, without suspending, but when I tried it today I was surprised to see no corruption after a fresh boot (and power cycle), and then it appeared after suspend/resume... I guess I was wrong about it appearing straight away, then.

(The reason I didn't respond earlier was because I didn't reboot my laptop for a couple of weeks, as I had work on it I didn't want to lose track of. It's nice to see that suspend/resume are reliable enough to use them several times a day for a couple of weeks now! Earlier versions of this driver often crashed during suspend, or during switch to console. It seems reliable now.)

Then I suspected one of the steps in the ACPI process, and looked in the scripts in /etc/acpi/{suspend,resume}.d for clues. Then I systematically did some tests with your intel_reg_dumper, and here's what I found....

Attached is an archive containing 11 dumpes from your modified intel_reg_dumper, including the page tables. The dumps were taken in sequence, following a single boot. The boot was done after a power cycle, removing the battery and mains, to be sure of the state.

The X driver is Gutsy's current, xserver-xorg-video-intel version2:2.1.1-0ubuntu9.

intel_reg_dump_01_after_boot: After a power cycle (removed battery and mains), booting through Ubuntu's splash screen and eventually to running X, logging in, and switching to dual head with TMDS-1 = 1680x1050+0+0 and LVDS = 1280x800+400+1050. This will show the artifact, and also demonstrates nicely that suspend/resume works fine with dual head.

intel_reg_dump_02_console_and_back: Switched to a text console, and back to X on console 7, then dumped registers again. Some changes, for the curious. I wanted to get closer the state which is restored by suspend/resume, as that always switches to the console and back.

intel_reg_dump_03_vberestore_and_back: Switched to console 1, ran "vbetool vbestate restore </var/lib/acpi-support/vbestate", then switched back to X, then did this dump. This restores the console VBE state which was saved during the boot sequence. Suspend/resume uses this same state during resume (it doesn't store the state during suspend, it always uses the boot time state). There's no change to the page tables, but there are some curious register changes which are hopefully harmless. (The TV control knobs are all zeroed, which might be related to TV suspend/resume bugs if anyone's interested.)

intel_reg_dump_04_suspend_resume: Did ACPI suspend/resume started by gnome-power-manager, then did this dump after the resume. The console before and after was the X console. This ran through all the scripts in /etc/acpi/suspend.d and /etc/acpi/resume.d as usual. All ACPI scripts are as distributed in Gutsy, but I changed /etc/default/acpi-support to comment out (disabling) SAVE_VBE_STATE and POST_VIDEO. That means "vbetool post" and "vbetool vbestate restore" were *not* run, and the only video-related thing done by the ACPI scripts was to change to a text console, and do "vbetool dpms off/on", which seems to be harmless.

intel_reg_dump_05_console_and_back: Switched to a text console, and back to X, then did this dump.

intel_reg_dump_06_console_vberestore_and_back: Switched to the console, did "vbetool post; vbetool vbestate restore </var/lib/acpi-support/vbestate", then switched back to X, then did this dump. The restored state was whatever was saved at boot time (suspend doesn't save the state). This restores the text console, otherwise it doesn't show after resume. X does show after resume without this.

intel_reg_dump_07_console: Switch to a text console, then did this dump *while at the text console*. All earlier dumps were inside X. You can see how page table entries are freed when switching to the console, by pointing them to the "scratch" page at 0x837814001 (the actual value varies from boot to boot). Alas, I didn't think to get a register dump of the console before vberestore (when the text console doesn't show).

intel_reg_dump_08_console_vbepost_vberestore: At the text console, did "vbetool post" then "vbetool vbestate restore </var/lib/acpi-support/vbestate", then did this dump. Two things are clear from this: 1) "vbetool post" doesn't do anything interesting to the Intel registers that isn't already restored by "vbetool vbestate restore", so it's not required at least on my laptop; 2) "vbetool post", by itself, corrupts the page table storing 0x83ffbe001 in spurious places (causing the visible artifact), and in *many* but *not all* places which previously contained the "scratch" page, 0x837814001.

intel_reg_dump_09_console_vberestore: At the text console, did "vbetool vbestate restore </var/lib/acpi-support/vbestate" and did this dump, just to see if there were changes. There were, but nothing significant.

intel_reg_dump_10_console_vberestore: Same as previous.

intel_reg_dump_11_back_to_X: Switch back to the X console, then did this dump. The difference between dumps 06 and 11 shows no important register differences (I think), but the page table corruption caused by "vbetool post" is clear. The artifact (part row duplicated between two parts of the screen) is now visible on screen, and wasn't before.