ami-6836dc01 8.04 32 bit AMI kernel lock bug

Bug #705562 reported by pwolanin on 2011-01-20
34
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Hardy
Undecided
Stefan Bader

Bug Description

SRU Justification:

Impact: For i386 PGDs are stored in a linked list. For this two elements of struct page are (mis-)used. To have a backwards pointer, the private field is assigned a pointer to the index field of the previous struct page. The main problem there was that list_add and list_del operations accidentally were done twice. Which leads to accesses to (after first list operation) innocent struct pages.

Fix: This is a bit more than needed to fix the bug itself, but it will bring our code more into a shape that resembles upstream (factually there is only a 2.6.18 upstream but that code did not do the double list access).

Testcase: Running a 32bit domU (64bit Hardy dom0, though that should not matter) with the xen kernel and doing a lot of process starts (like the aslr qa regression test does) would quite soon crash because the destructor of a PTE (which incidentally is stored in index) was suddenly overwritten.

---

For months we have been working around a bug in ami-6836dc01, but
this seems not to be reported any place. Is this a known issue?

When we use ruby/puppet (from the Canonical repo) on an instance with
this AMI (e.g. a c1.medium) or in some cases when using java
applications the instance gets locked up.

Our work-around is using kernel 2.6.27-22-xen instead - the person
who created the fixed AMI used this method:

- launch instance of ami-7e28ca17 (instance #1)
- modprobe loop on instance #1
- copy up creds, jdk and ec2-ami-tools to /dev/shm on instance #1
- launch instance of ami-69d73000
(canonical-beta-us/ubuntu-intrepid-beta2-20090226-i386.manifest.xml)
to grab kernel modules from (instance #2)
- tar.gz /lib/modules/2.6.27-22-xen on instance #2
       - scp to instance #1 and untar in /lib/modules
- rm -rf the old /lib/modules/2.6.24-10-xen dir on instance #1
- edit quick-bundle script on instance #1 to hard-code AKI to
aki-20c12649, ARI to ari-21c12648 (the AKI and ARI from instance #2).
       - hard-coded manifest name, bucket to whatever.
- run pre-clean script on instance #1
- run quick-bundle script on instance #1

The console output from a locked instance is attached

pwolanin (pwolanin) wrote :
Scott Moser (smoser) on 2011-03-17
affects: ubuntu → linux-meta (Ubuntu)
Changed in linux-meta (Ubuntu):
importance: Undecided → Medium
Scott Moser (smoser) wrote :

Hi,
  Thank you for taking the time to open a bug report.
  Just for clarity, the issue here is with:
     us-east-1 ami-7e28ca17 canonical ubuntu-hardy-8.04-i386-server-20091130
  And the bug opener has found that they can fix the issue by using the jaunty kernels, which were never officially released by Canonical. (aki-20c12649 canonical-beta-us/vmlinuz-2.6.27-22-xen-i386-us.manifest.xml).

   The ami itself for hardy is not the newest release, but the kernel it uses is the newest released kernels for hardy (ubuntu-kernels-us/ubuntu-hardy-i386-linux-image-2.6.24-10-xen-v-2.6.24-10-kernel.img.manifest.xml).

   to help us debug the issue, could you please
a.) give any information you have on how you can reproduce this issue
b.) try to reproduce on one of the hardy daily build amis. The latest daily build image of hardy uses a substantially newer kernel. I would suggest trying to reproduce on:
  us-east-1 ami-22cc3e4b canonical ubuntu-hardy-daily-i386-server-20110314
which uses kernel:
  us-east-1 aki-7e15e617 canonical ubuntu-hardy-i386-linux-image-2.6.24-28-xen-v-2.6.24-28.86-kernel

Scott Moser (smoser) wrote :

Just for reference, there is information in [private] bug 730765 relevant to this bug.

Scott Moser (smoser) wrote :

This represents the following 3 patches (attached to bug 730765), but applied
to the hardy kernel:
  http://lkml.org/lkml/2010/9/18/288
  http://lkml.org/lkml/2010/9/18/282
  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3588fe2e3f36543664beafedd3bb6dc3ffa896c5

Note that the third patch stacks over the change made in the first.

It might be useful to just build a kernel with these applied and test.

Joseph Salisbury (jsalisbury) wrote :

@pwolanin

I am also available to assist with this bug. I will touch base with Scott on the current status. Just let me know if you have any questions in the mean time.

pwolanin (pwolanin) wrote :

Thanks - looking forward to being able to test a fix.

Joseph Salisbury (jsalisbury) wrote :

@pwolanin

I'm going to build a kernel with the patches listed in comment #4 and test it on EC2. Does the instance need to be c1.medium, or can it me a t1.micro instance?

pwolanin (pwolanin) wrote :

We've seen the problem most consistently on c1.medium, so I'd suggest using that as a test bed.

Joseph Salisbury (jsalisbury) wrote :

@pwolanin

Just curious, do you have a specific requirement to run Hardy vs. Lucid? Is Hardy a requirement for your applications?

pwolanin (pwolanin) wrote :

We began using 8.04 due to the LTS commitment, and before 10.04 was out. We have several hundred running instances now.

We expect to start moving some instances to 10.04 within a couple months, but there are issues like PHP version (5.2 vs. 5.3) that complicate that move and/or require us to build additional packages.

Joseph Salisbury (jsalisbury) wrote :

@pwolanin, I'm still in the process of building a Hardy kernel for EC2 that contains the three patches. I ran into an issue with the second of the three patches. I'll let you know once I have a patched kernel.

Joseph Salisbury (jsalisbury) wrote :

I looked some more at why the second patch was failing. The patches won't apply to the Hardy kernel without rework. The patches were pulled from a future release of the kernel, and this part of the kernel changed quite a bit.

pwolanin (pwolanin) wrote :

Will the basic method of the patch work (just need to find the correct lines to change), or additional supporting changes in later kernels are needed too?

Joseph Salisbury (jsalisbury) wrote :

Unfortunately, the basic method of applying patches will not work for these patches against Hardy. Like you mention, additional changes are required since the patches were pulled from upstream git commits, which are well after Hardy.

While the latest code has only one place that actually registers the interrupt, there are several places in the old code, which would need to be addressed.

Stefan Bader (smb) wrote :

As mentioned before, the Hardy code is quite different to upstream. Even more as both the xen kernels we build as custom-binaries as well as the package provided by the server-team are not build from the code as it is in the git tree, but from code that has additional patches applied (for our xen kernels those would be in the git tree under debian/binary-custom.d/xen/patchset/...). And that modifies the code in a way that the files that seem to be xen related in the unmodified tree are actually not used at all.

So after a bit of rework, I came up with the following patch which seems to make the interrupt handling use an edge triggered model and the resulting kernels would boot as dom0 and domU. I am not really sure this is already enough to solve the problems but then the whole code is old enough that some issues may have been introduced later and it is. So I would appreciate if someone would be able to try and tell me how that compares to the old kernel.

I have kernel packages prepared (http://people.canonical.com/~smb/lp705562/). If there is something missing, please let me know. Oh, as I took the kernel version currently in proposed as a base, I added some lrm and lum packages. Those are basically what currently is in proposed. I have made no changes there.

Stefan Bader (smb) wrote :

Actually adding the patch now. The other version was the patch against Hardy git to get the patch in. But patches of patches are a bit confusing.

Tim Gardner (timg-tpi) on 2011-03-30
affects: linux-meta (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
status: New → In Progress
Changed in linux (Ubuntu Hardy):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
status: New → In Progress
Changed in linux (Ubuntu):
status: In Progress → Invalid
assignee: Stefan Bader (stefan-bader-canonical) → nobody
Jim Salem (jcsalem) wrote :

I'm really glad to see a patch was created. However, we really need this packaged up in a 32-bit Amazon AMI (or at least a kernel build that we can use to create our own AMI).

We could probably build our own kernel but without knowing how the Amazon AMI was built I'm concerned we'd build it incorrectly.

Any ETA on an AMI?

Stefan Bader (smb) wrote :

Actually it is already possible to get things up but I have to admit that there was no good description how. So what you have to do is to take a current daily ami build

http://uec-images.ubuntu.com/hardy/current/

and boot that using --kernel with a matching pv-grub aki described here (this is about 10.04 but those work as well with the daily hardy ami's)

https://lists.ubuntu.com/archives/ubuntu-cloud/2010-December/000466.html

Having booted you can fetch the kernel package from

http://people.canonical.com/~smb/lp705562/

install it with dpkg and reboot. At the moment the kernel abi version in the ami is 29 which is the same as the test kernels. So you only need the linux-image package. If that changes then you probably need some of the other packages and have to check /boot/grub/menu.lst to have the right kernel booted.

I hope this helps to get a testing ground up. When you check cat /proc/interrupts you should see irg-level change into irq-edge with the patched kernel.

I was able to bundle an AMI for testing with this patched kernel. Unfortunately we are still experiencing the same behavior as before with the original kernel. Attached is the console output from one of the servers I reproduced the bug on. This was a c1.medium.

Stefan Bader (smb) wrote :

That does not really look like something that would be interrupt related at all. Joey, just to be sure, these were the only two oops messages coming up (nothing scrolling off the screen). Both look like rather being related to release some mmap space. One after forking and the other when closing a file. But I need to have a closer look.

Stefan Bader (smb) wrote :

So the why is clearer, just not the how. The crash happens because on releasing memory, there are pages with the foreign bit set (meaning those came from a special allocator). The code section in question is special to the xen patch and will take an element of the page structure as a function pointer of the destructor. This (0xc1b19960) is outside the in kernel addresses (maybe completely wrong) and causes a page fault on the instruction fetch.

Now the "only" thing left is to find out how this happens... Meanwhile, is there some reasonably easy way of triggering this at will?

The output I attached cuts off mid-stream I believe because the server hard locks and stops logging anything to the console. I am currently trying to reproduce the output, but the servers keep locking up before spitting out anything useful.

Currently I am able to reproduce the bug by launching one of our managed web servers in production on a c1.medium. A server I am reliably able to reproduce this on is:
- apache2
- php5
- ruby 1.8
- puppet
(If you'd like specific package versions I'd be happy to private message them to you, let me know)

There are php/ruby/puppet management scripts running in cron in addition to apache. Within ~5-20 minutes the server will lock up and stop responding to any services.

Stefan Bader (smb) wrote :

I fear just installing those won't help me that much as long as there is nothing that is really done. Ok, I hope this will work as intended. The problem is that while it is clear that the pointer used to call the destructor in free_hot_cold_page() is wrong, it does not help much explaining how things got there.

Looking at the code there are only about four places that use this foreign page flag: gnttab, netback, pageattr_64-xen and pgtable_32-xen. Now netback should not be used as we are domU and pageattr_64 should not matter as the instance is a 32bit. Not really sure about gnttab. But there are other places using page->index and maybe something goes wrong there (though the likely candidate pgrable_32 only uses the flag for pte pages and those should not get used otherwise).

So what I did is to rename the element index of the page structure and convert all users to s that indirectly through function calls. Then to have a independent element to additionally store the destructor and finally a check whenever index is set whether this would compromise the destructor.

I added v2 kernel images to my people page. If things go as I hope, those should emit a warning whenever the destructor seems to get overwritten. Probably still leading to a crash later because the page should not have had a destructor at all... Unfortunately I cannot verify whether I did it right as I cannot trigger the problem.

Sorry for not getting back to you sooner on this, the past few days have been a little bumpy operationally.

I was able to package a new ami to test your v2 kernel image with. I have only run one server so far with it, but it has taken a little longer to crash with output. Attached is the entire console output I receive on this server when it crashes.

This is the same environment that I have been testing the other kernel images in. Just some more info about it in case it helps us debug this:

- The server has apache running and is receiving traffic. It primarily serves PHP code which may query an external MySQL server.
- There are PHP cron scripts that run to manage the server
- Periodic Nagios checks are run that check the health/status of the server. These are perl/php/bash
- Puppet runs periodically to check to ensure configuration consistency
- SVN is frequently run checking repositories for updates
- rsync is frequently run
- A network filesystem is mounted using the Gluster client
- No EBS volumes are attached

There is a lot going on, so I'm sure its going to be a challenge trying to narrow this down. If there is anything else I can describe about the system that will help us let me know.

I was able to get another server to produce some more output. Attached is the kern.log I was able to grab before it hard locked.

Stefan Bader (smb) wrote :

*sigh* So instead of hitting the place prepared to catch the corruption seen last, it just crashes at various other places... Well not completely true. The first one could be a corrupted list, the second one rather looks like a deadlock.

Apart from that (which I need to think about), the thing that I would not normally have in any testing is the use of a cluster client and network fs. Maybe you could help me with a quick recipe how to set up some similar environment.

One other thing that I just thought of is that I am able to reproduce the bug more quickly when load is applied to the server. When I am running an idle server in development I noticed it took substantially longer to surface the bug. When I launch one in production and more load is applied to it, it seems to happen within an hour.

I can send you some of our Gluster packages and configurations so you can get that setup easily. Please let me know a good place to share these with you.

Stefan Bader (smb) wrote :

I completely agree that this has to be related to some workload. I begin to suspect that this is some use-after-free thing going on in some code that is less commonly used. It seems to strike various places and cause odd behavior. I wonder whether the stack trace of the lockup message is sensible. The only code that has a sigd_enq2 would be net/atm (some network protocol as far as I understand). Would that make sense to you? For sharing configurations you could send them to me (<email address hidden>) directly.

I have sent you the configuration files directly. One thing to note that is odd about this entire situation is that I have not seen this issue on other instance sizes in AWS. The other 32 bit type m1.small does not experience the same problem, and none of our 64 bit instances have this issue either, it is confined to the c1.medium type. I'm curious what is so different between the m1.small / c1.medium that would be triggering this issue.

Stefan Bader (smb) wrote :

Yesterday I tried to do some coverage on the glusterfs part. Lacking any good publicly reachable sources, I ran two m1.small instances as servers and a c1.medium as the client (the main difference between c1.medium and the other 32bit instance types seems to be that this is the only one that got 2 vcpus, of course 64bit instances usually do but there is enough differences in the internals to matter).

Lacking any better test, I ran bonnie++ (without the byte access tests) over the glusterfs. It seemed to be slow but I saw no problems. Still this could be due to the access pattern being different. I wonder, how hard would it be to experimentally provide the fs as NFS share? That might help to rule out the whole glusterfs part or show that it likely is the trigger and just the usage in my test was not hitting the right parts.

Beside of that, if possible, could I see the lsmod output of an instance set up correctly? I think I saw messages about xfs somewhere. So I also ran the bonnie++ test on an disk reformatted to xfs, still not triggering problems.

It may be possible for us to setup an NFS export to test this, this will take a little bit of time so I will need to get back to you on this. In the meantime below is the output of lsmod on the instance I have been using to test this. We do use XFS for these file-systems. If you need additional information please let me know.

~# lsmod
Module Size Used by
dm_crypt 16772 0
crypto_blkcipher 20868 1 dm_crypt
sha256_generic 16256 0
iptable_filter 6656 0
ip_tables 15504 1 iptable_filter
x_tables 18692 1 ip_tables
fbcon 43040 0
tileblit 6656 1 fbcon
font 12288 1 fbcon
bitblit 9728 1 fbcon
softcursor 5888 1 bitblit
uvesafb 30436 0
xfs 551336 0
binfmt_misc 12552 0
dm_multipath 20364 0
dm_mod 58824 2 dm_crypt,dm_multipath
scsi_dh 10884 1 dm_multipath
scsi_mod 149964 1 scsi_dh
ipv6 256804 18
af_packet 21376 2
loop 19084 0
fuse 56604 3
ext3 127880 2
jbd 53268 1 ext3

I am able to reproduce this error on a server that is not running GFS at all. Attached is the output of the server when it crashed. The server is running the following services:

-puppet
-mysql
-php management scripts
-nagios checks

It seems that the commonalities between the two servers affected are:

-puppet
-php management scripts
-nagios checks

Stefan Bader (smb) wrote :

The crash itself seems to be back to the initial problem (calling a function pointer which is bad), with a minor variation of hitting a non-execute area. The process affected here is puppetd but again, this could be coincidence.

From whatever the dmesg output before is telling, it seems that there are two xfs volumes getting mounted and unmounted. I am not sure why, but on the second time this happens, it looks like some device-mapper and the xfs module had to be reloaded. And after mounting and unmounting the two volumes, something caused the clocksource to be changed.

As a next experiment, I would play around with loops involving mount/unmount of xfs and maybe also have a look at changing the clocksource around.

Stefan Bader (smb) wrote :

Sorry for getting back a bit late. I had been playing with xfs getting mounted and unmounted and also writing to it in between. This did not trigger the bug. Neither did removing the xfs module in between.
Not sure this will lead anywhere but as the dmesg seemed to indicate that at least some modules were loaded again (so must have been unloaded), maybe you could create a little script like

while true; do
  date
  lsmod
  sleep 60
done >modules.log

to track what modules are loaded at which point in time.

Stefan Bader (smb) wrote :

Maybe I finally found a reproducer and maybe also a solution for 8.04 (Hardy). If there is time/still interest, would it be possible to try that kernel I dropped at http://people.canonical.com/~smb/lp705562/?

Stefan, thank you for keeping at this. I will test this and get back to you.

Thanks,
Joey

Tim Gardner (timg-tpi) on 2011-06-22
Changed in linux (Ubuntu Hardy):
status: In Progress → Fix Committed

Stefan, so far everything has been going great in our testing. We are rolling this out on a larger scale this week, I will let you know how things go.

Thank you!

Stefan Bader (smb) on 2011-06-29
description: updated
description: updated
Herton R. Krzesinski (herton) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-hardy' to 'verification-done-hardy'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-hardy
Stefan Bader (smb) wrote :

Running ./test-kernel-aslr-collisions.py -v
Running test: './test-kernel-aslr-collisions.py' distro: 'Ubuntu 8.04' kernel: '
2.6.24-29.91 (Ubuntu 2.6.24-29.91-xen)' arch: 'i386' uid: 1000/1000 SUDO_USER: '')
Build helper tools ... (gcc 4.2.4 (Ubuntu 4.2.4-1ubuntu4)) ok
...
Check if stack crashes into mmap in 100,000 execs (amd64 only?) (LP: #504164) ...
...
----------------------------------------------------------------------
Ran 5 tests in 983.613s

This would have crashed before. Joey, if you could give feedback for results with the official proposed kernel on your side, that would be great. Thanks.

We have been running this in production for ~2 weeks now and have begun rolling this out on the rest of our servers. It has resolved the problems that we had previously been seeing with random crashes, soft/hard lockups.

Thank you!

Stefan Bader (smb) wrote :

Does "this" mean the 2.6.24-29.91 version or the kernel that I provided on my people page. The -proposed verification tries to get feedback on the kernel for the next pending update. It is different from my test kernel. There is at least some debug statements missing and I think the final change was different, too.

I have been using the one from your people page. I will attempt to bundle up the proposed one soon and start testing. Will get back to you.

Just to keep things updated:

I've bundled up a new AMI for testing using the proposed kernel (2.6.24-29.91). We have launched a few servers in production using this kernel and will update in a couple days with the status. (So far so good though.)

I can confirm that the proposed kernel does solve the problem. We have been running it for ~36 hours with no problems. In the past the problem has been triggered within 24 hours.

Stefan Bader (smb) wrote :

Thank you Joey. Marking verification done.

tags: added: verification-done-hardy
removed: verification-needed-hardy
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 2.6.24-29.91

---------------
linux (2.6.24-29.91) hardy-proposed; urgency=low

  [Steve Conklin]

  * Release Tracking Bug
    - LP: #801636

  [Andy Whitcroft]

  * custom binaries need VERSION_SIGNATURE updated during prepare
    - LP: #794698

  [Stefan Bader]

  * (config) Disable COMPAT_VDSO for i386 Xen kernels
    - LP: #794715
  * XEN: Add yield points to blktap and blkback
    - LP: #791212
    - CVE-2010-4247
  * xen: Fix memory corruption caused by double free
    - LP: #705562

  [Upstream Kernel Changes]

  * agp: fix arbitrary kernel memory writes, CVE-1011-2022
    - LP: #788684
    - CVE-1011-2022
  * agp: fix OOM and buffer overflow
    - LP: #791918
    - CVE-2011-1746
  * tty: icount changeover for other main devices, CVE-2010-4076,
    CVE-2010-4077
    - LP: #794034
    - CVE-2010-4077
  * fs/partitions/efi.c: corrupted GUID partition tables can cause kernel
    oops
    - LP: #795418
    - CVE-2011-1577
  * Fix corrupted OSF partition table parsing
    - LP: #796606
    - CVE-2011-1163
  * proc: avoid information leaks to non-privileged processes
    - LP: #799906
    - CVE-2011-0726
  * proc: protect mm start_code/end_code in /proc/pid/stat
    - LP: #799906
    - CVE-2011-0726
  * sctp: Fix a race between ICMP protocol unreachable and connect()
    - LP: #799828
    - CVE-2010-4526
  * xen: blkback, blktap: Fix potential resource leak
    - LP: #800254
 -- Steve Conklin <email address hidden> Fri, 24 Jun 2011 10:59:11 -0500

Changed in linux (Ubuntu Hardy):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers