DB deadlock on reboot prevents UEC from working, temporarily - 403 Forbidden errors

Bug #444352 reported by Thierry Carrez on 2009-10-06
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Eucalyptus
Invalid
Undecided
chris grzegorczyk
Release Notes for Ubuntu
Undecided
Mathias Gug
eucalyptus (Ubuntu)
Low
Unassigned
Karmic
Low
Dustin Kirkland 

Bug Description

Installed 20091006, works alright.
Rebooted the cloud/cluster controller at 09:46
Started an instance (i-45410877) at 09:48:06
The instance stayed at pending (without an IP address assigned), then terminated.
cloud-output.log shows an apparent DB deadlock condition.

Next one started couldn't get a reservation ("Trying to allocate an address which is already pending")
Next one started worked.

See cloud-output.log that shows the first two tries. Let me know if you need more.

==================
Karmic release notes:

UEC might refuse to serve the first requests received after boot

After a UEC boot the system may not answer correctly, returning for example "403 Forbidden" errors to client requests. This is cause by a database deadlock condition which is automatically cleared after some retries. To workaround this issue you can restart eucalyptus after boot on the cluster controller by running "sudo restart eucalyptus" (444352).

==================

Thierry Carrez (ttx) wrote :
Changed in eucalyptus (Ubuntu):
importance: Undecided → Medium
Thierry Carrez (ttx) wrote :

Dustin: could you run the logs through Dan and/or reproduce-confirm the issue ?

Changed in eucalyptus (Ubuntu):
assignee: nobody → Dustin Kirkland (kirkland)
Dustin Kirkland  (kirkland) wrote :

Thierry-

I can confirm this issue. Note that I only see this problem with our UEC images. I can boot/run the Eucalyptus images okay.

I'll debug this further with Dan tomorrow. In the mean time, I think we need to some investigation on our UEC images in parallel.

Is there another package/project we need to add to this bug to ensure that we get the proper coverage from Scott and Chuck, or whoever is working on the UEC images?

:-Dustin

Changed in eucalyptus (Ubuntu):
status: New → Confirmed
Thierry Carrez (ttx) wrote :

Given the nature of the bug (and how early it shows up in the instance run start), I would be very surprised if it was imagecontent-specific. Maybe imagesize-specific, but even in that case I'd be surprised.

Maybe you tested a UEC image first, and it failed, then tried a eucalyptus one, and it worked ? In my testing, the DB recovered by itself after a couple minutes.

Thierry Carrez (ttx) wrote :

15:09:19 Startup
15:09:53 "Error using or obtaining a database connection, will try till 5>5 more times before reloading."
15:09:59 "Error using or obtaining a database connection, will try till 4>5 more times before reloading."
15:10:05 "Error using or obtaining a database connection, will try till 3>5 more times before reloading."
15:10:07 "APPARENT DEADLOCK!!!"
15:10:10 "APPARENT DEADLOCK!!!"
15:10:21 Running euca-describe-availability-zones verbose fails on 403:
15:10:21 "Error using or obtaining a database connection, will try till 2>5 more times before reloading."
15:10:21 "com.eucalyptus.ws.AuthenticationException: User authentication failed."
15:10:34 Running euca-describe-availability-zones verbose fails on 403:
15:10:34 "Error using or obtaining a database connection, will try till 1>5 more times before reloading."
15:10:34 "com.eucalyptus.ws.AuthenticationException: User authentication failed."
15:10:53 Running euca-describe-availability-zones verbose fails on 403:
15:10:53 "Resetting hard"
15:10:53 "com.eucalyptus.ws.AuthenticationException: User authentication failed."
15:18:57 Running "euca-describe-availability-zones verbose" works

So this seems to be linked to the number of tries rather than a timeout. The DB is wedged on reboot, and it takes 6 tries (2 of which are being done by the system itself) before its working again.

Changed in eucalyptus (Ubuntu):
importance: Medium → High
status: Confirmed → Triaged
summary: - DB deadlock on reboot prevents EMI from being started
+ DB deadlock on reboot prevents UEC from working, temporarily

Can't reproduce it on a stop/start eucalyptus. My hint is that we wedge something when we reboot that we don't wedge on regular stop.

I can reproduce it every time on reboot. It takes 5 DB connections before the system "resets hard". Sometimes the system by itself will do enough connections so that it will be reset by the time you try your command. Otherwise trying your command 5 times in a row is sufficient to make it work.

Look for "DEADLOCK" and "Resetting" in cloud-output.log.

Thierry Carrez (ttx) wrote :

From my limited upstart knowledge, what we do on regular stop that we appear not doing on reboot is... stopping eucalyptus.
However adding "stop on runlevel [!23]" doesn't seem to be sufficient, so maybe it's a red herring.

Thierry Carrez (ttx) wrote :

Lowering priority since its transient.
Probably related to networking not being up before Eucalyptus starts, since apparently it only hits DHCP-based configs

Changed in eucalyptus (Ubuntu Karmic):
assignee: Dustin Kirkland (kirkland) → Thierry Carrez (ttx)
importance: High → Medium
Thierry Carrez (ttx) wrote :

Further testing seems to invalidate the "eucalyptus started before network" theory.

I added a loop to wait for the external IP address to show up before starting up eucalyptus. It does *not* prevent the DEADLOCK issues.

I added a loop to delay eucalyptus startup by 60 seconds arbitrarily, checked that networking and everything else was up even before any of eucalyptus starts: It does *not* prevent the DEADLOCK issues.

Thierry Carrez (ttx) wrote :

More testing:

Trying with static networking setup does *not* prevent DEADLOCK issues.

It's probably not linked to the stop procedure either, since I stop eucalyptus cleanly before rebooting.

Thierry Carrez (ttx) wrote :

Finally I tested disabling eucalyptus start at boot:

(From a running system)
Stopping eucalyptus.
Clearing logs.
Starting eucalyptus manually
Testing: PASS. no deadlock.
Stopping eucalyptus.
Rebooting, waiting for system to be fully up.
Starting eucalyptus manually
Testing: FAIL. deadlock issue.

See resulting logs: one successful startup, one failed startup. They look very much the same we hit the DB error.

Thierry Carrez (ttx) wrote :

So to summarize:
- To trigger the bug, reboot your cluster and try running euca-describe-availability-zones verbose once eucalyptus is ready. It fails, and makes DEADLOCK errors appear in the logs.
- After a few tries, eucalyptus does a "hard reset" on connections and autofixes itself.
- This occurs for me every time, the first time I start eucalyptus after system startup.
- Stopping and restarting eucalyptus on a running system does not trigger the error. It only happens at first eucalyptus startup after a boot.
- This does not depend on the boot order, since I can reproduce by starting up manually after boot is fully complete. It does not depend on static / DHCP networking either, I can reproduce in both cases.
- It does not depend on the way eucalyptus is stopped, since I reproduce it by manually stopping eucalyptus.

It seems like a condition that is only found after a fresh boot is triggering the issue. Something that starting and stopping eucalyptus once will fix. I tried cleaning up /var/run/eucalyptus and /tmp/ and reproduce, but that doesn't come from there. I also tried cleaning up the 169.254.169.254/32 address, but that doesn't trigger it either.

Changed in eucalyptus (Ubuntu Karmic):
assignee: Thierry Carrez (ttx) → nobody
Thierry Carrez (ttx) wrote :

I think we need some confirmation here.
Anyone with a UEC setup, please do:

$ sudo stop eucalyptus
$ sudo rm /var/log/eucalyptus/*
$ sudo reboot
...
# wait for reboot complete and port 8443 to show up on netstat
$ . eucarc
$ euca-describe-availability-zones verbose
$ grep Reset /var/log/eucalyptus/cloud-output.log
$ grep DEAD /var/log/eucalyptus/cloud-output.log
$ sudo stop eucalyptus
$ sudo rm /var/log/eucalyptus/*
$ sudo start eucalyptus
# wait for port 8443 to show up on netstat
$ euca-describe-availability-zones verbose
$ grep Reset /var/log/eucalyptus/cloud-output.log
$ grep DEAD /var/log/eucalyptus/cloud-output.log

Dustin Kirkland  (kirkland) wrote :

Thierry,

I ran all of these.

After reboot, I got a few Reset messages in cloud-output.log, like this:
11:21:56 WARN DatabaseUtil | -> Resetting: z8kfsx84g4e6c2if50wl|6e905272

But after clearing the logs, and warm restart of eucalyptus, port 8443 came up, my availability zones were shown, and there are no Reset or DEAD messages in the logs.

:-Dustin

Dustin Kirkland  (kirkland) wrote :

Thierry asked me to lower the priority, since I'm not reproducing this problem now.

:-Dustin

Changed in eucalyptus (Ubuntu Karmic):
importance: Medium → Low
Nick Barcet (nijaba) wrote :

I ran into the same problem rebooting after and apt-get upgrade. Had to restart the services manually to be able to log in again... Enclosing relevant part of the log.

Thierry Carrez (ttx) wrote :

Too late to nail and fix for release, especially that deep in the code.

Changed in eucalyptus (Ubuntu Karmic):
assignee: nobody → Thierry Carrez (ttx)
status: Triaged → Won't Fix
Thierry Carrez (ttx) wrote :

Proposed Title:
Upon reboot, UEC might refuse to serve the first requests
Text:
On specific UEC systems we observed a database deadlock condition on reboot that prevents a UEC cloud to serve the first 1 to 6 requests it receives, until the deadlock is automatically cleared (444352).

Thierry Carrez (ttx) wrote :

Maybe that's clearer:

Proposed Title:
UEC might refuse to serve the first requests received after boot
Text:
On specific UEC systems we observed a database deadlock condition after boot that prevents a UEC cloud to answer correctly to the first requests it receives, usually triggering "503 Forbidden" answers. After some tries the deadlock condition is automatically cleared. To workaround this issue you can restart eucalyptus after boot on the cluster controller by running "sudo restart eucalyptus" (444352).

Dustin Kirkland  (kirkland) wrote :

Hmm, I'm now experiencing this problem on my cluster, and it particularly sucks. :-(

:-Dustin

Dustin Kirkland  (kirkland) wrote :

After waiting ~5 minutes, it seems to have become unwedged, and working again. Looks like I need to reboot, and walk away for ~10 minutes, get some coffee, and come back.

:-Dustin

Dustin Kirkland  (kirkland) wrote :

Minor change also required to powernap, to test for the writability of /var/cache/powerwake/ethers, before attempting to update the cache. Committed upstream.

Now that I've patched both of these, Cloud Power Management once again works as expected!

:-Dustin

Dustin Kirkland  (kirkland) wrote :

Please disregard that last comment... Posted to the wrong bug. Sorry.

:-Dustin

Changed in ubuntu-release-notes:
assignee: nobody → Dustin Kirkland (kirkland)
assignee: Dustin Kirkland (kirkland) → Thierry Carrez (ttx)
Mathias Gug (mathiaz) on 2009-10-28
Changed in ubuntu-release-notes:
assignee: Thierry Carrez (ttx) → Mathias Gug (mathiaz)
status: New → In Progress
Mathias Gug (mathiaz) on 2009-10-28
description: updated
Changed in ubuntu-release-notes:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2009-10-28
description: updated
Steve Langasek (vorlon) wrote :

Documented at <https://wiki.ubuntu.com/KarmicKoala/ReleaseNotes#UEC%20may%20refuse%20to%20serve%20the%20first%20requests%20received%20after%20startup>:

On startup a UEC system may not process requests correctly, for instance returning "403 Forbidden" errors in response to some client requests. This is caused by a database deadlock condition which is automatically cleared after some retries. To workaround this issue you can restart eucalyptus on the cluster controller by running "sudo restart eucalyptus" after boot. (444352)

Changed in ubuntu-release-notes:
status: Fix Committed → Fix Released
Mathias Gug (mathiaz) on 2009-11-02
summary: - DB deadlock on reboot prevents UEC from working, temporarily
+ DB deadlock on reboot prevents UEC from working, temporarily - 403
+ Forbidden errors
Dustin Kirkland  (kirkland) wrote :

I'm seeing this quite a bit now in my UEC. I'm suggesting that we should probably identify a fix and roll it out in an SRU.

Thierry, if you disagree, feel free to close it won't-fix again.

:-Dustin

Changed in eucalyptus (Ubuntu Karmic):
milestone: none → karmic-updates
status: Won't Fix → Triaged
Mathias Gug (mathiaz) on 2009-11-05
tags: added: uec
Thierry Carrez (ttx) on 2009-11-09
Changed in eucalyptus (Ubuntu Karmic):
assignee: Thierry Carrez (ttx) → Dustin Kirkland (kirkland)
Changed in eucalyptus:
assignee: nobody → chris grzegorczyk (chris-grze)
chris grzegorczyk (chris-grze) wrote :

Is there a way to reproduce this problem?

c

tags: added: iso-testing
Thierry Carrez (ttx) on 2010-01-20
Changed in eucalyptus (Ubuntu Karmic):
status: Triaged → Won't Fix
Changed in eucalyptus:
status: New → Invalid
Dustin Kirkland  (kirkland) wrote :

The DB deadlock in Eucalyptus itself has been fixed, but there is a remaining upstart/init issue, Bug #503180. I'm marking this bug a duplicate of that one, at this point.

Ultimately, eucalyptus-cloud upstart script depends on net, and "completes successfully" before it has an IP address. We will solve this when we have an "ip_address_obtained" event that we can depend upon.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package eucalyptus - 1.6.2~bzr1136-0ubuntu3

---------------
eucalyptus (1.6.2~bzr1136-0ubuntu3) lucid; urgency=low

  * eucalyptus-cc.eucalyptus-cc-publication.upstart,
    eucalyptus-common.eucalyptus.upstart, eucalyptus-network.upstart,
    eucalyptus-sc.eucalyptus-sc-publication.upstart,
    eucalyptus-walrus.eucalyptus-walrus-publication.upstart, rules: rework
    our eucalyptus starting condition to depend on a new upstart emitted
    signal, eucalyptus-network-is-ready, which is only fired once the
    network interface providing the default route is in fact up, and
    listening on a real ip address, LP: #503180
  * debian/eucalyptus-common.eucalyptus.upstart:
    - don't respawn eucalyptus-cloud
    - ensure that the iptables module gets loaded soon enough, otherwise
      much bad behavior happens in various nasty ways, most notably, a
      wedged database which renders the CLC non responsive on restart/reboot,
      LP: #503180 and dupes, LP: #444352, #444946, #467521, #478573, #480048
 -- Dustin Kirkland <email address hidden> Tue, 02 Feb 2010 17:13:52 -0800

Changed in eucalyptus (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers