Ubuntu
eucalyptus package

DB deadlock on reboot prevents UEC from working, temporarily - 403 Forbidden errors

Bug #444352 reported by Thierry Carrez on 2009-10-06

This bug report is a duplicate of: Bug #503180: [SRU] eucalyptus-cloud doesn't reply to requests (eucalyptus doesn't work after reboot or services restart issues due to upstart networking behavior). Edit Remove

This bug affects 3 people

	Status	Importance	Assigned to	Milestone
Eucalyptus	Invalid	Undecided	chris grzegorczyk
Release Notes for Ubuntu	Fix Released	Undecided	Mathias Gug
eucalyptus (Ubuntu)	Fix Released	Low	Unassigned
Karmic	Won't Fix	Low	Dustin Kirkland 	Ubuntu karmic-updates

Bug Description

Installed 20091006, works alright.
Rebooted the cloud/cluster controller at 09:46
Started an instance (i-45410877) at 09:48:06
The instance stayed at pending (without an IP address assigned), then terminated.
cloud-output.log shows an apparent DB deadlock condition.

Next one started couldn't get a reservation ("Trying to allocate an address which is already pending")
Next one started worked.

See cloud-output.log that shows the first two tries. Let me know if you need more.

==================
Karmic release notes:

UEC might refuse to serve the first requests received after boot

After a UEC boot the system may not answer correctly, returning for example "403 Forbidden" errors to client requests. This is cause by a database deadlock condition which is automatically cleared after some retries. To workaround this issue you can restart eucalyptus after boot on the cluster controller by running "sudo restart eucalyptus" (444352).

==================

See original description

Tags:

Related branches

lp:~ubuntu-core-dev/eucalyptus/ubuntu

lp:ubuntu/lucid/eucalyptus

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-06:

cloud-output.log Edit (191.6 KiB, text/html)

Changed in eucalyptus (Ubuntu):
importance:	Undecided → Medium

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-06:

Dustin: could you run the logs through Dan and/or reproduce-confirm the issue ?

Changed in eucalyptus (Ubuntu):
assignee:	nobody → Dustin Kirkland (kirkland)

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-10-06:

Thierry-

I can confirm this issue. Note that I only see this problem with our UEC images. I can boot/run the Eucalyptus images okay.

I'll debug this further with Dan tomorrow. In the mean time, I think we need to some investigation on our UEC images in parallel.

Is there another package/project we need to add to this bug to ensure that we get the proper coverage from Scott and Chuck, or whoever is working on the UEC images?

:-Dustin

Changed in eucalyptus (Ubuntu):
status:	New → Confirmed

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-07:

Given the nature of the bug (and how early it shows up in the instance run start), I would be very surprised if it was imagecontent-specific. Maybe imagesize-specific, but even in that case I'd be surprised.

Maybe you tested a UEC image first, and it failed, then tried a eucalyptus one, and it worked ? In my testing, the DB recovered by itself after a couple minutes.

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-08:

15:09:19 Startup
15:09:53 "Error using or obtaining a database connection, will try till 5>5 more times before reloading."
15:09:59 "Error using or obtaining a database connection, will try till 4>5 more times before reloading."
15:10:05 "Error using or obtaining a database connection, will try till 3>5 more times before reloading."
15:10:07 "APPARENT DEADLOCK!!!"
15:10:10 "APPARENT DEADLOCK!!!"
15:10:21 Running euca-describe-availability-zones verbose fails on 403:
15:10:21 "Error using or obtaining a database connection, will try till 2>5 more times before reloading."
15:10:21 "com.eucalyptus.ws.AuthenticationException: User authentication failed."
15:10:34 Running euca-describe-availability-zones verbose fails on 403:
15:10:34 "Error using or obtaining a database connection, will try till 1>5 more times before reloading."
15:10:34 "com.eucalyptus.ws.AuthenticationException: User authentication failed."
15:10:53 Running euca-describe-availability-zones verbose fails on 403:
15:10:53 "Resetting hard"
15:10:53 "com.eucalyptus.ws.AuthenticationException: User authentication failed."
15:18:57 Running "euca-describe-availability-zones verbose" works

So this seems to be linked to the number of tries rather than a timeout. The DB is wedged on reboot, and it takes 6 tries (2 of which are being done by the system itself) before its working again.

Changed in eucalyptus (Ubuntu):
importance:	Medium → High
status:	Confirmed → Triaged
summary:	- DB deadlock on reboot prevents EMI from being started + DB deadlock on reboot prevents UEC from working, temporarily

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-08: Re: DB deadlock on reboot prevents UEC from working, temporarily

Can't reproduce it on a stop/start eucalyptus. My hint is that we wedge something when we reboot that we don't wedge on regular stop.

I can reproduce it every time on reboot. It takes 5 DB connections before the system "resets hard". Sometimes the system by itself will do enough connections so that it will be reset by the time you try your command. Otherwise trying your command 5 times in a row is sufficient to make it work.

Look for "DEADLOCK" and "Resetting" in cloud-output.log.

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-08:

From my limited upstart knowledge, what we do on regular stop that we appear not doing on reboot is... stopping eucalyptus.
However adding "stop on runlevel [!23]" doesn't seem to be sufficient, so maybe it's a red herring.

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-08:

Lowering priority since its transient.
Probably related to networking not being up before Eucalyptus starts, since apparently it only hits DHCP-based configs

Changed in eucalyptus (Ubuntu Karmic):
assignee:	Dustin Kirkland (kirkland) → Thierry Carrez (ttx)
importance:	High → Medium

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-09:

Further testing seems to invalidate the "eucalyptus started before network" theory.

I added a loop to wait for the external IP address to show up before starting up eucalyptus. It does *not* prevent the DEADLOCK issues.

I added a loop to delay eucalyptus startup by 60 seconds arbitrarily, checked that networking and everything else was up even before any of eucalyptus starts: It does *not* prevent the DEADLOCK issues.

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-09:

#10

More testing:

Trying with static networking setup does *not* prevent DEADLOCK issues.

It's probably not linked to the stop procedure either, since I stop eucalyptus cleanly before rebooting.

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-09:

#11

logs.tar.gz Edit (69.1 KiB, application/x-tar)

Finally I tested disabling eucalyptus start at boot:

(From a running system)
Stopping eucalyptus.
Clearing logs.
Starting eucalyptus manually
Testing: PASS. no deadlock.
Stopping eucalyptus.
Rebooting, waiting for system to be fully up.
Starting eucalyptus manually
Testing: FAIL. deadlock issue.

See resulting logs: one successful startup, one failed startup. They look very much the same we hit the DB error.

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-09:

#12

So to summarize:
- To trigger the bug, reboot your cluster and try running euca-describe-availability-zones verbose once eucalyptus is ready. It fails, and makes DEADLOCK errors appear in the logs.
- After a few tries, eucalyptus does a "hard reset" on connections and autofixes itself.
- This occurs for me every time, the first time I start eucalyptus after system startup.
- Stopping and restarting eucalyptus on a running system does not trigger the error. It only happens at first eucalyptus startup after a boot.
- This does not depend on the boot order, since I can reproduce by starting up manually after boot is fully complete. It does not depend on static / DHCP networking either, I can reproduce in both cases.
- It does not depend on the way eucalyptus is stopped, since I reproduce it by manually stopping eucalyptus.

It seems like a condition that is only found after a fresh boot is triggering the issue. Something that starting and stopping eucalyptus once will fix. I tried cleaning up /var/run/eucalyptus and /tmp/ and reproduce, but that doesn't come from there. I also tried cleaning up the 169.254.169.254/32 address, but that doesn't trigger it either.

Changed in eucalyptus (Ubuntu Karmic):
assignee:	Thierry Carrez (ttx) → nobody

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-09:

#13

I think we need some confirmation here.
Anyone with a UEC setup, please do:

$ sudo stop eucalyptus
$ sudo rm /var/log/eucalyptus/*
$ sudo reboot
...
# wait for reboot complete and port 8443 to show up on netstat
$ . eucarc
$ euca-describe-availability-zones verbose
$ grep Reset /var/log/eucalyptus/cloud-output.log
$ grep DEAD /var/log/eucalyptus/cloud-output.log
$ sudo stop eucalyptus
$ sudo rm /var/log/eucalyptus/*
$ sudo start eucalyptus
# wait for port 8443 to show up on netstat
$ euca-describe-availability-zones verbose
$ grep Reset /var/log/eucalyptus/cloud-output.log
$ grep DEAD /var/log/eucalyptus/cloud-output.log

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-10-09:

#14

Thierry,

I ran all of these.

After reboot, I got a few Reset messages in cloud-output.log, like this:
11:21:56 WARN DatabaseUtil | -> Resetting: z8kfsx84g4e6c2if50wl|6e905272

But after clearing the logs, and warm restart of eucalyptus, port 8443 came up, my availability zones were shown, and there are no Reset or DEAD messages in the logs.

:-Dustin

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-10-09:

#15

Thierry asked me to lower the priority, since I'm not reproducing this problem now.

:-Dustin

Changed in eucalyptus (Ubuntu Karmic):
importance:	Medium → Low

Revision history for this message

Nick Barcet (nijaba) wrote on 2009-10-19:

#16

cloud-output.log Edit (512.7 KiB, text/plain)

I ran into the same problem rebooting after and apt-get upgrade. Had to restart the services manually to be able to log in again... Enclosing relevant part of the log.

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-19:

#17

Too late to nail and fix for release, especially that deep in the code.

Changed in eucalyptus (Ubuntu Karmic):
assignee:	nobody → Thierry Carrez (ttx)
status:	Triaged → Won't Fix

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-22:

#18

Proposed Title:
Upon reboot, UEC might refuse to serve the first requests
Text:
On specific UEC systems we observed a database deadlock condition on reboot that prevents a UEC cloud to serve the first 1 to 6 requests it receives, until the deadlock is automatically cleared (444352).

Revision history for this message

Thierry Carrez (ttx) wrote on 2009-10-22:

#19

Maybe that's clearer:

Proposed Title:
UEC might refuse to serve the first requests received after boot
Text:
On specific UEC systems we observed a database deadlock condition after boot that prevents a UEC cloud to answer correctly to the first requests it receives, usually triggering "503 Forbidden" answers. After some tries the deadlock condition is automatically cleared. To workaround this issue you can restart eucalyptus after boot on the cluster controller by running "sudo restart eucalyptus" (444352).

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-10-22:

#20

Hmm, I'm now experiencing this problem on my cluster, and it particularly sucks. :-(

:-Dustin

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-10-22:

#21

After waiting ~5 minutes, it seems to have become unwedged, and working again. Looks like I need to reboot, and walk away for ~10 minutes, get some coffee, and come back.

:-Dustin

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-10-22:

#22

Minor change also required to powernap, to test for the writability of /var/cache/powerwake/ethers, before attempting to update the cache. Committed upstream.

Now that I've patched both of these, Cloud Power Management once again works as expected!

:-Dustin

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-10-22:

#23

Please disregard that last comment... Posted to the wrong bug. Sorry.

:-Dustin

Dustin Kirkland  (kirkland) on 2009-10-28

Changed in ubuntu-release-notes:
assignee:	nobody → Dustin Kirkland (kirkland)
assignee:	Dustin Kirkland (kirkland) → Thierry Carrez (ttx)

Mathias Gug (mathiaz) on 2009-10-28

Changed in ubuntu-release-notes:
assignee:	Thierry Carrez (ttx) → Mathias Gug (mathiaz)
status:	New → In Progress

Mathias Gug (mathiaz) on 2009-10-28

description:	updated
Changed in ubuntu-release-notes:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2009-10-28

description:

updated

Revision history for this message

Steve Langasek (vorlon) wrote on 2009-10-28:

#24

Documented at <https://wiki.ubuntu.com/KarmicKoala/ReleaseNotes#UEC%20may%20refuse%20to%20serve%20the%20first%20requests%20received%20after%20startup>:

On startup a UEC system may not process requests correctly, for instance returning "403 Forbidden" errors in response to some client requests. This is caused by a database deadlock condition which is automatically cleared after some retries. To workaround this issue you can restart eucalyptus on the cluster controller by running "sudo restart eucalyptus" after boot. (444352)

Changed in ubuntu-release-notes:
status:	Fix Committed → Fix Released

Mathias Gug (mathiaz) on 2009-11-02

summary:

- DB deadlock on reboot prevents UEC from working, temporarily
+ DB deadlock on reboot prevents UEC from working, temporarily - 403
+ Forbidden errors

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-11-02:

#25

I'm seeing this quite a bit now in my UEC. I'm suggesting that we should probably identify a fix and roll it out in an SRU.

Thierry, if you disagree, feel free to close it won't-fix again.

:-Dustin

Changed in eucalyptus (Ubuntu Karmic):
milestone:	none → karmic-updates
status:	Won't Fix → Triaged

Mathias Gug (mathiaz) on 2009-11-05

tags:

added: uec

Thierry Carrez (ttx) on 2009-11-09

Changed in eucalyptus (Ubuntu Karmic):
assignee:	Thierry Carrez (ttx) → Dustin Kirkland (kirkland)

chris grzegorczyk (chris-grze) on 2009-11-17

Changed in eucalyptus:
assignee:	nobody → chris grzegorczyk (chris-grze)

Revision history for this message

chris grzegorczyk (chris-grze) wrote on 2009-11-17:

#26

Is there a way to reproduce this problem?

Ubuntu QA Website (ubuntuqa) on 2009-11-17

tags:

added: iso-testing

Thierry Carrez (ttx) on 2010-01-20

Changed in eucalyptus (Ubuntu Karmic):
status:	Triaged → Won't Fix

chris grzegorczyk (chris-grze) on 2010-01-21

Changed in eucalyptus:
status:	New → Invalid

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2010-01-29:

#27

The DB deadlock in Eucalyptus itself has been fixed, but there is a remaining upstart/init issue, Bug #503180. I'm marking this bug a duplicate of that one, at this point.

Ultimately, eucalyptus-cloud upstart script depends on net, and "completes successfully" before it has an IP address. We will solve this when we have an "ip_address_obtained" event that we can depend upon.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2010-02-03:

#28

This bug was fixed in the package eucalyptus - 1.6.2~bzr1136-0ubuntu3

---------------
eucalyptus (1.6.2~bzr1136-0ubuntu3) lucid; urgency=low

  * eucalyptus-cc.eucalyptus-cc-publication.upstart,
    eucalyptus-common.eucalyptus.upstart, eucalyptus-network.upstart,
    eucalyptus-sc.eucalyptus-sc-publication.upstart,
    eucalyptus-walrus.eucalyptus-walrus-publication.upstart, rules: rework
    our eucalyptus starting condition to depend on a new upstart emitted
    signal, eucalyptus-network-is-ready, which is only fired once the
    network interface providing the default route is in fact up, and
    listening on a real ip address, LP: #503180
  * debian/eucalyptus-common.eucalyptus.upstart:
    - don't respawn eucalyptus-cloud
    - ensure that the iptables module gets loaded soon enough, otherwise
      much bad behavior happens in various nasty ways, most notably, a
      wedged database which renders the CLC non responsive on restart/reboot,
      LP: #503180 and dupes, LP: #444352, #444946, #467521, #478573, #480048
-- Dustin Kirkland <email address hidden> Tue, 02 Feb 2010 17:13:52 -0800

Changed in eucalyptus (Ubuntu):
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #503180 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntueucalyptus package

DB deadlock on reboot prevents UEC from working, temporarily - 403 Forbidden errors

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
eucalyptus package