Ubuntu

[SRU] eucalyptus-cloud doesn't reply to requests (eucalyptus doesn't work after reboot or services restart issues due to upstart networking behavior)

Reported by Mathias Gug on 2010-01-05
60
This bug affects 10 people
Affects Status Importance Assigned to Milestone
eucalyptus (Ubuntu)
High
Dustin Kirkland 
Karmic
High
Dustin Kirkland 
Lucid
High
Dustin Kirkland 

Bug Description

Using 1.6.2~bzr1120-0ubuntu1, I can't get the credentials:

ubuntu@uec-cc:~$ sudo euca_conf --get-credentials mycreds.zip
ERROR: you need to be on the CLC host and the CLC needs to be running.

A wget on the register url also times out:

ubuntu@uec-cc:~$ wget -T 10 -t 1 -O - --no-check-certificate https://127.0.0.1:8443/register
--2010-01-04 20:39:34-- https://127.0.0.1:8443/register
Connecting to 127.0.0.1:8443... connected.
WARNING: cannot verify 127.0.0.1's certificate, issued by `/C=US/ST=CA/L=Santa Barbara/O=db/OU=Eucalyptus/CN=localhost':
  Self-signed certificate encountered.
WARNING: certificate common name `localhost' doesn't match requested host name `127.0.0.1'.
HTTP request sent, awaiting response... Read error (Connection timed out) in headers.
Giving up.

The eucalyptus-cloud process is running a listening on port 8443.

I can see the following errors in /var/log/eucalyptus/cloud-error.log:

0:36:55 [log:653891498@qtp-1693378617-9] ERROR /register
java.lang.RuntimeException: javax.persistence.PersistenceException: org.hibernate.exception.JDBCConnectionException: Cannot open connection
        at com.eucalyptus.util.TxHandle.<init>(TxHandle.java:46)
        at com.eucalyptus.util.EntityWrapper.<init>(EntityWrapper.java:98)
        at com.eucalyptus.util.EntityWrapper.<init>(EntityWrapper.java:91)
        at edu.ucsb.eucalyptus.util.EucalyptusProperties.getSystemConfiguration(EucalyptusProperties.java:117)
        at edu.ucsb.eucalyptus.admin.server.Registration.getRegistrationId(Registration.java:199)
        at edu.ucsb.eucalyptus.admin.server.Registration.doGet(Registration.java:210)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:389)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:326)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
        at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)
Caused by: javax.persistence.PersistenceException: org.hibernate.exception.JDBCConnectionException: Cannot open connection
        at org.hibernate.ejb.AbstractEntityManagerImpl.throwPersistenceException(AbstractEntityManagerImpl.java:614)
        at org.hibernate.ejb.TransactionImpl.begin(TransactionImpl.java:41)
        at com.eucalyptus.util.TxHandle.<init>(TxHandle.java:40)
        ... 24 more
Caused by: org.hibernate.exception.JDBCConnectionException: Cannot open connection
        at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:97)
        at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:66)
        at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:52)
        at org.hibernate.jdbc.ConnectionManager.openConnection(ConnectionManager.java:449)
        at org.hibernate.jdbc.ConnectionManager.getConnection(ConnectionManager.java:167)
        at org.hibernate.jdbc.JDBCContext.connection(JDBCContext.java:142)
        at org.hibernate.transaction.JDBCTransaction.begin(JDBCTransaction.java:85)
        at org.hibernate.impl.SessionImpl.beginTransaction(SessionImpl.java:1353)
        at org.hibernate.ejb.TransactionImpl.begin(TransactionImpl.java:38)
        ... 25 more
Caused by: java.sql.SQLException: Connection is broken: java.net.SocketException: Connection timed out
        at org.hsqldb.jdbc.Util.sqlException(Unknown Source)
        at org.hsqldb.jdbc.jdbcConnection.getAutoCommit(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor73.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.logicalcobwebs.proxool.WrappedConnection.invoke(WrappedConnection.java:162)
        at $Proxy27.getAutoCommit(Unknown Source)
        at org.hibernate.connection.ProxoolConnectionProvider.getConnection(ProxoolConnectionProvider.java:81)
        at org.hibernate.jdbc.ConnectionManager.openConnection(ConnectionManager.java:446)
        ... 30 more

== SRU ==

IMPACT: Users of 9.10 UEC will often experience long, non-deterministic delays (10-20 minutes in many cases) when restarting eucalyptus services (which includes reboots, package upgrades, service restarts). This is highly inconvenient, yielding UEC unusable until database network connections reset.

HOW FIXED: The fix consists of two trivial iptables calls being added to the eucalyptus upstart script. Upstream had these calls in there init scripts, but were inadvertently dropped when porting Eucalyptus to upstart. These iptables commands will ensure that the iptables kernel module (and most importantly, the ip connection tracker) is loaded and active before Eucalyptus comes up. WIthout said ip connection tracker, Eucalyptus will often establish a connection to the database, then iptables is loaded and connections are mangled, breaking the connection to the database. The user will see the problem in any one of a number of disguising ways (front end not working, api tools not responding, etc). All of these problems are due to an inaccessible database. After a while (10-20 minutes), Eucalyptus will reset the database connection. With this fix, the above problems should never happen. Eucalyptus should be back up and running within 1-2 minutes of boot (if not immediately).

MINIMAL PATCH:
diff -u eucalyptus-1.6~bzr931/debian/eucalyptus-common.eucalyptus.upstart eucalyptus-1.6~bzr931/debian/eucalyptus-common.eucalyptus.upstart
--- eucalyptus-1.6~bzr931/debian/eucalyptus-common.eucalyptus.upstart
+++ eucalyptus-1.6~bzr931/debian/eucalyptus-common.eucalyptus.upstart
@@ -11,6 +11,10 @@
        # Check if installed
        [ -f /usr/sbin/euca_conf ] || { stop; exit 0; }

+ # Ensure that the iptables module gets loaded here
+ iptables -t nat -L -n >/dev/null
+ iptables -L -n > /dev/null
+
        mkdir -p /var/run/eucalyptus/net
        chown eucalyptus:eucalyptus /var/run/eucalyptus /var/run/eucalyptus/net

REPRODUCING THE BUG:
Reboot your UEC (or sudo restart eucalyptus). If restarting eucalyptus takes a *long* time, you are experiencing one symptom of this bug. Once upstart thinks that eucalyptus is up, try: $(sudo wget --no-check-certificate https://localhost:8443) If this takes a long time, or fails to work, you are experiencing a symptom of this bug. Note that the problem is inherently due to a race condition, and therefore may not be immediately reproducible. Try rebooting/restarting a few times, and you will likely hit it.

REGRESSION POTENTIAL:
I cannot see any possible regression potential. The iptables modules will be loaded eventually. This patch just ensures that they get loaded before Eucalyptus tries to start services.

Mathias Gug (mathiaz) on 2010-01-05
Changed in eucalyptus (Ubuntu):
importance: Undecided → High
Thierry Carrez (ttx) wrote :

I ran into this one as well. I had to stop eucalyptus / start eucalyptus to get a running system.

Changed in eucalyptus (Ubuntu):
status: New → Confirmed

I've seen similar problems in the past, but waiting for some extended
time (15 minutes) solved it. In that time the Web UI was also not
working, the admin user was reported to not exist.

I also have to kill -9 the remaining eucalyptus-cloud process after the stop. I don't have the same exception in cloud-error.log. In cloud-output.log everything seems normal, then nothing gets added anymore. euca_conf is run, nothing shows in the log. Then after some time you start getting tons of:

NioServerHandler.java:145 org.jboss.netty.handler.timeout.ReadTimeoutException
NioServerHandler.java:148 org.jboss.netty.handler.timeout.WriteTimeoutException
NioServerHandler.java:154 Internal Error.

That seems to point to some lockup in the eucalyptus-cloud process.
Note that euca_conf still returns 0 in that case, which is confusing.

tags: added: iso-testing
Thierry Carrez (ttx) on 2010-01-14
Changed in eucalyptus (Ubuntu):
assignee: nobody → Dustin Kirkland (kirkland)
milestone: none → lucid-alpha-3
Daniel Nurmi (nurmi) wrote :

Greetings,

We're having trouble reproducing this problem on our lucid systems against eucalyptus-cloud-1.6.2~bzr1124-0ubuntu3. We think that the issue is possibly related to some process startup behavior, but it would be great to get some more system information or a process by which the issue can often be triggered - some questions follow, the answers to which will help us try to reproduce:

1.) does this condition appear on a fresh boot (i.e. there are definitely no prior eucalyptus-cloud processes running when a new eucalyptus-cloud process is started)?
2.) does this condition appear when there is certainly the primary network interface up and configured?
3.) does this condition on 'start eucalyptus-cloud', 'restart eucalyptus-cloud' or both?
4.) do you see the condition when the cloud/walrus/sc are all on the same system or is this a stand-alone cloud service (or some other topology)?
5.) does this happen after a cloud has been configured/working or is it always during the initial setup (during registration of other components)?

If we can trap the problem a bit further, we'll surely be able to find the solution!

regards
-dan

Thierry Carrez (ttx) wrote :

I hit it systematically on the following test:
Alpha2 ISO, UEC installer, select CLC+Walrus+CC+SC
Just after reboot (so, fresh boot, clearly), the right ports come up... but...
"sudo euca_conf --get-credentials mycreds.zip" stalls
"sudo euca_conf --list-walruses" answers "You need to be on the CLC..."
Running "sudo stop eucalyptus" leaves a eucalyptus-cloud process running that only SIGKILL will stop.
After killing the process, sudo start eucalyptus... everything runs fine.

I haven't tried so much other conditions. Looking at the upstart job, since it now starts on "runlevel [2345]" I think lo is certainly up, but nothing is ensured about eth0. Could that explain it ?

Daniel Nurmi (nurmi) wrote :

Thank Thierry,

We've finally been able to (we believe) reproduce this type of condition on our Lucid machines, and have figured out the reason why it is being triggered on lucid UEC installations. The Eucalyptus front-end components (cloud, walrus, sc) require that, on startup, the network interface that was used to register components is up and configured (i.e. has an IP address that was used at component registration time). Our init process ensures this by running after the 'network' init script completes, but in the upstart case it looks like the eucalyptus components can start after the network process has started, but before it completes. In sum, if eucalyptus starts, then the interface becomes configured, this condition will be triggered and the front-end will need to be restarted so that it can go through the initialization process again.

Note that this is a startup time requirement only (i.e., once the system is up, and for some reason the interface goes down and comes up again), the service will stay active; it is just during the startup that eucalyptus requires that the interface is alive and configured.

Is there a way to have the eucalyptus upstart scripts wait until the network interface is fully alive before starting the front-end components?

Thanks!

Thierry Carrez (ttx) wrote :

That makes sense. I will test with an upstart script that clearly waits for eth0 up before starting "eucalyptus" and let you know.

Thierry Carrez (ttx) wrote :

It looks like starting eucalyptus on "net-device-up IFACE=eth0" is not sufficient. It definitely looks sensitive to startup order / network up though. I'll let Dustin investigate this one deeper.

Dustin Kirkland  (kirkland) wrote :

Okay, I talked to Scott about similar upstart situations last week, and I think I have his blessing on what I thought would be hacky.

Basically, we need to create an additional, upstart (psuedo) job, that would start on started networking, and basically loop until some conditional is settled (in this case, ip_address_obtained), at which point this psuedo job would emit a signal. The Eucalyptus upstart jobs would simply wait on this signal.

I'll implement this at the distro sprint.

summary: - eucalyptus-cloud doesn't reply to requests
+ eucalyptus-cloud doesn't reply to requests (due to upstart networking
+ issue)
summary: - eucalyptus-cloud doesn't reply to requests (due to upstart networking
- issue)
+ eucalyptus-cloud doesn't reply to requests (eucalyptus doesn't work
+ after reboot or services restart issues due to upstart networking
+ behavior)

This bug was fixed in the package eucalyptus - 1.6.2~bzr1136-0ubuntu3

---------------
eucalyptus (1.6.2~bzr1136-0ubuntu3) lucid; urgency=low

  * eucalyptus-cc.eucalyptus-cc-publication.upstart,
    eucalyptus-common.eucalyptus.upstart, eucalyptus-network.upstart,
    eucalyptus-sc.eucalyptus-sc-publication.upstart,
    eucalyptus-walrus.eucalyptus-walrus-publication.upstart, rules: rework
    our eucalyptus starting condition to depend on a new upstart emitted
    signal, eucalyptus-network-is-ready, which is only fired once the
    network interface providing the default route is in fact up, and
    listening on a real ip address, LP: #503180
  * debian/eucalyptus-common.eucalyptus.upstart:
    - don't respawn eucalyptus-cloud
    - ensure that the iptables module gets loaded soon enough, otherwise
      much bad behavior happens in various nasty ways, most notably, a
      wedged database which renders the CLC non responsive on restart/reboot,
      LP: #503180 and dupes, LP: #444352, #444946, #467521, #478573, #480048
 -- Dustin Kirkland <email address hidden> Tue, 02 Feb 2010 17:13:52 -0800

Changed in eucalyptus (Ubuntu Lucid):
status: Confirmed → Fix Released
Changed in eucalyptus (Ubuntu Karmic):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Dustin Kirkland (kirkland)
milestone: none → karmic-updates
Dustin Kirkland  (kirkland) wrote :

I have nominated this for Karmic, in the hopes of getting an SRU out soon.

We think we have tracked this problem to a very simple fix...ensuring that the iptables module gets loaded in initial eucalyptus upstart job.

I have uploaded eucalyptus_1.6~bzr931-0ubuntu7.5~ppa1 to the ubuntu-virt ppa:
 * https://launchpad.net/~ubuntu-virt/+archive/ppa

I don't have a current Karmic test environment, since I'm totally focused on Lucid development right now. If I could get one person to confirm this fix on their Karmic UEC, I will upload to karmic-proposed and begin the SRU process (where the fix will again need to be confirmed).

Nick/Boris/Torsten- can any of you guys help verify the fix in the PPA?

Thanks,
:-Dustin

Changed in eucalyptus (Ubuntu Karmic):
status: Triaged → In Progress
Robert Collins (lifeless) wrote :

Manually testing the change worked for me. I'll try the built package shortly

Torsten Spindler (tspindler) wrote :

I first reproduced the problem on the training cloud, then installed the PPA packages. The problem went away, the cloud is responsive within the first two minutes of restarting the frontend.

Okay, this is now uploaded to karmic-proposed.

Please test that package (too), to help get this pushed to karmic-updates.

Changed in eucalyptus (Ubuntu Karmic):
status: In Progress → Fix Committed
summary: - eucalyptus-cloud doesn't reply to requests (eucalyptus doesn't work
- after reboot or services restart issues due to upstart networking
+ [SRU] eucalyptus-cloud doesn't reply to requests (eucalyptus doesn't
+ work after reboot or services restart issues due to upstart networking
behavior)
Dustin Kirkland  (kirkland) wrote :

SRU instructions in the bug description above.

description: updated
John Dong (jdong) wrote :

Hi Dustin, did you get a SRU ACK yet? If not, the patch looks like a great candidate for SRU and everything looks good to me. Consider this an ACK :)

Accepted eucalyptus into karmic-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

tags: added: verification-needed
Dustin Kirkland  (kirkland) wrote :

Robert, Torsten, Nick-

I've helped each of you guys either manually fix this bug, or upgrade to the PPA packages where I fixed it in the packaging.

The fix is currently awaiting verification. It would be great if someone could install the -proposed package and add a comment confirming that their issue is fixed, so that this upload can move into -updates.

I have another fix queued for eucalyptus karmic SRU, but I want to get this package moved to -updates before overwriting what's in -proposed now.

Thanks.

I enabled karmic-proposed, installed eucalyptus-*-1.6~bzr931-0ubuntu7.5
on the training cloud in the datacenter. I rebooted node and frontend
and the cloud was operational from the start. I started an instance and
ssh'ed to it. All looks good and operational.

Martin Pitt (pitti) on 2010-02-24
tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package eucalyptus - 1.6~bzr931-0ubuntu7.5

---------------
eucalyptus (1.6~bzr931-0ubuntu7.5) karmic-proposed; urgency=low

  * debian/eucalyptus-common.eucalyptus.upstart:
    - ensure that the iptables module gets loaded soon enough, otherwise
      much bad behavior happens in various nasty ways, most notably, a
      wedged database which renders the CLC non responsive on restart/reboot,
      LP: #503180 (and many duplicates)
 -- Dustin Kirkland <email address hidden> Wed, 10 Feb 2010 09:17:41 -0600

Changed in eucalyptus (Ubuntu Karmic):
status: Fix Committed → Fix Released
juan sipag (juan-sipag123) wrote :

can i used ubuntu in an <a href="http://gpsintel.com/">asset tracking</a> device program?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers