nova + glance services die if started before db is reachable

Bug #959426 reported by Nick Moffitt
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Glance
Fix Released
High
Adam Gandelman
OpenStack Compute (nova)
Fix Released
High
Adam Gandelman
glance (Ubuntu)
Fix Released
Undecided
Unassigned
nova (Ubuntu)
Fix Released
Critical
Unassigned

Bug Description

I've got a system where all the controller-type services (glance, api, scheduler, rabbit, etc) are all on the same machine as the mysql DB, and the compute nodes are on the same network.

When I reboot my precise machine, the logs are full of "could not connect to mysql" errors, and half the services aren't running. The upstart scripts should probably make an allowance for starting everything up after databases, or the services should re-try the connections for a while before giving up and crashing.

Tags: canonistack
James Troup (elmo)
tags: added: canonistack
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed
Revision history for this message
David Kranz (david-kranz) wrote :

I just wasted three hours on this one. It is the same as the ubuntu bug
https://bugs.launchpad.net/ubuntu/+source/nova/+bug/959426

My path was as a successful diablo deployer/developer trying to get essex working using a single node to make it easier. Or so I thought. Many people will be hurt by this bug in the near future. It at least needs to be highlighted in the nova install docs if it is not fixed.

Revision history for this message
Alan Pevec (apevec) wrote :

> or the services should re-try the connections for a while before giving up and crashing.

Isn't this fixed in bug 943031 which should be in essex-rc1 ?

Thierry Carrez (ttx)
Changed in nova:
status: New → Incomplete
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

This is not resolved by the fix for bug 943031 which, AFAICS, fixes the issue of an existing db connection going away. The bug here is at service start up, ie starting nova-compute with a mysql server that is not up (yet) ends in:

(nova): TRACE: OperationalError: (OperationalError) (2003, "Can't connect to MySQL server on 'outo.home.base' (111)") None None

and a dead process.

Initial setup of the database connection does not retry, and instead the service dies (quickly). This can't really be worked around in packaging/upstart and shouldn't need to be, as a database is truly and external resource anyway. In contrast, the rabbitmq connection initialization loops at start up if it cannot connect to its server, dumping tracebacks to nova-$foo.log but at least keeps the service up and makes a connection when the server appears.

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

The fix in bug 943031 was recently re-implemented in Glance @ https://review.openstack.org/#change,5552 While this change currently doesn't resolve the issue of an unreachable database on service startup, it can easily be extended to check the ability to actually connect thru the sqlalchemy engine at startup / during configure_db(). It's been mentioned that the same logic should be a part of nova or openstack-common as well. If it lands there, it will also resolve this issue for nova.

Marking as critical in Ubuntu, adding a task for Glance + tagging essex-rc-potential as this is a serious issue for anyone expecting these services to start reliably.

Note, Keystone seems to initialize its database lazily and not touch till a request comes in. I'm not sure about quantum?

Changed in nova (Ubuntu):
importance: Undecided → Critical
summary: - nova services start before mysql on boot
+ nova + glance services die if started before db is reachable
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to glance (master)

Fix proposed to branch: master
Review: https://review.openstack.org/5760

Changed in glance:
assignee: nobody → Adam Gandelman (gandelman-a)
status: New → In Progress
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Note that this really has nothing to do with whether the services are all on the same host or not. The service being unavailable is not a terminal error, since this host and the other host(s) may be starting up in parallel (think massive powerfailure or complicated system bring-up). So retrying for a considerable amount of time seems to be the appropriate thing to do.

Revision history for this message
David Kranz (david-kranz) wrote :

Yes. It could also happen that even after startup the connection is temporarily unavailable and the nova processes should not crash immediately. It should be noted that this problem did not happen ever (for me) in oneiric/diablo-stable and I was/am using a "complicated system bring-up". Right now I cannot reboot the machine running mysql and nova-scheduler because nova-scheduler crashes every time.

Thierry Carrez (ttx)
tags: added: essex-rc-potential
Thierry Carrez (ttx)
Changed in nova:
status: Incomplete → Triaged
Changed in glance:
importance: Undecided → High
Changed in nova:
importance: Undecided → High
Changed in nova (Ubuntu):
status: Confirmed → Triaged
Changed in glance (Ubuntu):
status: New → Triaged
Thierry Carrez (ttx)
Changed in glance:
milestone: none → essex-rc2
tags: removed: essex-rc-potential
Changed in nova:
milestone: none → essex-rc2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to glance (master)

Reviewed: https://review.openstack.org/5760
Committed: http://github.com/openstack/glance/commit/12757d70fe6cf0d0fe6e46c5ee09a4f4d3efbd49
Submitter: Jenkins
Branch: master

commit 12757d70fe6cf0d0fe6e46c5ee09a4f4d3efbd49
Author: Adam Gandelman <email address hidden>
Date: Fri Mar 23 18:23:54 2012 -0700

    Ensure functional db connection in configure_db()

    During initial database setup, ensure we can physically connect
    to the database and allow a failed connection to make use of the
    new retry mechanism instead of registry startup failing outright.

    Fixes lp bug #959426.

    Change-Id: I1c87b19913c4204465e5d2027f2f184f0f358fd0

Changed in glance:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
assignee: nobody → Adam Gandelman (gandelman-a)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to glance (milestone-proposed)

Fix proposed to branch: milestone-proposed
Review: https://review.openstack.org/5911

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to glance (milestone-proposed)

Reviewed: https://review.openstack.org/5911
Committed: http://github.com/openstack/glance/commit/127101a44c0708ef26572abf98a025983ae35aa4
Submitter: Jenkins
Branch: milestone-proposed

commit 127101a44c0708ef26572abf98a025983ae35aa4
Author: Adam Gandelman <email address hidden>
Date: Fri Mar 23 18:23:54 2012 -0700

    Ensure functional db connection in configure_db()

    During initial database setup, ensure we can physically connect
    to the database and allow a failed connection to make use of the
    new retry mechanism instead of registry startup failing outright.

    Fixes lp bug #959426.

    Change-Id: I1c87b19913c4204465e5d2027f2f184f0f358fd0

Changed in glance:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/5939

Changed in nova:
status: Triaged → In Progress
Changed in nova:
assignee: Adam Gandelman (gandelman-a) → Vish Ishaya (vishvananda)
Changed in nova:
assignee: Vish Ishaya (vishvananda) → Adam Gandelman (gandelman-a)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/5939
Committed: http://github.com/openstack/nova/commit/a4dd6b6f06d222f49bd0d2582dfe0f2925a1638f
Submitter: Jenkins
Branch: master

commit a4dd6b6f06d222f49bd0d2582dfe0f2925a1638f
Author: Adam Gandelman <email address hidden>
Date: Wed Mar 28 18:52:41 2012 -0700

    Ensure a functional database connection

    Allow retrying database connection in get_engine() at an interval. Resolves
    the issue of nova components erroring at startup if a database connection is
    unavailable, particularly at boot. Borrowed from a similar commit to glance,
    (https://review.openstack.org/#change,5552).

    Fixes Bug #959426 for nova.

    Update: * Properly return an engine (fixes tests)
            * Setting sql_max_retries to -1 will retry infinitely
            * Bumped options count in nova.conf.sample
            * i18n log warning
            * Add note to flag help about -1 == infinite
            * Pep8 fix

    Change-Id: Id34eda9e0bad6b477a74e9a7d3575e513e6291d5

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
status: Fix Released → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (milestone-proposed)

Fix proposed to branch: milestone-proposed
Review: https://review.openstack.org/6075

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (milestone-proposed)

Reviewed: https://review.openstack.org/6075
Committed: http://github.com/openstack/nova/commit/ccb93c47370d774dc7b1959f2b2d038a4819855d
Submitter: Jenkins
Branch: milestone-proposed

commit ccb93c47370d774dc7b1959f2b2d038a4819855d
Author: Thierry Carrez <email address hidden>
Date: Mon Apr 2 11:55:35 2012 +0200

    Ensure a functional database connection

    Allow retrying database connection in get_engine() at an interval. Resolves
    the issue of nova components erroring at startup if a database connection is
    unavailable, particularly at boot. Borrowed from a similar commit to glance,
    (https://review.openstack.org/#change,5552).

    This also fixes code duplication due to a half-backport of
    commit 155ef7daab08d7f3fb8f7838df1d715bf1dc2f3f

    Fixes Bug #959426 for nova.

    Change-Id: Ifea94da8347714887c8cae02cc48288f3fa4fa7f

Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Fixed in Ubuntu with rc2 upload (nova-2012.1~rc2-0ubuntu1)

Changed in nova (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Fixed in Ubuntu as of glance-2012.1~rc2-0ubuntu1

Changed in glance (Ubuntu):
status: Triaged → Fix Released
Thierry Carrez (ttx)
Changed in glance:
milestone: essex-rc2 → 2012.1
Thierry Carrez (ttx)
Changed in nova:
milestone: essex-rc2 → 2012.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.