nova-cloud-controller db sync sometimes fails in HA mode

Bug #1335139 reported by Alexander List
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Nova Cloud Controller Charm
Fix Released
High
Unassigned
nova-cloud-controller (Juju Charms Collection)
Invalid
High
Unassigned

Bug Description

We are deploying openstack infrastructure components (trusty/icehouse) to LXC containers on 3 physical machines provided by MAAS (full smoosh).

When deploying nova-cloud-controller, I got an error from hook shared-db-relation-changed that is caused by nova-manage db sync which tries to create a table that already exists.

My suspicion is that another instance of nova-cloud-controller already created table "instances", so a subsequent CREATE TABLE fails.

This will have to be fixed upstream by changing this to CREATE TABLE IF NOT EXISTS or even more HA awareness, to skip DB generation entirely if it exists already.

Revision history for this message
Alexander List (alexlist) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I think are hitting this bug too. Attached is my log.

tags: added: landscape
tags: added: cloud-installer
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I filed https://bugs.launchpad.net/charms/+source/nova-cloud-controller/+bug/1347245 with another backtrace I had, this might be a different bug.

Revision history for this message
James Page (james-page) wrote :

The db-changed hook is gated to ensure that only one unit runs the db-sync task:

@hooks.hook('shared-db-relation-changed')
@restart_on_change(restart_map())
def db_changed():
    if 'shared-db' not in CONFIGS.complete_contexts():
        log('shared-db relation incomplete. Peer not ready?')
        return
    CONFIGS.write_all()

    if eligible_leader(CLUSTER_RES):
        migrate_database()
        log('Triggering remote cloud-compute restarts.')
        [compute_joined(rid=rid, remote_restart=True)
         for rid in relation_ids('cloud-compute')]

The 'eligible_leader' function does a few checks

1) if fully clustered, then the owner of the VIP is the leader
2) if not fully clustered, the oldest service unit in the 'cluster' relation is declared the leader

If in the event of 2) the cluster relation is not fully formed with all service units prior to the shared-db relation being made, I could see how 2 nova-cc units both think they are the leader and you hit a race; do you see the db sync being run on multiple service units?

Revision history for this message
Adam Collard (adam-collard) wrote : Re: [Bug 1335139] Re: nova-cloud-controller db sync fails in HA mode

On 23 July 2014 14:01, James Page <email address hidden> wrote:

> The 'eligible_leader' function does a few checks
>
> 1) if fully clustered, then the owner of the VIP is the leader
> 2) if not fully clustered, the oldest service unit in the 'cluster'
> relation is declared the leader
>
> If in the event of 2) the cluster relation is not fully formed with all
> service units prior to the shared-db relation being made, I could see
> how 2 nova-cc units both think they are the leader and you hit a race;
> do you see the db sync being run on multiple service units?
>

Yes.

Given it's not possible for a user of the Juju API to know when a relation
is "fully formed" I'm not sure if/how we can work around this.

Revision history for this message
James Page (james-page) wrote : Re: nova-cloud-controller db sync fails in HA mode

We definitely hit this problem during original HA implementation and testing; two things where done to work-around this (pending any sort of leader election function from Juju itself):

1) juju deployer deploys service units and waits for them all to start before adding relations

2) peer relations should always fire before other relations when you introduce a new unit into an existing service

Are you waiting for all service units to fully start prior to adding relations between services?

Revision history for this message
James Page (james-page) wrote :

I raised bug 1228243 a while back to request this feature.

Changed in nova-cloud-controller (Juju Charms Collection):
status: New → Confirmed
James Page (james-page)
Changed in nova-cloud-controller (Juju Charms Collection):
importance: Undecided → High
JuanJo Ciarlante (jjo)
tags: added: canonical-bootstack canonical-is
tags: added: openstack
James Page (james-page)
Changed in nova-cloud-controller (Juju Charms Collection):
status: Confirmed → Triaged
summary: - nova-cloud-controller db sync fails in HA mode
+ nova-cloud-controller db sync sometimes fails in HA mode
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Just hit this with current /next charms (rev 154) - http://paste.ubuntu.com/10827860/

Changed in nova-cloud-controller (Juju Charms Collection):
milestone: none → 15.04
James Page (james-page)
Changed in nova-cloud-controller (Juju Charms Collection):
milestone: 15.04 → 15.07
James Page (james-page)
Changed in nova-cloud-controller (Juju Charms Collection):
milestone: 15.07 → 15.10
Revision history for this message
Chad Smith (chad.smith) wrote :

Just hit this one more time at Cisco. Could attach a ton of logs if necessary, but it's the same failure mode.

2015-08-28 20:22:38 INFO shared-db-relation-changed 2015-08-28 20:22:38.017 56639 CRITICAL nova [-] OperationalError: (OperationalError) (1050, "Table 'instances' already exists") "\nCREATE TABLE instances (\n\tcrea
....

2015-08-28 20:22:38 INFO shared-db-relation-changed subprocess.check_output(cmd)
2015-08-28 20:22:38 INFO shared-db-relation-changed File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
2015-08-28 20:22:38 INFO shared-db-relation-changed raise CalledProcessError(retcode, cmd, output=output)
2015-08-28 20:22:38 INFO shared-db-relation-changed subprocess.CalledProcessError: Command '['nova-manage', 'db', 'sync']' returned non-zero exit status 1

Revision history for this message
Chad Smith (chad.smith) wrote :

        u"cs:trusty/nova-cloud-controller-60",
        config={"openstack-origin": "cloud:trusty-icehouse",

Our juju charm settings from the above failed deployment

James Page (james-page)
Changed in nova-cloud-controller (Juju Charms Collection):
milestone: 15.10 → 16.01
James Page (james-page)
Changed in nova-cloud-controller (Juju Charms Collection):
milestone: 16.01 → 16.04
James Page (james-page)
Changed in nova-cloud-controller (Juju Charms Collection):
milestone: 16.04 → 16.07
Liam Young (gnuoy)
Changed in nova-cloud-controller (Juju Charms Collection):
milestone: 16.07 → 16.10
James Page (james-page)
Changed in nova-cloud-controller (Juju Charms Collection):
milestone: 16.10 → 17.01
James Page (james-page)
Changed in charm-nova-cloud-controller:
importance: Undecided → High
status: New → Triaged
Changed in nova-cloud-controller (Juju Charms Collection):
status: Triaged → Invalid
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I'm going to mark this as fix released since leader elections landed in nova-cloud-controller on June 9 2015 and that should fix the issue reported in this bug. If that's not the case please feel free to open the bug back up.

commit eaaaec38ddd6b49ce8be530529b6ffaf165ba6e1
Merge: b651ebf d4b768f
Author: James Page <email address hidden>
Date: Tue Jun 9 10:59:06 2015 +0100

    Add support for leader-election

Changed in charm-nova-cloud-controller:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.