New deploys of nova-compute charm sometimes go into a relation-changed loop

Bug #1415763 reported by Paul Gear
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
nova-compute (Juju Charms Collection)
In Progress
Medium
Edward Hope-Morley

Bug Description

On a new deploy of the nova-compute charm, 2 out of 5 deploys have resulted in nova-compute going into a loop running two different relation-changed hooks. I'll attach a log of the ceilometer subordinate charm on the same host, showing the nova-compute cycling between these two hooks. I've also confirmed that the two hooks run successfully using debug-hooks.

I suspect a race condition between nova-compute and one of the other OpenStack components. I'll also attach our juju status brief output, showing the units with failures.

Revision history for this message
Paul Gear (paulgear) wrote :
Revision history for this message
Paul Gear (paulgear) wrote :
tags: added: openstack
JuanJo Ciarlante (jjo)
tags: added: canonical-bootstack
Revision history for this message
Edward Hope-Morley (hopem) wrote :

I wonder if this is the same issue we saw with percona-cluster bug 1389670.

If your relations are spinning could you please try:

Get relid:

    juju run --unit nova-compute/0 "relation-ids shared-db"

Then:

    juju run --unit nova-compute/0 "relation-get -r <relid> - mysql/0" > 1
    juju run --unit nova-compute/0 "relation-get -r <relid> - mysql/0" > 2
    juju run --unit nova-compute/0 "relation-get -r <relid> - mysql/0" > 3
    juju run --unit nova-compute/0 "relation-get -r <relid> - mysql/0" > 4

Then diff 1 2, diff 2 3 etc and paste the output. If it looks like the the settings are changing/toggling on each run then it is likely the same issue we are seing with Percona and would require the same fix.

Also, you can actually do without relating db with nova-compute unless you are using nova-network

Revision history for this message
Paul Gear (paulgear) wrote :

Thanks Edward. Next time it comes up, I'll make sure I gather those.

Revision history for this message
Paul Gear (paulgear) wrote :

I've encountered this issue again, and unfortunately I'm not getting past square 1: The first juju run you mentioned above gives the message "ERROR command timed out" after about 3 minutes. More than 30 minutes after attempting it, "juju-run nova-compute/2 relation-ids shared-db" and "juju-run nova-compute/2 df" (which I ran to find out whether the problem lay with the relation-ids part or juju run in general) are still polling a domain sockets that lsof is unable to identify as something other than "socket". Is there some further troubleshooting that I can do to work out what's going on with juju on this node?

Revision history for this message
Paul Gear (paulgear) wrote :

A further note: the relation that seemed to be experiencing the most churn was the nova-cloud-controller <-> nova-compute one; I tried a deploy without the mysql <-> nova-compute relation present and still encountered this issue.

Revision history for this message
Paul Gear (paulgear) wrote :

I downgraded to juju 1.20.14 from trusty-updates and this issue still occurs.

Revision history for this message
Paul Gear (paulgear) wrote :

Correction: 1.20.11

Revision history for this message
Paul Gear (paulgear) wrote :

I've been attempting to debug this all day, and I don't believe it's a bug in nova-compute. On machine zero, the following hooks are running constantly:

root 20082 29682 47 06:41 ? 00:00:03 /usr/bin/python /var/lib/juju/agents/unit-nova-cloud-controller-0/charm/hooks/shared-db-relation-changed
root 21307 2397 45 06:41 ? 00:00:02 /usr/bin/python /var/lib/juju/agents/unit-keystone-0/charm/hooks/shared-db-relation-changed
root 22246 15530 71 06:41 ? 00:00:02 /usr/bin/python /var/lib/juju/agents/unit-neutron-api-0/charm/hooks/shared-db-relation-changed
root 22451 16225 89 06:41 ? 00:00:03 /usr/bin/python /var/lib/juju/agents/unit-glance-0/charm/hooks/shared-db-relation-changed
root 24414 14705 99 06:41 ? 00:00:01 /usr/bin/python /var/lib/juju/agents/unit-cinder-0/charm/hooks/shared-db-relation-changed

As an example, the keystone hook takes nearly 40 seconds to run, during which time it produces over 1500 lines of logging data. In about 96 minutes since that unit's log file was created, that hook has run 138 times. Figures for the other units:

unit-nova-cloud-controller-0: 516 times
unit-neutron-api-0: 736 times
unit-glance-0: 581 times
unit-cinder-0: 701 times

It seems to me this is an issue either with juju itself, or with the mysql charm.

tags: added: cts
Changed in nova-compute (Juju Charms Collection):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Edward Hope-Morley (hopem)
milestone: none → 15.04
tags: added: backport-potential
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Ok so, firstly I think the root cause here is that same as bug 1389670 since mysql and percona-cluster charms share the same logic for determining and distributing allowed_units to shared-db relation. It is currently unnecessarily noisy and I believe that you could hit this problem with any charm that uses the shared-db relation. So I am going to couple this bug with bug 1389670 which I am fixing first. ultimate I will be moving the duplicate coded into charm-helpers.contrib so that both share the same common code.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.