Juju 1.25.3 - infinite loop on cluster-relation-changed - cinder/apache services constantly restarted

Bug #1561927 reported by Alvaro Uria
48
This bug affects 9 people
Affects Status Importance Assigned to Milestone
cinder (Juju Charms Collection)
Fix Released
High
Edward Hope-Morley

Bug Description

LSB: Ubuntu 14.04.4 LTS
openstack: cloud:trusty-liberty
cinder packages: 2:7.0.1-0ubuntu1~cloud0
cinder charm: lp:charms/trusty/cinder;revno=106
Juju: 1.25.3.1
num_units: 3
related to hacluster charm (which remains idle)

Symptoms:
All three units constantly run cluster-relation-changed, causing restart of all cinder upstart jobs as well as apache2.
If a unit is stopped, the other two stop looping. By restarting stopped unit and juju resolving it, loop starts on all three.

Temp solution to end loop (only applied on cinder/1):
"""
@hooks.hook('cluster-relation-changed',
            'cluster-relation-departed')
@restart_on_change(restart_map(), stopstart=True)
def cluster_changed():
    #check_db_initialised()
    #CONFIGS.write_all()
    pass
"""

Such temporary solution was done at 13:08 (see attached 20160325-unit-cinder-1.log). Once all three units settled, I rolledback cluster_changed() config to original code (uncommenting check_db_initialised and CONFIGS.write_all), at 13:09.

Please let me know if you need further details.

Revision history for this message
Alvaro Uria (aluria) wrote :
Revision history for this message
Alvaro Uria (aluria) wrote :

cinder-0 attachment shows juju status when all three units are in the loop.

juju status-history cinder/0 shows transition between states when I was stopping peer units (showing active, as it stops looping) or cinder/0 unit itself (showing error state).

description: updated
Revision history for this message
Alvaro Uria (aluria) wrote :

Hi,

This is happening on three different ha+liberty+juju 1.25.3 deployments.

I made cluster_changed() "pass" until units settled (less than a minute) and restored cluster_changed() code.

Cheers,
-Alvaro.

Revision history for this message
Jill Rouleau (jillrouleau) wrote :

This behaviour manifests in these clouds every few days. Changing cluster_changed() to "pass", then restoring the original code temporarily resolves things but it always comes back. Are there additional logs, diagnostics, or troubleshooting we can provide?

Revision history for this message
Robert Clark (returntoreptar) wrote :

This behavior is manifesting it with my deployment and as of right now I have applied no fixes to it. Any logs I can provide I would be happy to.

Revision history for this message
Alvaro Uria (aluria) wrote :

Hi,

We're seeing this behaviour every weeks on different Clouds. Would permanently leaving "pass" on cluster_changed hook be ok?

Thank you,
-Alvaro.

James Page (james-page)
Changed in cinder (Juju Charms Collection):
milestone: none → 16.04
assignee: nobody → James Page (james-page)
James Page (james-page)
Changed in cinder (Juju Charms Collection):
milestone: 16.04 → 16.07
Revision history for this message
Junien F (axino) wrote :

Hi,

I'm impacted by this bug as well, with 3 cinder units. I believe this is due to the following code :
https://paste.ubuntu.com/16202202/

Each unit sets CINDER_DB_INIT_RKEY and CINDER_DB_INIT_ECHO_RKEY in the relation, which in turns calls the "cluster-relation-changed" hook on the other units, which changes this in the relation and so on.

I'm not sure what the purpose of this settings are, but I guess the best solution is to use Juju's leader mechanism, and have only the leader instruct its peers to restart.

Thank you !

James Page (james-page)
Changed in cinder (Juju Charms Collection):
milestone: 16.07 → 16.10
Revision history for this message
Edward Hope-Morley (hopem) wrote :

I have just upgrade cinder to 16.10 and am seeing all cinder processes restarted every 10s.

Changed in cinder (Juju Charms Collection):
importance: Undecided → High
milestone: 16.10 → 17.01
tags: added: openstack sts
Changed in cinder (Juju Charms Collection):
assignee: James Page (james-page) → Edward Hope-Morley (hopem)
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Ok i think ive found the problem and it looks like the db init check code does not tolerate leader switch. I'll have a patch up shortly.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/402954

Changed in cinder (Juju Charms Collection):
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-cinder (master)

Reviewed: https://review.openstack.org/402954
Committed: https://git.openstack.org/cgit/openstack/charm-cinder/commit/?id=1e1000a0892046b158e104025b04d3cf53a2a1b8
Submitter: Jenkins
Branch: master

commit 1e1000a0892046b158e104025b04d3cf53a2a1b8
Author: Edward Hope-Morley <email address hidden>
Date: Fri Nov 25 16:20:05 2016 +0000

    Fix cluster relation unnecessary service restarts

    The logic introduced in commit 619ce065 to formalise database
    initialisation did not support the leader switching and re-runs
    of the shared-db relation. This resulted in extraneous service
    restarts. We avoid this by adding some extra logic around this
    code.

    Change-Id: If988331e552da930eff868abded323014fd50f04
    Closes-Bug: 1561927

Changed in cinder (Juju Charms Collection):
status: In Progress → Fix Committed
tags: added: backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-cinder (stable/16.10)

Fix proposed to branch: stable/16.10
Review: https://review.openstack.org/408050

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-cinder (stable/16.10)

Reviewed: https://review.openstack.org/408050
Committed: https://git.openstack.org/cgit/openstack/charm-cinder/commit/?id=0f7358b258dd21a8a9567175ed298c1ffacc556d
Submitter: Jenkins
Branch: stable/16.10

commit 0f7358b258dd21a8a9567175ed298c1ffacc556d
Author: Edward Hope-Morley <email address hidden>
Date: Fri Nov 25 16:20:05 2016 +0000

    Fix cluster relation unnecessary service restarts

    The logic introduced in commit 619ce065 to formalise database
    initialisation did not support the leader switching and re-runs
    of the shared-db relation. This resulted in extraneous service
    restarts. We avoid this by adding some extra logic around this
    code.

    Closes-Bug: 1561927
    (cherry picked from commit 1e1000a0892046b158e104025b04d3cf53a2a1b8)
    Change-Id: If988331e552da930eff868abded323014fd50f04

Changed in cinder (Juju Charms Collection):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.