removing and then re-adding the amqp relation results in waiting state for 'amqp'

Bug #1721580 reported by Tytus Kurek
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Charm Interface
In Progress
Medium
Unassigned

Bug Description

The designate charm from "next" repository (cs:~openstack-charmers-next/designate-50) has been deployed in an old OpenStack environment based on legacy charm networking options which does not use Juju spaces at all.

Just after the deployment it went into the following state:

$ juju status designate
Model Controller Cloud/Region Version
openstack bud0b1 bud0b1 2.1.3

App Version Status Scale Charm Store Rev OS Notes
designate 2.0.0 waiting 3 designate local 1 ubuntu
designate-hacluster active 3 hacluster local 1 ubuntu

Unit Workload Agent Machine Public address Ports Message
designate/3* waiting executing 20/lxd/34 100.107.3.100 9001/tcp 'shared-db' incomplete, 'amqp' incomplete, 'identity-service' incomplete
  designate-hacluster/3* active idle 100.107.3.100 Unit is ready and clustered
designate/4 waiting idle 24/lxd/11 100.107.67.56 9001/tcp 'shared-db' incomplete, 'amqp' incomplete, 'identity-service' incomplete
  designate-hacluster/4 active idle 100.107.67.56 Unit is ready and clustered
designate/5 waiting idle 25/lxd/16 100.107.130.13 9001/tcp 'shared-db' incomplete, 'amqp' incomplete, 'identity-service' incomplete
  designate-hacluster/5 active idle 100.107.130.13 Unit is ready and clustered

Machine State DNS Inst id Series AZ
20 started 100.107.3.5 aekwnf trusty FC1
20/lxd/34 started 100.107.3.100 juju-a2afdd-20-lxd-34 trusty
24 started 100.107.67.3 gf7kcw trusty FC2
24/lxd/11 started 100.107.67.56 juju-a2afdd-24-lxd-11 trusty
25 started 100.107.130.10 brts3a trusty FC3
25/lxd/16 started 100.107.130.13 juju-a2afdd-25-lxd-16 trusty

Relation Provides Consumes Type
cluster designate designate peer
dns-backend designate designate-bind regular
ha designate designate-hacluster subordinate
identity-service designate keystone regular
shared-db designate percona-cluster regular
amqp designate rabbitmq-server-tor regular
hanode designate-hacluster designate-hacluster peer

Further troubleshooting reveals that designate units are missing a lot of relation data:

root@juju-a2afdd-20-lxd-34:/var/lib/juju/agents/unit-designate-3/charm# relation-get -r 619 - percona-cluster/0
private-address: 100.107.3.11
root@juju-a2afdd-20-lxd-34:/var/lib/juju/agents/unit-designate-3/charm# relation-get -r 619 - percona-cluster/1
private-address: 100.107.67.12
root@juju-a2afdd-20-lxd-34:/var/lib/juju/agents/unit-designate-3/charm# relation-get -r 619 - percona-cluster/3
private-address: 100.107.130.40
root@juju-a2afdd-20-lxd-34:/var/lib/juju/agents/unit-designate-3/charm# relation-get -r 624 - rabbitmq-server-tor/0
private-address: 100.107.3.54
root@juju-a2afdd-20-lxd-34:/var/lib/juju/agents/unit-designate-3/charm# relation-get -r 624 - rabbitmq-server-tor/1
private-address: 100.107.67.41
root@juju-a2afdd-20-lxd-34:/var/lib/juju/agents/unit-designate-3/charm# relation-get -r 624 - rabbitmq-server-tor/2
private-address: 100.107.130.11
root@juju-a2afdd-20-lxd-34:/var/lib/juju/agents/unit-designate-3/charm# relation-get -r 620 - keystone/0
private-address: 100.107.3.25
root@juju-a2afdd-20-lxd-34:/var/lib/juju/agents/unit-designate-3/charm# relation-get -r 620 - keystone/1
private-address: 100.107.67.18
root@juju-a2afdd-20-lxd-34:/var/lib/juju/agents/unit-designate-3/charm# relation-get -r 620 - keystone/3
private-address: 100.107.130.29

On the other side of relation it looks as follows:

juju debug-hooks percona-cluster/0

root@juju-a2afdd-27-lxd-0:/var/lib/juju/agents/unit-percona-cluster-0/charm# relation-get -r 619 - designate/3
designate_database: designate
designate_hostname: 100.107.3.100
designate_username: designate
dpm_database: dpm
dpm_hostname: 100.107.3.100
dpm_username: dpm
private-address: 100.107.3.100

juju debug-hooks rabbitmq-server-tor/0

root@juju-a2afdd-29-kvm-8:/var/lib/juju/agents/unit-rabbitmq-server-tor-0/charm# relation-get -r 624 - designate/3
private-address: 100.107.3.100
username: designate
vhost: openstack

juju debug-hooks keystone/0

root@juju-a2afdd-18-lxd-2:/var/lib/juju/agents/unit-keystone-0/charm# relation-get -r 620 - designate/3
admin_url: http://100.86.0.11:9001
internal_url: http://100.86.0.11:9001
private-address: 100.107.3.100
public_url: http://100.86.0.11:9001
region: RegionOne
service: designate

Also, further debugging reveals that some of the patches which have been recently implemented don't work in this environment. For example "(database.access_network()" returns "None", however, the access-network is configured:

root@juju-a2afdd-29-kvm-8:/var/lib/juju/agents/unit-rabbitmq-server-tor-0/charm# cat reactive/designate_handlers.py

...
@reactive.when('shared-db.connected')
@reactive.when_not('shared-db.setup')
def setup_database(database):
    """Send request designate accounts and dbs"""
    hostname = None
    print (database.access_network())
    if database.access_network():
        hostname = ip.get_address_in_network(database.access_network())
    database.configure('designate', 'designate', prefix='designate',
                       hostname=hostname)
    database.configure('dpm', 'dpm', prefix='dpm',
                       hostname=hostname)
    if database.base_data_complete():
        reactive.set_state('shared-db.setup')
...

root@juju-a2afdd-20-lxd-34:/var/lib/juju/agents/unit-designate-3/charm# ./hooks/shared-db-relation-joined
None

$ juju config percona-cluster access-network
100.86.0.0/20

So this may be a regression. On the other hand it works fine on some other environment which has recently been deployed from scratch.

Attached are juju logs from all designate, percona-cluster, rabbitmq-server and keystone units.

Revision history for this message
Tytus Kurek (tkurek) wrote :
Revision history for this message
Tytus Kurek (tkurek) wrote :

It looks like the actual problem is that the "get_remote()" function does not return the value of the "access-network" parameter. I've modified the following part of the code ("hooks/relations/mysql-shared/requires.py"):

    @hook('{requires:mysql-shared}-relation-changed')
    def changed(self):
        conv = self.conversation()
        print (self.conversation())
        print (conv.get_remote('access-network'))
        print (conv.get_remote('access_network'))

and this is the output I'm getting:

2017-10-06 11:00:41 INFO shared-db-relation-changed <charms.reactive.relations.Conversation object at 0x7f544009d080>
2017-10-06 11:00:41 INFO shared-db-relation-changed None
2017-10-06 11:00:41 INFO shared-db-relation-changed None

Revision history for this message
Tytus Kurek (tkurek) wrote :

I can confirm that there's no regression in the designate charm as this exact version deploys fine on some other environment. It looks like there's something wrong with this specific environment.

Revision history for this message
James Page (james-page) wrote :

What's the configuration of 'access-network' on the percona-cluster charm?

Revision history for this message
Tytus Kurek (tkurek) wrote :

James,

As stated above it's 100.86.0.0/20

$ juju config percona-cluster access-network
100.86.0.0/20

Revision history for this message
Tytus Kurek (tkurek) wrote :

I've just redeployed on some other environment with the exact charm versions of designate, keystone, percona-cluster and rabbitmq-server and it ended up with "'identity-service' incomplete", so at least keystone charm version may be a partial root cause.

Revision history for this message
Tytus Kurek (tkurek) wrote :

I've redeployed one more time on the other environment and it got deployed. When I was redeploying for the first time, keystone was not related to percona-cluster at the time designate tried to relate with it. So maybe the issue is that if it fails to relate for the first time, it will never relate?

Revision history for this message
James Page (james-page) wrote :

Relation ordering should not impact on the final state of the deployment - hook execution gets deferred in keystone until the db is configured, at which point designate will get all the information it needs.

Revision history for this message
Tytus Kurek (tkurek) wrote :
Download full text (4.2 KiB)

I made a debug and on the percona-cluster side it fails at the beginning of the "shared-db-relation-changed" hook:

@hooks.hook('shared-db-relation-changed')
def shared_db_changed(relation_id=None, unit=None):
    #log("DEBUG: shared_db_changed")
    if not seeded():
        print ("TYTUS 1")
        log("Percona cluster not yet bootstrapped - deferring shared-db rel "
            "until bootstrapped", DEBUG)
        return

The following is being logged:

2017-10-09 14:12:11 INFO juju.worker.uniter resolver.go:121 no operations in progress; waiting for changes
2017-10-09 14:12:11 INFO juju.worker.uniter.relation relations.go:431 joining relation "designate:shared-db percona-cluster:shared-db"
2017-10-09 14:12:11 INFO juju.worker.uniter.relation relations.go:467 joined relation "designate:shared-db percona-cluster:shared-db"
2017-10-09 14:12:11 INFO juju.worker.uniter resolver.go:121 no operations in progress; waiting for changes
2017-10-09 14:12:24 INFO juju.worker.uniter resolver.go:121 no operations in progress; waiting for changes
2017-10-09 14:12:24 INFO juju.worker.uniter.operation runhook.go:116 skipped "shared-db-relation-joined" hook (missing)
2017-10-09 14:12:24 INFO juju.worker.uniter resolver.go:121 no operations in progress; waiting for changes
2017-10-09 14:12:29 INFO juju-log shared-db:628: Sufficient number of peer units to form cluster 3
2017-10-09 14:12:31 INFO juju-log shared-db:628: Sufficient number of peer units to form cluster 3
2017-10-09 14:12:32 INFO shared-db-relation-changed * Percona XtraDB Cluster up and running
2017-10-09 14:12:32 INFO juju-log shared-db:628: Unit is ready
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO shared-db-relation-changed TYTUS 1
2017-10-09 14:12:34 INFO juju.worker.uniter.operation runhook.go:113 ran "shared-db-relation-changed" hook

However, I cannot see the "Percona cluster not yet bootstrapped" message. Also, percona-cluster is bootstrapped:

percona-cluster/0 active executing 27/lxd/0 100.107.3.11 3306/tcp (update-status) Unit is ready
  check-mk-agent/548 ...

Read more...

Revision history for this message
Tytus Kurek (tkurek) wrote :

It looks like "seeded()" function returns "False" for some reason:

def seeded():
    ''' Check whether service unit is already seeded '''
    print ("TYTUS 1")
    print (resolve_data_dir())
    print (os.path.exists(SEEDED_MARKER.format(data_dir=resolve_data_dir())))
    print ("TYTUS 2")
    return os.path.exists(SEEDED_MARKER.format(data_dir=resolve_data_dir()))

log:

2017-10-09 14:29:17 INFO update-status TYTUS 1
2017-10-09 14:29:17 INFO update-status /var/lib/mysql
2017-10-09 14:29:17 INFO update-status False
2017-10-09 14:29:17 INFO update-status TYTUS 2

Revision history for this message
Tytus Kurek (tkurek) wrote :

Found it!

percona-cluster/0:

root@juju-a2afdd-27-lxd-0:~# ls -l /var/lib/mysql/seeded
-rw-r--r-- 1 root root 4 Mar 27 2017 /var/lib/mysql/seeded

percona-cluster/1:

root@juju-a2afdd-31-lxd-1:~# ls -l /var/lib/mysql/seeded
ls: cannot access /var/lib/mysql/seeded: No such file or directory

percona-cluster/3:

root@juju-a2afdd-75-lxd-2:~# ls -l /var/lib/mysql/seeded
ls: cannot access /var/lib/mysql/seeded: No such file or directory

I have no idea why "seeded" files are absent on percona-cluster units 1 and 3.

Revision history for this message
Tytus Kurek (tkurek) wrote :

So it looks like the issue is environment-specific, but I can still see 2 things which could be improved:

1) The "Percona cluster not yet bootstrapped - deferring shared-db rel until bootstrapped" message is not being logged

2) Using text files is not a reliable way to test whether the cluster has been bootstrapped as they can be removed by someone accidentally.

Revision history for this message
Tytus Kurek (tkurek) wrote :

After re-creating those file both percona-cluster and keystone are no longer in the "incomplete" state. I'll keep debugging rabbitmq-server issue.

Revision history for this message
Tytus Kurek (tkurek) wrote :

rabbitmq-server fails at the following piece of the code:

    if not rabbit.leader_node_is_ready():
        return

Again, adding debug message would be a good idea for future.

Revision history for this message
Tytus Kurek (tkurek) wrote :

The following part of "rabbit_utils.py" seems to be causing the issues:

                if not relation_get(attribute='clustered',
                                    rid=relation_id,
                                    unit=remote_unit):
                    log("{} is not yet clustered".format(remote_unit),
                        DEBUG)
                    return False

Again, for some reason the log message is not being logged.

Revision history for this message
Tytus Kurek (tkurek) wrote :

It looks like rabbitmq-server-tor/2 unit is missing "clustered" relation data from rabbitmq-server-tor/0 unit:

root@juju-a2afdd-75-kvm-3:/var/lib/juju/agents/unit-rabbitmq-server-tor-2/charm# relation-get -r 384 - rabbitmq-server-tor/0
cookie: VWFZLEGPTBKSBWZLOXCH
hostname: juju-a2afdd-29-kvm-8
private-address: 100.86.0.196

although the status being reported is as follows:

rabbitmq-server-tor/0 active executing 29/kvm/8 100.107.3.54 5672/tcp Unit is ready and clustered
  check-mk-agent/576 active idle 100.107.3.54 ready
  nrpe/125 unknown idle 100.107.3.54
rabbitmq-server-tor/1 active executing 33/kvm/4 100.107.67.41 5672/tcp Unit is ready and clustered
  check-mk-agent/574 active idle 100.107.67.41 ready
  nrpe/157 unknown idle 100.107.67.41
rabbitmq-server-tor/2* active idle 75/kvm/3 100.107.130.11 5672/tcp Unit is ready and clustered
  check-mk-agent/575 active idle 100.107.130.11 ready
  nrpe/158 unknown idle 100.107.130.11

which is misleading.

Revision history for this message
Tytus Kurek (tkurek) wrote :

I managed to fix the issue with "clustered" relation data on rabbitmq-server-tor/0 unit by manually setting it with "relation-set" command and there is some progress:

root@juju-a2afdd-20-lxd-34:/var/lib/juju/agents/unit-designate-3/charm# relation-get -r 656 - rabbitmq-server-tor/0
private-address: 100.107.3.54
root@juju-a2afdd-20-lxd-34:/var/lib/juju/agents/unit-designate-3/charm# relation-get -r 656 - rabbitmq-server-tor/1
private-address: 100.107.67.41
root@juju-a2afdd-20-lxd-34:/var/lib/juju/agents/unit-designate-3/charm# relation-get -r 656 - rabbitmq-server-tor/2
hostname: 100.86.0.183
private-address: 100.86.0.183

I'll keep debugging it.

Revision history for this message
Tytus Kurek (tkurek) wrote :

It looks like for some reason now the designate charm is not exposing the required relation data:

root@juju-a2afdd-75-kvm-3:/var/lib/juju/agents/unit-rabbitmq-server-tor-2/charm# relation-get -r 656 - designate/3
private-address: 100.107.3.100
root@juju-a2afdd-75-kvm-3:/var/lib/juju/agents/unit-rabbitmq-server-tor-2/charm# relation-get -r 656 - designate/4
private-address: 100.107.67.56
root@juju-a2afdd-75-kvm-3:/var/lib/juju/agents/unit-rabbitmq-server-tor-2/charm# relation-get -r 656 - designate/5
private-address: 100.107.130.13

Revision history for this message
James Page (james-page) wrote :

Tytus

Please can you confirm which version of the designate charm you're using.

Revision history for this message
Tytus Kurek (tkurek) wrote :

The issue was finally solved. I'm not sure whether it helped, but I noticed that the following state remains set even if designate is not related to rabbitmq-server:

'amqp.requested-access': None

I removed the state with the following commands:

juju run --unit designate/3 '. /var/lib/juju/agents/unit-designate-3/.venv/bin/activate;charms.reactive remove_state amqpg.requested-access'
juju run --unit designate/4 '. /var/lib/juju/agents/unit-designate-4/.venv/bin/activate;charms.reactive remove_state amqp.requested-access'
juju run --unit designate/5 '. /var/lib/juju/agents/unit-designate-5/.venv/bin/activate;charms.reactive remove_state amqp.requested-access'

and it got deployed.

Revision history for this message
James Page (james-page) wrote :

hmm but why was that state set? this still smells like a bug to me

Revision history for this message
Tytus Kurek (tkurek) wrote :

From my point of view the bug can be marked as invalid, however, I would suggest to take the feedback I provided regarding log messages not being displayed, percona-cluster state checking and the above comment regarding rabbitmq-server.

Revision history for this message
James Page (james-page) wrote :

I'm guessing you removed and re-added the amqp relation maybe?

Revision history for this message
Tytus Kurek (tkurek) wrote :

James,

I'm using the following version:

cs:~openstack-charmers-next/designate-50

Yes, the sequence was as follows:

1) I removed the amqp relation
2) I noticed the state is still set and remove it manually
3) I re-added the relation

Unfortunately, I have no time right now to test whether it's reproducible, but I believe I've seen it before.

Revision history for this message
James Page (james-page) wrote :

There is a state/flags issue if you remove and then re-add the amqp relation with rabbitmq-server.

The state is not removed so the handler never gets re-executed when the relation is re-created.

Changed in charm-designate:
status: New → Confirmed
summary: - 'shared-db' incomplete, 'amqp' incomplete, 'identity-service' incomplete
+ removing and then re-adding the amqp relation results in waiting state
+ for 'amqp'
Changed in charm-designate:
importance: Undecided → High
importance: High → Medium
status: Confirmed → Triaged
Tytus Kurek (tkurek)
affects: charm-designate → charm-interface-rabbitmq
Changed in charm-interface-rabbitmq:
status: Triaged → In Progress
assignee: nobody → Tytus Kurek (tkurek)
Revision history for this message
Tytus Kurek (tkurek) wrote :

This is also a bug in "charms.reactive" library - the state is not removed if it's "None". I have a patch for both and will submit it shortly.

Revision history for this message
Tytus Kurek (tkurek) wrote :

I've opened a pull request for "charms.reactive" library:

https://github.com/juju-solutions/charms.reactive/pull/164

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-interface-rabbitmq (master)

Fix proposed to branch: master
Review: https://review.openstack.org/550770

Tytus Kurek (tkurek)
Changed in charm-interface-rabbitmq:
assignee: Tytus Kurek (tkurek) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-interface-rabbitmq (master)

Change abandoned by Tytus Kurek (<email address hidden>) on branch: master
Review: https://review.openstack.org/550770

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.