applications missing relation data from keystone

Bug #1796923 reported by Jason Hobbs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Keystone Charm
Invalid
Undecided
Unassigned
OpenStack Nova Cloud Controller Charm
Fix Released
Critical
Unassigned

Bug Description

in a deploy of openstack, neutron-gateway and nova-cloud-controller applications were left in 'waiting' state, waiting for relation data from keystone:

For example, neutron-gateway/0 was stuck forever with this repeated message:

2018-10-08 17:06:45 INFO juju-log network-service relation's interface, quantum-network-service, is related awaiting the following data from the relationship: keystone_host, service_port, auth_port, service_tenant, service_username, service_password.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Tagged cdo-release-blocker since this appears to be a regression. We have only seen this with the 'next' charms.

tags: added: cdo-release-blocker
Revision history for this message
David Ames (thedac) wrote :

It seems this is nova-cloud-controller not completing its relationships:

2018-10-08 13:51:37 INFO juju-log cluster:34: database relation's interface, shared-db, is related awaiting the following data from the relationship: database_password, database_host.
2018-10-08 13:51:52 INFO juju-log ha:64: neutron-api relations's interface, neutron-api, is related but has no units in the relation.
2018-10-08 13:51:53 INFO juju-log ha:64: messaging relation's interface, amqp, is related awaiting the following data from the relationship: rabbitmq_password.
2018-10-08 13:51:53 INFO juju-log ha:64: identity relation's interface, identity-service, is related awaiting the following data from the relationship: admin_user, service_port, admin_tenant_name, auth_port, admin_password, auth_host, service_host.
2018-10-08 13:51:53 INFO juju-log ha:64: database relation's interface, shared-db, is related awaiting the following data from the relationship: database_password, database_host.

      nova-cloud-controller/0:
        workload-status:
          current: waiting
          message: 'Incomplete relations: neutron-api, database, identity, messaging'
          since: 08 Oct 2018 13:51:07Z
        juju-status:
          current: executing
          message: running ha-relation-changed hook

I do not think this is keystone related.

Changed in charm-nova-cloud-controller:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 18.11
Changed in charm-keystone:
status: New → Invalid
Revision history for this message
Liam Young (gnuoy) wrote :

I'm not sure why but the nova-cloud-controller log entries stop on or before 2018-10-08 13:53:18:

$ tail -n1 nova-cloud-controller_*/var/log/juju/unit-nova-cloud-controller-*.log | grep -oE '^.*DEBUG'
2018-10-08 13:51:53 DEBUG
2018-10-08 13:52:03 DEBUG
2018-10-08 13:53:18 DEBUG

However, mysql does not seem to become clustered until 13:51:05

$ grep "All 3" mysql_2/var/log/juju/unit-mysql-2.log | head -1
2018-10-08 13:51:05 DEBUG juju-log cluster:31: All 3 percona units reporting clustered

and the hook that detects the cluster doesn't seem to complete until about 2018-10-08 13:54:14. This mean that the relation data is propagated to the nova-cloud-controllers until after 13:54:14.

On the surface it looks like the decision to abort the deploy was taken prematurely. Perhaps the logs were truncated before they were uploaded?

Please could you provide juju logs for nova-cloud-controllers for the period after 13:54? Alternatively could you try extending the timeout for the deploy ? (the nova-cloud-controllers only seem to have had 30mins between running their install hooks and the end of the logs)

Changed in charm-nova-cloud-controller:
status: Triaged → Incomplete
Revision history for this message
Edward Hope-Morley (hopem) wrote :

I'm seeing this a lot with the next charms i.e. I deploy openstack non-HA but with Keystone deployed as HA and i regularly have at least one service that does not get its relation set from keystone. It appears that keystone was waiting for hacluster to finish which took a while (slow env) and it never goes back and updates (all) relations that need updating thus leaving then with "incomplete relations: identity".

Revision history for this message
Liam Young (gnuoy) wrote :

Ed, the existing crash dump provided for this bug is pretty clear that the missing data was missing because the db clustering hadn't finished. That is not to say you are not hitting a closely related bug or that the deploy already referenced in this bug wouldn't have moved on to seeing the issue you're seeing eventually.

Please can you provide a juju crashdump? In the mean time I will attempt to reproduce the issue.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Yep, on it's way...

Revision history for this message
Edward Hope-Morley (hopem) wrote :
Revision history for this message
Edward Hope-Morley (hopem) wrote :
Liam Young (gnuoy)
Changed in charm-nova-cloud-controller:
assignee: nobody → Liam Young (gnuoy)
status: Incomplete → Confirmed
Revision history for this message
Liam Young (gnuoy) wrote :

I've created a new bug, https://bugs.launchpad.net/charm-keystone/+bug/1801754 , for the keystone goal state issue for clarity.

Changed in charm-nova-cloud-controller:
status: Confirmed → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Liam,

Those are the complete logs, collected 17:06:43z.

There are no more logs on any of the units on the nova-cloud-controller machines, and juju status says those nova-cloud-controller units have been executing hooks since around 13:53.

We didn't truncate any logs, they just weren't written. Is it possible the code is stuck in a loop in the hook?

Changed in charm-nova-cloud-controller:
status: Incomplete → New
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Setting as incomplete rather than new. If, after the keystone goal state fixes, this specific issue is observed again, please attach new logs and reset the bug state to new.

Changed in charm-nova-cloud-controller:
status: New → Incomplete
David Ames (thedac)
Changed in charm-nova-cloud-controller:
milestone: 18.11 → 19.04
Revision history for this message
Chris Gregan (cgregan) wrote :
Revision history for this message
Chris Gregan (cgregan) wrote :

Hit is on Nov 15...logs attached above

Changed in charm-nova-cloud-controller:
status: Incomplete → New
James Page (james-page)
Changed in charm-nova-cloud-controller:
importance: Critical → Undecided
Revision history for this message
Andrew McLeod (admcleod) wrote :

Looking at the logs from Chris Gregan, the latest 'since' time in the juju_status.yaml is 11:01, with most of the 'since' times occuring between 10:50 and 10:55 - if the juju_status.yaml was captured at the same time crashdump was run (as is usually the case), it would indicate a controller problem - multiple units were still running update-status well beyond 11:01, e.g.

https://pastebin.canonical.com/p/vKf6pMKzNk/

I assume something broke on the controller node at approx. 11:01. I guess we need to see the controller logs if this happens again?

Revision history for this message
Liam Young (gnuoy) wrote :

I think we need to be a bit careful about attributing deployment failures to this bug as there are many reasons why a charm may not have got the full relation data from a charm it is related to. I think the latest crashdump is yet another bug (3rd distinct bug attributed this bug). Anyway, <tl;dr> The latest crash dump looks like a duplicate of Bug #1794850 </tl;dr>.

In the lastest crashdump the incomplete relations, as reported in the juju_status.yaml, show both neutron-gateways stuck with "Incomplete relations: network-service", this is consistent with them lacking information from the nova-cloud-controller. Looking at the nova-cloud-controller statuses (and their subordinates) shows them still running ha setup hooks. As admcleod observed the units of the nova-cloud-controller charm appear to get stuck just before 11:00am. Taking nova-cloud-controller/0 as an example nova-cloud-controller/0 (and its subordinates filebeat/66, hacluster-nova/2, nrpe-container/45 and telegraf/67) illustrates this point https://pastebin.canonical.com/p/Cd4XnKwdsj/. So, why are they stuck?

Only the leader hacluster unit should be setting up hacluster configuration, however, two seem to be doing it ad exactly the same time:

$ grep -E "Configuring Group.*nova" nova-cloud-controller_*/var/log/juju/unit-hacluster-nova*log
nova-cloud-controller_0/var/log/juju/unit-hacluster-nova-2.log:2018-11-15 10:55:33 DEBUG juju-log ha:64: Configuring Groups: {'grp_nova_vips': 'res_nova_eth0_vip res_nova_eth1_vip'}
nova-cloud-controller_2/var/log/juju/unit-hacluster-nova-1.log:2018-11-15 10:55:33 DEBUG juju-log ha:64: Configuring Groups: {'grp_nova_vips': 'res_nova_eth0_vip res_nova_eth1_vip'}

This should not be happening as in master this setup is gated by is_leader(). However, the deployment that the crashdump is from is using cs:hacluster-49 which is missing commit 6b6a1776. 6b6a1776 updates the charm from using the old and buggy 'oldest_peer' to using the Juju is-leader primitive. So, I think this should be fixed with the latest charms.

Changed in charm-nova-cloud-controller:
assignee: Liam Young (gnuoy) → nobody
David Ames (thedac)
Changed in charm-nova-cloud-controller:
milestone: 19.04 → 19.07
Revision history for this message
Pen Gale (pengale) wrote :

Per Liam's comment, this should be fixed by the latest charm release. Please reopen if the problem recurs.

Changed in charm-nova-cloud-controller:
status: New → Fix Released
importance: Undecided → Critical
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.