Bug #1796923 “applications missing relation data from keystone” : Bugs : OpenStack Keystone Charm

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-10-09:

#1

juju-crashdump-controller-2018-10-08-17.05.16.tar.gz Edit (14.7 MiB, application/x-tar)

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-10-09:

#2

juju-crashdump-openstack-2018-10-08-17.06.34.tar.gz Edit (179.3 MiB, application/x-tar)

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-10-31:

#3

Tagged cdo-release-blocker since this appears to be a regression. We have only seen this with the 'next' charms.

tags:

added: cdo-release-blocker

Revision history for this message

David Ames (thedac) wrote on 2018-10-31:

#4

It seems this is nova-cloud-controller not completing its relationships:

2018-10-08 13:51:37 INFO juju-log cluster:34: database relation's interface, shared-db, is related awaiting the following data from the relationship: database_password, database_host.
2018-10-08 13:51:52 INFO juju-log ha:64: neutron-api relations's interface, neutron-api, is related but has no units in the relation.
2018-10-08 13:51:53 INFO juju-log ha:64: messaging relation's interface, amqp, is related awaiting the following data from the relationship: rabbitmq_password.
2018-10-08 13:51:53 INFO juju-log ha:64: identity relation's interface, identity-service, is related awaiting the following data from the relationship: admin_user, service_port, admin_tenant_name, auth_port, admin_password, auth_host, service_host.
2018-10-08 13:51:53 INFO juju-log ha:64: database relation's interface, shared-db, is related awaiting the following data from the relationship: database_password, database_host.

      nova-cloud-controller/0:
        workload-status:
          current: waiting
          message: 'Incomplete relations: neutron-api, database, identity, messaging'
          since: 08 Oct 2018 13:51:07Z
        juju-status:
          current: executing
          message: running ha-relation-changed hook

I do not think this is keystone related.

Changed in charm-nova-cloud-controller:
status:	New → Triaged
importance:	Undecided → Critical
milestone:	none → 18.11
Changed in charm-keystone:
status:	New → Invalid

Revision history for this message

Liam Young (gnuoy) wrote on 2018-11-01:

#5

I'm not sure why but the nova-cloud-controller log entries stop on or before 2018-10-08 13:53:18:

$ tail -n1 nova-cloud-controller_*/var/log/juju/unit-nova-cloud-controller-*.log | grep -oE '^.*DEBUG'
2018-10-08 13:51:53 DEBUG
2018-10-08 13:52:03 DEBUG
2018-10-08 13:53:18 DEBUG

However, mysql does not seem to become clustered until 13:51:05

$ grep "All 3" mysql_2/var/log/juju/unit-mysql-2.log | head -1
2018-10-08 13:51:05 DEBUG juju-log cluster:31: All 3 percona units reporting clustered

and the hook that detects the cluster doesn't seem to complete until about 2018-10-08 13:54:14. This mean that the relation data is propagated to the nova-cloud-controllers until after 13:54:14.

On the surface it looks like the decision to abort the deploy was taken prematurely. Perhaps the logs were truncated before they were uploaded?

Please could you provide juju logs for nova-cloud-controllers for the period after 13:54? Alternatively could you try extending the timeout for the deploy ? (the nova-cloud-controllers only seem to have had 30mins between running their install hooks and the end of the logs)

Changed in charm-nova-cloud-controller:
status:	Triaged → Incomplete

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2018-11-05:

#6

I'm seeing this a lot with the next charms i.e. I deploy openstack non-HA but with Keystone deployed as HA and i regularly have at least one service that does not get its relation set from keystone. It appears that keystone was waiting for hacluster to finish which took a while (slow env) and it never goes back and updates (all) relations that need updating thus leaving then with "incomplete relations: identity".

Revision history for this message

Liam Young (gnuoy) wrote on 2018-11-05:

#7

Ed, the existing crash dump provided for this bug is pretty clear that the missing data was missing because the db clustering hadn't finished. That is not to say you are not hitting a closely related bug or that the deploy already referenced in this bug wouldn't have moved on to seeing the issue you're seeing eventually.

Please can you provide a juju crashdump? In the mean time I will attempt to reproduce the issue.

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2018-11-05:

#8

Yep, on it's way...

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2018-11-05:

#9

juju-crashdump-321bc070-8a07-4591-96aa-243bd1fcf553.tar.xz Edit (6.2 MiB, application/x-tar)

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2018-11-05:

#10

bundle - https://pastebin.ubuntu.com/p/9mcN5JyjDj/
overlay - https://pastebin.ubuntu.com/p/s7NKY2Pn98/

Liam Young (gnuoy) on 2018-11-05

Changed in charm-nova-cloud-controller:
assignee:	nobody → Liam Young (gnuoy)
status:	Incomplete → Confirmed

Revision history for this message

Liam Young (gnuoy) wrote on 2018-11-05:

#11

I've created a new bug, https://bugs.launchpad.net/charm-keystone/+bug/1801754 , for the keystone goal state issue for clarity.

Changed in charm-nova-cloud-controller:
status:	Confirmed → Incomplete

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-11-05:

#12

Liam,

Those are the complete logs, collected 17:06:43z.

There are no more logs on any of the units on the nova-cloud-controller machines, and juju status says those nova-cloud-controller units have been executing hooks since around 13:53.

We didn't truncate any logs, they just weren't written. Is it possible the code is stuck in a loop in the hook?

Changed in charm-nova-cloud-controller:
status:	Incomplete → New

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2018-11-10:

#13

Setting as incomplete rather than new. If, after the keystone goal state fixes, this specific issue is observed again, please attach new logs and reset the bug state to new.

Changed in charm-nova-cloud-controller:
status:	New → Incomplete

David Ames (thedac) on 2018-11-20

Changed in charm-nova-cloud-controller:
milestone:	18.11 → 19.04

Revision history for this message

Chris Gregan (cgregan) wrote on 2018-12-03:

#14

juju-crashdump-openstack-2018-11-15-14.05.34.tar.gz Edit (243.4 MiB, application/x-tar)

Revision history for this message

Chris Gregan (cgregan) wrote on 2018-12-03:

#15

Hit is on Nov 15...logs attached above

Changed in charm-nova-cloud-controller:
status:	Incomplete → New

James Page (james-page) on 2018-12-10

Changed in charm-nova-cloud-controller:
importance:	Critical → Undecided

Revision history for this message

Andrew McLeod (admcleod) wrote on 2018-12-11:

#16

Looking at the logs from Chris Gregan, the latest 'since' time in the juju_status.yaml is 11:01, with most of the 'since' times occuring between 10:50 and 10:55 - if the juju_status.yaml was captured at the same time crashdump was run (as is usually the case), it would indicate a controller problem - multiple units were still running update-status well beyond 11:01, e.g.

https://pastebin.canonical.com/p/vKf6pMKzNk/

I assume something broke on the controller node at approx. 11:01. I guess we need to see the controller logs if this happens again?

Revision history for this message

Liam Young (gnuoy) wrote on 2018-12-11:

#17

I think we need to be a bit careful about attributing deployment failures to this bug as there are many reasons why a charm may not have got the full relation data from a charm it is related to. I think the latest crashdump is yet another bug (3rd distinct bug attributed this bug). Anyway, <tl;dr> The latest crash dump looks like a duplicate of Bug #1794850 </tl;dr>.

In the lastest crashdump the incomplete relations, as reported in the juju_status.yaml, show both neutron-gateways stuck with "Incomplete relations: network-service", this is consistent with them lacking information from the nova-cloud-controller. Looking at the nova-cloud-controller statuses (and their subordinates) shows them still running ha setup hooks. As admcleod observed the units of the nova-cloud-controller charm appear to get stuck just before 11:00am. Taking nova-cloud-controller/0 as an example nova-cloud-controller/0 (and its subordinates filebeat/66, hacluster-nova/2, nrpe-container/45 and telegraf/67) illustrates this point https://pastebin.canonical.com/p/Cd4XnKwdsj/. So, why are they stuck?

Only the leader hacluster unit should be setting up hacluster configuration, however, two seem to be doing it ad exactly the same time:

$ grep -E "Configuring Group.*nova" nova-cloud-controller_*/var/log/juju/unit-hacluster-nova*log
nova-cloud-controller_0/var/log/juju/unit-hacluster-nova-2.log:2018-11-15 10:55:33 DEBUG juju-log ha:64: Configuring Groups: {'grp_nova_vips': 'res_nova_eth0_vip res_nova_eth1_vip'}
nova-cloud-controller_2/var/log/juju/unit-hacluster-nova-1.log:2018-11-15 10:55:33 DEBUG juju-log ha:64: Configuring Groups: {'grp_nova_vips': 'res_nova_eth0_vip res_nova_eth1_vip'}

This should not be happening as in master this setup is gated by is_leader(). However, the deployment that the crashdump is from is using cs:hacluster-49 which is missing commit 6b6a1776. 6b6a1776 updates the charm from using the old and buggy 'oldest_peer' to using the Juju is-leader primitive. So, I think this should be fixed with the latest charms.

I think we need to be a bit careful about attributing deployment failures to this bug as there are many reasons why a charm may not have got the full relation data from a charm it is related to. I think the latest crashdump is yet another bug (3rd distinct bug attributed this bug). Anyway, <tl;dr> The latest crash dump looks like a duplicate of Bug #1794850 </tl;dr>.

In the lastest crashdump the incomplete relations, as reported in the juju_status.yaml, show both neutron-gateways stuck with "Incomplete relations: network-service", this is consistent with them lacking information from the nova-cloud-controller. Looking at the nova-cloud-controller statuses (and their subordinates) shows them still running ha setup hooks. As admcleod observed the units of the nova-cloud-controller charm appear to get stuck just before 11:00am. Taking nova-cloud-controller/0 as an example nova-cloud-controller/0 (and its subordinates  filebeat/66, hacluster-nova/2, nrpe-container/45 and telegraf/67) illustrates this point https://pastebin.canonical.com/p/Cd4XnKwdsj/. So, why are they stuck?

Only the leader hacluster unit should be setting up hacluster configuration, however, two seem to be doing it ad exactly the same time:

$ grep -E "Configuring Group.*nova" nova-cloud-controller_*/var/log/juju/unit-hacluster-nova*log
nova-cloud-controller_0/var/log/juju/unit-hacluster-nova-2.log:2018-11-15 10:55:33 DEBUG juju-log ha:64: Configuring Groups: {'grp_nova_vips': 'res_nova_eth0_vip res_nova_eth1_vip'}
nova-cloud-controller_2/var/log/juju/unit-hacluster-nova-1.log:2018-11-15 10:55:33 DEBUG juju-log ha:64: Configuring Groups: {'grp_nova_vips': 'res_nova_eth0_vip res_nova_eth1_vip'}

This should not be happening as in master this setup is gated by is_leader(). However, the deployment that the crashdump is from is using cs:hacluster-49 which is missing commit 6b6a1776. 6b6a1776 updates the charm from using the old and buggy 'oldest_peer' to using the Juju is-leader primitive. So, I think this should be fixed with the latest charms.

Changed in charm-nova-cloud-controller:
assignee:	Liam Young (gnuoy) → nobody

David Ames (thedac) on 2019-04-17

Changed in charm-nova-cloud-controller:
milestone:	19.04 → 19.07

Revision history for this message

Pen Gale (pengale) wrote on 2019-05-14:

#18

Per Liam's comment, this should be fixed by the latest charm release. Please reopen if the problem recurs.

Changed in charm-nova-cloud-controller:
status:	New → Fix Released
importance:	Undecided → Critical

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Keystone Charm	Invalid	Undecided	Unassigned
	OpenStack Nova Cloud Controller Charm	Fix Released	Critical	Unassigned	OpenStack Nova Cloud Controller Charm 19.07

OpenStack Keystone Charm

applications missing relation data from keystone

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches