nova-cloud-controller services are only running on 1 of 3 nodes

Bug #1584951 reported by Francis Ginther
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Landscape Server
New
High
Andreas Hasenack
16.06
New
High
Andreas Hasenack
nova-cloud-controller (Juju Charms Collection)
New
Undecided
Unassigned

Bug Description

This was found with a landscape Autopilot deployment using swift object and ceph block storage and an internal nagios service. Nagios is reporting:

CRITICAL: nova-cloud-controller-1 nova-cloud-controller-0 nova-cloud-controller-1 nova-cloud-controller-0 nova-cloud-controller-1 nova-cloud-controller-0

The nagios check goes through each server in haproxy.cfg and tries to reach each endpoint. This fails for all but one of the nova-cloud-controller instances. Looking at the three instances shows that the nova services: nova-api-os-compute, nova-scheduler, nova-conductor and nova-cert, are only running on a single instance.

Logs have been attached.

Tags: landscape
description: updated
Revision history for this message
Francis Ginther (fginther) wrote :

Here's what I get from poking at the nagios check manually:

[From /etc/haproxy/haproxy.cfg]
...
backend nova-api-ec2_10.245.201.243
    balance leastconn
    server nova-cloud-controller-2 10.245.201.243:8763 check
    server nova-cloud-controller-1 10.245.201.60:8763 check
    server nova-cloud-controller-0 10.245.201.231:8763 check

...
backend nova-objectstore_10.245.201.243
    balance leastconn
    server nova-cloud-controller-2 10.245.201.243:3323 check
    server nova-cloud-controller-1 10.245.201.60:3323 check
    server nova-cloud-controller-0 10.245.201.231:3323 check

...
backend nova-api-os-compute_10.245.201.243
    balance leastconn
    server nova-cloud-controller-2 10.245.201.243:8764 check
    server nova-cloud-controller-1 10.245.201.60:8764 check
    server nova-cloud-controller-0 10.245.201.231:8764 check
...

Now the nagios check tries to access each of those IPs and ports. Doing this manually I get:

ubuntu@juju-machine-1-lxc-2:/etc/nagios/nrpe.d$ nc -vz 10.245.201.243 8763
nc: connect to 10.245.201.243 port 8763 (tcp) failed: Connection refused
ubuntu@juju-machine-1-lxc-2:/etc/nagios/nrpe.d$ nc -vz 10.245.201.243 8764
Connection to 10.245.201.243 8764 port [tcp/*] succeeded!
ubuntu@juju-machine-1-lxc-2:/etc/nagios/nrpe.d$ nc -vz 10.245.201.243 3323
nc: connect to 10.245.201.243 port 3323 (tcp) failed: Connection refused

ubuntu@juju-machine-1-lxc-2:/etc/nagios/nrpe.d$ nc -vz 10.245.201.60 3323
nc: connect to 10.245.201.60 port 3323 (tcp) failed: Connection refused
ubuntu@juju-machine-1-lxc-2:/etc/nagios/nrpe.d$ nc -vz 10.245.201.60 8763
nc: connect to 10.245.201.60 port 8763 (tcp) failed: Connection refused
ubuntu@juju-machine-1-lxc-2:/etc/nagios/nrpe.d$ nc -vz 10.245.201.60 8764
nc: connect to 10.245.201.60 port 8764 (tcp) failed: Connection refused

ubuntu@juju-machine-1-lxc-2:/etc/nagios/nrpe.d$ nc -vz 10.245.201.231 8764
nc: connect to 10.245.201.231 port 8764 (tcp) failed: Connection refused
ubuntu@juju-machine-1-lxc-2:/etc/nagios/nrpe.d$ nc -vz 10.245.201.231 8763
nc: connect to 10.245.201.231 port 8763 (tcp) failed: Connection refused
ubuntu@juju-machine-1-lxc-2:/etc/nagios/nrpe.d$ nc -vz 10.245.201.231 3323
nc: connect to 10.245.201.231 port 3323 (tcp) failed: Connection refused

Only one of the ports is reachable.

tags: added: landscape
Revision history for this message
Francis Ginther (fginther) wrote :

Logs from all three nova-cloud-controller units.

Revision history for this message
Francis Ginther (fginther) wrote :
Download full text (4.3 KiB)

List of 'nova' processes on the three units (see http://paste.ubuntu.com/16653734/ for better formatting):
# for i in {0..2}; do juju ssh nova-cloud-controller/${i} "ps -ef|grep nova"; done | tee /tmp/nova-cloud-controller-ps.log
Warning: Permanently added 'node-10.vmwarestack,10.245.202.2' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.245.201.231' (ECDSA) to the list of known hosts.
root 1064 1 0 13:34 ? 00:00:26 /var/lib/juju/tools/unit-nova-cloud-controller-0/jujud unit --data-dir /var/lib/juju --unit-name nova-cloud-controller/0 --debug
root 17797 1 0 13:38 ? 00:00:05 /var/lib/juju/tools/unit-hacluster-nova-cloud-controller-0/jujud unit --data-dir /var/lib/juju --unit-name hacluster-nova-cloud-controller/0 --debug
ubuntu 19343 19342 0 21:02 pts/1 00:00:00 bash -c ps -ef|grep nova
ubuntu 19345 19343 0 21:02 pts/1 00:00:00 grep nova
Connection to 10.245.201.231 closed.
Warning: Permanently added 'node-10.vmwarestack,10.245.202.2' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.245.201.60' (ECDSA) to the list of known hosts.
root 990 1 0 13:34 ? 00:00:24 /var/lib/juju/tools/unit-nova-cloud-controller-1/jujud unit --data-dir /var/lib/juju --unit-name nova-cloud-controller/1 --debug
root 21141 1 0 13:39 ? 00:00:05 /var/lib/juju/tools/unit-hacluster-nova-cloud-controller-1/jujud unit --data-dir /var/lib/juju --unit-name hacluster-nova-cloud-controller/1 --debug
ubuntu 312261 312260 0 21:02 pts/0 00:00:00 bash -c ps -ef|grep nova
ubuntu 312263 312261 0 21:02 pts/0 00:00:00 grep nova
Connection to 10.245.201.60 closed.
Warning: Permanently added 'node-10.vmwarestack,10.245.202.2' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.245.201.243' (ECDSA) to the list of known hosts.
root 1073 1 0 13:34 ? 00:00:27 /var/lib/juju/tools/unit-nova-cloud-controller-2/jujud unit --data-dir /var/lib/juju --unit-name nova-cloud-controller/2 --debug
nova 10556 1 1 14:16 ? 00:05:10 /usr/bin/python /usr/bin/nova-api-os-compute --log-file=/var/log/nova/nova-api-os-compute.log --config-file=/etc/nova/nova.conf
nova 10586 1 0 14:16 ? 00:00:44 /usr/bin/python /usr/bin/nova-cert --log-file=/var/log/nova/nova-cert.log --config-file=/etc/nova/nova.conf
nova 10605 10556 1 14:16 ? 00:06:34 /usr/bin/python /usr/bin/nova-api-os-compute --log-file=/var/log/nova/nova-api-os-compute.log --config-file=/etc/nova/nova.conf
nova 10606 10556 1 14:16 ? 00:06:36 /usr/bin/python /usr/bin/nova-api-os-compute --log-file=/var/log/nova/nova-api-os-compute.log --config-file=/etc/nova/nova.conf
nova 10607 10556 0 14:16 ? 00:03:19 /usr/bin/python /usr/bin/nova-api-os-compute --log-file=/var/log/nova/nova-api-os-compute.log --config-file=/etc/nova/nova.conf
nova 10608 10556 1 14:16 ? 00:06:36 /usr/bin/python /usr/bin/nova-api-os-compute --log-file=/var/log/nova/nova-api-os-compute.log --config-file=/etc/nova/nova.conf
nova 10621 1 1 14:16 ? 00:07:13 /usr/bin/python /usr/bin...

Read more...

description: updated
tags: added: kanban-cross-team
tags: removed: kanban-cross-team
Ryan Beisner (1chb1n)
Changed in nova-cloud-controller (Juju Charms Collection):
assignee: nobody → Liam Young (gnuoy)
Changed in landscape:
importance: Undecided → High
Revision history for this message
Liam Young (gnuoy) wrote :

I *think* I've found the issue but I still need to put together the steps to reproduce the bug but basically I the following sequence will trigger the bug:

1) All relations except the db pass all required information to the charm and configs are rendered. Services are still disabled, however, as the charm does not enable services until the db sync is complete.
2) shared-db-changed relation fires, configs are updated with db information and schema migration runs.
3) After db migration charm runs:
    enable_services()
    cmd_all_services('start')
4) cmd_all_services('start') fails because cmd_all_services checks if a service is running before starting it and there appears to be a bug in charmhelpers around service_running...

service nova-conductor status; python -c "from charmhelpers.core.host import service_running; print service_running('nova-conductor')"
nova-conductor stop/waiting
True

5) Contexts were complete and configs rendered prior to the db migration so config files do not change on subsequent hook executions meaning restart_on_change does not fire.

This bug is a race of sorts since the ordering of hook execution matters ie if amq or identity hooks fire after db migration the config files would change, services would be restarted and this bug would be covered up.

Revision history for this message
Liam Young (gnuoy) wrote :

Ok, so that does seem to be the issue. There is an existing bug for charm helpers incorrectly calculating the status of a service, Bug #1581171 . The fix fir that (which is also the fix for this) has landed in charmhelpers, http://bazaar.launchpad.net/~charm-helpers/charm-helpers/devel/revision/574 . The fix is also in the master and stable/16.04 branches of the nova-cloud-controller charm (git+ssh://github.com/openstack/charm-nova-cloud-controller)

Changed in nova-cloud-controller (Juju Charms Collection):
assignee: Liam Young (gnuoy) → nobody
Changed in landscape:
assignee: nobody → Andreas Hasenack (ahasenack)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.