All 3 nova-cloud-controller services down

Bug #1585780 reported by Andreas Hasenack
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Landscape Server
New
High
Unassigned
16.06
New
High
Unassigned
nova-cloud-controller (Juju Charms Collection)
Confirmed
Undecided
Liam Young

Bug Description

This looks similar to https://bugs.launchpad.net/charms/+source/nova-cloud-controller/+bug/1584951, but I didn't find the same telltales in the logs so I'm filing a new bug.

HA openstack deploy with cs:trusty/nova-cloud-controller-284
juju 1.25.5
Our reference: https://ci.lscape.net/job/landscape-system-tests/2543/

All nova services from the 3 ncc units are down (https://pastebin.canonical.com/157279/ for better formatting):

==> nova-cloud-controller-0/var/log/nova/nova-api-os-compute.log <==
2016-05-25 17:09:17.281 11228 INFO oslo_service.service [-] Caught SIGTERM, stopping children
2016-05-25 17:09:17.282 11228 INFO nova.wsgi [-] Stopping WSGI server.
2016-05-25 17:09:17.283 11228 INFO oslo_service.service [-] Waiting on 12 children to exit
2016-05-25 17:09:17.288 11228 INFO oslo_service.service [-] Child 11400 exited with status 0
2016-05-25 17:09:17.289 11228 INFO oslo_service.service [-] Child 11396 exited with status 0
2016-05-25 17:09:17.289 11228 INFO oslo_service.service [-] Child 11407 exited with status 0
2016-05-25 17:09:17.289 11228 INFO oslo_service.service [-] Child 11401 exited with status 0
2016-05-25 17:09:17.290 11228 INFO oslo_service.service [-] Child 11404 exited with status 0
2016-05-25 17:09:17.290 11228 INFO oslo_service.service [-] Child 11403 exited with status 0
2016-05-25 17:09:17.290 11228 INFO oslo_service.service [-] Child 11399 exited with status 0
2016-05-25 17:09:17.291 11228 INFO oslo_service.service [-] Child 11398 exited with status 0
2016-05-25 17:09:17.291 11228 INFO oslo_service.service [-] Child 11402 exited with status 0
2016-05-25 17:09:17.291 11228 INFO oslo_service.service [-] Child 11406 exited with status 0
2016-05-25 17:09:17.292 11228 INFO oslo_service.service [-] Child 11397 exited with status 0
2016-05-25 17:09:17.292 11228 INFO oslo_service.service [-] Child 11405 killed by signal 15

==> nova-cloud-controller-1/var/log/nova/nova-api-os-compute.log <==
2016-05-25 17:12:05.413 11292 INFO oslo_service.service [-] Caught SIGTERM, stopping children
2016-05-25 17:12:05.414 11292 INFO nova.wsgi [-] Stopping WSGI server.
2016-05-25 17:12:05.414 11292 INFO oslo_service.service [-] Waiting on 12 children to exit
2016-05-25 17:12:05.415 11292 INFO oslo_service.service [-] Child 11468 exited with status 0
2016-05-25 17:12:05.415 11292 INFO oslo_service.service [-] Child 11469 exited with status 0
2016-05-25 17:12:05.415 11292 INFO oslo_service.service [-] Child 11470 exited with status 0
2016-05-25 17:12:05.415 11292 INFO oslo_service.service [-] Child 11471 exited with status 0
2016-05-25 17:12:05.416 11292 INFO oslo_service.service [-] Child 11472 exited with status 0
2016-05-25 17:12:05.416 11292 INFO oslo_service.service [-] Child 11473 exited with status 0
2016-05-25 17:12:05.416 11292 INFO oslo_service.service [-] Child 11474 exited with status 0
2016-05-25 17:12:05.416 11292 INFO oslo_service.service [-] Child 11475 exited with status 0
2016-05-25 17:12:05.417 11292 INFO oslo_service.service [-] Child 11476 exited with status 0
2016-05-25 17:12:05.417 11292 INFO oslo_service.service [-] Child 11477 exited with status 0
2016-05-25 17:12:05.417 11292 INFO oslo_service.service [-] Child 11478 exited with status 0
2016-05-25 17:12:05.418 11292 INFO oslo_service.service [-] Child 11479 exited with status 0

==> nova-cloud-controller-2/var/log/nova/nova-api-os-compute.log <==
2016-05-25 17:12:06.904 11297 INFO oslo_service.service [-] Caught SIGTERM, stopping children
2016-05-25 17:12:06.905 11297 INFO nova.wsgi [-] Stopping WSGI server.
2016-05-25 17:12:06.905 11297 INFO oslo_service.service [-] Waiting on 12 children to exit
2016-05-25 17:12:06.914 11297 INFO oslo_service.service [-] Child 11475 exited with status 0
2016-05-25 17:12:06.915 11297 INFO oslo_service.service [-] Child 11467 exited with status 0
2016-05-25 17:12:06.915 11297 INFO oslo_service.service [-] Child 11468 exited with status 0
2016-05-25 17:12:06.916 11297 INFO oslo_service.service [-] Child 11469 exited with status 0
2016-05-25 17:12:06.916 11297 INFO oslo_service.service [-] Child 11470 exited with status 0
2016-05-25 17:12:06.917 11297 INFO oslo_service.service [-] Child 11471 exited with status 0
2016-05-25 17:12:06.917 11297 INFO oslo_service.service [-] Child 11472 exited with status 0
2016-05-25 17:12:06.917 11297 INFO oslo_service.service [-] Child 11473 exited with status 0
2016-05-25 17:12:06.918 11297 INFO oslo_service.service [-] Child 11474 exited with status 0
2016-05-25 17:12:06.918 11297 INFO oslo_service.service [-] Child 11476 exited with status 0
2016-05-25 17:12:06.919 11297 INFO oslo_service.service [-] Child 11477 exited with status 0
2016-05-25 17:12:06.919 11297 INFO oslo_service.service [-] Child 11478 exited with status 0

I did some troubleshooting, but it's hard.

Some interesting (or so I think) things I found:
- rabbit seems to be ready more or less after 17:16, so after the ncc services stopped
- I see rabbit-related errors like these also past the time when the nova services stopped:
2016-05-25 17:21:17 INFO juju-log amqp:64: Missing required data: rabbitmq_password
- from the juju logs, it appears that the services were started at some point later, but there is nothing in the nova logs:
2016-05-25 17:21:23 INFO amqp-relation-changed stop: Unknown instance:
2016-05-25 17:21:23 INFO amqp-relation-changed nova-api-os-compute start/pre-start, process 24926
2016-05-25 17:21:23 INFO amqp-relation-changed stop: Unknown instance:
2016-05-25 17:21:23 INFO amqp-relation-changed nova-cert start/pre-start, process 24975
2016-05-25 17:21:23 INFO amqp-relation-changed stop: Unknown instance:
2016-05-25 17:21:23 INFO amqp-relation-changed nova-scheduler start/pre-start, process 25023
2016-05-25 17:21:23 INFO amqp-relation-changed stop: Unknown instance:
2016-05-25 17:21:23 INFO amqp-relation-changed nova-conductor start/pre-start, process 25104
- there are no juju errors, nothing pending or blocked in juju status
- hacluster has no errors that I can identify, I can't tell for sure when the cluster formed (logs are also attached). If I use the fist time this message showed ("hanode-relation-changed Waiting for 1 replies from the CRMd. OK"), then the timestamps are:
hacluster-nova-cloud-controller-0.log:2016-05-25 17:29:41 INFO hanode-relation-changed Waiting for 1 replies from the CRMd. OK
hacluster-nova-cloud-controller-1.log:2016-05-25 17:24:08 INFO hanode-relation-changed Waiting for 1 replies from the CRMd. OK
hacluster-nova-cloud-controller-2.log:2016-05-25 17:15:14 INFO ha-relation-changed Waiting for 1 replies from the CRMd. OK

Tags: landscape
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Pardon the weird formatting, that was loaded into python and then dumped back out again.

description: updated
tags: removed: kanban-cross-team
Revision history for this message
Francis Ginther (fginther) wrote :

Here's another case of this bug from https://ci.lscape.net/job/landscape-system-tests/2639/. Attached are all of the nova-cloud-controller logs and the logs from landscape-server itself.

Changed in landscape:
importance: Undecided → High
Liam Young (gnuoy)
Changed in nova-cloud-controller (Juju Charms Collection):
assignee: nobody → Liam Young (gnuoy)
Changed in nova-cloud-controller (Juju Charms Collection):
status: New → Invalid
status: Invalid → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.