compute node heartbeat out of sync causing scheduler to fail in devstack: VMs fail to spawn

Bug #1221987 reported by Joe Gordon
62
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Ceilometer
Won't Fix
Medium
Unassigned
OpenStack Compute (nova)
Fix Released
Critical
Unassigned

Bug Description

I have now seen this several times in devstack-gate

Occasionally instances will go into error state, and when checking the logs, I see the error is in nova-scheduler where the heartbeat is too old, which causes the scheduler to not schedule on that node.

http://logs.openstack.org/97/45497/2/check/gate-tempest-devstack-vm-postgres-full/084442d/logs/screen-n-sch.txt.gz#_2013-09-07_00_06_14_880

From logfile:

 DB_Driver.is_up last_heartbeat = 2013-09-07 00:04:47.516557 elapsed = 87.363841 is_up /opt/stack/new/nova/nova/servicegroup/drivers/db.py:71

  (devstack-precise-hpcloud-az1-229926, devstack-precise-hpcloud-az1-229926) ram:5170 disk:210944 io_ops:5 instances:18 is disabled or has not been heard from in a while host_passes /opt/stack/new/nova/nova/scheduler/filters/compute_filter.py:44

  Filter ComputeFilter returned 0 host(s) get_filtered_objects /opt/stack/new/nova/nova/filters.py:85
[instance: 415fb048-1d47-4fb8-8974-323529d8b1b5] Setting instance to ERROR state.
20

Tags: testing
Revision history for this message
Joe Gordon (jogo) wrote :

more examples of this bug can be found by searching logstash.openstack.org for '@tags:"logs/screen-n-sch.txt" AND @message:"Setting instance to ERROR state"'

Revision history for this message
Joe Gordon (jogo) wrote :

better search query: @tags:"logs/screen-n-sch.txt" AND @message:(tempest* AND "Setting instance to ERROR state")

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

I thought this was not a bug, just lag causing timeouts, hence my not rechecking against it. Thanks for clarifying that it is actually an out of sync problem.

tags: added: testing
Revision history for this message
Joe Gordon (jogo) wrote :

Since nova-compute has no DB access it goes to the conductor as a call, and it appears that even in a failing case (http://logs.openstack.org/86/45486/3/gate/gate-tempest-devstack-vm-postgres-full/b0bf0f2/) that the conductor is getting the update_service call (heartbeat) every 10 seconds. So somehow that information is not getting to the scheduler.

summary: - compute node heartbeat out of sync causing scheduler to fail in devstack
+ compute node heartbeat out of sync causing scheduler to fail in
+ devstack: VMs fail to spawn
Changed in nova:
status: New → Confirmed
Joe Gordon (jogo)
Changed in nova:
importance: Undecided → Critical
Revision history for this message
Joe Gordon (jogo) wrote :

In the example of: http://logs.openstack.org/86/45486/3/gate/gate-tempest-devstack-vm-postgres-full/b0bf0f2/

It looks like nova-conductor stops receiving the 'conductor.service_update' command from nova-compute at some point, after which point nova-compute's heartbeat doesn't write to the DB.

http://logs.openstack.org/86/45486/3/gate/gate-tempest-devstack-vm-postgres-full/b0bf0f2/logs/screen-n-cond.txt.gz#_2013-09-09_18_53_58_112

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/45974

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/45975

Revision history for this message
Joe Gordon (jogo) wrote :
Download full text (7.8 KiB)

It looks like nova-compute stops logging output and calling conductor.service_update at some point. In at least two cases the last line nova-compute logs is:

2013-09-09 18:54:05.238 DEBUG nova.ceilometer.notifier [req-4293c134-ac04-454c-927c-fbd59308267f ServerMetadataTestJSON1896624302-user ServerMetadataTestJSON1708281686-tenant] INFO {u'vm_state': u'active', u'internal_id': None, u'availability_zone': None, u'terminated_at': None, u'ephemeral_gb': 0, u'instance_type_id': 6, u'user_data': None, u'cleaned': 0, u'vm_mode': None, u'deleted_at': None, u'reservation_id': u'r-sh0bat2j', u'id': 40, u'security_groups': [{u'project_id': u'c085401b92e647248738282131a280cf', u'user_id': u'0f776f5034fc46d880d3cd83b7426b6d', u'name': u'default', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.256598', u'updated_at': None, u'rules': [], u'deleted_at': None, u'id': 63, u'description': u'default'}], u'disable_terminate': False, u'root_device_name': u'/dev/vda', u'user_id': u'0f776f5034fc46d880d3cd83b7426b6d', u'uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'default_swap_device': None, u'info_cache': {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 40, u'created_at': u'2013-09-09T18:53:48.280078', u'updated_at': u'2013-09-09T18:53:51.571804', u'network_info': u'[{"ovs_interfaceid": null, "network": {"bridge": "br100", "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.1.0.40"}], "version": 4, "meta": {"dhcp_server": "10.1.0.1"}, "dns": [{"meta": {}, "version": 4, "type": "dns", "address": "8.8.4.4"}], "routes": [], "cidr": "10.1.0.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.1.0.1"}}, {"ips": [], "version": null, "meta": {"dhcp_server": null}, "dns": [], "routes": [], "cidr": null, "gateway": {"meta": {}, "version": null, "type": "gateway", "address": null}}], "meta": {"tenant_id": null, "should_create_bridge": true, "bridge_interface": "eth0"}, "id": "f0bcca3e-f111-4f98-a1b7-c29e97cc1ac9", "label": "private"}, "devname": null, "qbh_params": null, "meta": {}, "address": "fa:16:3e:bc:c8:10", "type": "bridge", "id": "a257f86e-6f1d-4cd2-b451-5d03df25fd39", "qbg_params": null}]', u'deleted_at': u'2013-09-09T18:54:05.145060', u'id': 40}, u'hostname': u'servermetadatatestjson-instance1338688511', u'launched_on': u'devstack-precise-hpcloud-az3-237303', u'display_description': u'ServerMetadataTestJSON-instance1338688511', u'key_data': None, u'deleted': 0, u'scheduled_at': u'2013-09-09T18:53:48.358005', u'power_state': 1, u'default_ephemeral_device': None, u'progress': 0, u'project_id': u'c085401b92e647248738282131a280cf', u'launched_at': u'2013-09-09T18:53:57.923464', u'config_drive': u'', u'node': u'devstack-precise-hpcloud-az3-237303', u'ramdisk_id': u'364e0577-95ca-4ef9-be29-7fa59d2f686e', u'access_ip_v6': None, u'access_ip_v4': None, u'kernel_id': u'97d8ffa0-2c26-4931-a592-e914a3fd7515', u'key_name': None, u'updated_at': u'2013-09-09T18:54:04.965035', u'host': u'devstack-precise-hpcloud-az3-237303', u'display_name': u'ServerMetadataTestJSON-instance1338688511', u'system_metadata': [{u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'delet...

Read more...

Revision history for this message
Doug Hellmann (doug-hellmann) wrote :

We talked about it in the ceilometer team meeting today and agreed that if the plugin is causing issues with the gate, it should be left out of the nova configuration for now.

Joe scheduled a summit session to work out how to do the notification better: http://summit.openstack.org/cfp/details/73

Joe Gordon (jogo)
Changed in nova:
milestone: none → havana-rc1
Revision history for this message
Joe Gordon (jogo) wrote :

celio makes nova-compute timeout in what appears to be two different ways. One is found here:

http://logs.openstack.org/58/46358/3/check/gate-tempest-devstack-vm-full/4f94066/logs/screen-n-cpu.txt.gz#_2013-09-13_19_35_17_001

where ceilo times out for a minute causing the heartbeat to stop for just over a minute

Revision history for this message
Joe Gordon (jogo) wrote :

If this is related to Celiometer, this should be documented in the release.

Revision history for this message
Joe Gordon (jogo) wrote :

This should be closed, https://review.openstack.org/#/c/46336/. Waiting a few days to make sure the fix works though.

Revision history for this message
Joe Gordon (jogo) wrote :

Bug hasn't been seen all day, looks like the fix worked

Changed in nova:
status: Confirmed → Fix Committed
Julien Danjou (jdanjou)
Changed in ceilometer:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Attila Fazekas (afazekas) wrote :

Some of the rechecks is really related to this bug: https://bugs.launchpad.net/nova/+bug/1229590

Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-rc1 → 2013.2
Revision history for this message
Sam Morrison (sorrison) wrote :

I'm a little confused with this. Does this mean the ceilometer.compute.nova_notifier driver is broken and shouldn't be used with Havana?

Revision history for this message
Joe Gordon (jogo) wrote :

Sam, yes, that is exactly what it means.

Revision history for this message
gordon chung (chungg) wrote :

marking this as won't fix on ceilometer end. we removed the nova_notifier as the plugin mechanism didn't play well with nova

Changed in ceilometer:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.