Bug #1221987 “compute node heartbeat out of sync causing schedul...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-07:

#1

more examples of this bug can be found by searching logstash.openstack.org for '@tags:"logs/screen-n-sch.txt" AND @message:"Setting instance to ERROR state"'

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-07:

#2

better search query: @tags:"logs/screen-n-sch.txt" AND @message:(tempest* AND "Setting instance to ERROR state")

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2013-09-07:

#3

I thought this was not a bug, just lag causing timeouts, hence my not rechecking against it. Thanks for clarifying that it is actually an out of sync problem.

Russell Bryant (russellb) on 2013-09-09

tags:

added: testing

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-10:

#4

Since nova-compute has no DB access it goes to the conductor as a call, and it appears that even in a failing case (http://logs.openstack.org/86/45486/3/gate/gate-tempest-devstack-vm-postgres-full/b0bf0f2/) that the conductor is getting the update_service call (heartbeat) every 10 seconds. So somehow that information is not getting to the scheduler.

summary:

- compute node heartbeat out of sync causing scheduler to fail in devstack
+ compute node heartbeat out of sync causing scheduler to fail in
+ devstack: VMs fail to spawn

Attila Fazekas (afazekas) on 2013-09-10

Changed in nova:
status:	New → Confirmed

Joe Gordon (jogo) on 2013-09-10

Changed in nova:
importance:	Undecided → Critical

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-10:

#6

In the example of: http://logs.openstack.org/86/45486/3/gate/gate-tempest-devstack-vm-postgres-full/b0bf0f2/

It looks like nova-conductor stops receiving the 'conductor.service_update' command from nova-compute at some point, after which point nova-compute's heartbeat doesn't write to the DB.

http://logs.openstack.org/86/45486/3/gate/gate-tempest-devstack-vm-postgres-full/b0bf0f2/logs/screen-n-cond.txt.gz#_2013-09-09_18_53_58_112

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-09-10: Fix proposed to nova (master)

#7

Fix proposed to branch: master
Review: https://review.openstack.org/45974

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-09-10:

#8

Fix proposed to branch: master
Review: https://review.openstack.org/45975

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-11:

#9

Download full text (7.8 KiB)

It looks like nova-compute stops logging output and calling conductor.service_update at some point. In at least two cases the last line nova-compute logs is:

2013-09-09 18:54:05.238 DEBUG nova.ceilometer.notifier [req-4293c134-ac04-454c-927c-fbd59308267f ServerMetadataTestJSON1896624302-user ServerMetadataTestJSON1708281686-tenant] INFO {u'vm_state': u'active', u'internal_id': None, u'availability_zone': None, u'terminated_at': None, u'ephemeral_gb': 0, u'instance_type_id': 6, u'user_data': None, u'cleaned': 0, u'vm_mode': None, u'deleted_at': None, u'reservation_id': u'r-sh0bat2j', u'id': 40, u'security_groups': [{u'project_id': u'c085401b92e647248738282131a280cf', u'user_id': u'0f776f5034fc46d880d3cd83b7426b6d', u'name': u'default', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.256598', u'updated_at': None, u'rules': [], u'deleted_at': None, u'id': 63, u'description': u'default'}], u'disable_terminate': False, u'root_device_name': u'/dev/vda', u'user_id': u'0f776f5034fc46d880d3cd83b7426b6d', u'uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'default_swap_device': None, u'info_cache': {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 40, u'created_at': u'2013-09-09T18:53:48.280078', u'updated_at': u'2013-09-09T18:53:51.571804', u'network_info': u'[{"ovs_interfaceid": null, "network": {"bridge": "br100", "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.1.0.40"}], "version": 4, "meta": {"dhcp_server": "10.1.0.1"}, "dns": [{"meta": {}, "version": 4, "type": "dns", "address": "8.8.4.4"}], "routes": [], "cidr": "10.1.0.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.1.0.1"}}, {"ips": [], "version": null, "meta": {"dhcp_server": null}, "dns": [], "routes": [], "cidr": null, "gateway": {"meta": {}, "version": null, "type": "gateway", "address": null}}], "meta": {"tenant_id": null, "should_create_bridge": true, "bridge_interface": "eth0"}, "id": "f0bcca3e-f111-4f98-a1b7-c29e97cc1ac9", "label": "private"}, "devname": null, "qbh_params": null, "meta": {}, "address": "fa:16:3e:bc:c8:10", "type": "bridge", "id": "a257f86e-6f1d-4cd2-b451-5d03df25fd39", "qbg_params": null}]', u'deleted_at': u'2013-09-09T18:54:05.145060', u'id': 40}, u'hostname': u'servermetadatatestjson-instance1338688511', u'launched_on': u'devstack-precise-hpcloud-az3-237303', u'display_description': u'ServerMetadataTestJSON-instance1338688511', u'key_data': None, u'deleted': 0, u'scheduled_at': u'2013-09-09T18:53:48.358005', u'power_state': 1, u'default_ephemeral_device': None, u'progress': 0, u'project_id': u'c085401b92e647248738282131a280cf', u'launched_at': u'2013-09-09T18:53:57.923464', u'config_drive': u'', u'node': u'devstack-precise-hpcloud-az3-237303', u'ramdisk_id': u'364e0577-95ca-4ef9-be29-7fa59d2f686e', u'access_ip_v6': None, u'access_ip_v4': None, u'kernel_id': u'97d8ffa0-2c26-4931-a592-e914a3fd7515', u'key_name': None, u'updated_at': u'2013-09-09T18:54:04.965035', u'host': u'devstack-precise-hpcloud-az3-237303', u'display_name': u'ServerMetadataTestJSON-instance1338688511', u'system_metadata': [{u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'delet...

It looks like nova-compute stops logging output and calling conductor.service_update at some point.  In at least two cases the last line nova-compute logs is:

2013-09-09 18:54:05.238 DEBUG nova.ceilometer.notifier [req-4293c134-ac04-454c-927c-fbd59308267f ServerMetadataTestJSON1896624302-user ServerMetadataTestJSON1708281686-tenant] INFO {u'vm_state': u'active', u'internal_id': None, u'availability_zone': None, u'terminated_at': None, u'ephemeral_gb': 0, u'instance_type_id': 6, u'user_data': None, u'cleaned': 0, u'vm_mode': None, u'deleted_at': None, u'reservation_id': u'r-sh0bat2j', u'id': 40, u'security_groups': [{u'project_id': u'c085401b92e647248738282131a280cf', u'user_id': u'0f776f5034fc46d880d3cd83b7426b6d', u'name': u'default', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.256598', u'updated_at': None, u'rules': [], u'deleted_at': None, u'id': 63, u'description': u'default'}], u'disable_terminate': False, u'root_device_name': u'/dev/vda', u'user_id': u'0f776f5034fc46d880d3cd83b7426b6d', u'uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'default_swap_device': None, u'info_cache': {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 40, u'created_at': u'2013-09-09T18:53:48.280078', u'updated_at': u'2013-09-09T18:53:51.571804', u'network_info': u'[{"ovs_interfaceid": null, "network": {"bridge": "br100", "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.1.0.40"}], "version": 4, "meta": {"dhcp_server": "10.1.0.1"}, "dns": [{"meta": {}, "version": 4, "type": "dns", "address": "8.8.4.4"}], "routes": [], "cidr": "10.1.0.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.1.0.1"}}, {"ips": [], "version": null, "meta": {"dhcp_server": null}, "dns": [], "routes": [], "cidr": null, "gateway": {"meta": {}, "version": null, "type": "gateway", "address": null}}], "meta": {"tenant_id": null, "should_create_bridge": true, "bridge_interface": "eth0"}, "id": "f0bcca3e-f111-4f98-a1b7-c29e97cc1ac9", "label": "private"}, "devname": null, "qbh_params": null, "meta": {}, "address": "fa:16:3e:bc:c8:10", "type": "bridge", "id": "a257f86e-6f1d-4cd2-b451-5d03df25fd39", "qbg_params": null}]', u'deleted_at': u'2013-09-09T18:54:05.145060', u'id': 40}, u'hostname': u'servermetadatatestjson-instance1338688511', u'launched_on': u'devstack-precise-hpcloud-az3-237303', u'display_description': u'ServerMetadataTestJSON-instance1338688511', u'key_data': None, u'deleted': 0, u'scheduled_at': u'2013-09-09T18:53:48.358005', u'power_state': 1, u'default_ephemeral_device': None, u'progress': 0, u'project_id': u'c085401b92e647248738282131a280cf', u'launched_at': u'2013-09-09T18:53:57.923464', u'config_drive': u'', u'node': u'devstack-precise-hpcloud-az3-237303', u'ramdisk_id': u'364e0577-95ca-4ef9-be29-7fa59d2f686e', u'access_ip_v6': None, u'access_ip_v4': None, u'kernel_id': u'97d8ffa0-2c26-4931-a592-e914a3fd7515', u'key_name': None, u'updated_at': u'2013-09-09T18:54:04.965035', u'host': u'devstack-precise-hpcloud-az3-237303', u'display_name': u'ServerMetadataTestJSON-instance1338688511', u'system_metadata': [{u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.269261', u'updated_at': None, u'value': u'97d8ffa0-2c26-4931-a592-e914a3fd7515', u'key': u'image_kernel_id', u'deleted_at': None, u'id': 743}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.270023', u'updated_at': None, u'value': u'64', u'key': u'instance_type_memory_mb', u'deleted_at': None, u'id': 744}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.270942', u'updated_at': None, u'value': u'0', u'key': u'instance_type_swap', u'deleted_at': None, u'id': 745}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.271535', u'updated_at': None, u'value': None, u'key': u'instance_type_vcpu_weight', u'deleted_at': None, u'id': 746}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.272194', u'updated_at': None, u'value': u'0', u'key': u'instance_type_root_gb', u'deleted_at': None, u'id': 747}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.272783', u'updated_at': None, u'value': u'6', u'key': u'instance_type_id', u'deleted_at': None, u'id': 748}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.273373', u'updated_at': None, u'value': u'364e0577-95ca-4ef9-be29-7fa59d2f686e', u'key': u'image_ramdisk_id', u'deleted_at': None, u'id': 749}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.273960', u'updated_at': None, u'value': u'm1.nano', u'key': u'instance_type_name', u'deleted_at': None, u'id': 750}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.274552', u'updated_at': None, u'value': u'0', u'key': u'instance_type_ephemeral_gb', u'deleted_at': None, u'id': 751}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.275143', u'updated_at': None, u'value': u'1.0', u'key': u'instance_type_rxtx_factor', u'deleted_at': None, u'id': 752}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.275768', u'updated_at': None, u'value': u'ami', u'key': u'image_disk_format', u'deleted_at': None, u'id': 753}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.276356', u'updated_at': None, u'value': u'42', u'key': u'instance_type_flavorid', u'deleted_at': None, u'id': 754}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.276946', u'updated_at': None, u'value': u'ami', u'key': u'image_container_format', u'deleted_at': None, u'id': 755}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.277531', u'updated_at': None, u'value': u'1', u'key': u'instance_type_vcpus', u'deleted_at': None, u'id': 756}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.278121', u'updated_at': None, u'value': u'0', u'key': u'image_min_ram', u'deleted_at': None, u'id': 757}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.278710', u'updated_at': None, u'value': u'0', u'key': u'image_min_disk', u'deleted_at': None, u'id': 758}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:53:48.279294', u'updated_at': None, u'value': u'99646245-a84e-491c-a636-666a3bb8aa22', u'key': u'image_base_image_ref', u'deleted_at': None, u'id': 759}], u'task_state': u'deleting', u'shutdown_terminate': False, u'cell_name': None, u'root_gb': 0, u'locked': False, u'name': u'instance-00000028', u'created_at': u'2013-09-09T18:53:48.265571', u'locked_by': None, u'launch_index': 0, u'memory_mb': 64, u'vcpus': 1, u'image_ref': u'99646245-a84e-491c-a636-666a3bb8aa22', u'architecture': None, u'auto_disk_config': False, u'os_type': None, u'metadata': [{u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:54:01.120265', u'updated_at': None, u'value': u'value2', u'key': u'key2', u'deleted_at': None, u'id': 8}, {u'instance_uuid': u'918bfde7-0bc5-4e5b-abe2-c9de647fe8fd', u'deleted': 0, u'created_at': u'2013-09-09T18:54:01.121199', u'updated_at': u'2013-09-09T18:54:04.717057', u'value': u'value1', u'key': u'key1', u'deleted_at': None, u'id': 9}]} __init__ /opt/stack/new/ceilometer/ceilometer/compute/nova_notifier.py:114

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2013-09-11:

#11

We talked about it in the ceilometer team meeting today and agreed that if the plugin is causing issues with the gate, it should be left out of the nova configuration for now.

Joe scheduled a summit session to work out how to do the notification better: http://summit.openstack.org/cfp/details/73

Joe Gordon (jogo) on 2013-09-12

Changed in nova:
milestone:	none → havana-rc1

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-13:

#12

celio makes nova-compute timeout in what appears to be two different ways. One is found here:

http://logs.openstack.org/58/46358/3/check/gate-tempest-devstack-vm-full/4f94066/logs/screen-n-cpu.txt.gz#_2013-09-13_19_35_17_001

where ceilo times out for a minute causing the heartbeat to stop for just over a minute

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-14:

#13

If this is related to Celiometer, this should be documented in the release.

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-16:

#14

This should be closed, https://review.openstack.org/#/c/46336/. Waiting a few days to make sure the fix works though.

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-16:

#15

Bug hasn't been seen all day, looks like the fix worked

Changed in nova:
status:	Confirmed → Fix Committed

Julien Danjou (jdanjou) on 2013-09-17

Changed in ceilometer:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Attila Fazekas (afazekas) wrote on 2013-09-24:

#16

Some of the rechecks is really related to this bug: https://bugs.launchpad.net/nova/+bug/1229590

Thierry Carrez (ttx) on 2013-10-03

Changed in nova:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2013-10-17

Changed in nova:
milestone:	havana-rc1 → 2013.2

Revision history for this message

Sam Morrison (sorrison) wrote on 2013-11-28:

#18

I'm a little confused with this. Does this mean the ceilometer.compute.nova_notifier driver is broken and shouldn't be used with Havana?

Revision history for this message

Joe Gordon (jogo) wrote on 2013-12-04:

#19

Sam, yes, that is exactly what it means.

Revision history for this message

gordon chung (chungg) wrote on 2015-04-29:

#20

marking this as won't fix on ceilometer end. we removed the nova_notifier as the plugin mechanism didn't play well with nova

Changed in ceilometer:
status:	Triaged → Won't Fix

Affects		Status	Importance	Assigned to	Milestone
	Ceilometer	Won't Fix	Medium	Unassigned
	OpenStack Compute (nova)	Fix Released	Critical	Unassigned	OpenStack Compute (nova) 2013.2 "havana"

OpenStack Compute (nova)

compute node heartbeat out of sync causing scheduler to fail in devstack: VMs fail to spawn

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches