Errors when starting introspection are silently ignored

Bug #1651704 reported by Alfredo Moralejo on 2016-12-21
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic Inspector
Invalid
Undecided
Unassigned
OpenStack Compute (nova)
Undecided
Unassigned
tripleo
High
Dougal Matthews
ironic-inspector (Ubuntu)
Undecided
Unassigned

Bug Description

Running tripleo using tripleo-quickstart with minimal profile (step_introspect: true) for master branch, overcloud deploy with error:

    ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"

Looking at nova-scheduler.log, following errors are found:

    https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-delorean-minimal-806/undercloud/var/log/nova/nova-scheduler.log.gz

    2016-12-21 06:45:56.822 17759 DEBUG nova.scheduler.host_manager [req-f889dbc0-1096-4f92-80fc-3c5bdcb1ad29 4f103e0230074c2488b7359bc079d323 f21dbfb3b2c840059ec2a0bba03b7385 - - -] Update host state from compute node: ComputeNode(cpu_allocation_ratio=16.0,cpu_info='',created_at=2016-12-21T06:38:28Z,current_workload=0,deleted=False,deleted_at=None,disk_allocation_ratio=1.0,disk_available_least=0,free_disk_gb=0,free_ram_mb=0,host='undercloud',host_ip=192.168.23.46,hypervisor_hostname='c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec',hypervisor_type='ironic',hypervisor_version=1,id=2,local_gb=0,local_gb_used=0,memory_mb=0,memory_mb_used=0,metrics='[]',numa_topology=None,pci_device_pools=PciDevicePoolList,ram_allocation_ratio=1.0,running_vms=0,service_id=None,stats={boot_option='local',cpu_aes='true',cpu_arch='x86_64',cpu_hugepages='true',cpu_hugepages_1g='true',cpu_vt='true',profile='control'},supported_hv_specs=[HVSpec],updated_at=2016-12-21T06:45:38Z,uuid=ac2742da-39fb-4ca4-9f78-8e04f703c7a6,vcpus=0,vcpus_used=0) _locked_update /usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py:168

    2016-12-21 06:47:48.893 17759 DEBUG nova.scheduler.filters.ram_filter [req-2aece1c8-6d3e-457b-92d7-a3177680c82e 4f103e0230074c2488b7359bc079d323 f21dbfb3b2c840059ec2a0bba03b7385 - - -] (undercloud, c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec) ram: 0MB disk: 0MB io_ops: 0 instances: 0 does not have 8192 MB usable ram before overcommit, it only has 0 MB. host_passes /usr/lib/python2.7/site-packages/nova/scheduler/filters/ram_filter.py:45

    2016-12-21 06:47:48.894 17759 INFO nova.filters [req-2aece1c8-6d3e-457b-92d7-a3177680c82e 4f103e0230074c2488b7359bc079d323 f21dbfb3b2c840059ec2a0bba03b7385 - - -] Filter RamFilter returned 0 hosts

My guess is that node introspection is failing to get proper node information.

Full logs can be found in https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-delorean-minimal-806/undercloud/

We have hit this issue twice in the last runs.

Tags: ci Edit Tag help
Yolanda Robla (yolanda.robla) wrote :

I'm hitting that as well in my local runs. My compute node was fine, but the controller failed with that error.

Thanks for the report. Yeah I can see that one node got deployed correctly [0] but there's no trace from a second node being deployed in the Ironic logs, which indicates that it failed prior to that (when nova is scheduling) so "No valid hosts" makes sense.

One possible cause for this could be: When the nodes are updated in Ironic (after inspection for example) it takes a while for the values to get propagated to the nova resource tracker. This happens because nova runs a period task that to update it and, as you may know, periodic tasks in OpenStack can be slow (because they run one after the other in a greenthread). In the Ironic DevStack pluging we have a workaround for this: https://github.com/openstack/ironic/blob/master/devstack/lib/ironic#L1151-L1167

Perhaps we need something similar in OOO-quickstart ?

[0]

2016-12-21 06:50:39.867 29447 INFO ironic.conductor.task_manager [req-f77a2cef-c3e7-4863-b313-e8fee9d2a3c2 - - - - -] Node f06c26d1-8008-47a9-a687-d5ef05083e41 moved to provision state "active" from state "deploying"; target provision state is "None"
2016-12-21 06:50:39.869 29447 INFO ironic.drivers.modules.agent_base_vendor [req-f77a2cef-c3e7-4863-b313-e8fee9d2a3c2 - - - - -] Deployment to node f06c26d1-8008-47a9-a687-d5ef05083e41 done

I'm also marking ironic-inspector as affected here because in the logs I see (thanks to bfournier for pointing it out):

2016-12-21 06:39:16.254 24006 ERROR ironic_inspector.node_cache [-] [node: f06c26d1-8008-47a9-a687-d5ef05083e41 state None] Database integrity error (pymysql.err.IntegrityError) (1062, u"Duplicate entry 'bmc_address-127.0.0.1' for key 'PRIMARY'") [SQL: u'INSERT INTO attributes (name, value, uuid) VALUES (%(name)s, %(value)s, %(uuid)s)'] [parameters: {'name': 'bmc_address', 'value': '127.0.0.1', 'uuid': u'f06c26d1-8008-47a9-a687-d5ef05083e41'}] during adding attributes
2016-12-21 06:39:16.255 24006 ERROR ironic_inspector.utils [-] [node: f06c26d1-8008-47a9-a687-d5ef05083e41 state None] Some or all of bmc_address's ['127.0.0.1'] are already on introspection

Since we changed to VirtualBMC, all the BMC addresses will point to 127.0.0.1, the ports will be the difference. So, maybe inspector is disconsidering the 2nd node thinking it's duplicated ?

Alfredo Moralejo (amoralej) wrote :

I had reported the db integrity error in a separated LP https://bugs.launchpad.net/tripleo/+bug/1651719 .

Alfredo Moralejo (amoralej) wrote :

So, if we focus in the node where introspection run properly (c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec), we can see following info:

1. Introspection finishes successfully with right information according to inspector logs:

2016-12-21 06:41:24.030 24006 INFO ironic_inspector.plugins.standard [-] [node: c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec state processing MAC 00:fc:e0:2e:a1:ec] Discovered data: CPUs: 2 x86_64, memory 8192 MiB, disk 49 GiB
2016-12-21 06:41:26.408 24006 INFO ironic_inspector.process [-] [node: c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec state None MAC 00:fc:e0:2e:a1:ec] Introspection finished successfully

2. Nova scheduler discards it by RamFilter:

2016-12-21 06:45:57.537 17759 DEBUG nova.filters [req-a2319864-8629-4b7f-8b94-d41911ad02d0 4f103e0230074c2488b7359bc079d323 f21dbfb3b2c840059ec2a0bba03b7385 - - -] Filter AvailabilityZoneFilter returned 1 host(s) get_filtered_objects /usr/lib/python2.7/site-packages/nova/filters.py:104
2016-12-21 06:45:57.538 17759 DEBUG nova.scheduler.filters.ram_filter [req-a2319864-8629-4b7f-8b94-d41911ad02d0 4f103e0230074c2488b7359bc079d323 f21dbfb3b2c840059ec2a0bba03b7385 - - -] (undercloud, c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec) ram: 0MB disk: 0MB io_ops: 0 instances: 0 does not have 8192 MB usable ram before overcommit, it only has 0 MB. host_passes /usr/lib/python2.7/site-packages/nova/scheduler/filters/ram_filter.py:45

That's about 4 minutes and 30 seconds between end of introspection and nova scheduller error, I'd say that should be enough, what do you think?

Alfredo Moralejo (amoralej) wrote :
Download full text (4.2 KiB)

Checking nova-compute.log it seems it's not a matter of waiting, see:

2016-12-21 06:38:28.987 23446 INFO nova.compute.resource_tracker [req-4ec06e18-d31c-4254-8bc1-9cad7fb8b5e8 - - - - -] Final resource view: name=c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec phys_ram=0MB used_ram=0MB phys_disk=0GB used_disk=0GB total_vcpus=0 used_vcpus=0 pci_stats=[]
2016-12-21 06:39:29.787 23446 INFO nova.compute.resource_tracker [req-4ec06e18-d31c-4254-8bc1-9cad7fb8b5e8 - - - - -] Final resource view: name=c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec phys_ram=0MB used_ram=0MB phys_disk=0GB used_disk=0GB total_vcpus=0 used_vcpus=0 pci_stats=[]
2016-12-21 06:40:29.820 23446 INFO nova.compute.resource_tracker [req-4ec06e18-d31c-4254-8bc1-9cad7fb8b5e8 - - - - -] Final resource view: name=c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec phys_ram=0MB used_ram=0MB phys_disk=0GB used_disk=0GB total_vcpus=0 used_vcpus=0 pci_stats=[]
2016-12-21 06:41:31.332 23446 INFO nova.compute.resource_tracker [req-4ec06e18-d31c-4254-8bc1-9cad7fb8b5e8 - - - - -] Final resource view: name=c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec phys_ram=0MB used_ram=0MB phys_disk=0GB used_disk=0GB total_vcpus=0 used_vcpus=0 pci_stats=[]
2016-12-21 06:42:32.838 23446 INFO nova.compute.resource_tracker [req-4ec06e18-d31c-4254-8bc1-9cad7fb8b5e8 - - - - -] Final resource view: name=c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec phys_ram=0MB used_ram=0MB phys_disk=0GB used_disk=0GB total_vcpus=0 used_vcpus=0 pci_stats=[]
2016-12-21 06:43:34.820 23446 INFO nova.compute.resource_tracker [req-4ec06e18-d31c-4254-8bc1-9cad7fb8b5e8 - - - - -] Final resource view: name=c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec phys_ram=0MB used_ram=0MB phys_disk=0GB used_disk=0GB total_vcpus=0 used_vcpus=0 pci_stats=[]
2016-12-21 06:44:36.839 23446 INFO nova.compute.resource_tracker [req-4ec06e18-d31c-4254-8bc1-9cad7fb8b5e8 - - - - -] Final resource view: name=c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec phys_ram=0MB used_ram=0MB phys_disk=0GB used_disk=0GB total_vcpus=0 used_vcpus=0 pci_stats=[]
2016-12-21 06:45:38.820 23446 INFO nova.compute.resource_tracker [req-4ec06e18-d31c-4254-8bc1-9cad7fb8b5e8 - - - - -] Final resource view: name=c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec phys_ram=0MB used_ram=0MB phys_disk=0GB used_disk=0GB total_vcpus=0 used_vcpus=0 pci_stats=[]
2016-12-21 06:46:39.972 23446 INFO nova.compute.resource_tracker [req-4ec06e18-d31c-4254-8bc1-9cad7fb8b5e8 - - - - -] Final resource view: name=c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec phys_ram=0MB used_ram=0MB phys_disk=0GB used_disk=0GB total_vcpus=0 used_vcpus=0 pci_stats=[]
2016-12-21 06:47:40.938 23446 INFO nova.compute.resource_tracker [req-4ec06e18-d31c-4254-8bc1-9cad7fb8b5e8 - - - - -] Final resource view: name=c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec phys_ram=0MB used_ram=0MB phys_disk=0GB used_disk=0GB total_vcpus=0 used_vcpus=0 pci_stats=[]
2016-12-21 06:48:43.261 23446 INFO nova.compute.resource_tracker [req-4ec06e18-d31c-4254-8bc1-9cad7fb8b5e8 - - - - -] Final resource view: name=c6f8f4ba-9c7c-4c87-b95a-67a5861b7bec phys_ram=0MB used_ram=0MB phys_disk=0GB used_disk=0GB total_vcpus=0 used_vcpus=0 pci_stats=[]
2016-12-21 06:49:56.760 23446 INFO nova.compute.resource_tracker [req-4ec06e18-d31c-4254-8bc1-9cad7fb8b5...

Read more...

Alfredo Moralejo (amoralej) wrote :
Download full text (13.0 KiB)

I've reproduced this locally running RDO from master. I have a ironic installation (a tripleo undercloud), and i've registered a node and instrospect it:

This is how the node looks like from ironic perspective:

$ openstack baremetal show e14be55d-8dd9-4b08-aa5d-efab2b5a5c01
This command is deprecated. Instead, use 'openstack baremetal node show'.
+------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| chassis_uuid | None |
| clean_step | {} |
| console_enabled | False |
| created_at | 2016-12-23T17:08:16+00:00 |
| driver | pxe_ssh |
| driver_info | {u'ssh_username': u'stack', u'deploy_kernel': u'd77546aa-5936-417a-9dec-43f4f98a057d', u'deploy_ramdisk': u'bc508426-aee4-4d7d-97e5-dcbd6761a896', u'ssh_key_contents': u'******', u'ssh_virt_type': u'virsh', |
| | u'ssh_address': u'192.168.23.1'} |
| driver_internal_info | {} |
| extra | {u'hardware_swift_object': u'extra_hardware-e14be55d-8dd9-4b08-aa5d-efab2b5a5c01'} ...

Dmitry Tantsur (divius) wrote :

This seems to be a problem:

 | provision_state | manageable

I wonder if something is wrong with the workflows. Which command did you use for introspection?

Changed in nova:
status: New → Invalid
Changed in ironic-inspector (Ubuntu):
status: New → Invalid
Changed in tripleo:
status: New → Incomplete
Changed in ironic:
status: New → Incomplete
affects: ironic → ironic-inspector
Alfredo Moralejo (amoralej) wrote :

I'm following the tripleo-quickstart workflow.

1. Create VMs in virthost, one for undercloud, one for controller and one for compute
2. I deployed undercloud
3. Run following commands:

openstack baremetal import --json instackenv.json
openstack baremetal configure boot
openstack baremetal introspection bulk start

At that point i executed commands in comment #7

Alfredo Moralejo (amoralej) wrote :

After some additional testing, i'd say that this issue may be related to error handling when running introspection bulk and one of the introspection fails.

In this case one of them is failing because of https://bugs.launchpad.net/tripleo/+bug/1651719 but "openstack baremetal introspection bulk start" is returning 0 and sometimes the server which is properly introspected is in manageable status when the command finishes.

Dmitry Tantsur (divius) on 2017-01-10
Changed in tripleo:
status: Incomplete → Triaged
summary: - Overcloud deployment fails when node introspection is enabled
+ Errors when starting introspection are silently ignored
Changed in tripleo:
importance: Undecided → High
assignee: nobody → Dmitry Tantsur (divius)
milestone: none → ocata-3
Dougal Matthews (d0ugal) on 2017-01-10
Changed in tripleo:
assignee: Dmitry Tantsur (divius) → Dougal Matthews (d0ugal)
Changed in tripleo:
status: Triaged → In Progress

Related fix proposed to branch: master
Review: https://review.openstack.org/418964

Related fix proposed to branch: master
Review: https://review.openstack.org/418966

Related fix proposed to branch: master
Review: https://review.openstack.org/418968

Reviewed: https://review.openstack.org/418423
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=c7b01eba55e5d133ccc19451cf4727170a5dbdd0
Submitter: Jenkins
Branch: master

commit c7b01eba55e5d133ccc19451cf4727170a5dbdd0
Author: Dougal Matthews <email address hidden>
Date: Tue Jan 10 14:35:36 2017 +0000

    Fail the baremetal workflows when sending a "FAILED" message

    When Mistral workflows execute a second workflow (a sub-workflow
    execution), the parent workflow can't easily determine if sub-workflow
    failed. This is because the failure is communicated via a Zaqar message
    only and when a workflow ends with a successful Zaqar message it appears
    have been successful. This problem surfaces because parent workflows
    should have an "on-error" attribute but it is never called, as the
    workflow doesn't error.

    This change marks the workflow as failed if the message has the status
    "FAILED". Now when a sub-workflow fails, the task that called it should
    have the on-error triggered. Previously it would always go to
    on-success.

    Closes-Bug: #1651704
    Change-Id: I60444ec692351c44753649b59b7c1d7c4b61fa8e

Changed in tripleo:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/418964
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=ab946c4a0e6b0bf5bc0b31e93af455af5b1887dc
Submitter: Jenkins
Branch: master

commit ab946c4a0e6b0bf5bc0b31e93af455af5b1887dc
Author: Dougal Matthews <email address hidden>
Date: Wed Jan 11 14:16:50 2017 +0000

    Fail the plan_management workflows when sending a "FAILED" message

    When Mistral workflows execute a second workflow (a sub-workflow
    execution), the parent workflow can't easily determine if sub-workflow
    failed. This is because the failure is communicated via a Zaqar message
    only and when a workflow ends with a successful Zaqar message it appears
    have been successful. This problem surfaces because parent workflows
    should have an "on-error" attribute but it is never called, as the
    workflow doesn't error.

    This change marks the workflow as failed if the message has the status
    "FAILED". Now when a sub-workflow fails, the task that called it should
    have the on-error triggered. Previously it would always go to
    on-success.

    Change-Id: Ie0f1d4e57505e9346ae3f0b25d755f55d73a255a
    Related-Bug: #1651704

Reviewed: https://review.openstack.org/418968
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=6051a7446bdf415e18ddc5f25697759ca488be17
Submitter: Jenkins
Branch: master

commit 6051a7446bdf415e18ddc5f25697759ca488be17
Author: Dougal Matthews <email address hidden>
Date: Wed Jan 11 14:23:25 2017 +0000

    Fail the validation workflows when sending a "FAILED" message

    When Mistral workflows execute a second workflow (a sub-workflow
    execution), the parent workflow can't easily determine if sub-workflow
    failed. This is because the failure is communicated via a Zaqar message
    only and when a workflow ends with a successful Zaqar message it appears
    have been successful. This problem surfaces because parent workflows
    should have an "on-error" attribute but it is never called, as the
    workflow doesn't error.

    This change marks the workflow as failed if the message has the status
    "FAILED". Now when a sub-workflow fails, the task that called it should
    have the on-error triggered. Previously it would always go to
    on-success.

    Change-Id: I2170afac89e8261c1e2289da8e9220694fadbad0
    Related-Bug: #1651704

Reviewed: https://review.openstack.org/418961
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=12f734562c5925c9260a3f038d570a7910c51cfb
Submitter: Jenkins
Branch: master

commit 12f734562c5925c9260a3f038d570a7910c51cfb
Author: Dougal Matthews <email address hidden>
Date: Wed Jan 11 14:12:44 2017 +0000

    Fail the deployment workflows when sending a "FAILED" message

    When Mistral workflows execute a second workflow (a sub-workflow
    execution), the parent workflow can't easily determine if sub-workflow
    failed. This is because the failure is communicated via a Zaqar message
    only and when a workflow ends with a successful Zaqar message it appears
    have been successful. This problem surfaces because parent workflows
    should have an "on-error" attribute but it is never called, as the
    workflow doesn't error.

    This change marks the workflow as failed if the message has the status
    "FAILED". Now when a sub-workflow fails, the task that called it should
    have the on-error triggered. Previously it would always go to
    on-success.

    Change-Id: Ic1781bcf3922da61d42970cf2b42c951f83f7a74
    Related-Bug: #1651704

Reviewed: https://review.openstack.org/418966
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=c2fb9b025d727fdbfae27d1f25f58efae1cb7679
Submitter: Jenkins
Branch: master

commit c2fb9b025d727fdbfae27d1f25f58efae1cb7679
Author: Dougal Matthews <email address hidden>
Date: Wed Jan 11 14:21:06 2017 +0000

    Fail the scale workflows when sending a "FAILED" message

    When Mistral workflows execute a second workflow (a sub-workflow
    execution), the parent workflow can't easily determine if sub-workflow
    failed. This is because the failure is communicated via a Zaqar message
    only and when a workflow ends with a successful Zaqar message it appears
    have been successful. This problem surfaces because parent workflows
    should have an "on-error" attribute but it is never called, as the
    workflow doesn't error.

    This change marks the workflow as failed if the message has the status
    "FAILED". Now when a sub-workflow fails, the task that called it should
    have the on-error triggered. Previously it would always go to
    on-success.

    Change-Id: I73c5965627e45b8151f67aaf8025369bf6c870bb
    Related-Bug: #1651704

Dmitry Tantsur (divius) on 2017-01-23
Changed in ironic-inspector:
status: Incomplete → Invalid

This issue was fixed in the openstack/tripleo-common 5.8.0 release.

Reviewed: https://review.openstack.org/425047
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=5435d9d8235349c986de6fef07d347431ca243ba
Submitter: Jenkins
Branch: master

commit 5435d9d8235349c986de6fef07d347431ca243ba
Author: Dougal Matthews <email address hidden>
Date: Wed Jan 25 08:06:27 2017 +0000

    Use the builtin Mistral engine command 'fail' when failing workflows

    In changes 12f7345, c2fb9b0, 6051a74, c7b01eb and ab946c4 Mistral
    Workflows were marked as failed if they sent a Zaqar message with the
    status "FAILED". This is correct, but it turns out there is an easier
    way to do it.

    Mistral has a engine command designed for this purpose, using it removes
    some of the duplication we have incurred.

    We use on-success rather than on-complete to surface the best error on
    the workflow. If the Zaqar queue post fails, then the workflow will fail
    with this error. If, on-complete we then manually mark the workflow as
    failed it will ovrwrite the Zaqar error with a generic one.

    Closes-Bug: #1664918
    Related-Bug: #1651704
    Change-Id: I9ef9cfef1b8740a535e005769ec0c3ad67ecb103

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers