Third Party CI Systems that use baremetal target nodes are failing

Bug #1683902 reported by Michael Turek
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Invalid
Undecided
Unassigned

Bug Description

This bug is being opened after a conversation that started here:

http://lists.openstack.org/pipermail/openstack-dev/2017-April/115487.html

Both PowerKVM CI's ironic job and Dell's HW PXE-IPMItool job are failing from a timeout in '/opt/stack/new/ironic/devstack/lib/ironic:wait_for_nova_resources'.

The following is output from calls to ironic run while 'wait_for_nova_resources' is looping:

$ source devstack/accrc/admin/admin

$ ironic node-show node-0
+------------------------+-----------------------------------------------+
| Property | Value |
+------------------------+-----------------------------------------------+
| boot_interface | |
| chassis_uuid | 77b66e65-e4c0-4bc1-a4ed-77c6373c57b0 |
| clean_step | {} |
| console_enabled | False |
| console_interface | |
| created_at | 2017-04-14T15:18:20+00:00 |
| deploy_interface | |
| driver | agent_ipmitool |
| driver_info | {u'deploy_kernel': |
| | u'cd57c951-f9d9-48bc-a2c1-eb4fd2048bbb', |
| | u'ipmi_address': u'*******', |
| | u'deploy_ramdisk': |
| | u'12a65420-1a2b-45f6-b486-bcbd03f7c764', |
| | u'ipmi_password': u'******', |
| | u'ipmi_username': u'******'} |
| driver_internal_info | {} |
| extra | {} |
| inspect_interface | |
| inspection_finished_at | None |
| inspection_started_at | None | | instance_info | {} |
| instance_uuid | None |
| last_error | None |
| maintenance | False |
| maintenance_reason | None |
| management_interface | |
| name | node-0 |
| network_interface | |
| power_interface | |
| power_state | None |
| properties | {u'memory_mb': 51000, u'cpu_arch': u'ppc64el',|
| | u'local_gb': 500, u'cpus': 1} |
| provision_state | available |
| provision_updated_at | None |
| raid_config | |
| raid_interface | |
| reservation | None |
| resource_class | |
| target_power_state | None |
| target_provision_state | None |
| target_raid_config | |
| updated_at | None |
| uuid | 7d03ef35-bd9b-40ec-bf8e-fecb5c1200e5 |
| vendor_interface | |
+------------------------------------------------------------------------+

$ openstack hypervisor stats show
+----------------------+-------+
| Field | Value |
+----------------------+-------+
| count | 1 |
| current_workload | 0 |
| disk_available_least | 0 |
| free_disk_gb | 0 |
| free_ram_mb | 0 |
| local_gb | 0 |
| local_gb_used | 0 |
| memory_mb | 0 |
| memory_mb_used | 0 |
| running_vms | 0 |
| vcpus | 0 |
| vcpus_used | 0 |
+----------------------+-------+

In short, the properties from the node are not propagating to the hypervisor stats. This means that the 'wait_for_nova_resources' will loop indefinitely. We have also confirmed that the stats do not make it to the database.

Vlad pointed out that these rabbitmq errors are suspect:
https://dal05.objectstorage.softlayer.net/v1/AUTH_3d8e6ecb-f597-448c-8ec2-164e9f710dd6/pkvmci/ironic/25/454625/10/check-ironic/tempest-dsvm-ironic-agent_ipmitool/0520958/screen-ir-api.txt.gz

Michael Turek (mjturek)
description: updated
Revision history for this message
Vladyslav Drok (vdrok) wrote :

It seems to hit the gate too.

Changed in ironic:
status: New → Incomplete
status: Incomplete → Confirmed
importance: Undecided → Critical
Revision history for this message
Vladyslav Drok (vdrok) wrote :

Ah, it seems like I've looked at a wrong upstream job, this one is different. Also maybe the issue is different from oslo.messaging, as that debug logging might be actually OK.

Changed in ironic:
importance: Critical → High
Revision history for this message
Vladyslav Drok (vdrok) wrote :

Another thing I've noticed is -- the _sync_power_states periodic task runs only once (you can search for _sync_power_states here -- https://dal05.objectstorage.softlayer.net/v1/AUTH_3d8e6ecb-f597-448c-8ec2-164e9f710dd6/pkvmci/ironic/25/454625/10/check-ironic/tempest-dsvm-ironic-agent_ipmitool/0520958/screen-ir-cond.txt.gz) while other periodics run properly. Also I see that your node is in power_state None, which causes its inventory not to be picked up by nova.

Revision history for this message
Vladyslav Drok (vdrok) wrote :

I also see the failure to establish ipmi connection in the logs Rajini provided in the mail thread -- https://stash.opencrowbar.org/logs/52/456952/2/check/dell-hw-tempest-dsvm-ironic-pxe_ipmitool/315bd85/logs/screen-ir-cond.txt

Revision history for this message
Vladyslav Drok (vdrok) wrote :

Could you please try running these commands manually? To get the power status, from the conductor node.

Revision history for this message
Michael Turek (mjturek) wrote :

@vdrok - You have found the solution! There was a problem communicating with the IPMI interface of our target node. We fixed that and are now getting successful runs again.

I guess this makes me wonder if we should be doing sanity checks on the node we're enrolling, but that seems out of scope for devstack.

Revision history for this message
Vladyslav Drok (vdrok) wrote :

OK, so for now I'm setting this as invalid, if dell issue persists it can be opened again (tho dell's case is different I think, they have clear message that ironic can't connect to bmc in the logs)

Changed in ironic:
status: Confirmed → Invalid
importance: High → Undecided
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.