Instances associated with multiple ironic nodes

Bug #1326364 reported by Derek Higgins
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Lucas Alvares Gomes

Bug Description

Ironic node-list is showing 3 instances associated with 2 ironic-nodes

| 33ad2330-6bd9-4413-8955-50db3a55e537 | 1cb064ee-3eff-4799-a489-bd4f447f285c | power on | deploy failed | False |
| b2b93c39-59e9-46f9-9e1e-d3764b2b0ec8 | 1cb064ee-3eff-4799-a489-bd4f447f285c | power on | deploy failed | False |
| 01c3663f-31a2-4bfa-9b1b-2d8953fb1c86 | 9df33a91-4be3-429c-a712-35cd60a1e101 | power on | deploy failed | False |
| 3f9df53e-5a76-47b3-a1c4-f596822bb43d | 9df33a91-4be3-429c-a712-35cd60a1e101 | None | deploy failed | True |
| 9b9e8326-2fea-4aa9-820d-2003b262000e | bb59bea5-166a-4023-a122-ee793d21ccd2 | power on | deploy failed | False |
| a5318eae-3d81-4d62-9531-e5700c71e378 | bb59bea5-166a-4023-a122-ee793d21ccd2 | power on | deploy failed | False |

after a nova delete all but these 3 were deleted

| 1cb064ee-3eff-4799-a489-bd4f447f285c | hw-test-1cb064ee-3eff-4799-a489-bd4f447f285c | ERROR | deleting | NOSTATE | ctlplane=10.10.16.168 |
| 9df33a91-4be3-429c-a712-35cd60a1e101 | hw-test-9df33a91-4be3-429c-a712-35cd60a1e101 | ERROR | deleting | NOSTATE | ctlplane=10.10.16.147 |
| bb59bea5-166a-4023-a122-ee793d21ccd2 | hw-test-bb59bea5-166a-4023-a122-ee793d21ccd2 | ERROR | deleting | NOSTATE | ctlplane=10.10.16.142 |

with the following error in nova-compute
http://10.10.16.134:6385//v1/nodes/?instance_uuid=1cb064ee-3eff-4799-a489-bd4f447f285c

{"error_message": "{\"debuginfo\": null, \"faultcode\": \"Server\", \"faultstring\": \"Multiple rows were found for one()\"}"}
 log_http_response /opt/stack/venvs/nova/local/lib/python2.7/site-packages/ironicclient/common/http.py:105
2014-06-04 11:39:54.015 3833 WARNING ironicclient.common.http [req-46897941-7436-40b9-98e1-c56a039a6064 ] Request returned failure status.

Tags: nova-driver
Revision history for this message
Dmitry Tantsur (divius) wrote :

Hi! How did you get several instances assigned to a node? Is it a valid condition?

Revision history for this message
Dmitry Tantsur (divius) wrote :

Also, could you please post full conductor and API logs here?

Changed in ironic:
status: New → Incomplete
Revision history for this message
Derek Higgins (derekh) wrote :

I believe the instance failed to deploy on the first node and then tried a second, is that possible? The logs are 8G I'll see if I can filter out whats relevant.

aeva black (tenbrae)
Changed in ironic:
status: Incomplete → Triaged
importance: Undecided → High
Revision history for this message
Robert Collins (lifeless) wrote :

Seems like Ironic wants a unique constraint on the instance_uuid column ?

summary: - Instances assosiated with multiple ironic nodes
+ Instances associated with multiple ironic nodes
Revision history for this message
aeva black (tenbrae) wrote :

Confirmed this on the TripleO hp1 test cluster just now -- it's got 3 ERROR/DELETING instances and 6 stuck nodes:

(undercloud)stack@bm-aw1az2-freecloud0001:/var/log/upstart$ ironic node-list | grep bb59bea5-166a-4023-a122-ee793d21ccd2
| 9b9e8326-2fea-4aa9-820d-2003b262000e | bb59bea5-166a-4023-a122-ee793d21ccd2 | power on | deploy failed | False |
| a5318eae-3d81-4d62-9531-e5700c71e378 | bb59bea5-166a-4023-a122-ee793d21ccd2 | power on | deploy failed | False |

(undercloud)stack@bm-aw1az2-freecloud0001:/var/log/upstart$ ironic node-list | grep 1cb064ee-3eff-4799-a489-bd4f447f285c
| 33ad2330-6bd9-4413-8955-50db3a55e537 | 1cb064ee-3eff-4799-a489-bd4f447f285c | power on | deploy failed | False |
| b2b93c39-59e9-46f9-9e1e-d3764b2b0ec8 | 1cb064ee-3eff-4799-a489-bd4f447f285c | power on | deploy failed | False |

(undercloud)stack@bm-aw1az2-freecloud0001:/var/log/upstart$ ironic node-list | grep 9df33a91-4be3-429c-a712-35cd60a1e101
| 01c3663f-31a2-4bfa-9b1b-2d8953fb1c86 | 9df33a91-4be3-429c-a712-35cd60a1e101 | power on | deploy failed | False |
| 3f9df53e-5a76-47b3-a1c4-f596822bb43d | 9df33a91-4be3-429c-a712-35cd60a1e101 | None | deploy failed | True |

The last node (3f9df53e-5a76-47b3-a1c4-f596822bb43d) seems to have had a hardware fault, but the other two are clearly repeating the same error.

Logs from the first one here:

http://paste.openstack.org/show/wOdHiJurWONIhlYrejCj/

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Hit this as well on a Debian-based tripleo run.

Revision history for this message
Nicholas Randon (nicholas-randon) wrote :
Download full text (30.0 KiB)

Debug info so far:
ironic node-list | grep 0c282e24-1a4a-4c29-8a91-db5d05fd2063
| b3396d19-2cad-4250-bede-67f478d54926 | 0c282e24-1a4a-4c29-8a91-db5d05fd2063 | power on | wait call-back | False |
| cc695acc-5674-44b6-b141-dc41de9b8778 | 0c282e24-1a4a-4c29-8a91-db5d05fd2063 | power off | deploy failed | False |

Nova-compute.log snip-its:
2014-06-04 23:07:32.712 6862 WARNING nova.openstack.common.loopingcall [-] task run outlasted interval by 9.205668 sec
2014-06-04 23:07:32.934 6862 ERROR nova.openstack.common.loopingcall [-] in fixed duration looping call
2014-06-04 23:07:32.934 6862 TRACE nova.openstack.common.loopingcall Traceback (most recent call last):
2014-06-04 23:07:32.934 6862 TRACE nova.openstack.common.loopingcall File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py", line 76, in _inner
2014-06-04 23:07:32.934 6862 TRACE nova.openstack.common.loopingcall self.f(*self.args, **self.kw)
2014-06-04 23:07:32.934 6862 TRACE nova.openstack.common.loopingcall File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/ironic/nova/virt/ironic/driver.py", line 289, in _wait_for_active
2014-06-04 23:07:32.934 6862 TRACE nova.openstack.common.loopingcall raise exception.InstanceDeployFailure(msg)
2014-06-04 23:07:32.934 6862 TRACE nova.openstack.common.loopingcall InstanceDeployFailure: Failed to provision instance 0c282e24-1a4a-4c29-8a91-db5d05fd2063: None
2014-06-04 23:07:32.934 6862 TRACE nova.openstack.common.loopingcall
2014-06-04 23:07:33.159 6862 ERROR nova.compute.manager [req-331412f4-88b4-4ce5-a988-faf489f8c023 None] [instance: 0c282e24-1a4a-4c29-8a91-db5d05fd2063] Instance failed to spawn
2014-06-04 23:07:33.159 6862 TRACE nova.compute.manager [instance: 0c282e24-1a4a-4c29-8a91-db5d05fd2063] Traceback (most recent call last):
2014-06-04 23:07:33.159 6862 TRACE nova.compute.manager [instance: 0c282e24-1a4a-4c29-8a91-db5d05fd2063] File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 2059, in _build_resources
2014-06-04 23:07:33.159 6862 TRACE nova.compute.manager [instance: 0c282e24-1a4a-4c29-8a91-db5d05fd2063] yield resources
2014-06-04 23:07:33.159 6862 TRACE nova.compute.manager [instance: 0c282e24-1a4a-4c29-8a91-db5d05fd2063] File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 1962, in _build_and_run_instance
2014-06-04 23:07:33.159 6862 TRACE nova.compute.manager [instance: 0c282e24-1a4a-4c29-8a91-db5d05fd2063] block_device_info=block_device_info)
2014-06-04 23:07:33.159 6862 TRACE nova.compute.manager [instance: 0c282e24-1a4a-4c29-8a91-db5d05fd2063] File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/ironic/nova/virt/ironic/driver.py", line 455, in spawn
2014-06-04 23:07:33.159 6862 TRACE nova.compute.manager [instance: 0c282e24-1a4a-4c29-8a91-db5d05fd2063] timer.start(interval=CONF.ironic.api_retry_interval).wait()
2014-06-04 23:07:33.159 6862 TRACE nova.compute.manager [instance: 0c282e24-1a4a-4c29-8a91-db5d05fd2063] File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
2014-06-04 23:07:33.159 6862 TRACE nova.comput...

Revision history for this message
Nicholas Randon (nicholas-randon) wrote :
Download full text (4.6 KiB)

seen in anuther run:
+--------------------------------------+--------------------------------------+-------------+-----------------+-------------+
| uuid | instance_uuid | power_state | provision_state | maintenance |
+--------------------------------------+--------------------------------------+-------------+-----------------+-------------+
| db61bf65-d5b0-41e1-80a9-1655f70b507a | None | power off | None | False |
| a2899cab-f4ea-41b9-9cf3-98b8f47308f9 | 0e8cd220-6ac3-4abc-b8eb-b55e80a45c78 | power on | active | False |
| a8115733-de5b-4c37-8e95-fb140e70e4db | 0f96015c-0525-45a7-a12d-17a3a46a76cc | power on | active | False |
| 6ef41558-e899-4b0a-941a-597974bb5063 | 131ce90e-ef89-43f6-a59b-72e30c056160 | power on | active | False |
| 3b2f1771-b60b-4935-a621-72d3e3f66d76 | 161dd203-f292-4668-be6c-22bd8584489d | power on | active | False |
| afd6acbf-17db-43c9-af5c-28da6e900f69 | 1765d8b6-370b-476b-8052-08a6ee746d4a | power on | active | False |
| 433c3a1e-ac06-4c3e-bcb9-2ba5f9bd044b | 18e78f76-26c7-4bb1-9f7c-3318bf080c20 | power on | active | False |
| 8ad63730-7f46-4ef4-9846-aac0126f6046 | 2dcad886-724b-4955-965a-5691a629dcbc | power on | active | False |
| 6d98ae9e-4d14-416c-9227-c4ac80af1917 | 3858beb8-2a4a-479d-bb6e-a66bd27d81ba | power on | active | False |
| 4b4258ef-381a-4506-bbf4-b3b06745f470 | 3b657bec-f5fb-4b7f-b8f8-02fb70f27f04 | power on | active | False |
| 34efa8ef-4053-40a1-9428-3e1320284709 | 466056ce-f6cd-48c1-b357-3597e8679882 | power on | active | False |
| f883ffde-3c26-44df-9315-9246d6d8cb38 | 47d47691-ed18-4b8d-b309-03224889c71b | power on | active | False |
| ab793030-eeb3-4edc-8092-c39a868d4438 | 4b74e95b-9bb3-41f9-9906-2c557a1ab586 | power on | active | False |
| e6747375-f1de-419f-baa9-be4f4fc11e6c | 4b7687fa-2e50-4e22-8c65-fd2394ba906c | power on | active | False |
| 66c9bb8e-e165-47a6-b788-d59b05f15cce | 4e39da5a-107b-426e-97d9-b73cdcd1649c | power on | active | False |
| c5ceb450-a3c8-41a7-8c69-777991eb6c1d | 4f207768-65ce-493f-8bde-a7ddcd8ac48d | power on | active | False |
| df2fff0e-2ca7-4bf7-8941-781082cc4e78 | 528b1e52-da34-4d9e-83e6-4379a26a73f9 | power on | active | False |
| 9bed4e18-dbeb-4f52-a95f-5a8ae9119e61 | 72d1fd9a-ef88-4fa8-bbc8-fbb37f281160 | power on | active | False |
| d48bf64b-1d6b-4a44-bf83-574a7219e357 | 7ae2e84b-5a5a-4cb0-b9b4-8b6660f4062d | power on | active | False |
| 84bd4c28-2979-43fe-b26f-30aeef4999cd | 9bdd8553-ff0b-48cb-9607-6611ba92a0aa | power on | active | False |
| 0eb214d1-deb2-41ce-b22d-c11e3074c0a2 | aae532b9-1634-404e-8a13-a6236587d5bf | power on | active | False |
| 103d0729-be99-4c25-80c9-412431d21908 | ab625435-77b7-4c25-a81d-50af2ad9988f | power on | wait call-back | False |
| 5b158a3b-0de6-488d-86a6-02a2b9a59729 | ab625435-77b7-4c25-a81d-50af2ad9988f | power on | active | False |
| 825f9e8a-9348-41fa-8c21-3...

Read more...

Revision history for this message
Robert Collins (lifeless) wrote :

Its spreading :)

| 00130b60-9844-47ef-9df0-b6007823eedd | 194cac82-71f3-46a3-889c-de9def060e47 | power on | wait call-back | False |
| 0caed410-db8e-4c3f-aedc-788704609e00 | 194cac82-71f3-46a3-889c-de9def060e47 | power on | wait call-back | False |
| 22610bc3-c81f-4a12-ba5b-c959ca5abc32 | 194cac82-71f3-46a3-889c-de9def060e47 | power on | wait call-back | False |
| 4608860a-598f-4508-9a9e-63c57ea3e569 | 194cac82-71f3-46a3-889c-de9def060e47 | power on | wait call-back | False |
| 7c8f3c96-aec0-444d-ab48-502442aef7bd | 194cac82-71f3-46a3-889c-de9def060e47 | power on | wait call-back | False |
| 917f49bd-4921-47e0-94cf-960d5b8fa8e9 | 194cac82-71f3-46a3-889c-de9def060e47 | power on | wait call-back | False |
| b90dcb51-912f-4b00-8898-64e98177b0f1 | 194cac82-71f3-46a3-889c-de9def060e47 | power on | wait call-back | False |
| bd17fa1c-a43f-4635-a26d-9f62a0834d39 | 194cac82-71f3-46a3-889c-de9def060e47 | power on | wait call-back | False |
| ce0aff60-8e24-4b3f-9783-e5b846c78653 | 194cac82-71f3-46a3-889c-de9def060e47 | power on | wait call-back | False |
| cfa5c267-3a7c-4973-bdcf-80a139a947ea | 194cac82-71f3-46a3-889c-de9def060e47 | power on | deploy failed | False |
| f170fdf0-83b1-40e3-9452-ab65a89432f4 | 194cac82-71f3-46a3-889c-de9def060e47 | power on | wait call-back | False |
| 33ad2330-6bd9-4413-8955-50db3a55e537 | 1cb064ee-3eff-4799-a489-bd4f447f285c | power on | deploy failed | False |
| b2b93c39-59e9-46f9-9e1e-d3764b2b0ec8 | 1cb064ee-3eff-4799-a489-bd4f447f285c | power on | deploy failed | False |
| 0a2d7b01-4988-4440-9f42-0546de37cfe8 | 73e15d61-5566-426e-9dcb-4c97c78c6aeb | power on | wait call-back | False |
| 4741d825-ad77-48e5-a047-55002ebf9262 | 73e15d61-5566-426e-9dcb-4c97c78c6aeb | power on | wait call-back | False |
| 5be61545-60d3-43ce-8be6-aab1f5df86ed | 73e15d61-5566-426e-9dcb-4c97c78c6aeb | power on | deploy failed | False |
| 01c3663f-31a2-4bfa-9b1b-2d8953fb1c86 | 9df33a91-4be3-429c-a712-35cd60a1e101 | power on | deploy failed | False |
| 3f9df53e-5a76-47b3-a1c4-f596822bb43d | 9df33a91-4be3-429c-a712-35cd60a1e101 | None | deploy failed | True |
| 9b9e8326-2fea-4aa9-820d-2003b262000e | bb59bea5-166a-4023-a122-ee793d21ccd2 | power on | deploy failed | False |
| a5318eae-3d81-4d62-9531-e5700c71e378 | bb59bea5-166a-4023-a122-ee793d21ccd2 | power on | deploy failed | False |

Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

will take a look at it

Changed in ironic:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
Dmitry Tantsur (divius)
tags: added: driver
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/98120

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/98121

aeva black (tenbrae)
tags: added: nova-driver
removed: driver
Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Jim Rollenhagen (jim-rollenhagen)
Ruby Loo (rloo)
Changed in ironic:
assignee: Jim Rollenhagen (jim-rollenhagen) → Lucas Alvares Gomes (lucasagomes)
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Been hitting this a lot and poking at it in devstack today. It seems to be more than simply nodes becoming disassociated with instances on failure, and smells more like something deeper in the scheduler. I'm curious what changes that broke this, as it was working okay previously.

Its easy to reproduce in devstack, simply enroll multiple VMs (IRONIC_VM_COUNT) and set deploy_callback_timeout to something low. You'll notice that, after the first failure, the instance get rescheduled to multiple nodes, and multiple other nodes after the second failure. I've attached a client-side log showing the transitions.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/98184
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=ffa8c91a951f37b3613a5abdb29c58fa917906cc
Submitter: Jenkins
Branch: master

commit ffa8c91a951f37b3613a5abdb29c58fa917906cc
Author: Jim Rollenhagen <email address hidden>
Date: Thu Jun 5 10:23:16 2014 -0700

    Fix concurrent deletes in virt driver

    The _unprovision function used a class member to track the number
    of calls to ironic to check the provision state. When more than one
    delete is happening concurrently, the deletes will both add to the
    same counter. This commit fixes this issue to use a local counter.

    Partial-Bug: #1326364

    Change-Id: I2552332445fe0e614970ec28eddf7c1ae94e34be

aeva black (tenbrae)
Changed in ironic:
milestone: none → juno-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/98120
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=332f562e5503099324364dbd2e68da77fd35624f
Submitter: Jenkins
Branch: master

commit 332f562e5503099324364dbd2e68da77fd35624f
Author: Lucas Alvares Gomes <email address hidden>
Date: Thu Jun 5 11:50:26 2014 +0100

    Add unique constraint to instance_uuid

    Add a unique constraint to the instance_uuid column of the nodes table
    and a migration script. The old node_instance_uuid index was dropped in
    favor of the new unique index.

    Partial-Bug: #1326364
    Change-Id: I005d3b3f252b18c506bab3297b1a5e3c7e7534d2

Revision history for this message
aeva black (tenbrae) wrote :

Closed by the following commit, which didn't get automatically linked here:

commit 4cc54684e45372f75096282b0768a4b6d5ca5c0c
Author: Adam Gandelman <email address hidden>
Date: Wed Jun 11 17:31:14 2014 -0700

    Destroy instance to clear node state on failure

    After a node has failed to deploy an instance, ensure the instance is
    destroyed from the virt driver. This clears the node state entirely before
    the instance gets rescheduled.

    Change-Id: I6a5d04f4c26c0910274783c0f686b0542af15f9c
    Closes-bug: #1326364.

Changed in ironic:
status: In Progress → Fix Committed
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: juno-2 → 2014.2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (master)

Change abandoned by Lucas Alvares Gomes (<email address hidden>) on branch: master
Review: https://review.openstack.org/98121

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.