starting multiple VMs causes some to be stuck in BUILDING

Bug #1051066 reported by Joe Gordon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Vish Ishaya

Bug Description

It appears some of the VMs are not getting private IPs.

From Devstack:

vagrant@precise:~/devstack$ euca-run-instances ami-00000001 -t m1.tiny -n 8
RESERVATION r-fv50jn6l ecb58b362a9845fea121c8b5f6f30f6c default
INSTANCE i-00000037 ami-00000001 server-c23963ad-2be5-4415-a3d1-51dbac1257a4 server-c23963ad-2be5-4415-a3d1-51dbac1257a4 pending None (ecb58b362a9845fea121c8b5f6f30f6c, None) 0 m1.tiny 2012-09-14T20:17:26.000Z unknown zone aki-00000002 ari-00000003 monitoring-disabled instance-store
INSTANCE i-00000038 ami-00000001 server-35618d88-47ce-450c-8e27-903cc41e5c43 server-35618d88-47ce-450c-8e27-903cc41e5c43 pending None (ecb58b362a9845fea121c8b5f6f30f6c, None) 0 m1.tiny 2012-09-14T20:17:26.000Z unknown zone aki-00000002 ari-00000003 monitoring-disabled instance-store
INSTANCE i-00000039 ami-00000001 server-41c78a55-2e0f-4807-8698-2612723ed01e server-41c78a55-2e0f-4807-8698-2612723ed01e pending None (ecb58b362a9845fea121c8b5f6f30f6c, None) 0 m1.tiny 2012-09-14T20:17:26.000Z unknown zone aki-00000002 ari-00000003 monitoring-disabled instance-store
INSTANCE i-0000003a ami-00000001 server-3b9cf5af-b1fa-4dba-a497-b3d89dac81ac server-3b9cf5af-b1fa-4dba-a497-b3d89dac81ac pending None (ecb58b362a9845fea121c8b5f6f30f6c, None) 0 m1.tiny 2012-09-14T20:17:26.000Z unknown zone aki-00000002 ari-00000003 monitoring-disabled instance-store
INSTANCE i-0000003b ami-00000001 server-b7be7f0b-2c67-4d20-a790-fd3a578fcb6e server-b7be7f0b-2c67-4d20-a790-fd3a578fcb6e pending None (ecb58b362a9845fea121c8b5f6f30f6c, None) 0 m1.tiny 2012-09-14T20:17:26.000Z unknown zone aki-00000002 ari-00000003 monitoring-disabled instance-store
INSTANCE i-0000003c ami-00000001 server-fccd49bf-2de6-457e-9a19-d2817e9a2cd8 server-fccd49bf-2de6-457e-9a19-d2817e9a2cd8 pending None (ecb58b362a9845fea121c8b5f6f30f6c, None) 0 m1.tiny 2012-09-14T20:17:26.000Z unknown zone aki-00000002 ari-00000003 monitoring-disabled instance-store
INSTANCE i-0000003d ami-00000001 server-2288a020-825d-405a-a7bb-6a6dd1a8a544 server-2288a020-825d-405a-a7bb-6a6dd1a8a544 pending None (ecb58b362a9845fea121c8b5f6f30f6c, None) 0 m1.tiny 2012-09-14T20:17:26.000Z unknown zone aki-00000002 ari-00000003 monitoring-disabled instance-store
INSTANCE i-0000003e ami-00000001 server-e4dc8891-db72-4cc5-8149-75af6c1a1746 server-e4dc8891-db72-4cc5-8149-75af6c1a1746 pending None (ecb58b362a9845fea121c8b5f6f30f6c, None) 0 m1.tiny 2012-09-14T20:17:26.000Z unknown zone aki-00000002 ari-00000003 monitoring-disabled instance-store
vagrant@precise:~/devstack$ nova list
+--------------------------------------+---------------------------------------------+--------+------------------+
| ID | Name | Status | Networks |
+--------------------------------------+---------------------------------------------+--------+------------------+
| 2288a020-825d-405a-a7bb-6a6dd1a8a544 | Server 2288a020-825d-405a-a7bb-6a6dd1a8a544 | BUILD | |
| 35618d88-47ce-450c-8e27-903cc41e5c43 | Server 35618d88-47ce-450c-8e27-903cc41e5c43 | ACTIVE | private=10.0.0.3 |
| 3b9cf5af-b1fa-4dba-a497-b3d89dac81ac | Server 3b9cf5af-b1fa-4dba-a497-b3d89dac81ac | ACTIVE | private=10.0.0.5 |
| 41c78a55-2e0f-4807-8698-2612723ed01e | Server 41c78a55-2e0f-4807-8698-2612723ed01e | ACTIVE | private=10.0.0.4 |
| b7be7f0b-2c67-4d20-a790-fd3a578fcb6e | Server b7be7f0b-2c67-4d20-a790-fd3a578fcb6e | ACTIVE | private=10.0.0.6 |
| c23963ad-2be5-4415-a3d1-51dbac1257a4 | Server c23963ad-2be5-4415-a3d1-51dbac1257a4 | ACTIVE | private=10.0.0.2 |
| e4dc8891-db72-4cc5-8149-75af6c1a1746 | Server e4dc8891-db72-4cc5-8149-75af6c1a1746 | BUILD | |
| fccd49bf-2de6-457e-9a19-d2817e9a2cd8 | Server fccd49bf-2de6-457e-9a19-d2817e9a2cd8 | BUILD | |
+--------------------------------------+---------------------------------------------+--------+------------------+

Revision history for this message
Vish Ishaya (vishvananda) wrote :

any traceback?

Changed in nova:
status: New → Triaged
importance: Undecided → High
importance: High → Critical
milestone: none → folsom-rc1
Revision history for this message
Boris Filippov (bfilippov) wrote :
Download full text (9.1 KiB)

Tried to reproduce it on devstack

[karadain@localhost devstack]$ euca-run-instances ami-00000001 -t m1.tiny -n 8
RESERVATION r-7bvqwocx d857f20c4265495bb0fdc43d21d67e36 default
INSTANCE i-0000000a ami-00000001 server-fe6ac7c3-6a64-48f6-8f33-d0cd7de1dc5d server-fe6ac7c3-6a64-48f6-8f33-d0cd7de1dc5d pending 0 m1.tiny 2012-09-15T13:30:38.000Z unknown zone aki-00000002 ari-00000003
INSTANCE i-0000000b ami-00000001 server-2169a834-a050-4c1f-84ee-8e664b36d9c2 server-2169a834-a050-4c1f-84ee-8e664b36d9c2 pending 0 m1.tiny 2012-09-15T13:30:39.000Z unknown zone aki-00000002 ari-00000003
INSTANCE i-0000000c ami-00000001 server-11f9ff3a-593d-482d-88b3-b20ff55e04a9 server-11f9ff3a-593d-482d-88b3-b20ff55e04a9 pending 0 m1.tiny 2012-09-15T13:30:39.000Z unknown zone aki-00000002 ari-00000003
INSTANCE i-0000000d ami-00000001 server-9938ecfb-a124-4e79-8790-1b1974c08fc2 server-9938ecfb-a124-4e79-8790-1b1974c08fc2 pending 0 m1.tiny 2012-09-15T13:30:39.000Z unknown zone aki-00000002 ari-00000003
INSTANCE i-0000000e ami-00000001 server-0541c3b2-11d1-46b4-9334-45ec47ac70ef server-0541c3b2-11d1-46b4-9334-45ec47ac70ef pending 0 m1.tiny 2012-09-15T13:30:39.000Z unknown zone aki-00000002 ari-00000003
INSTANCE i-0000000f ami-00000001 server-e940f970-980e-4804-85b6-1f1ad6eb1153 server-e940f970-980e-4804-85b6-1f1ad6eb1153 pending 0 m1.tiny 2012-09-15T13:30:39.000Z unknown zone aki-00000002 ari-00000003
INSTANCE i-00000010 ami-00000001 server-6e399d2d-d471-4672-becc-499a4130ffc8 server-6e399d2d-d471-4672-becc-499a4130ffc8 pending 0 m1.tiny 2012-09-15T13:30:39.000Z unknown zone aki-00000002 ari-00000003
INSTANCE i-00000011 ami-00000001 server-c91fee0d-4ccd-4386-8729-214529e25119 server-c91fee0d-4ccd-4386-8729-214529e25119 pending 0 m1.tiny 2012-09-15T13:30:39.000Z unknown zone aki-00000002 ari-00000003

[karadain@localhost devstack]$ nova list
+--------------------------------------+---------------------------------------------+--------+-------------------+
| ID | Name | Status | Networks |
+--------------------------------------+---------------------------------------------+--------+-------------------+
| 0860309f-6b29-4d8c-b0e8-bea23698d002 | Server 0860309f-6b29-4d8c-b0e8-bea23698d002 | ACTIVE | private=10.0.0.8 |
| 2d700ad3-5710-456c-8435-72f41ba4e3df | Server 2d700ad3-5710-456c-8435-72f41ba4e3df | BUILD | |
| 5e63f6f3-aff4-4268-96fc-7477ef362264 | Server 5e63f6f3-aff4-4268-96fc-7477ef362264 | ACTIVE | private=10.0.0.9 |
| 6e0a7cda-142f-4d48-bace-aa2428f8c021 | Server 6e0a7cda-142f-4d48-bace-aa2428f8c021 | BUILD | |
| aa67312a-b9e2-4714-bda5-f582e764d176 | Server aa67312a-b9e2-4714-bda5-f582e764d176 | ACTIVE | private=10.0.0.2 |
| adc012d6-70a4-4fb2-b236-69d5d701f0af | Server adc012d6-70a4-4fb2-b236-69d5d701f0af | ACTIVE | private=10.0.0.3 |
| c04f7d2f-6757-48ef-9782-e320547076ac | Server c04f7d2f-6757-48ef-9782-e320547076ac | ACTIVE | private=10.0.0.10 |
| e688309e-ad59-454e-9baa-d6e1564931da | Server e688309e-ad59-454e-9baa-d6e1564931da | BUILD | |
+--------------------------------------+--------------------...

Read more...

Revision history for this message
Joe Gordon (jogo) wrote :

When I run a single VM at a time I can spawn 8, but when I run all 8 at once some fail:

http://paste.openstack.org/show/20957/

This appears to be related to how the free_ram_mb is now updated dynamically from the compute node (not just based on a deterministic calculation). So when another VM is about to start the amount of free memory drops causing the other VMs to fail to start.

In the past the ram information was derived from a deterministic calculation, and now a non-deterministic formula is used.

Revision history for this message
Joe Gordon (jogo) wrote :

While the non-deterministic approach may allow for higher usage, it is much harder to debug. Where running the same command twice gives different results

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/13159

Changed in nova:
assignee: nobody → Vish Ishaya (vishvananda)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/13172

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/13159
Committed: http://github.com/openstack/nova/commit/91734bad9139555294fe088d2c2d77a9712652ab
Submitter: Jenkins
Branch: master

commit 91734bad9139555294fe088d2c2d77a9712652ab
Author: Vishvananda Ishaya <email address hidden>
Date: Mon Sep 17 16:02:06 2012 -0700

    Fixes error handling during schedule_run_instance

    If there are not enough hosts available during a multi-instance launch,
    every failing instance should be updated to error state, instead of
    just the first instance. Currently only the first instance is set
    to Error and the rest stay in building.

    This patch makes a number of fixes to error handling during scheduling.

     * Moves instance faults into compute utils so they can be created
       from the scheduler.
     * Moves error handling into the driver so that each instance can be
       updated separately.
     * Sets an instance fault for failed scheduling
     * Sets task state back to none if there is a scheduling failure
     * Modifies chance scheduler to stop returning a list of instances
       as it is not used.
     * Modifies tests to check for these states.

    In addition to the included tests, the code was manually verified on
    a devstack install

    Fixes bug 1051066
    Fixes bug 1019017

    Change-Id: I49267ce4a21e2f7cc7a996fb2ed5d625f6794730

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/13172
Committed: http://github.com/openstack/nova/commit/0cba85cb267994018c8a0d5e40b2ed0b5a7837df
Submitter: Jenkins
Branch: master

commit 0cba85cb267994018c8a0d5e40b2ed0b5a7837df
Author: Vishvananda Ishaya <email address hidden>
Date: Mon Sep 17 16:09:41 2012 -0700

    Improve error handling of scheduler

    Modifies scheduler errors to report instance faults and to set
    instance_state back to None on failure.

    Related to bug 1051066

    Change-Id: Id9f36a75370849db7baf3fe24ce96c6f4284255d

Changed in nova:
status: Fix Released → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (milestone-proposed)

Fix proposed to branch: milestone-proposed
Review: https://review.openstack.org/13294

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (milestone-proposed)

Reviewed: https://review.openstack.org/13294
Committed: http://github.com/openstack/nova/commit/47e606a95a3a6396e30c825cd6ff913613a85b06
Submitter: Jenkins
Branch: milestone-proposed

commit 47e606a95a3a6396e30c825cd6ff913613a85b06
Author: Vishvananda Ishaya <email address hidden>
Date: Mon Sep 17 16:09:41 2012 -0700

    Improve error handling of scheduler

    Modifies scheduler errors to report instance faults and to set
    instance_state back to None on failure.

    Related to bug 1051066
    Fixes bug 1052993

    Change-Id: Id9f36a75370849db7baf3fe24ce96c6f4284255d
    (cherry picked from commit 0cba85cb267994018c8a0d5e40b2ed0b5a7837df)

Thierry Carrez (ttx)
Changed in nova:
milestone: folsom-rc1 → 2012.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.