Undercloud vm in state error after update of the undercloud.

Bug #1777475 reported by Sofer Athlan-Guyot
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Unassigned
tripleo
Fix Released
Critical
Carlos Camacho

Bug Description

Hi,

after an update of the undercloud, the undercloud vm is in error:

[stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| 9f80c38a-9f33-4a18-88e0-b89776e62150 | compute-0 | ERROR | ctlplane=192.168.24.18 | overcloud-full | compute |
| e87efe17-b939-4df2-af0c-8e2effd58c95 | controller-1 | ERROR | ctlplane=192.168.24.9 | overcloud-full | controller |
| 5a3ea20c-75e8-49fe-90b6-edad01fc0a48 | controller-2 | ERROR | ctlplane=192.168.24.13 | overcloud-full | controller |
| ba0f26e7-ec2c-4e61-be8e-05edf00ce78a | controller-0 | ERROR | ctlplane=192.168.24.8 | overcloud-full | controller |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+

Originally found starting there https://bugzilla.redhat.com/show_bug.cgi?id=1590297#c14

It boils down to a ordering issue between openstack-ironic-conductor and openstack-nova-compute, a simple reproducer is:

sudo systemctl stop openstack-ironic-conductor
sudo systemctl restart openstack-nova-compute

on the undercloud.

Changed in tripleo:
assignee: nobody → Sofer Athlan-Guyot (sofer-athlan-guyot)
status: Triaged → In Progress
Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

From @ccamacho:

Apparently, this is a cosmetic issue as the nodes are still operational

We can re-set the node state with:

    (undercloud) [stack@undercloud ~]$ openstack server list
    +--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+
    | ID | Name | Status | Networks | Image | Flavor |
    +--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+
    | a8c447c1-7f47-42be-807a-c475dc76ea89 | overcloud-controller-0 | ERROR | ctlplane=192.168.24.6 | overcloud-full | oooq_control |
    | 8ef23886-298e-451b-8094-f10c6c6c02fb | overcloud-novacompute-0 | ERROR | ctlplane=192.168.24.11 | overcloud-full | oooq_compute |
    +--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+
    (undercloud) [stack@undercloud ~]$ openstack server set --state active a8c447c1-7f47-42be-807a-c475dc76ea89
    (undercloud) [stack@undercloud ~]$ openstack server list
    +--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+
    | ID | Name | Status | Networks | Image | Flavor |
    +--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+
    | a8c447c1-7f47-42be-807a-c475dc76ea89 | overcloud-controller-0 | ACTIVE | ctlplane=192.168.24.6 | overcloud-full | oooq_control |
    | 8ef23886-298e-451b-8094-f10c6c6c02fb | overcloud-novacompute-0 | ERROR | ctlplane=192.168.24.11 | overcloud-full | oooq_compute |
    +--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+

mmm In my case the nodes always worked, just had to manually set the state to ACTIVE again.

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

Hi,

right, the undercloud node are working correctly, but this can sound alarming to the user :)

tags: added: ux
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

Hi Bogdan, 1777606 was meant to be to track the nova side of it, but I wrongly created it in tripleo and closed it as invalid.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to instack-undercloud (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/576909

Revision history for this message
Matthew Booth (mbooth-9) wrote :

The Nova fix should be to not call plug_vifs at all during ironic driver initialization. It probably isn't necessary for 'non-local' hypervisors in general, so guessing also Power, Hyper-V, and VMware.

Revision history for this message
Matthew Booth (mbooth-9) wrote :

Reading back, comment 6 may sound like a non-sequitur without additional context. The issue as described is caused if nova compute comes up before ironic conductor. Nova compute calls plug_vifs for all instances in init_host, which is passed to the driver. This calls out to ironic which, prior to the merging of https://review.openstack.org/#/c/576608/, returned a 400 to us if conductor hadn't started yet, causing nova to put the instance in an error state.

Note that this issue is resolved by the merging of the above ironic fix. With this fix, ironic will return a 503 to us, which will cause us to retry. If the retries time out we will fail initialization. At this point if service management restarts nova compute the retries will start again. We will never erroneously put ironic instances in an error state.

We landed a hack in nova to implement retries: https://review.openstack.org/#/c/576580/ so the issue is currently mitigated but not really fixed in Nova without the ironic fix.

In comment 6 I point out that the plug_vifs call is not required at all for certain driver, including ironic. Not making this call at all is obviously the most robust solution from nova's perspective.

Changed in tripleo:
assignee: Sofer Athlan-Guyot (sofer-athlan-guyot) → Carlos Camacho (ccamacho)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to instack-undercloud (master)

Reviewed: https://review.openstack.org/576165
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=aab11800fefc42ebae37109cf30076aa88c4d88b
Submitter: Zuul
Branch: master

commit aab11800fefc42ebae37109cf30076aa88c4d88b
Author: Sofer Athlan-Guyot <email address hidden>
Date: Mon Jun 18 17:07:13 2018 +0200

    Make sure we start nova-compute after ironic-conductor.

    We need to ensure that ironic-conductor starts before nova-compute.
    This is to workaround an issue where nova-compute tries and fails to
    call plug_vifs, this in turn report a vm_state error which, in this
    case is a false positive. See lp#1777608 for more.

    We ensure ordering by forcing puppet to restart nova-compute after
    ironic-conductor in the case of undercloud upgrade/update.

    Change-Id: Ifbada53f088258a397777a6fa18dd7c1b37c09d3
    Closes-Bug: #1777475

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

We need the similar fix for t-h-t/puppet in order to fix this for containerized undercloud, which is going to be default installation method in Rocky. The instack only fix is not complete, reopening.

Changed in tripleo:
status: Fix Released → Triaged
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This should be fixed in Nova. We cannot reliably build start-up ordering for containerized nova and ironic. Even if we did that install-time only, we still need to make sure the same order will be maintained after a node reboot.

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

Hi,

so when this[1] is merged in queens, we won't need the strong dependency anymore.

[1] https://review.openstack.org/#/c/576608/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/576948
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ebdd39deea589654958773b8a75f3b181c51fabe
Submitter: Zuul
Branch: stable/queens

commit ebdd39deea589654958773b8a75f3b181c51fabe
Author: Dan Smith <email address hidden>
Date: Tue Jun 19 09:34:23 2018 -0700

    Be graceful about vif plugging in early ironic driver startup

    During ironic driver startup, we re-plug VIFs (for potentially
    dubious reasons). If we're racing with ironic startup, which is
    likely on single-node (i.e. undercloud) installations, we will
    attempt to do the plug and can fail because no ironic conductors
    are available to do the job. It sounds like ironic should be
    returning 500 in this case, and maybe the long-term solution is
    to do that and have ironicclient manage the retry for us.

    For the moment, ironic shows its hand in the message, indicating
    the conductor reasoning. This patch makes us retry for up to a
    minute waiting for that to clear, before actually failing.

    Conflicts:
        nova/tests/unit/virt/ironic/test_driver.py

    NOTE(lyarwood): Test conflict due to _get_cached_node being
    introduced in Rocky by Id96e7e513f469b87992ddae1431cce714e91ed16.

    Change-Id: I6450ceb54bd85945b76d3ac46882e1fea6b82742
    Related-Bug: #1777475
    (cherry picked from commit 07f027d2b068caa1ebde81e476ce01f55a0fa5b1)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to instack-undercloud (stable/queens)

Reviewed: https://review.openstack.org/576909
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=cd986d3ea26702f228c4b8ab5229b90f22bb27ca
Submitter: Zuul
Branch: stable/queens

commit cd986d3ea26702f228c4b8ab5229b90f22bb27ca
Author: Sofer Athlan-Guyot <email address hidden>
Date: Mon Jun 18 17:07:13 2018 +0200

    Make sure we start nova-compute after ironic-conductor.

    We need to ensure that ironic-conductor starts before nova-compute.
    This is to workaround an issue where nova-compute tries and fails to
    call plug_vifs, this in turn report a vm_state error which, in this
    case is a false positive. See lp#1777608 for more.

    We ensure ordering by forcing puppet to restart nova-compute after
    ironic-conductor in the case of undercloud upgrade/update.

    Change-Id: Ifbada53f088258a397777a6fa18dd7c1b37c09d3
    Closes-Bug: #1777475
    (cherry picked from commit 801fb3ced610cfdec9a19dc7b5296677f18904fe)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/instack-undercloud 8.4.3

This issue was fixed in the openstack/instack-undercloud 8.4.3 release.

Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/instack-undercloud 9.2.0

This issue was fixed in the openstack/instack-undercloud 9.2.0 release.

Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
Artom Lifshitz (notartom) wrote :

Looks like the patch in comment #12 has addressed this from Nova's POV, so I'm setting this as Fix Release.

Changed in nova:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.