Grenade job fails due to systemd stopping n-cpu

Bug #1744139 reported by Julia Kreger
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Critical
Julia Kreger

Bug Description

Observed as part of a grenade failure, we discovered that n-cpu was being stopped prior to the upgrade completing.

Jan 18 05:48:47.109960 ubuntu-xenial-inap-mtl01-0001976291 nova-compute[1669]: DEBUG ironicclient.common.http [None req-f2cddc54-4c89-4e83-9443-1c59de3fb2d9 None None] Error contacting Ironic server: Unable to establish connection to http://198.72.124.151:6385/v1/nodes/detail: HTTPConnectionPool(host='198.72.124.151', port=6385): Max retries exceeded with url: /v1/nodes/detail (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6c79f1e0d0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)). Attempt 24 of 61 {{(pid=1669) wrapper /usr/local/lib/python2.7/dist-packages/ironicclient/common/http.py:199}}
Jan 18 05:48:49.185150 ubuntu-xenial-inap-mtl01-0001976291 systemd[1]: <email address hidden>: Main process exited, code=killed, status=11/SEGV
Jan 18 05:48:49.185907 ubuntu-xenial-inap-mtl01-0001976291 systemd[1]: <email address hidden>: Unit entered failed state.
Jan 18 05:48:49.185967 ubuntu-xenial-inap-mtl01-0001976291 systemd[1]: <email address hidden>: Failed with result 'signal'.

The errors in the log are due to the ironic-api service being stopped during the upgrade and the redirect to the other ironic API server having not been put into place yet. In this specific case, the redirect was loaded at 05:50, which means n-cpu was already stopped.

In reality, what appears to be occurring is that the nova conductor process is spontaneously failing and entering into a loop at some point between the glance and swift upgrades.

Jan 18 05:49:48.642840 ubuntu-xenial-inap-mtl01-0001976291 nova-conductor[19600]: INFO oslo_service.service [None req-db5ca533-98f7-4358-8448-0201fe9a04d8 None None] Child 20023 killed by signal 11
Jan 18 05:49:48.649246 ubuntu-xenial-inap-mtl01-0001976291 nova-conductor[19600]: DEBUG oslo_service.service [None req-db5ca533-98f7-4358-8448-0201fe9a04d8 None None] Started child 24359 {{(pid=19600) _start_child /usr/local/lib/python2.7/dist-packages/oslo_service/service.py:513}}
Jan 18 05:49:48.657002 ubuntu-xenial-inap-mtl01-0001976291 nova-conductor[19600]: INFO nova.service [-] Starting conductor node (version 16.0.5)
Jan 18 05:49:48.681224 ubuntu-xenial-inap-mtl01-0001976291 nova-conductor[19600]: INFO oslo_service.service [None req-db5ca533-98f7-4358-8448-0201fe9a04d8 None None] Child 24359 killed by signal 11
Jan 18 05:49:48.686965 ubuntu-xenial-inap-mtl01-0001976291 nova-conductor[19600]: DEBUG oslo_service.service [None req-db5ca533-98f7-4358-8448-0201fe9a04d8 None None] Started child 24360 {{(pid=19600) _start_child /usr/local/lib/python2.7/dist-packages/oslo_service/service.py:513}}
Jan 18 05:49:48.695042 ubuntu-xenial-inap-mtl01-0001976291 nova-conductor[19600]: INFO nova.service [-] Starting conductor node (version 16.0.5)

What this appears to be is that that the service is failing as underlying libraries are being upgraded, however failing with a segmentation fault.

If this is truly a python library compatibility issue, then the only way to proceed forth is to upgrade nova as part of the upgrade. The downside is the "upgraded" scenario that we execute is where ironic-api is not upgraded, and the nova service is not upgraded at all.

Without some sort of virtualenv or container level isolation, we are unable to really solve this issue short of identifying the exact library compatibility issue or upgrading. Just restarting the n-cpu service is not enough.

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

Seems to be viable and seems to work: Upgrading nova - We then hit microversion issues.
Restaring n-cpu. We should also try to restart n-cond and see if that impacts it.
Python3 instead of python2 - Swift does not start

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/535594

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/535596

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic (master)

Reviewed: https://review.openstack.org/535594
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=e4925352a4c019beefceeb414b032621442ed30c
Submitter: Zuul
Branch: master

commit e4925352a4c019beefceeb414b032621442ed30c
Author: Julia Kreger <email address hidden>
Date: Thu Jan 18 20:02:19 2018 -0800

    Mark multinode job as non-voting

    As the mutlinode job is failing and is failing
    in ways that cannot currently be easily and cleanly
    rectified, we should disable voting for the job until
    we are able to fully identify the cause and proper
    solution in order to allow the project contributors
    to continue to land code and have it reviewed while
    we work to resolve the multinode grenade job issues.

    Change-Id: If204c7b979baa71b3b9bbb7e79d13741f580ba8b
    Related-Bug: #1744139

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (master)

Change abandoned by Julia Kreger (<email address hidden>) on branch: master
Review: https://review.openstack.org/535391
Reason: No loner needed.

Revision history for this message
Ruby Loo (rloo) wrote :

From the whiteboard, our current plan:

(TheJulia) Given the above information, the following courses of action seem to exist:

- Upgrade ironic-api and nova to the state on master branch. This is a departure from the "We will only upgrade ironic-conductor stance" and means we are not testing rolling-upgrade.
- Change grenade multinode to non-voting for the time being, while we continue to investigate.
   - (dtantsur) done, need to revert asap though
- Possibly fix negotiation logic in the client (patches in progress), and and then fix nova.
   - There is likely no way to even get the client in nova before mid-next week if we were to land changes and reqeust a release tomorrow (Friday) since it would be unlikely to actually be released until Monday.
   - This likely won't make g-r before Tuesday or Wednesday, and would then need to be landed in nova prior to being able to land code to negotiate. tl;dr this cannot be the only option.
   - Additional factor: This would mean tests can no longer expect that if nova is one version, that ironic must be another version with perfect compatability. I don't know if this would be an actual issue, however tempest plugins should have already been released, and ironic has already released its tempest plugin.
   - nova patch: "Ironic: negotiate microversion to allow downgrade to Pike": https://review.openstack.org/#/c/535786/
   - ironic patch (dependent on nova patch): "Rework upgrade to upgrade nova/ironic": https://review.openstack.org/535596

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/544750
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=93f376f3450b611842816f7248943350e4fd1a73
Submitter: Zuul
Branch: master

commit 93f376f3450b611842816f7248943350e4fd1a73
Author: Julia Kreger <email address hidden>
Date: Wed Feb 14 16:27:27 2018 -0800

    Disable .pyc files for grenade multinode

    Ironic's grenade multinode suffers from some of unique issues.

    * Nova can never be upgraded as there is a static pin that increments
      higher. As such, newer nova can never run with older ironic.
    * As Nova cannot be upgraded, it is left running throughout the test
      sequence.

    The above two conditions result in possible breaking package upgrades
    as the python environments are shared between the old and new
    installations.

    In order to better isolate the running processes, as would be in
    most actual production environments, we need to minimize underlying
    inter-reactions due to python library upgrades.

    Credit goes to Jim Rollenhagen for coming up with this idea, once the
    massive conundrum was fully explained. Thanks Jim!

    Partial-Bug: #1744139
    Change-Id: Ifdd119d9cdde2ead6c3e36862cc77da67d10f7d1

Revision history for this message
Vasyl Saienko (vsaienko) wrote :

Also we need a fixes to devstack - that will increase retries connecting to Ironic from nova-compute, as upgrading ironic may take a time

https://review.openstack.org/536743
https://review.openstack.org/536744

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (stable/queens)

Reviewed: https://review.openstack.org/545089
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=780ab207a1b2e46a7fd131f4cbb7c026f3a1b390
Submitter: Zuul
Branch: stable/queens

commit 780ab207a1b2e46a7fd131f4cbb7c026f3a1b390
Author: Julia Kreger <email address hidden>
Date: Wed Feb 14 16:27:27 2018 -0800

    Disable .pyc files for grenade multinode

    Ironic's grenade multinode suffers from some of unique issues.

    * Nova can never be upgraded as there is a static pin that increments
      higher. As such, newer nova can never run with older ironic.
    * As Nova cannot be upgraded, it is left running throughout the test
      sequence.

    The above two conditions result in possible breaking package upgrades
    as the python environments are shared between the old and new
    installations.

    In order to better isolate the running processes, as would be in
    most actual production environments, we need to minimize underlying
    inter-reactions due to python library upgrades.

    Credit goes to Jim Rollenhagen for coming up with this idea, once the
    massive conundrum was fully explained. Thanks Jim!

    Partial-Bug: #1744139
    Change-Id: Ifdd119d9cdde2ead6c3e36862cc77da67d10f7d1
    (cherry picked from commit 93f376f3450b611842816f7248943350e4fd1a73)

tags: added: in-stable-queens
Dmitry Tantsur (divius)
Changed in ironic:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.