OVB is stuck on providing nodes

Bug #1866204 reported by Sagi (Sergey) Shnaidman
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Steve Baker

Bug Description

While running OVB registering and providing nodes, the jobs is stuck. Seems like nodes cleaning is enabled by default and it doesn't work perfectly.
In addition to nodes clean disble, we need to ensure the jobs is not stuck but fails with an appropriate message.

images script log:
https://logserver.rdoproject.org/36/711436/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/aeae9c5/logs/undercloud/home/zuul/overcloud_prep_images.log.txt.gz

logs:
https://logserver.rdoproject.org/36/711436/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/aeae9c5/

1 node is in clean_failed:
https://logserver.rdoproject.org/36/711436/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/aeae9c5/logs/undercloud/var/log/extra/baremetal_list.txt.gz

Revision history for this message
wes hayutin (weshayutin) wrote :
tags: added: promotion-blocker
Revision history for this message
Luke Short (ekultails) wrote :

Regarding the CentOS 8 job, it cannot find yum. By default, there *should* be a symlink from yum to dnf. I don't know why it would be missing.

2020-03-05 13:59:52 | + sudo yum -y install python-tripleoclient
2020-03-05 13:59:52 | sudo: yum: command not found

https://logserver.rdoproject.org/32/25332/38/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/e5d131c/logs/undercloud/home/zuul/overcloud_image_build.log.txt.gz

Revision history for this message
Luke Short (ekultails) wrote :

Sagi mentioned that OVB on CentOS 8 is being tested in this review: https://review.rdoproject.org/r/#/c/25666/

Revision history for this message
Luke Short (ekultails) wrote :
Revision history for this message
Luke Short (ekultails) wrote :

The last Ansible task to run is restarting the network service on all of the nodes and that fails: http://paste.openstack.org/show/790351/

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@Luke this is a different bug, here is the problem with cleaning and providing OVB nodes.

Revision history for this message
Luke Short (ekultails) wrote :
Revision history for this message
Luke Short (ekultails) wrote :

For clarification, we have been looking into a number of different problems in this bug.

Issues:

1. OVB node cleaning failed
2. yum command not available
3. network service cannot be restarted (NIC configuration issue?)

Status:

1 = I will keep this bug (#1866204) focused on this.
2 = This is a temporary issue with a work-in-progress patch that is not currently a bug: https://review.rdoproject.org/r/#/c/25666/
3 = This is being worked on in via: https://bugs.launchpad.net/tripleo/+bug/1866202

Revision history for this message
Luke Short (ekultails) wrote :

Ironic is failing to configure the Neutron ports.

Errors: http://paste.openstack.org/show/790362/ and http://paste.openstack.org/show/790360/
Full log: https://logserver.rdoproject.org/36/711436/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/aeae9c5/logs/undercloud/var/log/containers/ironic/ironic-conductor.log.txt.gz

I found a few BugZillas that list similar problems:

1. https://bugzilla.redhat.com/show_bug.cgi?id=1633287
2. https://bugzilla.redhat.com/show_bug.cgi?id=1806464

The first one links to another BZ where the workaround was to increase Neutron's configuration for `[AGENT]/report_interval` to a higher number of seconds. The second BZ sounds like the error occurred due to a misconfiguration of custom networks.

Revision history for this message
wes hayutin (weshayutin) wrote :

Luke, I'm pretty sure cleaning should be turned off.. or it used to be.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-quickstart (master)

Fix proposed to branch: master
Review: https://review.opendev.org/711759

Changed in tripleo:
assignee: nobody → Luke Short (ekultails)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/711767

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/stein)

Reviewed: https://review.opendev.org/711767
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=b943e993539824491d79f88e90497045db2625e8
Submitter: Zuul
Branch: stable/stein

commit b943e993539824491d79f88e90497045db2625e8
Author: Wes Hayutin <email address hidden>
Date: Fri Mar 6 13:30:36 2020 -0700

    remove older py35 tox job, due to bugs in py35 itself

    Related-Bug: #1866204
    Change-Id: I2d4f1e83e6b9445c8d621ff5d353040b92cf0a7d

tags: added: in-stable-stein
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

I can help with this if there is anything I can look into

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

Looking at the other logs from comment #9, there is a 409 Conflict raised on port creation in the neutron server.log[1]

2020-03-05 12:41:26.872 34 INFO neutron.pecan_wsgi.hooks.translation [req-3b19b08d-4839-44bd-af36-80d09c63a88e 215ca4ea2b7f4aec9ad5c105ef427bf9 c339add0e84a4056aeddec66f0e030e1 - default default] POST failed (client error): There was a conflict when trying to complete your request.
2020-03-05 12:41:26.872 34 DEBUG neutron.pecan_wsgi.hooks.notifier [req-3b19b08d-4839-44bd-af36-80d09c63a88e 215ca4ea2b7f4aec9ad5c105ef427bf9 c339add0e84a4056aeddec66f0e030e1 - default default] No notification will be sent due to unsuccessful status code: 409 after /usr/lib/python2.7/site-packages/neutron/pecan_wsgi/hooks/notifier.py:79
2020-03-05 12:41:27.024 34 INFO neutron.wsgi [req-3b19b08d-4839-44bd-af36-80d09c63a88e 215ca4ea2b7f4aec9ad5c105ef427bf9 c339add0e84a4056aeddec66f0e030e1 - default default] 192.168.24.1 "POST /v2.0/ports HTTP/1.1" status: 409 len: 441 time: 1.6941121

Until neutron introduces locks to its ipam[2] it looks like 409s need to be caught for retries by the callers, since concurrent port creation may sometimes raise a 409.

Once I've done a bit more research I'll raise a bug against ironic.

[1] https://logserver.rdoproject.org/36/711436/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/aeae9c5/logs/undercloud/var/log/containers/neutron/server.log.txt.gz
[2] https://specs.openstack.org/openstack/neutron-specs/specs/train/introduce-distributed-locks-to-ipam.html

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

A retry might not be appropriate here, but I think we need to understand why 1 out of the 4 ports fails with this error:

  Neutron server returns request_ids: ['req-3b19b08d-4839-44bd-af36-80d09c63a88e']: Conflict: Host abc7b2e6-4604-4d02-852d-6c28d97156c7 is not connected to any segments on routed provider network '5919823b-d4d5-4225-8395-8621f943a411'. It should be connected to one.

Revision history for this message
Harald Jensås (harald-jensas) wrote :

Regarding:

Neutron server returns request_ids: ['req-3b19b08d-4839-44bd-af36-80d09c63a88e']: Conflict: Host abc7b2e6-4604-4d02-852d-6c28d97156c7 is not connected to any segments on routed provider network '5919823b-d4d5-4225-8395-8621f943a411'. It should be connected to one.

This may be a timing issue. The ironic-neutron-agent which is reporting host->segment mapping to neutron use a FixedIntervalLoopingCall, the default is 30 second intervals[2]. We may be hitting an issue where 3 out of 4 nodes+port are registered in ironic when the agent syncs and then ~30 seconds pass before the 4th node's data is reported to neutron.

[1] https://opendev.org/openstack/networking-baremetal/src/branch/master/networking_baremetal/agent/ironic_neutron_agent.py#L159-L162
[2] https://opendev.org/openstack/neutron/src/branch/master/neutron/conf/agent/common.py#L116-L119

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to python-tripleoclient (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/712078

Revision history for this message
Harald Jensås (harald-jensas) wrote :
Revision history for this message
Bob Fournier (bfournie) wrote : Re: [Bug 1866204] Re: OVB is stuck on providing nodes

nice

On Tue, Mar 10, 2020 at 1:05 PM Harald Jensås <email address hidden> wrote:

> https://logserver.rdoproject.org/78/712078/1/openstack-check/tripleo-ci-
> centos-7-ovb-3ctlr_1comp-featureset001/db6f03b/
> <https://logserver.rdoproject.org/78/712078/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/db6f03b/>
>
> That's one job passed with the delay added before providing nodes in
> https://review.opendev.org/712078.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1866204
>
> Title:
> OVB is stuck on providing nodes
>
> Status in tripleo:
> In Progress
>
> Bug description:
> While running OVB registering and providing nodes, the jobs is stuck.
> Seems like nodes cleaning is enabled by default and it doesn't work
> perfectly.
> In addition to nodes clean disble, we need to ensure the jobs is not
> stuck but fails with an appropriate message.
>
> images script log:
>
> https://logserver.rdoproject.org/36/711436/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/aeae9c5/logs/undercloud/home/zuul/overcloud_prep_images.log.txt.gz
>
> logs:
>
> https://logserver.rdoproject.org/36/711436/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/aeae9c5/
>
> 1 node is in clean_failed:
>
> https://logserver.rdoproject.org/36/711436/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/aeae9c5/logs/undercloud/var/log/extra/baremetal_list.txt.gz
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/tripleo/+bug/1866204/+subscriptions
>
>

Revision history for this message
wes hayutin (weshayutin) wrote :

testing here:

[12:20:28] <weshay|ruck> hjensas, bfournie I included https://review.opendev.org/#/c/712078 in https://review.rdoproject.org/r/#/c/25332/
[12:20:39] <weshay|ruck> hjensas, 4 hangs

Revision history for this message
Alex Schultz (alex-schultz) wrote :

fyi from the original bug report/logs:

https://logserver.rdoproject.org/36/711436/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/aeae9c5/logs/undercloud/home/zuul/overcloud_prep_images.log.txt.gz

2020-03-05 12:41:16 | TASK [Notice] ******************************************************************
2020-03-05 12:41:16 | Thursday 05 March 2020 12:41:16 +0000 (0:00:00.065) 0:00:00.250 ********
2020-03-05 12:41:16 | ok: [localhost] =>
2020-03-05 12:41:16 | msg: No nodes are manageable at this time.
2020-03-05 12:41:16 |

This was fixed by https://review.opendev.org/#/c/711543/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/712203

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on python-tripleoclient (master)

Change abandoned by Harald Jensås (<email address hidden>) on branch: master
Review: https://review.opendev.org/712078
Reason: CI job's failed because introspection was skipped.

Looking a bit more on a proper solution, we most likely need to validate the agents in the workflow. Since workflows are moving to ansible it makes sense to wait.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/712203
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=e10d6f4fe66447fcbf960f460cdba0c9304f3e34
Submitter: Zuul
Branch: master

commit e10d6f4fe66447fcbf960f460cdba0c9304f3e34
Author: Steve Baker <email address hidden>
Date: Tue Mar 10 22:31:09 2020 +0000

    Fail when there are no nodes to introspect

    The mistral workflow would fail when no nodes were in a manageable
    state to introspect. This change replicates that behaviour.

    One of the causes of bug #1866204 is that there is a reliance on the
    time it takes to to the introspect for the node segment mappings to
    populate. When introspect finishes instantly, provide ends up timing
    out because the port create fails.

    Change-Id: I8a27a098b3cd006f73c33d95f2d93c65cf629203
    Closes-Bug: #1866911
    Related-Bug: #1866204

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

I've just seen another source of provide timeouts, when nodes go to state 'clean failed' the provide command continues waiting. Possibly 'clean wait' is not considered an error state to poll for? I'll look now.

Revision history for this message
wes hayutin (weshayutin) wrote :

Thanks Steve!

Revision history for this message
Steve Baker (steve-stevebaker) wrote :
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

In case of autoclean=true in ironic conductor settings, providing will include cleaning: https://docs.openstack.org/ironic/latest/admin/cleaning.html#automated-cleaning

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
Luke Short (ekultails) wrote :

Has anything changed recently that might have improved the nodes' ability to be cleaned?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart (master)

Change abandoned by Luke Short (<email address hidden>) on branch: master
Review: https://review.opendev.org/711759
Reason: Temporary workaround no longer required.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/714823

Changed in tripleo:
assignee: Luke Short (ekultails) → Steve Baker (steve-stevebaker)
Changed in tripleo:
assignee: Steve Baker (steve-stevebaker) → Kevin Carter (kevin-carter)
Changed in tripleo:
assignee: Kevin Carter (kevin-carter) → Steve Baker (steve-stevebaker)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/714823
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=1bc900c749b23974e4156c1d56875b61a0742c6f
Submitter: Zuul
Branch: master

commit 1bc900c749b23974e4156c1d56875b61a0742c6f
Author: Steve Baker <email address hidden>
Date: Wed Mar 25 03:23:52 2020 +0000

    Wait for ironic-neutron-agent bridge_mappings before provide

    When provide is immediately run after import, it sometimes stalls
    indefinitely due to port creation failure.

    This fix avoids this issue by polling the ironic neutron agent
    associated with the host until:
    1. the agent entry exists for that host
    2. the agent configuration has something populated for bridge_mappings

    Change-Id: I7eb90fb0b532942825e32c43ebd057a28005c8ec
    Closes-Bug: #1866204

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 1.3.0

This issue was fixed in the openstack/tripleo-ansible 1.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.