fs020(both queens/master) tempest tests failing while booting an instance

Bug #1757111 reported by yatin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Alex Schultz

Bug Description

Tempest test are failing while waiting for instance to be ACTIVE:-

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/0f6d715/undercloud/home/jenkins/tempest/tempest.html.gz
https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-queens/8ebe831/undercloud/home/jenkins/tempest/tempest.html.gz

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/0f6d715/overcloud-novacompute-bar-0/var/log/containers/nova/nova-compute.log.txt.gz#_2018-03-20_02_01_40_640
https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-queens/8ebe831/overcloud-novacompute-bar-0/var/log/containers/nova/nova-compute.log.txt.gz#_2018-03-20_01_57_50_041

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/0f6d715/overcloud-novacompute-bar-0/var/log/containers/neutron/openvswitch-agent.log.txt.gz#_2018-03-20_03_37_22_980

Last pass for queens was on 13 march: https://review.rdoproject.org/jenkins/job/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-queens/77/consoleText

This ^^ don't have overcloud logs because of:- https://bugs.launchpad.net/tripleo/+bug/1755891(fixed recently)
This job have overcloud logs as well but it's not last passed:- https://review.rdoproject.org/jenkins/job/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-queens/63/consoleText

and for master on 6th March: https://review.rdoproject.org/jenkins/job/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-master/738/consoleText

Last passed jobs which have logs:-

I locally reproduced it and noticed following:-
Instance spawning failed as openvswitch agent is not running on compute node:-
neutron-ovs-agent container was failed to start because openvswitch services were not running.

ovsdb-server,ovs-vswitcd, openvswitch,and neutron-ovs containers on compute node were not running,
After starting all in sequence tempest run completed successfully.

Also note featureset020 has network_isolation=false which is different from other jobs running in promotion pipeline.

Also noticed following but not sure if this is relevant here:-
- Running os-net-config by creating config.json(containing type: ovs_bridge) started the ovs services. I think this is the reason the environment running with network isolation haven't faced this issue.
- Second is few days back all services started to run on containers([1]) as default including neutron ovs agent, that might also be the reason but i haven't tested it.
[1] https://review.openstack.org/#/c/548554/2/overcloud-resource-registry-puppet.j2.yaml

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/554528

John Trowbridge (trown)
Changed in tripleo:
status: New → Triaged
importance: Undecided → Critical
milestone: none → rocky-1
tags: added: alert ci promotion-blocker quickstart
Revision history for this message
Alex Schultz (alex-schultz) wrote :

In looking into the failures there are a few things. We found that openvswitch is not running on the compute node which may be the reason for some of the tempest failures. In trying to understand why it's not working, I have found that the openvswitch service is no longer enabled in the overcloud-full.qcow2 images since Pike.

After pulling the various images from rdo (https://images.rdoproject.org/master/rdo_trunk/tripleo-ci-testing/)

guestfish -a overcloud-full.qcow2 run : mount /dev/sda / : find /etc/systemd/system > pike.txt
guestfish -a overcloud-full.qcow2 run : mount /dev/sda / : find /etc/systemd/system > queens.txt
guestfish -a overcloud-full.qcow2 run : mount /dev/sda / : find /etc/systemd/system > master.txt

2:37 PM ☁ tmp ➜ diff pike.txt queens.txt
4a5
> /ceph.target.wants/ceph-mgr.target
18a20
> /multi-user.target.wants/ceph-mgr.target
38d39
< /multi-user.target.wants/openvswitch.service
2:37 PM ☁ tmp ➜ diff pike.txt master
diff: master/pike.txt: No such file or directory
2:37 PM ☁ tmp ➜ diff pike.txt master.txt
4a5
> /ceph.target.wants/ceph-mgr.target
18a20
> /multi-user.target.wants/ceph-mgr.target
38d39
< /multi-user.target.wants/openvswitch.service

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/555056

Changed in tripleo:
assignee: nobody → Alex Schultz (alex-schultz)
status: Triaged → In Progress
Revision history for this message
Alex Schultz (alex-schultz) wrote :

So it should be noted that because the service isn't auto started on boot, it may be running into a race condition with the ovs agent docker container. The ovs service will get started at some point when the ifup-ovs script is run which will check to see if the service is running or start it. But I observed that it the service wasn't running and the error for ovsdb-server was that /var/run/openvswitch/db.sock was a directory. This could be because the neutron-ovs-agent tries to mount the socket file directly and docker may create it as a directory if it does not exist. So if the neutron-ovs-agent is launched before openvswitch is started, it may prevent openvswitch from mounting

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/555077

Revision history for this message
Arx Cruz (arxcruz) wrote :
Revision history for this message
Arx Cruz (arxcruz) wrote :

Ignore comment #6, it was supposed to be in another bug. Too many tabs opened...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart (master)

Change abandoned by John Trowbridge (<email address hidden>) on branch: master
Review: https://review.openstack.org/554528
Reason: actual fix in tripleo-common for this: https://review.openstack.org/555056

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/555056
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=05009b32ef027f06f604f491e4e95e283d0dfc73
Submitter: Zuul
Branch: master

commit 05009b32ef027f06f604f491e4e95e283d0dfc73
Author: Alex Schultz <email address hidden>
Date: Wed Mar 21 15:07:43 2018 -0600

    Add openvswitch element back in

    The change Ie4f5d771a16ea453b470be8ea103b2bde4aa892a switched
    os-net-config to be installed as a package rather than an element.
    Unfortunately the openvswitch service was being configured via the
    openvswitch dependency from the os-net-config element. The switch to use
    the package lost the automatic starting of openvswitch on the nodes
    which has manifested itself as intermittent issues on compute nodes when
    network isolation is not deployed.

    Change-Id: I0f865d4811919c19577c75615b66d7d8a1e685d3
    Partial-Bug: #1757111

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/555077
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=ccfc1e9d1fcfb215cd8de28839404c9cf84ca0a3
Submitter: Zuul
Branch: master

commit ccfc1e9d1fcfb215cd8de28839404c9cf84ca0a3
Author: Alex Schultz <email address hidden>
Date: Wed Mar 21 16:04:58 2018 -0600

    Mount openvswitch dir rather than socket

    If openvswitch is not started (meaning the socket file doesn't exist)
    and the docker container launches first, docker may create a folder for
    the db.sock file which would prevent ovs from starting up later. We
    should mount the directory since ovs could be started after the docker
    containers.

    Change-Id: I0aaed5c73c1c1485ad61202f3fca53348ef5a669
    Closes-Bug: #1757111

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/555791

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/555801

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/queens)

Reviewed: https://review.openstack.org/555801
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=68b57092554418604c50ca176f38aa8c73a12c07
Submitter: Zuul
Branch: stable/queens

commit 68b57092554418604c50ca176f38aa8c73a12c07
Author: Alex Schultz <email address hidden>
Date: Wed Mar 21 16:04:58 2018 -0600

    Mount openvswitch dir rather than socket

    If openvswitch is not started (meaning the socket file doesn't exist)
    and the docker container launches first, docker may create a folder for
    the db.sock file which would prevent ovs from starting up later. We
    should mount the directory since ovs could be started after the docker
    containers.

    Change-Id: I0aaed5c73c1c1485ad61202f3fca53348ef5a669
    Closes-Bug: #1757111
    (cherry picked from commit ccfc1e9d1fcfb215cd8de28839404c9cf84ca0a3)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/queens)

Reviewed: https://review.openstack.org/555791
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=4d7c530175dfdeaca3295a8ca6a88dbfa9e2f349
Submitter: Zuul
Branch: stable/queens

commit 4d7c530175dfdeaca3295a8ca6a88dbfa9e2f349
Author: Alex Schultz <email address hidden>
Date: Wed Mar 21 15:07:43 2018 -0600

    Add openvswitch element back in

    The change Ie4f5d771a16ea453b470be8ea103b2bde4aa892a switched
    os-net-config to be installed as a package rather than an element.
    Unfortunately the openvswitch service was being configured via the
    openvswitch dependency from the os-net-config element. The switch to use
    the package lost the automatic starting of openvswitch on the nodes
    which has manifested itself as intermittent issues on compute nodes when
    network isolation is not deployed.

    Change-Id: I0f865d4811919c19577c75615b66d7d8a1e685d3
    Partial-Bug: #1757111
    (cherry picked from commit 05009b32ef027f06f604f491e4e95e283d0dfc73)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 9.0.0.0b1

This issue was fixed in the openstack/tripleo-heat-templates 9.0.0.0b1 development milestone.

Revision history for this message
Alan Pevec (apevec) wrote :

> This issue was fixed in the openstack/tripleo-heat-templates 9.0.0.0b1 development milestone.

It is actually 8.0.1 for Queens https://review.openstack.org/556972

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 9.0.0.0b2

This issue was fixed in the openstack/tripleo-heat-templates 9.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.