[stein] standalone-upgrade failing tempest

Bug #1896537 reported by Rafael Folco
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

https://19862c796f0176e05bae-03c0af7271380f8d13c3735dacc9c317.ssl.cf2.rackcdn.com/750071/1/check/tripleo-ci-centos-7-standalone-upgrade-stein/23f97b8/logs/undercloud/home/zuul/tempest/tempest.html

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_minimum_basic.py", line 147, in test_minimum_basic_scenario
    server=server)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 501, in get_remote_client
    linux_client.validate_authentication()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 60, in wrapper
    six.reraise(*original_exception)
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 33, in wrapper
    return function(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 116, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 209, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 192.168.24.116 via SSH timed out.
User: cirros, Password: None

Changed in tripleo:
importance: Undecided → Critical
Revision history for this message
Bhagyashri Shewale (bhagyashri-shewale) wrote :

I had some discussion with marios for this issue:

bhagyashri|rover> marios, hi Good morning ! currently some tempest test are failing on standalone upgrade stein
<bhagyashri|rover> marios, and here i saw the some mysql related failure https://19862c796f0176e05bae-03c0af7271380f8d13c3735dacc9c317.ssl.cf2.rackcdn.com/750071/1/check/tripleo-ci-centos-7-standalone-upgrade-stein/23f97b8/logs/undercloud/var/log/extra/errors.txt.txt
<bhagyashri|rover> marios, so have you seen same error on any upgrade job failure ?
<marios> bhagyashri|rover: might be similar to https://bugs.launchpad.net/tripleo/+bug/1895822 (standalone-upgrade ussuri)
<openstack> Launchpad bug 1895822 in tripleo "centos8 standalone-upgrade-ussuri fails tempest ping router IP" [Critical,Triaged] - Assigned to Sergii Golovatiuk (sgolovatiuk)
<marios> bhagyashri|rover: in both cases it passes deployment & upgrade OK (https://19862c796f0176e05bae-03c0af7271380f8d13c3735dacc9c317.ssl.cf2.rackcdn.com/750071/1/check/tripleo-ci-centos-7-standalone-upgrade-stein/23f97b8/logs/undercloud/home/zuul/standalone_upgrade.log ) but it seems there is something happening during the upgrade that kills connectivity
<marios> bhagyashri|rover: hard to be sure without digging more but it might be similar
<bhagyashri|rover> marios, ack thank you :) is any one working on it.
* jtomasek (~jtomasek@27-143.gtt-net.cz) has joined
<marios> bhagyashri|rover: yeah we have a cix on that one so few folks have looked into it but we don't have root cause yet
<bhagyashri|rover> marios, ok thanks

Changed in tripleo:
assignee: nobody → Sergii Golovatiuk (sgolovatiuk)
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

I looked there and I saw the external network subnet is the same as subnet used for the control plane. It means there was a route for the FIP to the br-ctplane instead of br-ex, br-ex was used in the bridge mappings for the given provider network.

When I changed the bridge mappings to use br-ctplane instead, the pings works. However this is not the solution because I think the control plane subnet should differ from the external public subnet. The network configuration should be changed for the external network so this can work properly (separate external and control plane networks).

Changed in tripleo:
assignee: Sergii Golovatiuk (sgolovatiuk) → nobody
Revision history for this message
Alex Schultz (alex-schultz) wrote :

So the stein job is not running into the same issue as bug #1895822. This likely needs to be investigated seperately

Revision history for this message
Alex Schultz (alex-schultz) wrote :

For the record this job didn't start failing consistently until 2020-09-05 which was after the ussuri job started failing on 2020-08-18

Revision history for this message
Alex Schultz (alex-schultz) wrote :

selinux is enabled and there's an ovs denial

Revision history for this message
Alex Schultz (alex-schultz) wrote :
Download full text (3.4 KiB)

So we have a change where it passed in check but not in gate (https://review.opendev.org/#/c/750071/). The difference in the packages consisted of:

pass:
 openstack-selinux-0.8.23-0.20200821105100.f05f4b2.el7.noarch
 openstack-tripleo-heat-templates-10.6.3-0.20200905130014.2026fa4.el7.noarch
 puppet-neutron-14.4.1-0.20200404011627.4aae155.el7.noarch

fail:
 openstack-selinux-0.8.24-0.20200826121419.53b8b2e.el7.noarch
 openstack-tripleo-heat-templates-10.6.3-0.20200908173018.2026fa4.el7.noarch
 puppet-neutron-14.4.1-0.20200908161637.080fd53.el7.noarch

Given that this was a change to THT, it leaves openstack-selinux or puppet-neutron as the culprit.

The only change in puppet-neutron is https://review.opendev.org/#/c/749969/ which lowered the number of workers.

Alternatively I did notice that selinux is enabled on the stein job upstream (this should be permissive)

SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: permissive
Mode from config file: permissive
Policy MLS status: enabled
Policy deny_unknown status: allowed
Max kernel policy version: 31

https://b711d185688da3b864bc-5593d50c131879f6a486eeedbad80e3c.ssl.cf2.rackcdn.com/750071/1/gate/tripleo-ci-centos-7-standalone-upgrade-stein/91e3191/logs/undercloud/var/log/extra/selinux.txt

And we have the following denial:
/var/log/audit/audit.log.1:type=AVC msg=audit(1599582437.978:2102): avc: denied { net_broadcast } for pid=5066 comm="ovs-vswitchd" capability=11 scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability permissive=0

https://b711d185688da3b864bc-5593d50c131879f6a486eeedbad80e3c.ssl.cf2.rackcdn.com/750071/1/gate/tripleo-ci-centos-7-standalone-upgrade-stein/91e3191/logs/undercloud/var/log/extra/denials.txt

However we had the same denial on the passing job. So if this is code related, it likely is the number of workers change in puppet-neutron.
/var/log/audit/audit.log.1:type=AVC msg=audit(1599307009.320:2793): avc: denied { net_broadcast } for pid=5013 comm="ovs-vswitchd" capability=11 scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability permissive=0
https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_a9d/750071/1/check/tripleo-ci-centos-7-standalone-upgrade-stein/a9d2f90/logs/undercloud/var/log/extra/denials.txt

SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: permissive
Mode from config file: permissive
Policy MLS status: enabled
Policy deny_unknown status: allowed
Max kernel policy version: 31

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_a9d/750071/1/check/tripleo-ci-centos-7-standalone-upgrade-stein/a9d2f90/logs/undercloud/var/log/extra/selinux.txt

The openstack-selinux changes are likely:
https://gith...

Read more...

Revision history for this message
Alex Schultz (alex-schultz) wrote :

erm helps if I read. Current mode: permissive. So the reduction of ovn workers likely is causing upgrade problems

Revision history for this message
wes hayutin (weshayutin) wrote : Re: [Bug 1896537] Re: [stein] standalone-upgrade failing tempest

I was wondering about that :) Thanks for diving into it anyway! Marios
should be point on the upgrade jobs from our end. Thanks Alex

On Wed, Sep 23, 2020 at 4:00 PM Alex Schultz <email address hidden>
wrote:

> erm helps if I read. Current mode: permissive. So the reduction of ovn
> workers likely is causing upgrade problems
>
> --
> You received this bug notification because you are subscribed to
> tripleo.
> Matching subscriptions: critical tripleo bugs
> https://bugs.launchpad.net/bugs/1896537
>
> Title:
> [stein] standalone-upgrade failing tempest
>
> Status in tripleo:
> Triaged
>
> Bug description:
>
>
> https://19862c796f0176e05bae-03c0af7271380f8d13c3735dacc9c317.ssl.cf2.rackcdn.com/750071/1/check/tripleo-ci-centos-7-standalone-upgrade-stein/23f97b8/logs/undercloud/home/zuul/tempest/tempest.html
>
> Traceback (most recent call last):
> File
> "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line
> 89, in wrapper
> return f(*func_args, **func_kwargs)
> File
> "/usr/lib/python2.7/site-packages/tempest/scenario/test_minimum_basic.py",
> line 147, in test_minimum_basic_scenario
> server=server)
> File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py",
> line 501, in get_remote_client
> linux_client.validate_authentication()
> File
> "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py",
> line 60, in wrapper
> six.reraise(*original_exception)
> File
> "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py",
> line 33, in wrapper
> return function(self, *args, **kwargs)
> File
> "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py",
> line 116, in validate_authentication
> self.ssh_client.test_connection_auth()
> File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py",
> line 209, in test_connection_auth
> connection = self._get_ssh_connection()
> File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py",
> line 121, in _get_ssh_connection
> password=self.password)
> tempest.lib.exceptions.SSHTimeout: Connection to the 192.168.24.116 via
> SSH timed out.
> User: cirros, Password: None
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/tripleo/+bug/1896537/+subscriptions
>
>

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

After reproducing the issue with reproducer I tried to boot up VM manually from CLI. However, it was not able to boot up showing "No bootable device" https://imgur.com/a/rvI0HFp . When I added to cinder cirros-0.4.0 then I was able to boot up VM. Then I changed tempest.conf so it passed after all.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/755401
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=79a1e3b13cac1770548e7bb7954bb5a89191b19a
Submitter: Zuul
Branch: master

commit 79a1e3b13cac1770548e7bb7954bb5a89191b19a
Author: Sergii Golovatiuk <email address hidden>
Date: Thu Oct 1 00:33:07 2020 +0200

    Update cirros image from 3.6 to 4.0

    Sync to 0.4.0 across all repos

    Change-Id: I09a96521a6cc83826fd7168d7544172f075b3e06
    Related-Bug: #1896537

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/stein)

Reviewed: https://review.opendev.org/756138
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=449674660bff526493d2cf3702089e2ade9f4ec8
Submitter: Zuul
Branch: stable/stein

commit 449674660bff526493d2cf3702089e2ade9f4ec8
Author: Sergii Golovatiuk <email address hidden>
Date: Mon Oct 5 19:20:22 2020 +0200

    Fix cinder_volume upgrade tasks for stein

    Prior to Train, the <service>_node_names variables are only populated
    for services that have network definitions in the ServiceNetMap. In
    train, this code was moved from yaql to ansible so all enabled services
    get <service>_node_names defined. Additionally only docker was supported
    upstream for HA, however we did support running podman when using with
    RHEL8 so we need to make sure the container cli uses the expected cli.

    Related-Bug: #1896537
    Change-Id: If442e5b43781c1d68a126c5bcfb93ffacda78f2f

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/stein)

Change abandoned by Alex Schultz (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/754726

Changed in tripleo:
milestone: victoria-3 → wallaby-1
wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.