test_update_router_admin_state test failed with Unable to connect to port 22

Bug #1968732 reported by chandan kumar
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
yatin

Bug Description

https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-containers-multinode-wallaby&skip=0
and similar version of RHEL job running on RHOS-17 RHEL-9 integration pipeline(periodic-tripleo-ci-rhel-9-containers-multinode-rhos-17) is failing with the same error at tempest
[1] and [2]
```
0} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_update_router_admin_state [355.391581s] ... FAILED

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/usr/lib/python3.9/site-packages/tempest/lib/common/ssh.py", line 131, in _get_ssh_connection
        ssh.connect(self.host, port=self.port, username=self.username,
      File "/usr/lib/python3.9/site-packages/paramiko/client.py", line 368, in connect
        raise NoValidConnectionsError(errors)
    paramiko.ssh_exception.NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 192.168.24.165

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/usr/lib/python3.9/site-packages/tempest/common/utils/__init__.py", line 70, in wrapper
        return f(*func_args, **func_kwargs)
      File "/usr/lib/python3.9/site-packages/tempest/scenario/test_network_basic_ops.py", line 557, in test_update_router_admin_state
        self._check_public_network_connectivity(
      File "/usr/lib/python3.9/site-packages/tempest/scenario/test_network_basic_ops.py", line 212, in _check_public_network_connectivity
        self.check_vm_connectivity(
      File "/usr/lib/python3.9/site-packages/tempest/scenario/manager.py", line 952, in check_vm_connectivity
        self.get_remote_client(ip_address, username, private_key,
      File "/usr/lib/python3.9/site-packages/tempest/scenario/manager.py", line 723, in get_remote_client
        linux_client.validate_authentication()
      File "/usr/lib/python3.9/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 31, in wrapper
        return function(self, *args, **kwargs)
      File "/usr/lib/python3.9/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 117, in validate_authentication
        self.ssh_client.test_connection_auth()
      File "/usr/lib/python3.9/site-packages/tempest/lib/common/ssh.py", line 240, in test_connection_auth
        connection = self._get_ssh_connection()
      File "/usr/lib/python3.9/site-packages/tempest/lib/common/ssh.py", line 150, in _get_ssh_connection
        raise exceptions.SSHTimeout(host=self.host,
    tempest.lib.exceptions.SSHTimeout: Connection to the 192.168.24.165 via SSH timed out.
    User: cirros, Password: None

```

By comparing with passing job : https://6696badaf160a21814ac-f8803ca7587e43f6b66c9edae98e760b.ssl.cf1.rackcdn.com/837455/1/check/tripleo-ci-centos-9-containers-multinode-wallaby/cd5e873/logs/subnode-1/var/log/extra/errors.txt
and failing job: https://057b70f24ff595a5f3b8-448e63f3241620894d174670888496e6.ssl.cf5.rackcdn.com/837512/1/check/tripleo-ci-centos-9-containers-multinode-wallaby/a2518f7/logs/subnode-1/var/log/extra/errors.txt

```
2022-04-12 12:28:43.660 ERROR /var/log/containers/nova/nova-api.log: 17 ERROR oslo.messaging._drivers.impl_rabbit [-] [f125b2b9-d1f6-4535-b795-a3e41539b09e] AMQP server on centos-9-stream-rax-iad-0029292021.ctlplane.localdomain:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
```

I am not sure what is causing the above error. Moving this test to skiplist till we investigate.

Logs:
[1]. https://057b70f24ff595a5f3b8-448e63f3241620894d174670888496e6.ssl.cf5.rackcdn.com/837512/1/check/tripleo-ci-centos-9-containers-multinode-wallaby/a2518f7/logs/undercloud/var/log/tempest/tempest_run.log

[2]. https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f65/837516/1/check/tripleo-ci-centos-9-containers-multinode-wallaby/f65ef99/logs/undercloud/var/log/tempest/tempest_run.log

[3]. https://2a0929bb28433d9a9dd5-585b06f1274d5f0e5ed27601745f1d4b.ssl.cf2.rackcdn.com/837055/1/gate/tripleo-ci-centos-9-containers-multinode/e946bbd/logs/undercloud/var/log/tempest/stestr_results.html

description: updated
description: updated
Revision history for this message
Jakob Meng (jm1337) wrote :

The cirros instance which is created in this tempest test cannot get its cloudinit metadata although the network is up (vm got dhcp lease 10.100.0.9):

```
Starting network: udhcpc: started, v1.29.3
udhcpc: sending discover
udhcpc: sending select for 10.100.0.9
udhcpc: lease of 10.100.0.9 obtained, lease time 43200
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.100.0.1"
OK
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 15.28. request failed
failed 2/20: up 64.50. request failed
failed 3/20: up 113.69. request failed
failed 4/20: up 162.87. request failed
failed 5/20: up 212.06. request failed
failed 6/20: up 261.25. request failed
```

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_323/836988/3/check/tripleo-ci-centos-9-containers-multinode-wallaby/323752e/logs/undercloud/var/log/tempest/stestr_results.html

Revision history for this message
yatin (yatinkarel) wrote :

Had a quick look and looks like issue is random as it has both pass and failure run across multiple runs[1][2]. Temporary it's ok to add test to skiplist until the issue is root caused.

Also didn't find any similar failure in master so looks wallaby specific.

The issue is happening as metadata request from vm are failing[3] as below, and as per metadata agent log[4], no request received for the server 2a0389f3-c36c-4ec1-ac80-81cfbb380d4f(ip 10.100.0.6).

checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 12.55. request failed
failed 2/20: up 61.73. request failed
failed 3/20: up 110.92. request failed
failed 4/20: up 160.16. request failed
failed 5/20: up 209.38. request failed
failed 6/20: up 258.58. request failed

[1] https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-containers-multinode-wallaby&skip=0
[2] https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-containers-multinode&branch=stable%2Fwallaby&skip=0
[3] https://88e7be3fd9763a3635e4-f525513b10cd88ade5723da63b385b8c.ssl.cf2.rackcdn.com/periodic/opendev.org/openstack/tripleo-ci/master/tripleo-ci-centos-9-containers-multinode-wallaby/a881ed1/logs/undercloud/var/log/tempest/stestr_results.html
[4] https://88e7be3fd9763a3635e4-f525513b10cd88ade5723da63b385b8c.ssl.cf2.rackcdn.com/periodic/opendev.org/openstack/tripleo-ci/master/tripleo-ci-centos-9-containers-multinode-wallaby/a881ed1/logs/subnode-1/var/log/containers/neutron/ovn-metadata-agent.log

Revision history for this message
Jakob Meng (jm1337) wrote :

Jobs based on OpenStack master branch(es), e.g. periodic-tripleo-ci-centos-9-containers-multinode-master, are not affected because OpenStack (unsure which component exactly) behaves differently than OpenStack Wallaby does:

OpenStack master attaches a (cloud-init) config_drive to the compute instance and which the instance will then use to read the instance id and other cloud-init data from. On Wallaby, no config_drive is attached to the compute instance and thus the instance will always try to get the cloud-init metadata via network, which randomly fails.

One can observe the different behavior for master and wallaby above by looking at the debugging output of tempest. It shows that tempest does not ask for a config_drive. For example, look at a failing wallaby job such as [1] and you will see 'config_drive": ""' in the response for 'GET http://192.168.24.8:8774/v2.1/servers/1e409626-7d97-40da-b480-c61c2cb214cc'. On a successful master job such as [2] look for '"config_drive": "True"' in response to 'GET http://192.168.24.19:8774/v2.1/servers/12de3ce6-aaf9-4fc8-ad30-32086712c3bb'.

The code for the failing metadata query (see 'failed 1/20: up 12.55. request failed' in previous comments) is in [3] and is called by [4], but only on wallaby. Unfortunately, cirros does not output anything when running these scripts via 'S45-cirros-net-ds' and 'S46-cirros-apply-net' [5].

So it looks as if tempest has not changed its behavior from wallaby to master. Maybe this config_drive enablement comes from a change in TripleO or in Nova?

[1] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-containers-multinode-wallaby/be1c327/logs/undercloud/var/log/tempest/testrepository.subunit.gz
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-containers-multinode-master/f076d7e/logs/undercloud/var/log/tempest/testrepository.subunit.gz
[3] https://github.com/cirros-dev/cirros/blob/master/src/lib/cirros/ds/ec2
[4] https://github.com/cirros-dev/cirros/blob/master/src/sbin/cirros-ds
[5] https://github.com/cirros-dev/cirros/tree/master/src/etc/rc3.d

Revision history for this message
yatin (yatinkarel) wrote :
Download full text (5.1 KiB)

<< So it looks as if tempest has not changed its behavior from wallaby to master. Maybe this config_drive enablement comes from a change in TripleO or in Nova?

Ok i found it's TripleO that changed the behavior and that's by mistake. It was switched with [1].
Also found in an other patch[2] dvr was also switched off by mistake. I tried to switch back as wallaby in [3] and issue is reproducible[4][5] like in wallaby. We should fix dvr and force_config_drive in master but after root cause is clear and issue is fixed.

Also i tried to debug on a reproduce on an env, and there i ran the same test multiple times and some times it pass and fails some times.

Also when it reproduces, i noticed below:-
1) ping with FIP works fine from undercloud
2) ping with private IP fails from ovn metadata namespace
3) same happens from inside vm to ovn metadata namespace interface ip(10.100.0.2)
4) after running recompute(ovs-appctl -t /var/run/ovn/ovn-controller.2.ctl recompute) all get's fine.

Tried to capture ovn-trace output from vm to 169.254.169.254:80 and i see below difference before and after running recompute, looks some issue on ovn side.

$ diff outstuck outstuckrecompute
9c9
< 5. ls_in_pre_acl (northd.c:5752): ip, priority 100, uuid 6058060e
---
> 5. ls_in_pre_acl (northd.c:5752): ip, priority 100, uuid 675b84da
17c17
< 8. ls_in_acl_hint (northd.c:6015): !ct.new && ct.est && !ct.rpl && ct_label.blocked == 0, priority 4, uuid 8d8b7541
---
> 8. ls_in_acl_hint (northd.c:6015): !ct.new && ct.est && !ct.rpl && ct_label.blocked == 0, priority 4, uuid 767f8f38
33c33
< 3. ls_out_acl_hint (northd.c:6015): !ct.new && ct.est && !ct.rpl && ct_label.blocked == 0, priority 4, uuid e64ef3fe
---
> 3. ls_out_acl_hint (northd.c:6015): !ct.new && ct.est && !ct.rpl && ct_label.blocked == 0, priority 4, uuid f08f19d2
72c72
< /* No MAC binding. */
---
> /* MAC binding to 8a:47:31:cd:d8:4c. */
78,86c78
< 19. lr_in_arp_request (northd.c:11776): eth.dst == 00:00:00:00:00:00 && ip4, priority 100, uuid 74e95254
< arp { eth.dst = ff:ff:ff:ff:ff:ff; arp.spa = reg1; arp.tpa = reg0; arp.op = 1; output; };
<
< arp
< ---
< eth.dst = ff:ff:ff:ff:ff:ff;
< arp.spa = reg1;
< arp.tpa = reg0;
< arp.op = 1;
---
> 19. lr_in_arp_request (northd.c:11795): 1, priority 0, uuid d0b40981
93a86,97
> 1. lr_out_undnat (northd.c:12643): ip && ip4.src == 10.100.0.14 && outport == "lrp-156e62", priority 100, uuid 37f05d87
> eth.src = fa:16:3e:a8:45:a0;
> ct_dnat_in_czone;
>
> ct_dnatin_czone /* assuming no un-dnat entry, so no change */
> -------------------------------------------------------------
> 3. lr_out_snat (northd.c:12772): ip && ip4.src == 10.100.0.14 && outport == "lrp-156e62" && is_chassis_resident("7a087e"), priority 161, uuid 0151da09
> eth.src = fa:16:3e:a8:45:a0;
> ct_snat_in_czone(192.168.24.152);
>
> ct_snatin_czone(ip4.src=192.168.24.152)
> ---------------------------------------
102c106,109
< 6. ls_in_pre_lb (northd.c:5821): eth.mcast, priority 110, uuid ff2b94dd
---
> 6. ls_in_pre_lb (northd.c:5638): ip && inport == "156e62", priority 110, uuid 02bce23e
> next;
> 22. ls_in_l2_lkup (northd.c:7493): 1, priority 0, uuid f...

Read more...

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
yatin (yatinkarel) wrote :

Ok looking further based on failures seems https://review.rdoproject.org/r/c/nfvinfo/+/40817 triggered the issue. Will check further and report a bug against OVN considering it's related to incremental processing of flows.

Revision history for this message
yatin (yatinkarel) wrote :

<< Will check further and report a bug against OVN considering it's related to incremental processing of flows.
Reported https://bugzilla.redhat.com/show_bug.cgi?id=2076604

Also seen 1 failure in tempest tests(like test_update_instance_port_admin_state,test_mtu_sized_frames and test_network_basic_ops)[1][2] other than
test_update_router_admin_state, but most of the failures were in
test_update_router_admin_state.

[1] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-containers-multinode-wallaby/be1c327/logs/undercloud/var/log/tempest/stestr_results.html.gz
[2] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-containers-multinode-wallaby/bcc3e3d/logs/undercloud/var/log/tempest/stestr_results.html.gz

Revision history for this message
Douglas Viroel (dviroel) wrote :
Revision history for this message
yatin (yatinkarel) wrote :

https://bugzilla.redhat.com/show_bug.cgi?id=2076604 issue is fixed in OVN.
The OVN issue is fixed in ovn-2021 >= 21.12.0-68
and ovn22.06 >= 22.06.0-4 and the fixed versions
available in CentOS 9 NFV SIG repos[1][2].

[1] https://review.rdoproject.org/r/c/nfvinfo/+/43667
[2] https://review.rdoproject.org/r/c/nfvinfo/+/44068

Proposed revert of the skipped tests https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/851342

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/851345

yatin (yatinkarel)
Changed in tripleo:
assignee: nobody → yatin (yatinkarel)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/851345
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/6c0410c74cee6e8809239f7a7877279723efec56
Submitter: "Zuul (22348)"
Branch: master

commit 6c0410c74cee6e8809239f7a7877279723efec56
Author: yatinkarel <email address hidden>
Date: Thu Jul 28 11:43:36 2022 +0530

    Set force_config_drive only when OVNMetadata is disabled

    It was already done in [1] but accidently reverted in [2],
    this patch fixes it. It got detected while investigation [3].

    Since the known metadata issue is now fixed in OVN, we can
    keep the force config drive disabled when OVN metadata is
    enabled.

    [1] https://review.opendev.org/660689
    [2] https://review.opendev.org/791415
    [3] https://bugs.launchpad.net/tripleo/+bug/1968732

    Related-Bug: #1830179
    Related-Bug: #1968732
    Depends-On: https://review.opendev.org/851342
    Change-Id: I7781505ed3080a4485baa83f8170eb6c361382b4

Rabi Mishra (rabi)
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.