Bug #1968732 “test_update_router_admin_state test failed with Un...” : Bugs : tripleo

chandan kumar (chkumar246) on 2022-04-12

description:

updated

chandan kumar (chkumar246) on 2022-04-12

description:

updated

Revision history for this message

Jakob Meng (jm1337) wrote on 2022-04-12:

#1

The cirros instance which is created in this tempest test cannot get its cloudinit metadata although the network is up (vm got dhcp lease 10.100.0.9):

```
Starting network: udhcpc: started, v1.29.3
udhcpc: sending discover
udhcpc: sending select for 10.100.0.9
udhcpc: lease of 10.100.0.9 obtained, lease time 43200
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.100.0.1"
OK
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 15.28. request failed
failed 2/20: up 64.50. request failed
failed 3/20: up 113.69. request failed
failed 4/20: up 162.87. request failed
failed 5/20: up 212.06. request failed
failed 6/20: up 261.25. request failed
```

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_323/836988/3/check/tripleo-ci-centos-9-containers-multinode-wallaby/323752e/logs/undercloud/var/log/tempest/stestr_results.html

Revision history for this message

yatin (yatinkarel) wrote on 2022-04-12:

#2

Had a quick look and looks like issue is random as it has both pass and failure run across multiple runs[1][2]. Temporary it's ok to add test to skiplist until the issue is root caused.

Also didn't find any similar failure in master so looks wallaby specific.

The issue is happening as metadata request from vm are failing[3] as below, and as per metadata agent log[4], no request received for the server 2a0389f3-c36c-4ec1-ac80-81cfbb380d4f(ip 10.100.0.6).

checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 12.55. request failed
failed 2/20: up 61.73. request failed
failed 3/20: up 110.92. request failed
failed 4/20: up 160.16. request failed
failed 5/20: up 209.38. request failed
failed 6/20: up 258.58. request failed

[1] https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-containers-multinode-wallaby&skip=0
[2] https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-containers-multinode&branch=stable%2Fwallaby&skip=0
[3] https://88e7be3fd9763a3635e4-f525513b10cd88ade5723da63b385b8c.ssl.cf2.rackcdn.com/periodic/opendev.org/openstack/tripleo-ci/master/tripleo-ci-centos-9-containers-multinode-wallaby/a881ed1/logs/undercloud/var/log/tempest/stestr_results.html
[4] https://88e7be3fd9763a3635e4-f525513b10cd88ade5723da63b385b8c.ssl.cf2.rackcdn.com/periodic/opendev.org/openstack/tripleo-ci/master/tripleo-ci-centos-9-containers-multinode-wallaby/a881ed1/logs/subnode-1/var/log/containers/neutron/ovn-metadata-agent.log

Revision history for this message

Jakob Meng (jm1337) wrote on 2022-04-14:

#3

Jobs based on OpenStack master branch(es), e.g. periodic-tripleo-ci-centos-9-containers-multinode-master, are not affected because OpenStack (unsure which component exactly) behaves differently than OpenStack Wallaby does:

OpenStack master attaches a (cloud-init) config_drive to the compute instance and which the instance will then use to read the instance id and other cloud-init data from. On Wallaby, no config_drive is attached to the compute instance and thus the instance will always try to get the cloud-init metadata via network, which randomly fails.

One can observe the different behavior for master and wallaby above by looking at the debugging output of tempest. It shows that tempest does not ask for a config_drive. For example, look at a failing wallaby job such as [1] and you will see 'config_drive": ""' in the response for 'GET http://192.168.24.8:8774/v2.1/servers/1e409626-7d97-40da-b480-c61c2cb214cc'. On a successful master job such as [2] look for '"config_drive": "True"' in response to 'GET http://192.168.24.19:8774/v2.1/servers/12de3ce6-aaf9-4fc8-ad30-32086712c3bb'.

The code for the failing metadata query (see 'failed 1/20: up 12.55. request failed' in previous comments) is in [3] and is called by [4], but only on wallaby. Unfortunately, cirros does not output anything when running these scripts via 'S45-cirros-net-ds' and 'S46-cirros-apply-net' [5].

So it looks as if tempest has not changed its behavior from wallaby to master. Maybe this config_drive enablement comes from a change in TripleO or in Nova?

[1] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-containers-multinode-wallaby/be1c327/logs/undercloud/var/log/tempest/testrepository.subunit.gz
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-containers-multinode-master/f076d7e/logs/undercloud/var/log/tempest/testrepository.subunit.gz
[3] https://github.com/cirros-dev/cirros/blob/master/src/lib/cirros/ds/ec2
[4] https://github.com/cirros-dev/cirros/blob/master/src/sbin/cirros-ds
[5] https://github.com/cirros-dev/cirros/tree/master/src/etc/rc3.d

Jobs based on OpenStack master branch(es), e.g. periodic-tripleo-ci-centos-9-containers-multinode-master, are not affected because OpenStack (unsure which component exactly) behaves differently than OpenStack Wallaby does:

OpenStack master attaches a (cloud-init) config_drive to the compute instance and which the instance will then use to read the instance id and other cloud-init data from. On Wallaby, no config_drive is attached to the compute instance and thus the instance will always try to get the cloud-init metadata via network, which randomly fails.

One can observe the different behavior for master and wallaby above by looking at the debugging output of tempest. It shows that tempest does not ask for a config_drive. For example, look at a failing wallaby job such as [1] and you will see 'config_drive": ""' in the response for 'GET http://192.168.24.8:8774/v2.1/servers/1e409626-7d97-40da-b480-c61c2cb214cc'. On a successful master job such as [2] look for '"config_drive": "True"' in response to 'GET http://192.168.24.19:8774/v2.1/servers/12de3ce6-aaf9-4fc8-ad30-32086712c3bb'.

The code for the failing metadata query (see 'failed 1/20: up 12.55. request failed' in previous comments) is in [3] and is called by [4], but only on wallaby. Unfortunately, cirros does not output anything when running these scripts via 'S45-cirros-net-ds' and 'S46-cirros-apply-net' [5].

So it looks as if tempest has not changed its behavior from wallaby to master. Maybe this config_drive enablement comes from a change in TripleO or in Nova?

[1] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-containers-multinode-wallaby/be1c327/logs/undercloud/var/log/tempest/testrepository.subunit.gz
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-containers-multinode-master/f076d7e/logs/undercloud/var/log/tempest/testrepository.subunit.gz
[3] https://github.com/cirros-dev/cirros/blob/master/src/lib/cirros/ds/ec2
[4] https://github.com/cirros-dev/cirros/blob/master/src/sbin/cirros-ds
[5] https://github.com/cirros-dev/cirros/tree/master/src/etc/rc3.d

Revision history for this message

yatin (yatinkarel) wrote on 2022-04-18:

#4

Download full text (5.1 KiB)

<< So it looks as if tempest has not changed its behavior from wallaby to master. Maybe this config_drive enablement comes from a change in TripleO or in Nova?

Ok i found it's TripleO that changed the behavior and that's by mistake. It was switched with [1].
Also found in an other patch[2] dvr was also switched off by mistake. I tried to switch back as wallaby in [3] and issue is reproducible[4][5] like in wallaby. We should fix dvr and force_config_drive in master but after root cause is clear and issue is fixed.

Also i tried to debug on a reproduce on an env, and there i ran the same test multiple times and some times it pass and fails some times.

Also when it reproduces, i noticed below:-
1) ping with FIP works fine from undercloud
2) ping with private IP fails from ovn metadata namespace
3) same happens from inside vm to ovn metadata namespace interface ip(10.100.0.2)
4) after running recompute(ovs-appctl -t /var/run/ovn/ovn-controller.2.ctl recompute) all get's fine.

Tried to capture ovn-trace output from vm to 169.254.169.254:80 and i see below difference before and after running recompute, looks some issue on ovn side.

$ diff outstuck outstuckrecompute
9c9
< 5. ls_in_pre_acl (northd.c:5752): ip, priority 100, uuid 6058060e
---
> 5. ls_in_pre_acl (northd.c:5752): ip, priority 100, uuid 675b84da
17c17
< 8. ls_in_acl_hint (northd.c:6015): !ct.new && ct.est && !ct.rpl && ct_label.blocked == 0, priority 4, uuid 8d8b7541
---
> 8. ls_in_acl_hint (northd.c:6015): !ct.new && ct.est && !ct.rpl && ct_label.blocked == 0, priority 4, uuid 767f8f38
33c33
< 3. ls_out_acl_hint (northd.c:6015): !ct.new && ct.est && !ct.rpl && ct_label.blocked == 0, priority 4, uuid e64ef3fe
---
> 3. ls_out_acl_hint (northd.c:6015): !ct.new && ct.est && !ct.rpl && ct_label.blocked == 0, priority 4, uuid f08f19d2
72c72
< /* No MAC binding. */
---
> /* MAC binding to 8a:47:31:cd:d8:4c. */
78,86c78
< 19. lr_in_arp_request (northd.c:11776): eth.dst == 00:00:00:00:00:00 && ip4, priority 100, uuid 74e95254
< arp { eth.dst = ff:ff:ff:ff:ff:ff; arp.spa = reg1; arp.tpa = reg0; arp.op = 1; output; };
<
< arp
< ---
< eth.dst = ff:ff:ff:ff:ff:ff;
< arp.spa = reg1;
< arp.tpa = reg0;
< arp.op = 1;
---
> 19. lr_in_arp_request (northd.c:11795): 1, priority 0, uuid d0b40981
93a86,97
> 1. lr_out_undnat (northd.c:12643): ip && ip4.src == 10.100.0.14 && outport == "lrp-156e62", priority 100, uuid 37f05d87
> eth.src = fa:16:3e:a8:45:a0;
> ct_dnat_in_czone;
>
> ct_dnatin_czone /* assuming no un-dnat entry, so no change */
> -------------------------------------------------------------
> 3. lr_out_snat (northd.c:12772): ip && ip4.src == 10.100.0.14 && outport == "lrp-156e62" && is_chassis_resident("7a087e"), priority 161, uuid 0151da09
> eth.src = fa:16:3e:a8:45:a0;
> ct_snat_in_czone(192.168.24.152);
>
> ct_snatin_czone(ip4.src=192.168.24.152)
> ---------------------------------------
102c106,109
< 6. ls_in_pre_lb (northd.c:5821): eth.mcast, priority 110, uuid ff2b94dd
---
> 6. ls_in_pre_lb (northd.c:5638): ip && inport == "156e62", priority 110, uuid 02bce23e
> next;
> 22. ls_in_l2_lkup (northd.c:7493): 1, priority 0, uuid f...

<< So it looks as if tempest has not changed its behavior from wallaby to master. Maybe this config_drive enablement comes from a change in TripleO or in Nova?

Ok i found it's TripleO that changed the behavior and that's by mistake. It was switched with [1].
Also found in an other patch[2] dvr was also switched off by mistake. I tried to switch back as wallaby in [3] and issue is reproducible[4][5] like in wallaby. We should fix dvr and force_config_drive in master but after root cause is clear and issue is fixed.

Also i tried to debug on a reproduce on an env, and there i ran the same test multiple times and some times it pass and fails some times.

Also when it reproduces, i noticed below:-
1) ping with FIP works fine from undercloud
2) ping with private IP fails from ovn metadata namespace
3) same happens from inside vm to ovn metadata namespace interface ip(10.100.0.2)
4) after running recompute(ovs-appctl -t /var/run/ovn/ovn-controller.2.ctl recompute) all get's fine.

Tried to capture ovn-trace output from vm to 169.254.169.254:80 and i see below difference before and after running recompute, looks some issue on ovn side.

$ diff outstuck outstuckrecompute 
9c9
<  5. ls_in_pre_acl (northd.c:5752): ip, priority 100, uuid 6058060e
---
>  5. ls_in_pre_acl (northd.c:5752): ip, priority 100, uuid 675b84da
17c17
<  8. ls_in_acl_hint (northd.c:6015): !ct.new && ct.est && !ct.rpl && ct_label.blocked == 0, priority 4, uuid 8d8b7541
---
>  8. ls_in_acl_hint (northd.c:6015): !ct.new && ct.est && !ct.rpl && ct_label.blocked == 0, priority 4, uuid 767f8f38
33c33
<  3. ls_out_acl_hint (northd.c:6015): !ct.new && ct.est && !ct.rpl && ct_label.blocked == 0, priority 4, uuid e64ef3fe
---
>  3. ls_out_acl_hint (northd.c:6015): !ct.new && ct.est && !ct.rpl && ct_label.blocked == 0, priority 4, uuid f08f19d2
72c72
<     /* No MAC binding. */
---
>     /* MAC binding to 8a:47:31:cd:d8:4c. */
78,86c78
< 19. lr_in_arp_request (northd.c:11776): eth.dst == 00:00:00:00:00:00 && ip4, priority 100, uuid 74e95254
<     arp { eth.dst = ff:ff:ff:ff:ff:ff; arp.spa = reg1; arp.tpa = reg0; arp.op = 1; output; };
< 
< arp
< ---
<     eth.dst = ff:ff:ff:ff:ff:ff;
<     arp.spa = reg1;
<     arp.tpa = reg0;
<     arp.op = 1;
---
> 19. lr_in_arp_request (northd.c:11795): 1, priority 0, uuid d0b40981
93a86,97
>  1. lr_out_undnat (northd.c:12643): ip && ip4.src == 10.100.0.14 && outport == "lrp-156e62", priority 100, uuid 37f05d87
>     eth.src = fa:16:3e:a8:45:a0;
>     ct_dnat_in_czone;
> 
> ct_dnatin_czone /* assuming no un-dnat entry, so no change */
> -------------------------------------------------------------
>  3. lr_out_snat (northd.c:12772): ip && ip4.src == 10.100.0.14 && outport == "lrp-156e62" && is_chassis_resident("7a087e"), priority 161, uuid 0151da09
>     eth.src = fa:16:3e:a8:45:a0;
>     ct_snat_in_czone(192.168.24.152);
> 
> ct_snatin_czone(ip4.src=192.168.24.152)
> ---------------------------------------
102c106,109
<  6. ls_in_pre_lb (northd.c:5821): eth.mcast, priority 110, uuid ff2b94dd
---
>  6. ls_in_pre_lb (northd.c:5638): ip && inport == "156e62", priority 110, uuid 02bce23e
>     next;
> 22. ls_in_l2_lkup (northd.c:7493): 1, priority 0, uuid f3bfc8cb
>     outport = get_fdb(eth.dst);
104,105c111,112
< 22. ls_in_l2_lkup (northd.c:7060): eth.src == {fa:16:3e:9a:df:12, fa:16:3e:a8:45:a0} && (arp.op == 1 || nd_ns), priority 75, uuid a8d79414
<     outport = "_MC_flood_l2";
---
> 23. ls_in_l2_unknown (northd.c:7497): outport == "none", priority 50, uuid 9c6adea4
>     outport = "_MC_unknown";
108,117c115,116
< multicast(dp="public", mcgroup="_MC_flood_l2")
< ----------------------------------------------
< 
<     egress(dp="public", inport="156e62", outport="2af056")
<     ------------------------------------------------------
<          0. ls_out_pre_lb (northd.c:5822): eth.mcast, priority 110, uuid 4194892b
<             next;
<          9. ls_out_port_sec_l2 (northd.c:5614): eth.mcast, priority 100, uuid 9c0634db
<             output;
<             /* output to "2af056", type "localport" */
---
> multicast(dp="public", mcgroup="_MC_unknown")
> ---------------------------------------------
121c120
<          0. ls_out_pre_lb (northd.c:5822): eth.mcast, priority 110, uuid 4194892b
---
>          0. ls_out_pre_lb (northd.c:5641): ip && outport == "provnet-85b776", priority 110, uuid 9b3cd58c
123c122
<          9. ls_out_port_sec_l2 (northd.c:5614): eth.mcast, priority 100, uuid 9c0634db
---
>          9. ls_out_port_sec_l2 (northd.c:5588): outport == "provnet-85b776", priority 50, uuid 636ab84d

[1] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/791415
[2] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/789142/
[3] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/837898
[4] https://515a658dc6968956297a-21cf5a0d939ed856705ff8889db81bff.ssl.cf2.rackcdn.com/837898/2/check/tripleo-ci-centos-9-containers-multinode-1/1514329/logs/undercloud/var/log/tempest/stestr_results.html
[5] https://83b304842de438a4ce44-164ae1091eb5a377552df92dcdcf8170.ssl.cf2.rackcdn.com/837898/2/check/tripleo-ci-centos-9-containers-multinode-2/1dbf5d1/logs/undercloud/var/log/tempest/stestr_results.html

Revision history for this message

chandan kumar (chkumar246) wrote on 2022-04-18:

#5

Adding job logs as a reference: https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-containers-multinode-wallaby/a17727d/logs/undercloud/var/log/tempest/tempest_run.log.txt.gz

It is also hitting the same bug.

Revision history for this message

yatin (yatinkarel) wrote on 2022-04-19:

#6

Ok looking further based on failures seems https://review.rdoproject.org/r/c/nfvinfo/+/40817 triggered the issue. Will check further and report a bug against OVN considering it's related to incremental processing of flows.

Revision history for this message

yatin (yatinkarel) wrote on 2022-04-19:

#7

<< Will check further and report a bug against OVN considering it's related to incremental processing of flows.
Reported https://bugzilla.redhat.com/show_bug.cgi?id=2076604

Also seen 1 failure in tempest tests(like test_update_instance_port_admin_state,test_mtu_sized_frames and test_network_basic_ops)[1][2] other than
test_update_router_admin_state, but most of the failures were in
test_update_router_admin_state.

[1] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-containers-multinode-wallaby/be1c327/logs/undercloud/var/log/tempest/stestr_results.html.gz
[2] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-containers-multinode-wallaby/bcc3e3d/logs/undercloud/var/log/tempest/stestr_results.html.gz

Revision history for this message

Douglas Viroel (dviroel) wrote on 2022-04-27:

#8

seems to be happenning more often, see [1][2][3]

[1] https://logserver.rdoproject.org/76/36976/37/check/periodic-tripleo-ci-centos-9-containers-multinode-network-wallaby/3ad597e/logs/undercloud/var/log/tempest/stestr_results.html.gz
[2] https://logserver.rdoproject.org/76/36976/37/check/periodic-tripleo-ci-centos-9-standalone-tripleo-wallaby/b33c7be/logs/undercloud/var/log/tempest/stestr_results.html.gz
[3] https://logserver.rdoproject.org/openstack-component-tripleo/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-standalone-tripleo-wallaby/4ecbc09/logs/undercloud/var/log/tempest/stestr_results.html.gz

Revision history for this message

yatin (yatinkarel) wrote on 2022-07-28:

#9

https://bugzilla.redhat.com/show_bug.cgi?id=2076604 issue is fixed in OVN.
The OVN issue is fixed in ovn-2021 >= 21.12.0-68
and ovn22.06 >= 22.06.0-4 and the fixed versions
available in CentOS 9 NFV SIG repos[1][2].

[1] https://review.rdoproject.org/r/c/nfvinfo/+/43667
[2] https://review.rdoproject.org/r/c/nfvinfo/+/44068

Proposed revert of the skipped tests https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/851342

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-07-28: Related fix proposed to tripleo-heat-templates (master)

#10

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/851345

yatin (yatinkarel) on 2022-07-28

Changed in tripleo:
assignee:	nobody → yatin (yatinkarel)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-03: Related fix merged to tripleo-heat-templates (master)

#11

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/851345
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/6c0410c74cee6e8809239f7a7877279723efec56
Submitter: "Zuul (22348)"
Branch: master

commit 6c0410c74cee6e8809239f7a7877279723efec56
Author: yatinkarel <email address hidden>
Date: Thu Jul 28 11:43:36 2022 +0530

Set force_config_drive only when OVNMetadata is disabled

It was already done in [1] but accidently reverted in [2],
this patch fixes it. It got detected while investigation [3].

    Since the known metadata issue is now fixed in OVN, we can
    keep the force config drive disabled when OVN metadata is
    enabled.

    [1] https://review.opendev.org/660689
    [2] https://review.opendev.org/791415
    [3] https://bugs.launchpad.net/tripleo/+bug/1968732

    Related-Bug: #1830179
    Related-Bug: #1968732
    Depends-On: https://review.opendev.org/851342
    Change-Id: I7781505ed3080a4485baa83f8170eb6c361382b4

Rabi Mishra (rabi) on 2022-10-12

Changed in tripleo:
status:	Triaged → Fix Released

tripleo

test_update_router_admin_state test failed with Unable to connect to port 22

Bug Description

Other bug subscribers

Remote bug watches