Octavia scenario jobs fail with timeouts in tempest

Bug #2023559 reported by Gregory Thiemonge
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
devstack
Invalid
Undecided
Unassigned
octavia
Fix Released
Undecided
Unassigned

Bug Description

Many jobs are failing (ubuntu, centos-9-stream, fips)

In all the cases, tempest fails because it cannot reach out the Cirros VMs that are created for the tests:

{0} setUpClass (octavia_tempest_plugin.tests.barbican_scenario.v2.test_tls_barbican.TLSWithBarbicanTest) [0.000000s] ... FAILED

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):

      File "/opt/stack/tempest/tempest/lib/common/ssh.py", line 136, in _get_ssh_connection
    ssh.connect(self.host, port=self.port, username=self.username,

      File "/opt/stack/tempest/.tox/tempest/lib/python3.9/site-packages/paramiko/client.py", line 365, in connect
    sock.connect(addr)

    socket.timeout: timed out

During handling of the above exception, another exception occurred:

    Traceback (most recent call last):

      File "/opt/stack/tempest/tempest/test.py", line 206, in setUpClass
    raise value.with_traceback(trace)

      File "/opt/stack/tempest/tempest/test.py", line 199, in setUpClass
    cls.resource_setup()

      File "/opt/stack/tempest/.tox/tempest/lib/python3.9/site-packages/octavia_tempest_plugin/tests/barbican_scenario/v2/test_tls_barbican.py", line 122, in resource_setup
    super(TLSWithBarbicanTest, cls).resource_setup()

      File "/opt/stack/tempest/.tox/tempest/lib/python3.9/site-packages/octavia_tempest_plugin/tests/test_base.py", line 830, in resource_setup
    cls._enable_ipv6_nic_webserver(

      File "/opt/stack/tempest/.tox/tempest/lib/python3.9/site-packages/octavia_tempest_plugin/tests/test_base.py", line 1131, in _enable_ipv6_nic_webserver
    raise e

      File "/opt/stack/tempest/.tox/tempest/lib/python3.9/site-packages/octavia_tempest_plugin/tests/test_base.py", line 1124, in _enable_ipv6_nic_webserver
    linux_client.validate_authentication()

      File "/opt/stack/tempest/tempest/lib/common/utils/linux/remote_client.py", line 31, in wrapper
    return function(self, *args, **kwargs)

      File "/opt/stack/tempest/tempest/lib/common/utils/linux/remote_client.py", line 123, in validate_authentication
    self.ssh_client.test_connection_auth()

      File "/opt/stack/tempest/tempest/lib/common/ssh.py", line 245, in test_connection_auth
    connection = self._get_ssh_connection()

      File "/opt/stack/tempest/tempest/lib/common/ssh.py", line 155, in _get_ssh_connection
    raise exceptions.SSHTimeout(host=self.host,

    tempest.lib.exceptions.SSHTimeout: Connection to the 172.24.5.84 via SSH timed out.
User: cirros, Password: None

A testing commit adds a hack to get the console logs when it fails (https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/877667)

in https://zuul.opendev.org/t/openstack/build/b094695bd5654eb5b32185dc325c9bbf
it displayed the following console logs:

[ 8.303851] unchecked MSR access error: WRMSR to 0x48 (tried to write 0x0000000000000004) at rIP: 0xffffffff94496104 (native_write_msr+0x4/0x30)
[ 8.304255] Call Trace:
[ 8.304255] <TASK>
[ 8.304255] ? write_spec_ctrl_current+0x45/0x50
[ 8.304255] speculation_ctrl_update+0x8f/0x200
[ 8.304255] speculation_ctrl_update_current+0x1f/0x30
[ 8.304255] ssb_prctl_set+0x9a/0xf0
[ 8.304255] arch_seccomp_spec_mitigate+0x66/0x70
[ 8.304255] seccomp_set_mode_filter+0x4e2/0x530
[ 8.304255] do_seccomp+0x37/0x200
[ 8.304255] prctl_set_seccomp+0x30/0x50
[ 8.304255] __do_sys_prctl+0x2f7/0x6f0
[ 8.304255] ? exit_to_user_mode_prepare+0x37/0xb0
[ 8.304255] __x64_sys_prctl+0x21/0x30
[ 8.304255] do_syscall_64+0x5c/0xc0
[ 8.304255] ? handle_mm_fault+0xd8/0x2c0
[ 8.304255] ? do_user_addr_fault+0x1e7/0x670
[ 8.304255] ? exit_to_user_mode_prepare+0x37/0xb0
[ 8.304255] ? irqentry_exit_to_user_mode+0x9/0x20
[ 8.304255] ? irqentry_exit+0x1d/0x30
[ 8.304255] ? exc_page_fault+0x89/0x170
[ 8.304255] entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 8.304255] RIP: 0033:0x7f52c4e33d39
[ 8.304255] Code: 00 f7 d8 64 89 02 48 83 c8 ff 44 89 cf 89 04 24 e8 ac d2 02 00 8b 04 24 48 83 c4 38 c3 49 89 ca 48 63 ff b8 9d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 76 10 48 8b 15 38 11 06 00 f7 d8 64 89 02 48 83
[ 8.304255] RSP: 002b:00007ffc8420f128 EFLAGS: 00000246 ORIG_RAX: 000000000000009d
[ 8.304255] RAX: ffffffffffffffda RBX: 00007ffc8420f1a8 RCX: 00007f52c4e33d39
[ 8.304255] RDX: 00000000004444c0 RSI: 0000000000000002 RDI: 0000000000000016
[ 8.304255] RBP: 044e043f801a5409 R08: 0000000000000000 R09: 0000000000000000
[ 8.304255] R10: 00007f52c4e33d39 R11: 0000000000000246 R12: 0000000000000000
[ 8.304255] R13: 00007ffc8420f698 R14: 0000000000000001 R15: 00007f52c4eaf690
[ 8.304255] </TASK>
[ 8.562590] Call Trace:
[ 8.566085] <TASK>
[ 8.566085] ? write_spec_ctrl_current+0x45/0x50
[ 8.566085] __switch_to_xtra+0x110/0x4e0
[ 8.566085] __switch_to+0x260/0x450
[ 8.566085] __schedule+0x23d/0x5a0
[ 8.566085] schedule+0x4e/0xc0
[ 8.566085] worker_thread+0xd4/0x420
[ 8.566085] ? process_one_work+0x3d0/0x3d0
[ 8.566085] kthread+0x12a/0x150
[ 8.566085] ? set_kthread_struct+0x50/0x50
[ 8.566085] ret_from_fork+0x22/0x30
[ 8.566085] </TASK>
[ 8.678551] Call Trace:
[ 8.680039] <TASK>
[ 8.680039] ? write_spec_ctrl_current+0x45/0x50
[ 8.680039] speculation_ctrl_update+0x8f/0x200
[ 8.680039] speculation_ctrl_update_current+0x1f/0x30
[ 8.680039] ssb_prctl_set+0x9a/0xf0
[ 8.680039] arch_seccomp_spec_mitigate+0x66/0x70
[ 8.680039] seccomp_set_mode_filter+0x4e2/0x530
[ 8.680039] do_seccomp+0x37/0x200
[ 8.680039] prctl_set_seccomp+0x30/0x50
[ 8.680039] __do_sys_prctl+0x2f7/0x6f0
[ 8.680039] ? security_task_setrlimit+0x37/0x60
[ 8.680039] __x64_sys_prctl+0x21/0x30
[ 8.680039] do_syscall_64+0x5c/0xc0
[ 8.680039] ? security_task_prctl+0x4e/0x90
[ 8.680039] ? __do_sys_prctl+0x40/0x6f0
[ 8.680039] ? do_syscall_64+0x69/0xc0
[ 8.680039] ? commit_creds+0x243/0x2d0
[ 8.680039] ? exit_to_user_mode_prepare+0x37/0xb0
[ 8.680039] ? syscall_exit_to_user_mode+0x27/0x50
[ 8.680039] ? __x64_sys_prctl+0x21/0x30
[ 8.680039] ? do_syscall_64+0x69/0xc0
[ 8.680039] ? __do_sys_getpid+0x1e/0x30
[ 8.680039] ? do_syscall_64+0x69/0xc0
[ 8.680039] ? do_syscall_64+0x69/0xc0
[ 8.680039] entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 8.680039] RIP: 0033:0x7f52c4e33d39
[ 8.680039] Code: 00 f7 d8 64 89 02 48 83 c8 ff 44 89 cf 89 04 24 e8 ac d2 02 00 8b 04 24 48 83 c4 38 c3 49 89 ca 48 63 ff b8 9d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 76 10 48 8b 15 38 11 06 00 f7 d8 64 89 02 48 83
[ 8.680039] RSP: 002b:00007ffc8420f118 EFLAGS: 00000246 ORIG_RAX: 000000000000009d
[ 8.680039] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

also reproduced on ubuntu jammy:

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_3c8/877667/7/check/octavia-v2-dsvm-scenario-traffic-ops/3c88434/job-output.txt

2023-06-12 14:59:48.398329 | # Job Information
2023-06-12 14:59:48.398653 | Ansible Version: 2.13.10
2023-06-12 14:59:48.398770 | Job: octavia-v2-dsvm-scenario-traffic-ops
2023-06-12 14:59:48.398874 | Pipeline: check
2023-06-12 14:59:48.398939 | Executor: ze07.opendev.org
2023-06-12 14:59:48.399036 | Triggered by: https://review.opendev.org/877667
2023-06-12 14:59:48.399123 | Event ID: 38f47feacfbc4c628049db7157dffe5a

2023-06-12 14:59:48.570623 | localhost | ok:
2023-06-12 14:59:48.570955 | localhost | # Node Information
2023-06-12 14:59:48.571034 | localhost | Inventory Hostname: controller
2023-06-12 14:59:48.571093 | localhost | Hostname: np0034285630
2023-06-12 14:59:48.571180 | localhost | Username: zuul
2023-06-12 14:59:48.571250 | localhost | Distro: Ubuntu 22.04
2023-06-12 14:59:48.571344 | localhost | Provider: ovh-bhs1
2023-06-12 14:59:48.571404 | localhost | Label: nested-virt-ubuntu-jammy
2023-06-12 14:59:48.571458 | localhost | Interface IP: 158.69.69.225

summary: - centos jobs fail with timeouts in tempest
+ scenario jobs fail with timeouts in tempest
description: updated
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote : Re: scenario jobs fail with timeouts in tempest

it seems that the issue started when devstack moved to cirros 0.6.1 with this commit 881437: Revert "Revert "Bump cirros version to 0.6.1"" | https://review.opendev.org/c/openstack/devstack/+/881437

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Added devstack project, quick summary: since we have bumped cirros to 0.6.1 in devstack, we have faced VM crashes on ovh nodes in the CI (with nested virt)

I'm currently testing cirros 0.6.2

summary: - scenario jobs fail with timeouts in tempest
+ Octavia scenario jobs fail with timeouts in tempest
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

we have had issues with cirros 0.6.1 on the following ovh nodes:
099ba2e09fdc07b1847717a9b605112a16b4c7d4bfd924c9204ac49d
4d0350f888b309b8d2e628a59a85c1905e6dccd534e51049e6a6e0f1
560fe06e257c2ed81871df3104745dc5fc659e07c014dd31ab3a8280
59c448d249b292e509024e00e4e5315a4444dbdf490a12ebfa08c39f
66d00b79b256c1947d6bf74aa6d9bd31fc890c6738ee0ed488eef181
a6224c46f83f86c84653df2192d459c42eecb5dcdcc4965ac30d712b
ad125c505c91c68a706349c615ed32c6fcafcba94ce88dcf4c5daf7b
d20529c2ad1a7b430915b3c684af9be02801b6d085a08a0c2c116f65
df3d8152cdd0cb80521ac739fdbf99c52076bd41d33eecd71d76b87a
eabba74e8469f0162dd549af83cd8cfcdb1b770b9babf6c2e8041a13
f4d8e912dd3fd42e525c9d1f783bb8a07b585c890bf54d12a806e868

when using cirros 0.6.2, it passes on the following nodes (and so far no failures):
099ba2e09fdc07b1847717a9b605112a16b4c7d4bfd924c9204ac49d
3628215d011fcba31a52b321cda4df0b2dc1cc351281446abce79714
420efd617526638d396ed0a71c391db260c7ce2f54808e8f8e506ca7
560fe06e257c2ed81871df3104745dc5fc659e07c014dd31ab3a8280
73966f2818137418add0485e4e59a0b5ad42734cd696393114dd9d77
af9aade2265164308bd1d2c9f09fe87cb5a2c757cd79509c260a8625
d20529c2ad1a7b430915b3c684af9be02801b6d085a08a0c2c116f65
eabba74e8469f0162dd549af83cd8cfcdb1b770b9babf6c2e8041a13

4 nodes that were failing with 0.6.1 are now passing with 0.6.2

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to octavia-tempest-plugin (master)
Changed in octavia:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/devstack/+/886162

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

I don't think that this is a devstack issue, you are free to override the images you are using according to your needs.

Changed in devstack:
status: New → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack (master)

Reviewed: https://review.opendev.org/c/openstack/devstack/+/886162
Committed: https://opendev.org/openstack/devstack/commit/39228451b6542ff63f288affbda13897089eb16d
Submitter: "Zuul (22348)"
Branch: master

commit 39228451b6542ff63f288affbda13897089eb16d
Author: Dr. Jens Harbott <email address hidden>
Date: Thu Jun 15 10:46:51 2023 +0200

    Bump default cirros version to 0.6.2

    Cirros has made a new release, including a newer kernel that should fix
    some issues when using nested virtualization.

    Related-Bug: 2023559
    Change-Id: I63469371b13801094a3ee1baae6e343999fbefa5

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Fixed in devstack

Changed in octavia:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.