Unchecked MSR access error - overcloud deploy "timed out waiting for ping module test

Bug #1929745 reported by Harald Jensås
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

https://logserver.rdoproject.org/46/28446/37/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/6adb4a3/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

FATAL | Wait for connection to become available | 192.168.24.30 | error={"changed": false, "elapsed": 2402, "msg": "timed out waiting for ping module test success: Data could not be sent to remote host \"192.168.24.30\". Make sure this host can be reached over ssh: Warning: Permanently added '192.168.24.30' (ECDSA) to the list of known hosts.\r\nheat-admin@192.168.24.30: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).\r\n"}

One of the OVB baremetal nodes has a "unchecked MSR access error" when booting, and no network interfaces are discovered.

Possibly related bug: https://bugzilla.redhat.com/show_bug.cgi?id=1808996

-- Console log for failing VM's from two separate runs --

  https://logserver.rdoproject.org/46/28446/37/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/6adb4a3/logs/baremetal_37_43957_2-console.log
  https://logserver.rdoproject.org/46/28446/37/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/bd84310/logs/baremetal_37_16163_0-console.log

Other VMs are OK, such as:

  https://logserver.rdoproject.org/46/28446/37/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/6adb4a3/logs/baremetal_37_43957_3-console.log

There is a call trace on the OVB node console:

  [[0;32m OK [[ 6.271140] unchecked MSR access error: RDMSR from 0xda0 at rIP: 0xffffffff8ac69e23 (native_read_msr+0x3/0x30)
  0m] Started Moni[ 6.272285] Call Trace:
  toring of LVM2 m[ 6.272733] kvm_arch_hardware_setup+0x5d/0x470 [kvm]
  irrors,���sing d[ 6.273323] ? __kmalloc_node+0x10b/0x290
  meventd or progr[ 6.273887] ? alloc_cpumask_var_node+0x1b/0x30
  ess polling.
  [ 6.274499] kvm_init+0x98/0x2b0 [kvm]
  [ 6.274988] ? svm_hardware_setup+0x546/0x546 [kvm_amd]
  [ 6.275554] do_one_initcall+0x46/0x1c3
  [ 6.275970] ? do_init_module+0x22/0x220
  [ 6.276395] ? kmem_cache_alloc_trace+0x131/0x270
  [ 6.276906] do_init_module+0x5a/0x220
  [ 6.277308] load_module+0x14c5/0x17f0
  [ 6.277738] ? __do_sys_init_module+0x13b/0x180
  [ 6.278221] __do_sys_init_module+0x13b/0x180
  [ 6.278691] do_syscall_64+0x5b/0x1a0
  [ 6.279099] entry_SYSCALL_64_after_hwframe+0x65/0xca
  [ 6.279625] RIP: 0033:0x7fbb1026080e
  [ 6.280043] Code: 48 8b 0d 7d 16 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4a 16 2c 00 f7 d8 64 89 01 48
  [ 6.282049] RSP: 002b:00007fff43748758 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
  [ 6.282863] RAX: ffffffffffffffda RBX: 00005584333bb890 RCX: 00007fbb1026080e
  [ 6.283619] RDX: 00007fbb10dce86d RSI: 00000000000395e8 RDI: 00005584334005c0
  [ 6.284407] RBP: 00007fbb10dce86d R08: 000055843331e01a R09: 0000000000000003
  [ 6.285172] R10: 000055843331e010 R11: 0000000000000246 R12: 00005584334005c0
  [ 6.285987] R13: 000055843337fbc0 R14: 0000000000020000 R15: 0000000000000000

description: updated
Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

We noticed this error in the integration promotion line of various branches, Marking as promotion-blocker as its blocking promotions.

Master:-

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/e04d671/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

~~~
2021-05-27 03:31:26 | 2021-05-27 03:31:26.426217 | fa163e35-a09e-960f-fa12-00000000003d | FATAL | Wait for connection to become available | 192.168.24.26 | error={"changed": false, "elapsed": 2401, "msg": "timed out waiting for ping module test success: Data could not be sent to remote host \"192.168.24.26\". Make sure this host can be reached over ssh: Warning: Permanently added '192.168.24.26' (ECDSA) to the list of known hosts.\r\nheat-admin@192.168.24.26: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).\r\n"}
~~~

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/e04d671/logs/baremetal_1-console.log
~~~
unchecked MSR access error: WRMSR to 0xda0 (tried to write 0x0000000000000000) at rIP: 0xffffffffac069f84 (native_write_msr+0x4/0x20)
~~~

Wallaby:-
https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-wallaby/00e43b7/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Ussuri:-
https://logserver.rdoproject.org/openstack-periodic-integration-stable3/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-ussuri/4f53691/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Changed in tripleo:
importance: High → Critical
tags: added: promotion-blocker
Revision history for this message
Arx Cruz (arxcruz) wrote :

I notice the failing logs have this:

[ 0.000000] DMI: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014

While the passing logs have this:

[ 0.000000] DMI: OpenStack Foundation OpenStack Nova, BIOS 1.11.0-2.el7 04/01/2014

Revision history for this message
Arx Cruz (arxcruz) wrote :

More findings:

Controller 0:
https://logserver.rdoproject.org/46/28446/37/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/6adb4a3/logs/overcloud-controller-0/var/log/extra/cpuinfo.txt.gz

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ssbd ibpb vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 virt_ssbd arat npt nrip_save umip

It has the virt_ssbd flag, while otherr like Controller 1 doesn't:

https://logserver.rdoproject.org/46/28446/37/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/6adb4a3/logs/overcloud-controller-1/var/log/extra/cpuinfo.txt.gz

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibpb vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 arat npt nrip_save umip

Revision history for this message
sean mooney (sean-k-mooney) wrote :

there is a previous msr error in those log that seems to have been ignored

[ 0.000000] Linux version 4.18.0-305.el8.x86_64 (<email address hidden>) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Tue May 18 14:52:12 UTC 2021
[ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-305.el8.x86_64 root=UUID=04ec3b16-1886-4fab-9c53-6831b66fc57c ro console=ttyS0,115200n8 no_timer_check crashkernel=auto
[ 0.000000] unchecked MSR access error: WRMSR to 0xda0 (tried to write 0x0000000000000000) at rIP: 0xffffffff8ac69f84 (native_write_msr+0x4/0x20)
[ 0.000000] Call Trace:
[ 0.000000] ? fpu__init_cpu_xstate+0xb6/0xf0
[ 0.000000] ? fpu__init_system_xstate+0x140/0x597
[ 0.000000] ? 0xffffffff8ac00000
[ 0.000000] ? fpu__init_system+0x208/0x234
[ 0.000000] ? 0xffffffff8ac00000
[ 0.000000] ? early_cpu_init+0x332/0x34f
[ 0.000000] ? setup_arch+0xe0/0xde0
[ 0.000000] ? printk+0x58/0x6f
[ 0.000000] ? start_kernel+0x63/0x538
[ 0.000000] ? x86_family+0x5/0x20
[ 0.000000] ? load_ucode_bsp+0x76/0x12e

related to the fpu init

Revision history for this message
Alan Pevec (apevec) wrote :

FTR rdo.vexxhost.ca openstack cluster did not have updates for months and computes are centos7 (linked rhbz is for RHEL8 qemu-kvm).
Diff in BIOS is b/c 2 new computes added few months ago have nova-libvirt container image based on Ubuntu but hypervisor is still EL7 + qemu-kvm-ev from VirtSIG, so we should have seen those failures at that point.

@Arx can you confirm 100% of fails have 1.13.0-1ubuntu1.1 and when did they start happening?

Revision history for this message
Arx Cruz (arxcruz) wrote :

@Alan although the failures are 100% on the 1.13.0-1ubuntu1.1, I also saw right now a passing one on the same ubuntu bios, so we can discard it as the root cause. However the MSR error only appears on the ubuntu bios.
I'm doing more investigation.

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ci (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/794208

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Hello,

The node on which overcloud deployment failed because it's not reachable. I was able to access that after reboot. I notice below traceback in cloud-int log..
~~~
2021-06-02 10:31:45,168 - util.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/stages.py", line 850, in _run_modules
    freq=freq)
  File "/usr/lib/python3.6/site-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3.6/site-packages/cloudinit/helpers.py", line 185, in run
    results = functor(*args)
  File "/usr/lib/python3.6/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py", line 107, in handle
    (key_fn, key_entries) = ssh_util.extract_authorized_keys(user_name)
  File "/usr/lib/python3.6/site-packages/cloudinit/ssh_util.py", line 253, in extract_authorized_keys
    (ssh_dir, pw_ent) = users_ssh_info(username)
  File "/usr/lib/python3.6/site-packages/cloudinit/ssh_util.py", line 226, in users_ssh_info
    pw_ent = pwd.getpwnam(username)
KeyError: "getpwnam(): name not found: 'centos'"
~~~

This traceback is not seen in other overcloud nodes which worked earlier during deployment.

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Hello,

Issue reproduced in test patch even which using older cloud-int:-

 https://logserver.rdoproject.org/61/33961/2/check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-1/6334684/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
 ~~~
 2021-06-03 07:30:42 | 2021-06-03 07:30:42.201383 | fa163e84-a8f3-8bc3-3d73-00000000003d | FATAL | Wait for connection to become available | 192.168.24.21 | error={"changed": false, "elapsed": 2418, "msg": "timed out waiting for ping module test success: Data could not be sent to remote host \"192.168.24.21\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.21 port 22: No route to host\r\n"}
~~~

https://logserver.rdoproject.org/61/33961/2/check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-1/6334684/logs/overcloud-controller-1/var/log/extra/package-list-installed.txt.gz
~~~
cloud-init.noarch 20.3-10.el8 @quickstart-centos-appstreams
~~~

temporary to test old cloud-int we used below patches as depends-On:

https://review.opendev.org/c/openstack/diskimage-builder/+/794395 and https://review.opendev.org/c/openstack/tripleo-quickstart/+/794225

Revision history for this message
Juan Badia Payno (jbadiapa) wrote :

As a workaround after executing the provisioning of the nodes, the nodes can be checked that there are pingable and loggable (ssh).

If they are not they can be restarted by:
 - openstack baremetal node power off NODE_NAME
 - openstack baremetal node power on NODE_NAME

Revision history for this message
Juan Badia Payno (jbadiapa) wrote :
Download full text (3.3 KiB)

I had this problem with two different outcomes:

The one described on comment#9 and the following outcome:
2021-06-04 21:40:29,701 - util.py[DEBUG]: Failed to set the hostname to localhost (localhost)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py", line 84, in handle
    cloud.distro.set_hostname(hostname, fqdn)
  File "/usr/lib/python3.6/site-packages/cloudinit/distros/__init__.py", line 118, in set_hostname
    self._write_hostname(writeable_hostname, self.hostname_conf_fn)
  File "/usr/lib/python3.6/site-packages/cloudinit/distros/rhel.py", line 87, in _write_hostname
    subp.subp(['hostnamectl', 'set-hostname', str(hostname)])
  File "/usr/lib/python3.6/site-packages/cloudinit/subp.py", line 293, in subp
    cmd=args)
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['hostnamectl', 'set-hostname', 'localhost']
Exit code: 1
Reason: -
Stdout:
Stderr: Failed to create bus connection: No such file or directory
2021-06-04 21:40:29,707 - handlers.py[DEBUG]: finish: init-network/config-set_hostname: FAIL: running config-set_hostname with frequency once-per-instance
2021-06-04 21:40:29,708 - util.py[WARNING]: Running module set_hostname (<module 'cloudinit.config.cc_set_hostname' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py'>) failed
2021-06-04 21:40:29,708 - util.py[DEBUG]: Running module set_hostname (<module 'cloudinit.config.cc_set_hostname' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py", line 84, in handle
    cloud.distro.set_hostname(hostname, fqdn)
  File "/usr/lib/python3.6/site-packages/cloudinit/distros/__init__.py", line 118, in set_hostname
    self._write_hostname(writeable_hostname, self.hostname_conf_fn)
  File "/usr/lib/python3.6/site-packages/cloudinit/distros/rhel.py", line 87, in _write_hostname
    subp.subp(['hostnamectl', 'set-hostname', str(hostname)])
  File "/usr/lib/python3.6/site-packages/cloudinit/subp.py", line 293, in subp
    cmd=args)
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['hostnamectl', 'set-hostname', 'localhost']
Exit code: 1
Reason: -
Stdout:
Stderr: Failed to create bus connection: No such file or directory

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/stages.py", line 850, in _run_modules
    freq=freq)
  File "/usr/lib/python3.6/site-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3.6/site-packages/cloudinit/helpers.py", line 185, in run
    results = functor(*args)
  File "/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py", line 88, in handle
    raise SetHostnameError("%s: %s" % (msg, e)) from e
cloudinit.config.cc_set_hostname.SetHostnameError: Failed to set the hostname to localhost (localhost): Unexpected error while running command.
Command: ['hostnamectl...

Read more...

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

@yatin and me were debugging this, We got a local reproducer, from testing looks like new hostname-3.20-7.el8.x86_64 is our culprit.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1959720

From train: Difference between rpm working(24th) and affected(27th)

http://pastebin.test.redhat.com/969401

hostname were one of the updated packages
~~~
hostname-3.20-6.el8.x86_64 | hostname-3.20-7.el8.x86_64
~~~

Fix is available: https://git.centos.org/rpms/hostname/c/e097d2aac3e76eebbaac3ee4c2b95f575f3798fa?branch=c8s

from local testing looks like .. hostname downgrade solve the issue...

Trying with https://review.opendev.org/c/openstack/tripleo-quickstart/+/794636 to confirm in CI. Awaiting results

Upstream C8 stream mirror have already moved back from hostname-3.20-7.el8.x86_64 to hostname-3.20-6.el8.x86_64, waiting on mirror sync.

Revision history for this message
wes hayutin (weshayutin) wrote :
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart (master)

Change abandoned by "Sandeep Yadav <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-quickstart/+/794636
Reason: This is not needed anymore.. infra rolled out previous version of hostname package and removed the new affected version.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ci (master)

Change abandoned by "Arx Cruz <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/794208

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.