Bug #1929745 “Unchecked MSR access error - overcloud deploy “tim...” : Bugs : tripleo

Harald Jensås (harald-jensas) on 2021-05-26

description:

updated

Revision history for this message

Sandeep Yadav (sandeepyadav93) wrote on 2021-05-27:

#1

We noticed this error in the integration promotion line of various branches, Marking as promotion-blocker as its blocking promotions.

Master:-

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/e04d671/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

~~~
2021-05-27 03:31:26 | 2021-05-27 03:31:26.426217 | fa163e35-a09e-960f-fa12-00000000003d | FATAL | Wait for connection to become available | 192.168.24.26 | error={"changed": false, "elapsed": 2401, "msg": "timed out waiting for ping module test success: Data could not be sent to remote host \"192.168.24.26\". Make sure this host can be reached over ssh: Warning: Permanently added '192.168.24.26' (ECDSA) to the list of known hosts.\r\nheat-admin@192.168.24.26: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).\r\n"}
~~~

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/e04d671/logs/baremetal_1-console.log
~~~
unchecked MSR access error: WRMSR to 0xda0 (tried to write 0x0000000000000000) at rIP: 0xffffffffac069f84 (native_write_msr+0x4/0x20)
~~~

Wallaby:-
https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-wallaby/00e43b7/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Ussuri:-
https://logserver.rdoproject.org/openstack-periodic-integration-stable3/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-ussuri/4f53691/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Changed in tripleo:
importance:	High → Critical
tags:	added: promotion-blocker

Revision history for this message

Arx Cruz (arxcruz) wrote on 2021-05-27:

#2

I notice the failing logs have this:

[ 0.000000] DMI: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014

While the passing logs have this:

[ 0.000000] DMI: OpenStack Foundation OpenStack Nova, BIOS 1.11.0-2.el7 04/01/2014

Revision history for this message

Arx Cruz (arxcruz) wrote on 2021-05-27:

#3

More findings:

Controller 0:
https://logserver.rdoproject.org/46/28446/37/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/6adb4a3/logs/overcloud-controller-0/var/log/extra/cpuinfo.txt.gz

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ssbd ibpb vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 virt_ssbd arat npt nrip_save umip

It has the virt_ssbd flag, while otherr like Controller 1 doesn't:

https://logserver.rdoproject.org/46/28446/37/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/6adb4a3/logs/overcloud-controller-1/var/log/extra/cpuinfo.txt.gz

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibpb vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 arat npt nrip_save umip

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2021-05-27:

#4

there is a previous msr error in those log that seems to have been ignored

[ 0.000000] Linux version 4.18.0-305.el8.x86_64 (<email address hidden>) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Tue May 18 14:52:12 UTC 2021
[ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-305.el8.x86_64 root=UUID=04ec3b16-1886-4fab-9c53-6831b66fc57c ro console=ttyS0,115200n8 no_timer_check crashkernel=auto
[ 0.000000] unchecked MSR access error: WRMSR to 0xda0 (tried to write 0x0000000000000000) at rIP: 0xffffffff8ac69f84 (native_write_msr+0x4/0x20)
[ 0.000000] Call Trace:
[ 0.000000] ? fpu__init_cpu_xstate+0xb6/0xf0
[ 0.000000] ? fpu__init_system_xstate+0x140/0x597
[ 0.000000] ? 0xffffffff8ac00000
[ 0.000000] ? fpu__init_system+0x208/0x234
[ 0.000000] ? 0xffffffff8ac00000
[ 0.000000] ? early_cpu_init+0x332/0x34f
[ 0.000000] ? setup_arch+0xe0/0xde0
[ 0.000000] ? printk+0x58/0x6f
[ 0.000000] ? start_kernel+0x63/0x538
[ 0.000000] ? x86_family+0x5/0x20
[ 0.000000] ? load_ucode_bsp+0x76/0x12e

related to the fpu init

Revision history for this message

Alan Pevec (apevec) wrote on 2021-05-27:

#5

FTR rdo.vexxhost.ca openstack cluster did not have updates for months and computes are centos7 (linked rhbz is for RHEL8 qemu-kvm).
Diff in BIOS is b/c 2 new computes added few months ago have nova-libvirt container image based on Ubuntu but hypervisor is still EL7 + qemu-kvm-ev from VirtSIG, so we should have seen those failures at that point.

@Arx can you confirm 100% of fails have 1.13.0-1ubuntu1.1 and when did they start happening?

Revision history for this message

Arx Cruz (arxcruz) wrote on 2021-05-31:

#6

@Alan although the failures are 100% on the 1.13.0-1ubuntu1.1, I also saw right now a passing one on the same ubuntu bios, so we can discard it as the root cause. However the MSR error only appears on the ubuntu bios.
I'm doing more investigation.

Revision history for this message

wes hayutin (weshayutin) wrote on 2021-06-01:

#7

Just in case:
./diff_builds.py -c https://logserver.rdoproject.org/89/793089/3/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001/79f8811/
logs/ -t https://logserver.rdoproject.org/74/792474/3/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001/071f174/logs/

results:
https://termbin.com/9r2r1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-02: Related fix proposed to tripleo-ci (master)

#8

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/794208

Revision history for this message

Sandeep Yadav (sandeepyadav93) wrote on 2021-06-03:

#9

Hello,

The node on which overcloud deployment failed because it's not reachable. I was able to access that after reboot. I notice below traceback in cloud-int log..
~~~
2021-06-02 10:31:45,168 - util.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/stages.py", line 850, in _run_modules
    freq=freq)
  File "/usr/lib/python3.6/site-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3.6/site-packages/cloudinit/helpers.py", line 185, in run
    results = functor(*args)
  File "/usr/lib/python3.6/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py", line 107, in handle
    (key_fn, key_entries) = ssh_util.extract_authorized_keys(user_name)
  File "/usr/lib/python3.6/site-packages/cloudinit/ssh_util.py", line 253, in extract_authorized_keys
    (ssh_dir, pw_ent) = users_ssh_info(username)
  File "/usr/lib/python3.6/site-packages/cloudinit/ssh_util.py", line 226, in users_ssh_info
    pw_ent = pwd.getpwnam(username)
KeyError: "getpwnam(): name not found: 'centos'"
~~~

This traceback is not seen in other overcloud nodes which worked earlier during deployment.

Revision history for this message

Sandeep Yadav (sandeepyadav93) wrote on 2021-06-03:

#10

Hello,

Issue reproduced in test patch even which using older cloud-int:-

https://logserver.rdoproject.org/61/33961/2/check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-1/6334684/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
~~~
2021-06-03 07:30:42 | 2021-06-03 07:30:42.201383 | fa163e84-a8f3-8bc3-3d73-00000000003d | FATAL | Wait for connection to become available | 192.168.24.21 | error={"changed": false, "elapsed": 2418, "msg": "timed out waiting for ping module test success: Data could not be sent to remote host \"192.168.24.21\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.21 port 22: No route to host\r\n"}
~~~

https://logserver.rdoproject.org/61/33961/2/check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-1/6334684/logs/overcloud-controller-1/var/log/extra/package-list-installed.txt.gz
~~~
cloud-init.noarch 20.3-10.el8 @quickstart-centos-appstreams
~~~

temporary to test old cloud-int we used below patches as depends-On:

https://review.opendev.org/c/openstack/diskimage-builder/+/794395 and https://review.opendev.org/c/openstack/tripleo-quickstart/+/794225

Revision history for this message

Juan Badia Payno (jbadiapa) wrote on 2021-06-07:

#11

As a workaround after executing the provisioning of the nodes, the nodes can be checked that there are pingable and loggable (ssh).

If they are not they can be restarted by:
- openstack baremetal node power off NODE_NAME
- openstack baremetal node power on NODE_NAME

Revision history for this message

Juan Badia Payno (jbadiapa) wrote on 2021-06-07:

#12

Download full text (3.3 KiB)

I had this problem with two different outcomes:

The one described on comment#9 and the following outcome:
2021-06-04 21:40:29,701 - util.py[DEBUG]: Failed to set the hostname to localhost (localhost)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py", line 84, in handle
    cloud.distro.set_hostname(hostname, fqdn)
  File "/usr/lib/python3.6/site-packages/cloudinit/distros/__init__.py", line 118, in set_hostname
    self._write_hostname(writeable_hostname, self.hostname_conf_fn)
  File "/usr/lib/python3.6/site-packages/cloudinit/distros/rhel.py", line 87, in _write_hostname
    subp.subp(['hostnamectl', 'set-hostname', str(hostname)])
  File "/usr/lib/python3.6/site-packages/cloudinit/subp.py", line 293, in subp
    cmd=args)
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['hostnamectl', 'set-hostname', 'localhost']
Exit code: 1
Reason: -
Stdout:
Stderr: Failed to create bus connection: No such file or directory
2021-06-04 21:40:29,707 - handlers.py[DEBUG]: finish: init-network/config-set_hostname: FAIL: running config-set_hostname with frequency once-per-instance
2021-06-04 21:40:29,708 - util.py[WARNING]: Running module set_hostname (<module 'cloudinit.config.cc_set_hostname' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py'>) failed
2021-06-04 21:40:29,708 - util.py[DEBUG]: Running module set_hostname (<module 'cloudinit.config.cc_set_hostname' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py", line 84, in handle
    cloud.distro.set_hostname(hostname, fqdn)
  File "/usr/lib/python3.6/site-packages/cloudinit/distros/__init__.py", line 118, in set_hostname
    self._write_hostname(writeable_hostname, self.hostname_conf_fn)
  File "/usr/lib/python3.6/site-packages/cloudinit/distros/rhel.py", line 87, in _write_hostname
    subp.subp(['hostnamectl', 'set-hostname', str(hostname)])
  File "/usr/lib/python3.6/site-packages/cloudinit/subp.py", line 293, in subp
    cmd=args)
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['hostnamectl', 'set-hostname', 'localhost']
Exit code: 1
Reason: -
Stdout:
Stderr: Failed to create bus connection: No such file or directory

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/stages.py", line 850, in _run_modules
    freq=freq)
  File "/usr/lib/python3.6/site-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3.6/site-packages/cloudinit/helpers.py", line 185, in run
    results = functor(*args)
  File "/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py", line 88, in handle
    raise SetHostnameError("%s: %s" % (msg, e)) from e
cloudinit.config.cc_set_hostname.SetHostnameError: Failed to set the hostname to localhost (localhost): Unexpected error while running command.
Command: ['hostnamectl...

I had this problem with two different outcomes:

The one described on comment#9 and the following outcome:
2021-06-04 21:40:29,701 - util.py[DEBUG]: Failed to set the hostname to localhost (localhost)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py", line 84, in handle
    cloud.distro.set_hostname(hostname, fqdn)
  File "/usr/lib/python3.6/site-packages/cloudinit/distros/__init__.py", line 118, in set_hostname
    self._write_hostname(writeable_hostname, self.hostname_conf_fn)
  File "/usr/lib/python3.6/site-packages/cloudinit/distros/rhel.py", line 87, in _write_hostname
    subp.subp(['hostnamectl', 'set-hostname', str(hostname)])
  File "/usr/lib/python3.6/site-packages/cloudinit/subp.py", line 293, in subp
    cmd=args)
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['hostnamectl', 'set-hostname', 'localhost']
Exit code: 1
Reason: -
Stdout:
Stderr: Failed to create bus connection: No such file or directory
2021-06-04 21:40:29,707 - handlers.py[DEBUG]: finish: init-network/config-set_hostname: FAIL: running config-set_hostname with frequency once-per-instance
2021-06-04 21:40:29,708 - util.py[WARNING]: Running module set_hostname (<module 'cloudinit.config.cc_set_hostname' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py'>) failed
2021-06-04 21:40:29,708 - util.py[DEBUG]: Running module set_hostname (<module 'cloudinit.config.cc_set_hostname' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py", line 84, in handle
    cloud.distro.set_hostname(hostname, fqdn)
  File "/usr/lib/python3.6/site-packages/cloudinit/distros/__init__.py", line 118, in set_hostname
    self._write_hostname(writeable_hostname, self.hostname_conf_fn)
  File "/usr/lib/python3.6/site-packages/cloudinit/distros/rhel.py", line 87, in _write_hostname
    subp.subp(['hostnamectl', 'set-hostname', str(hostname)])
  File "/usr/lib/python3.6/site-packages/cloudinit/subp.py", line 293, in subp
    cmd=args)
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['hostnamectl', 'set-hostname', 'localhost']
Exit code: 1
Reason: -
Stdout:
Stderr: Failed to create bus connection: No such file or directory

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/stages.py", line 850, in _run_modules
    freq=freq)
  File "/usr/lib/python3.6/site-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3.6/site-packages/cloudinit/helpers.py", line 185, in run
    results = functor(*args)
  File "/usr/lib/python3.6/site-packages/cloudinit/config/cc_set_hostname.py", line 88, in handle
    raise SetHostnameError("%s: %s" % (msg, e)) from e
cloudinit.config.cc_set_hostname.SetHostnameError: Failed to set the hostname to localhost (localhost): Unexpected error while running command.
Command: ['hostnamectl', 'set-hostname', 'localhost']
Exit code: 1
Reason: -
Stdout:
Stderr: Failed to create bus connection: No such file or directory

Revision history for this message

Sandeep Yadav (sandeepyadav93) wrote on 2021-06-07:

#13

@yatin and me were debugging this, We got a local reproducer, from testing looks like new hostname-3.20-7.el8.x86_64 is our culprit.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1959720

From train: Difference between rpm working(24th) and affected(27th)

http://pastebin.test.redhat.com/969401

hostname were one of the updated packages
~~~
hostname-3.20-6.el8.x86_64 | hostname-3.20-7.el8.x86_64
~~~

Fix is available: https://git.centos.org/rpms/hostname/c/e097d2aac3e76eebbaac3ee4c2b95f575f3798fa?branch=c8s

from local testing looks like .. hostname downgrade solve the issue...

Trying with https://review.opendev.org/c/openstack/tripleo-quickstart/+/794636 to confirm in CI. Awaiting results

Upstream C8 stream mirror have already moved back from hostname-3.20-7.el8.x86_64 to hostname-3.20-6.el8.x86_64, waiting on mirror sync.

Revision history for this message

wes hayutin (weshayutin) wrote on 2021-06-07:

#14

Woot.. thanks Harold, Sandeep and everyone else!

http://mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org/centos/8-stream/BaseOS/x86_64/os/Packages/
hostname-3.20-6.el8...> 2019-07-02 13:22 32K

https://review.rdoproject.org/zuul/build/cc41082bfcd345349e09de7a5bcfd8d1

worked!!!

Changed in tripleo:
status:	Triaged → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-08: Change abandoned on tripleo-quickstart (master)

#15

Change abandoned by "Sandeep Yadav <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-quickstart/+/794636
Reason: This is not needed anymore.. infra rolled out previous version of hostname package and removed the new affected version.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-15: Change abandoned on tripleo-ci (master)

#16

Change abandoned by "Arx Cruz <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/794208

tripleo

Unchecked MSR access error - overcloud deploy "timed out waiting for ping module test

Bug Description

Other bug subscribers

Remote bug watches