Layered Build: Main controller fails after unlock - drbd error on manifest

Bug #1864221 reported by Cristopher Lemus
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Scott Little

Bug Description

Brief Description
-----------------
Simplex configuration fails after the initial unlock. It's not possible to grab credentials.

Severity
--------
Critical

Steps to Reproduce
------------------
Follow up instructions on docs.starlingx.io for Simplex on Baremetal. Error appeared after the initial unlock.

Expected Behavior
------------------
Some minutes after the unlock, the system should be available.

Actual Behavior
----------------
After the unlock, services are not working. "controller-0:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller."

Reproducibility
---------------
Seen once. Will try again when a system is available.

System Configuration
--------------------
Simplex

Branch/Pull Time/Commit
-----------------------
Layered build: 20200220T172409Z

Last Pass
---------
Initial sanity for layered build.

Timestamp/Logs
--------------
Some outputs: http://paste.openstack.org/show/789874/
Full collect: https://files.starlingx.kube.cengn.ca/download_file/44

Attach the logs for debugging (use attachments in Launchpad or for large collect files use: )
Provide a snippet of logs here and the timestamp when issue was seen.
Please indicate the unique identifier in the logs to highlight the problem

Test Activity
-------------
Sanity. Layered build.

description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / high priority - sanity issue w/ layered build

summary: - Main controller fails after unlock - drbd error on manifest
+ Layered Build: Main controller fails after unlock - drbd error on
+ manifest
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
tags: added: stx.4.0 stx.build stx.sanity
Changed in starlingx:
assignee: nobody → Scott Little (slittle1)
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

This issue is still appearing on a new build, 20200225T194204Z, it was reproduced on Simplex and Duplex configurations: http://paste.openstack.org/show/790036/

Full collect from duplex baremetal: https://files.starlingx.kube.cengn.ca/download_file/50

Revision history for this message
Mihail-Laurentiu Trica (mtrica) wrote :

The issue is still reproducible with build 20200226T090221Z, on both Simplex and Duplex configurations.

Revision history for this message
Nicolae Jascanu (njascanu-intel) wrote :

The issue is still reproducible with build 20200227T122131Z, on both Simplex and Duplex configurations.

Revision history for this message
Frank Miller (sensfan22) wrote :

We are unable to reproduce this issue when testing the ISO generated by build layering. Using the /mirror/starlingx/master/centos/flock/20200227T122131Z/outputs/iso/ load resulted in sanity passing for both AIO-SX and AIO-DX when sanity was run in our hardware labs.

Next step is to determine why you are seeing issues in your lab. Please set up a zoom session so we can perform a joint debug session. Ideally would like to see you reproduce the issue during the debug session so it can be debugged in real time.

Revision history for this message
Frank Miller (sensfan22) wrote :

From the puppet logs this is an issue with being out of memory:

2020-02-20T23:13:49.329 ^[[mNotice: 2020-02-20 23:13:49 +0000 /Stage[main]/Platform::Drbd::Platform/Platform::Drbd::Filesystem[drbd-platform]/Drbd::Resource[drbd-platform]/Drbd::Resource::Enable[drbd-platform]/Drbd::Resource::Up[drbd-platform]/Exec[reuse existing DRBD resource drbd-platform]/returns: drbd-platform: Failure: (122) kmalloc() failed. Out of memory?^[[0m
2020-02-20T23:13:49.332 ^[[mNotice: 2020-02-20 23:13:49 +0000 /Stage[main]/Platform::Drbd::Platform/Platform::Drbd::Filesystem[drbd-platform]/Drbd::Resource[drbd-platform]/Drbd::Resource::Enable[drbd-platform]/Drbd::Resource::Up[drbd-platform]/Exec[reuse existing DRBD resource drbd-platform]/returns: Command 'drbdsetup new-resource drbd-platform --cpu-mask=300000003' terminated with exit code 10^[[0m

It is also reported in https://bugs.launchpad.net/starlingx/+bug/1863362

The bug is not related to the layered build but an intermittent issue that is not yet understood. Has this same lab been used to run sanity with a regular ISO (not layered build) built on the same day?

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hi Frank,

We used the latest layered build: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/flock/20200305T111629Z/

However, we still face the exact same issue with Simplex and Duplex

LOGS
=======
Simplex:
Some outputs: http://paste.openstack.org/show/790365/
Collect: https://files.starlingx.kube.cengn.ca/download_file/54

Duplex:
Some outputs: http://paste.openstack.org/show/790368/
Collect: https://files.starlingx.kube.cengn.ca/download_file/55

HARDWARE DETAILS
=======
Here are some hardware details for our simplex system: http://paste.openstack.org/show/790370/

Both duplex systems have the same hardware specs.

COMMANDS EXECUTED
=======
These are the commands that we ran against the simplex system before doing the unlock:
http://paste.openstack.org/show/790373/
After that, the system boots, but that's when we receive the error "Openstack Admin credentials can only be loaded from the active controller."

NOTE: The exact same steps are followed up for both, monolithic and layered build. Is there a different command that needs to be executed?

QUESTIONS
=======
So regarding the questions:
- We are using the exact same server for Simplex on Monolithic and Layered build.
- Another pair of servers is used on Duplex for Monolithic and Layered build.
- Latest monolithic build doesn't face this error on Simplex, neither on Duplex, Sanity report was sent as green.
- We can arrange the meeting, however, could you please let us know at which point do you want the servers to be left? Just with the Image installed? Or, will it be better to execute every command, just before the "system host-unlock controller-0"? In that way, we can check the status before the unlock and monitor how it progress.

Revision history for this message
Chris Friesen (cbf123) wrote :

1) Can we get a "collect" from the monolithic system as well?

2) Also, I was wondering if you could try the following commands before the initial unlock while installing each of the monolithic and layered builds:

First, run "drbdsetup new-resource blah --cpu-mask=300000003|dmesg|tail" and check whether you see an error in the kernel logs that looks like "Overflow in bitmap_parse(300000003), truncating to 256 bits"

Next, run "drbdsetup new-resource blah2 --cpu-mask=3,00000003|dmesg|tail" and see if any new drbd-related errors have been added to the kernel logs.

3) Lastly, could you run "grep DRBD /boot/config-3.10.0*" and "find /lib/modules/3.10.0* -name drbd.ko" on both the monolithic and layered builds?

In the one monolithic lab I checked, the drbd kernel module is provided by the kmod-drbd-rt-8.4.11-1.tis.0.x86_64 package, which will emit the above "overflow" message to the kernel logs instead of returning an error code back up to userspace.

The logs from https://files.starlingx.kube.cengn.ca/download_file/50 however have the following:
2020-02-25T23:03:01.537 controller-0 kernel: warning [ 370.541536] d-con drbd-platform: bitmap_parse() failed with -75
2020-02-25T23:03:01.634 controller-0 kernel: warning [ 370.638097] d-con drbd-rabbit: bitmap_parse() failed with -75
2020-02-25T23:03:01.689 controller-0 kernel: warning [ 370.693070] d-con drbd-extension: bitmap_parse() failed with -75
2020-02-25T23:03:02.137 controller-0 kernel: warning [ 371.139836] d-con drbd-etcd: bitmap_parse() failed with -75
2020-02-25T23:03:02.212 controller-0 kernel: warning [ 371.214433] d-con drbd-dockerdistribution: bitmap_parse() failed with -75
2020-02-25T23:06:02.802 controller-0 kernel: warning [ 551.381727] d-con drbd-pgsql: bitmap_parse() failed with -75

These logs make it seem like the layered build is somehow using the kernel implementation rather than the out-of-tree implementation. Alternately, something is messed up and it's not recognizing that EOVERFLOW corresponds to 75.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hi Chris,

From the Layered Build, I got the following:

drbdsetup new-resource blah --cpu-mask=300000003|dmesg|tail
returns an error:
blah: Failure: (122) kmalloc() failed. Out of memory?
and
[ 4079.525275] d-con blah: bitmap_parse() failed with -75

drbdsetup new-resource blah2 --cpu-mask=3,00000003|dmesg|tail
doesn't return an error on kernel logs.

After 15-20mins, the last entry on dmesg is still:
[ 4079.525275] d-con blah: bitmap_parse() failed with -75

Full outputs here:

http://paste.openstack.org/show/790506/

Tomorrow, I'll install this same hardware with the monolithic build and provide the same output and collect.

Thanks a lot for your help!

Revision history for this message
Nicolae Jascanu (njascanu-intel) wrote :

Hi Chris,
We checked the monolithic build on bare-metal and we found similar errors as we are expecting.

Please take a look at:

http://paste.openstack.org/show/790517/

Revision history for this message
Chris Friesen (cbf123) wrote :

On the layered build, can you check whether there's a "drbd.ko" file under the "extra" kernel modules directory? It should look something like this:

controller-1:/home/sysadmin# ls -l /lib/modules/3.10.0-1062*/extra/drbd/drbd.ko
-rw-r--r--. 1 root root 684157 Feb 13 03:51 /lib/modules/3.10.0-1062.1.2.el7.2.tis.x86_64/extra/drbd/drbd.ko

Revision history for this message
Chris Friesen (cbf123) wrote :

Looking at the drbd driver version in the kernel log, it's reporting version 8.4.3:

-bash-4.2$ grep drbd kern.log |grep Version
2020-02-25T22:33:53.116 localhost kernel: info [ 555.710962] drbd: initialized. Version: 8.4.3 (api:1/proto:86-101)
2020-02-25T23:00:52.074 controller-0 kernel: info [ 241.382367] drbd: initialized. Version: 8.4.3 (api:1/proto:86-101)
2020-02-25T23:02:09.308 controller-0 kernel: info [ 318.434910] drbd: initialized. Version: 8.4.3 (api:1/proto:86-101)

The out-of-tree driver (which is what we should be using) is version 8.4.11.

Is the kmod-drbd-8.4.11-1.tis.0.x86_64.rpm being installed as part of the layered loadbuild?

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hi Chris,

Here are the outputs from the layered build before the unlock:

controller-0:~$ cat /etc/build.info |egrep "BUILD_ID|JOB"
BUILD_ID="20200307T090226Z"
JOB="STX_build_layer_flock_master_master"
FLOCK_JOB="STX_build_layer_flock_master_master"
DISTRO_JOB="STX_build_layer_distro_master_master"
COMPILER_JOB="STX_build_layer_compiler_master_master"

controller-0:~$ ls -l /lib/modules/3.10.0-1062*/extra/drbd/drbd.ko
-rw-r--r--. 1 root root 684157 ene 7 23:56 /lib/modules/3.10.0-1062.1.2.el7.1.tis.x86_64/extra/drbd/drbd.ko

controller-0:~$ rpm -qa |grep kmod-drbd
kmod-drbd-8.4.11-1.tis.0.x86_64

Kernel parameters:
controller-0:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-1062.1.2.el7.2.tis.x86_64 root=UUID=6f98e090-a0de-40b3-907c-b43ac5e63227 ro security_profile=standard module_blacklist=integrity,ima audit=0 tboot=false crashkernel=auto biosdevname=0 console=ttyS0,115200 iommu=pt usbcore.autosuspend=-1 hugepagesz=2M hugepages=0 default_hugepagesz=2M isolcpus=2,34,3,35 rcu_nocbs=2-31,34-63 kthread_cpus=0,32,1,33 irqaffinity=0,32,1,33 selinux=0 enforcing=0 nmi_watchdog=panic,1 softlockup_panic=1 intel_iommu=on user_namespace.enable=1

Should I try to do the unlock and check the outputs? The failure is during the unlock activity, I assume that something might be different after the unlock.

Revision history for this message
Frank Miller (sensfan22) wrote :

After some further debug by Chris, the drbd command is failing because it is using an in-tree kmod instead of an out-of-tree kmod.

Don Penney looked into the output for the build layering build and sees that the distro layer is building the kmod-drbd RPM correctly, but the flock layer is not picking up the right RPM. This is likely due to the version being the same, and whatever repoquery command is being used is finding a very old version.

It looks like the latest flock ISO build has a kmod-drbd built in January, rather than a recently built distro build.

Next step will be for Scott to update how the Flock build is pulling in the rpms for kmods so it aligns with the monolithic build.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tools (master)

Fix proposed to branch: master
Review: https://review.opendev.org/713968

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tools (master)

Reviewed: https://review.opendev.org/717755
Committed: https://git.openstack.org/cgit/starlingx/tools/commit/?id=bbd79fcd380992192027601b9444edd668803ed4
Submitter: Zuul
Branch: master

commit bbd79fcd380992192027601b9444edd668803ed4
Author: Scott Little <email address hidden>
Date: Sat Apr 4 19:16:28 2020 -0400

    Fast download of lower layer rpms

    The default yumdownloader approach to downloading files
    does so one at a time, and is very slow. Too slow for layered
    build which is built around the assumption that files can
    be downloaded faster than they can be built.

    This update will switch from yumdownloader to reposync
    for the download of lower layer rpms. It exploits the
    fact that each layer and build type publishes it's own lst file
    of rpms to download. The lst file is transformed into an'
    includepkgs directive in a custom yum.conf which is passed
    to reposync, so we only sync the desited rpms.

    Reposync won't redownload rpms that it already has, even if the
    upstream repodata indicates that the files checksum has changed.
    Forceing the redownload of these rpms requires that we manually
    download the upstream repodata and that we use verifytree to
    identify and delete the new obsolete rpms.

    Also including two small bug fixes found while investivgating an
    alternative solution to launchpad 1864221.
    - incorrect userid in a chown
    - Bug in get_url, --url pass being passed in the wrong place

    Story: 2006166
    Task: 39307
    Closes-Bug: 1864221
    Change-Id: If12b98ff4f5f24d9318250356920f397419f0f80
    Signed-off-by: Scott Little <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tools (master)

Change abandoned by Scott Little (<email address hidden>) on branch: master
Review: https://review.opendev.org/713968

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tools (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729840

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tools (f/centos8)
Download full text (20.8 KiB)

Reviewed: https://review.opendev.org/729840
Committed: https://git.openstack.org/cgit/starlingx/tools/commit/?id=05ba65cb054e71ef7c2669668831f50ee334e768
Submitter: Zuul
Branch: f/centos8

commit d923525441f034c31f0231d15e0f55e3ac5c37b6
Author: Don Penney <email address hidden>
Date: Wed May 20 14:20:35 2020 -0400

    Update 7.7.1908 repo to point to vault

    The Centos 7.7.1908 repo has been deprecated and moved to the vault.
    This commit updates the repo config file to point to the new baseurl
    locations.

    Change-Id: I0d9b98eea925ad05544c7a8bd626ee5dcc48d103
    Signed-off-by: Don Penney <email address hidden>

commit daff8261d18e7512c2ebc1d42fa4a75ddbaddda8
Author: Ran An <email address hidden>
Date: Fri May 15 01:34:46 2020 +0000

    Add python3 rpms required by python3-daemon runtime

    This reverts commit 6fa137af2312abbdfb0f0f015869d65c29e36b82.

    Add python3 rpms required by python3-daemon runtime
    adding python3 pkgs to support python3-daemon build

    Story: 2007106
    Task: 39353

    Signed-off-by: SidneyAn <email address hidden>
    Change-Id: I12e1ed08381cbc45480c6c02e4d93c8e04fac2dc

commit e12c55e47ed72fc4e4c38c11525769d89d1f9eed
Author: Ran An <email address hidden>
Date: Fri May 15 01:34:42 2020 +0000

    python3: add python3 and python3-devel

    This reverts commit 79abb72562b53c8682ea0d854e1071540b38a398.

    following pkgs were added:
    python3
    python3-lib
    python3-devel
    python3-rpm-generators (python3-devel dependency)
    python3-rpm-macros (python3-devel dependency)

    Story: 2007106
    Task: 39147

    Signed-off-by: SidneyAn <email address hidden>
    Change-Id: I127453a728a0ce1bbc517a5505c0b187fc945207

commit 008685698ae3e520dbc5178e76fa5956697192ed
Author: Shuicheng Lin <email address hidden>
Date: Wed Apr 29 04:25:59 2020 +0800

    Upgrade kata containers to 1.11.0

    Include several fixs for IPv6 support in Kata.

    Test:
    Could create kata container in both IPv4 and IPv6 environment.

    Story: 2006145
    Task: 39077

    Change-Id: I0fcd2b516f9003a851e601493ae7c323ceab8df5
    Signed-off-by: Shuicheng Lin <email address hidden>

commit 79abb72562b53c8682ea0d854e1071540b38a398
Author: Ran An <email address hidden>
Date: Thu May 14 11:42:07 2020 +0000

    Revert "python3: add python3-devel dep pkgs"

    This reverts commit 50bff514030d51dece7d23c2e1b64284ea5cda27.

    Change-Id: I50d5926f3ead6e773fa8e86ed2ef72f86ff9f343

commit 6fa137af2312abbdfb0f0f015869d65c29e36b82
Author: Ran An <email address hidden>
Date: Thu May 14 11:42:01 2020 +0000

    Revert "Add python3 rpms required by python3-daemon runtime"

    This reverts commit b2b481a7b0d348cf332e8002ec522457fbb80814.

    Change-Id: Ieb7d2dae22efa31730d3f5560c17162b668eca58

commit 07e819c768456022cf2668ebd7f2fb1b6596b9dc
Author: Poornima <email address hidden>
Date: Tue Apr 21 06:32:32 2020 +0530

    Added mirror folder creation

    Added mirror folder creation in the docker automation script

    Depends-on: https://review.opendev.org/#/c/721517/

    Story: 2007580
    Task: 39501
    Change-Id: Ib...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.