System controllers with 128 core threads fails to install due to drbd issue

Bug #2012639 reported by Takamasa Takenaka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Takamasa Takenaka

Bug Description

Brief Description
-----------------
the installation on controllers is not finishing because the installations
is failing during the drbd configuration having a conflict with processor
with 128 cores (more than 64 cores)

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
Try to install DC with the latest ISO

Expected Behavior
------------------
Installation finishes successfully

Actual Behavior
----------------
Installation fails to install due to controller-0 failing to install (drbd error)

Reproducibility
---------------
Reproducible 100%

System Configuration
--------------------
DC System Controller IPv6/IPV4

Branch/Pull Time/Commit
-----------------------
The latest

Last Pass
---------
Never

Timestamp/Logs
--------------
root@controller-0:~# cat /var/log/puppet/latest/puppet.log | grep Error
2023-03-03T22:54:15.972 Debug: 2023-03-03 22:54:15 +0000 Facter: Error reading file: No such file or directory
2023-03-03T22:55:56.226 Error: 2023-03-03 22:55:56 +0000 'drbdadm up drbd-pgsql' returned 1 instead of one of [0]
2023-03-03T22:55:56.314 Error: 2023-03-03 22:55:56 +0000 /Stage[main]/Platform::Drbd::Pgsql/Platform::Drbd::Filesystem[drbd-pgsql]/Drbd::Resource[drbd-pgsql]/Drbd::Resource::Enable[drbd-pgsql]/Drbd::Resource::Up[drbd-pgsql]/Exec[enable DRBD resource drbd-pgsql]/returns: change from 'notrun' to ['0'] failed: 'drbdadm up drbd-pgsql' returned 1 instead of one of [0]
2023-03-03T22:55:56.416 Error: 2023-03-03 22:55:56 +0000 'drbdadm up drbd-rabbit' returned 1 instead of one of [0]
2023-03-03T22:55:56.505 Error: 2023-03-03 22:55:56 +0000 /Stage[main]/Platform::Drbd::Rabbit/Platform::Drbd::Filesystem[drbd-rabbit]/Drbd::Resource[drbd-rabbit]/Drbd::Resource::Enable[drbd-rabbit]/Drbd::Resource::Up[drbd-rabbit]/Exec[enable DRBD resource drbd-rabbit]/returns: change from 'notrun' to ['0'] failed: 'drbdadm up drbd-rabbit' returned 1 instead of one of [0]
2023-03-03T22:55:56.607 Error: 2023-03-03 22:55:56 +0000 'drbdadm up drbd-platform' returned 1 instead of one of [0]
2023-03-03T22:55:56.694 Error: 2023-03-03 22:55:56 +0000 /Stage[main]/Platform::Drbd::Platform/Platform::Drbd::Filesystem[drbd-platform]/Drbd::Resource[drbd-platform]/Drbd::Resource::Enable[drbd-platform]/Drbd::Resource::Up[drbd-platform]/Exec[enable DRBD resource drbd-platform]/returns: change from 'notrun' to ['0'] failed: 'drbdadm up drbd-platform' returned 1 instead of one of [0]
2023-03-03T22:55:56.798 Error: 2023-03-03 22:55:56 +0000 'drbdadm up drbd-extension' returned 1 instead of one of [0]
2023-03-03T22:55:56.885 Error: 2023-03-03 22:55:56 +0000 /Stage[main]/Platform::Drbd::Extension/Platform::Drbd::Filesystem[drbd-extension]/Drbd::Resource[drbd-extension]/Drbd::Resource::Enable[drbd-extension]/Drbd::Resource::Up[drbd-extension]/Exec[enable DRBD resource drbd-extension]/returns: change from 'notrun' to ['0'] failed: 'drbdadm up drbd-extension' returned 1 instead of one of [0]
2023-03-03T22:55:57.031 Error: 2023-03-03 22:55:56 +0000 'drbdadm up drbd-etcd' returned 1 instead of one of [0]
2023-03-03T22:55:57.117 Error: 2023-03-03 22:55:56 +0000 /Stage[main]/Platform::Drbd::Etcd/Platform::Drbd::Filesystem[drbd-etcd]/Drbd::Resource[drbd-etcd]/Drbd::Resource::Enable[drbd-etcd]/Drbd::Resource::Up[drbd-etcd]/Exec[enable DRBD resource drbd-etcd]/returns: change from 'notrun' to ['0'] failed: 'drbdadm up drbd-etcd' returned 1 instead of one of [0]
2023-03-03T22:55:57.392 Error: 2023-03-03 22:55:56 +0000 'drbdadm up drbd-dockerdistribution' returned 1 instead of one of [0]
2023-03-03T22:55:57.477 Error: 2023-03-03 22:55:56 +0000 /Stage[main]/Platform::Drbd::Dockerdistribution/Platform::Drbd::Filesystem[drbd-dockerdistribution]/Drbd::Resource[drbd-dockerdistribution]/Drbd::Resource::Enable[drbd-dockerdistribution]/Drbd::Resource::Up[drbd-dockerdistribution]/Exec[enable DRBD resource drbd-dockerdistribution]/returns: change from 'notrun' to ['0'] failed: 'drbdadm up drbd-dockerdistribution' returned 1 instead of one of [0]

Test Activity
-------------
Install test

Workaround
----------
N/A

Changed in starlingx:
assignee: nobody → Takamasa Takenaka (ttakenak)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)
Download full text (3.8 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/877992
Committed: https://opendev.org/starlingx/config/commit/bb8ae1ae4d79538e0cd6fd6d6ee7268143e2dc20
Submitter: "Zuul (22348)"
Branch: master

commit bb8ae1ae4d79538e0cd6fd6d6ee7268143e2dc20
Author: Takamasa Takenaka <email address hidden>
Date: Mon Mar 20 14:33:35 2023 -0300

    Truncate drbd cpu mask string to 31 bytes

    drbd configuration has cpu mask which is affined
    to use cpu core. drbd process needs to be assigned
    to cpu for platform function.
    Internally this assing is done by parameter
    "cpu-mask" of the command "drbdsetup".
    In the code, the size of cpu-mask" is defined
    32 bytes.One byte is used for the "end-of-string"
    character, and a comma is required after every
    8 characters of text, which leaves us 28 bytes
    to represent the bitmask. Since each character of
    text represents 4 bits of bitmask but requires
    one byte to store, this allows us to represent
    a bitmask of at most 112 CPUs, (or 56 cores
    with HT enabled).

    It is possible to assign platform function by
    logical cpu id and set drbd cpu_mask=0 after
    truncated if assign it to cpu 126 and 127 for
    example. To avoid this, we will reject to configure
    if platform function is assigned larger than
    112 logical cpu. So that user can notice to
    need to configure differently (without truncate).

    Launchpad:1900174 made this assignment. But,
    it is overflowed for 128 core cpu (64 cpu with HT)
    because cpu-mask string is;
    "ffffffff,ffffffff,ffffffff,fffffffff" (36 bytes)
    This launchpad also indicated comma is necessary
    every 8 bytes to parse cpu-mask string properly.

    This fix is:
    1. Set cpu-mask=0 if cpu function is all for
       platform ("cpu-mask=0" means use all cpu)
       (In case of controller in DC/STD)
    2. Truncate to 31 bytes cpu-mask string if
       cpu-mask string is more than 31 bytes
       (Not 32 bytes as last bytes is string stop bit)
    3. Calculate cpu-mask if cpu-mask is less than
       31 bytes (current implementation)

    Closes-bug: 2012639

    Test Plan:
    PASS: Install all platform function cpu with 128 cpu
          - Finished successfully
          - Set cpu-mask = 0 in configuration
          - "Set drbd cpu mask=0" is logged
    PASS: Install mixed function cpu with 128 cpu
          - Finished successfully
          - Calculate cpu-mask properly in configuration
    PASS: Modify cpu function with "system host-cpu-modify -c"
          (Modify with more than 31 bytes mask)
          - Command shows error message
    PASS: Modify cpu function with "system host-cpu-modify -c"
          (Modify with less than 31 bytes mask)
          - Command finished successfully
          - Function is assigned properly
          - Calculate cpu-mask properly in configuration
    PASS: Modify cpu function with "system host-cpu-modify -p"
          (Modify with more than 31 bytes mask)
          - Command shows error message
    PASS: Modify cpu function with "system host-cpu-modify -p"
          (Modify with less than 31 bytes mask)
          - Command finished su...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.config
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.