stx-openstack: etcd 1MB size limit will prevent scaling up openstack workers

Bug #1887677 reported by Bob Church
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Angie Wang

Bug Description

Brief Description
-----------------
Etcd is designed to store small key/value pairs. Helm/Tiller stores it's release information in
configmaps which will have that underlying limit. As worker nodes scale out in an Openstack
installation the configmap for the Nova and Neutron releases will grow in size proportional to the number of Openstack workers as each contain per/host override information. When this limit is reached, the stx-openstack application will apply successfully but the release configmap will not
be saved. Once this occurs, Helm/Tiller will have lost track of the applied k8s resources. Armada re-applies of the stx-openstack application will fail. Manual cleanup the k8s resources will be required.

This LP needs to investigate:
a) The current limits (i.e. how many workers are currently supported)
b) Is there an Etcd option to increase this size (maybe --max-request-bytes) and what is the impact
c) Use an alternative Helm storage backend (https://helm.sh/docs/topics/advanced/#sql-storage-backend)

Severity
--------
Major: System will hit a worker scalability limit in < 20 nodes.

Steps to Reproduce
------------------
Keep adding worker nodes and observe the size of the release configmap for neutron increase

Expected Behavior
------------------
Helm release information should be stored/maintained regardless of the number of workers

Actual Behavior
----------------
See description.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Multi-node 2 controller + N workers

Branch/Pull Time/Commit
-----------------------
Latest master branch

Last Pass
---------
This is the first time attempting scaling workers up to 20

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Developer

Workaround
----------
N/A

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - this is a scaling issues for systems w/ more than 20 workers. Will not hold up stx.4.0 on this at this late state.

summary: - stx-openstack: Etcd 1MB size limit will prevent scaling up openstack
+ stx-openstack: etcd 1MB size limit will prevent scaling up openstack
workers
tags: added: stx.5.0 stx.distro.openstack
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Paul-Ionut Vaduva (pvaduva)
Revision history for this message
Daewon Kim (prudentcircle) wrote :

I am currently having the same issue with Neutron.
But I only have 9 worker nodes.

I installed stx 4.0 and the issue occurs with only 9 worker nodes + 2 controller nodes.
I think this is a critical limitation.

Revision history for this message
Frank Miller (sensfan22) wrote :

Re-assigning to Angie to start the investigation.

Changed in starlingx:
assignee: Paul-Ionut Vaduva (pvaduva) → Angie Wang (angiewang)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/761642

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/761645

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/761647

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/761649

Revision history for this message
Austin Sun (sunausti) wrote :

Hi Angle:
  based W-1, The decision is just made to not port this helm storage change to stx4.0 so upgrade support is needed. will this on stx.5.0 or need new solution ?

Revision history for this message
Angie Wang (angiewang) wrote :

Hi Austin,

This solution will be on stx5.0. Yes, the upgrade support from stx4.0 is needed. I am currently working on it so I leave it -1W.

Revision history for this message
Austin Sun (sunausti) wrote :

Hi, Angie,
    W-1 was removed, does it mean this fix is ready for review and merge ?
Thanks.
BR
Austin Sun.

Revision history for this message
Angie Wang (angiewang) wrote :

Hi Austin,

I believe -1W is still being set. Backup&restore testing is still in progress.

Thanks,
Angie

Revision history for this message
Austin Sun (sunausti) wrote :

ok. Thanks your update.

Revision history for this message
Angie Wang (angiewang) wrote :
Changed in starlingx:
status: In Progress → Fix Committed
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792029

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/integ/+/793754

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/792195

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (f/centos8)
Download full text (52.5 KiB)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/794324
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/163ec9989cc7360dba4c572b2c43effd10306048
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 4e96b762f549aadb0291cc9bcf3352ae923e94eb
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 15:48:19 2021 +0000

    Revert "Restore host filesystems with collected sizes"

    This reverts commit 255488739efa4ac072424b19f2dbb7a3adb0254e.

    Reason for revert: Did a rework to fix https://bugs.launchpad.net/starlingx/+bug/1926591. The original problem was in puppet, and this fix in ansible was not good enough, it generated some other problems.

    Change-Id: Iea79701a874effecb7fe995ac468d50081d1a84f
    Depends-On: I55ae6954d24ba32e40c2e5e276ec17015d9bba44

commit c064aacc377c8bd5336ceab825d4bcbf5af0b5e8
Author: Angie Wang <email address hidden>
Date: Fri May 21 21:28:02 2021 -0400

    Ensure apiserver keys are present before extract from tarball

    This is to fix the upgrade playbook issue that happens during
    AIO-SX upgrade from stx4.0 to stx5.0 which introduced by
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/792093.
    The apiserver keys are not available in stx4.0 side so we need
    to ensure the keys under /etc/kubernetes/pki are present in the
    backed-up tarball before extracting, otherwise playbook fails
    because the keys are not found in the archive.

    Change-Id: I8602f07d1b1041a7fd3fff21e6f9a422b9784ab5
    Closes-Bug: 928925
    Signed-off-by: Angie Wang <email address hidden>

commit 0261f22ff7c23d2a8608fe3b51725c9f29931281
Author: Don Penney <email address hidden>
Date: Thu May 20 23:09:07 2021 -0400

    Update SX to DX migration to wait for coredns config

    This commit updates the SX to DX migration playbook to wait after
    modifying the system mode to duplex until the runtime manifest that
    updates coredns config has completed. The playbook will wait for up to
    20 minutes to allow for the possibilty that sysinv has multiple
    runtime manifests queued up, each of which could take several minutes.

    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/792494
    Depends-On: https://review.opendev.org/c/starlingx/config/+/792496
    Change-Id: I3bf94d3493ae20eeb16b3fdcb27576ee18c0dc4d
    Closes-Bug: 1929148
    Signed-off-by: Don Penney <email address hidden>

commit 7c4f17bd0d92fc1122823211e1c9787829d206a9
Author: Daniel Safta <email address hidden>
Date: Wed May 19 09:08:16 2021 +0000

    Fixed missing apiserver-etcd-client certs

    When controller-1 is the active controller
    the backup archive does not contain
    /etc/etcd/apiserver-etcd-client.{crt, key}

    This change adds a new task which brings
    the certs from /etc/kubernetes/pki

    Closes-bug: 1928925
    Signed-off-by: Daniel Safta <email address hidden>
    Change-Id: I3c68377603e1af9a71d104e5b1108e9582497a09

commit e221ef8fbe51aa6ca229b584fb5632fe512ad5cb
Author: David Sullivan <email address hidden>
Date: Wed May 19 16:01:27 2021 -0500

    Support boo...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/794285

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (48.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/792029
Committed: https://opendev.org/starlingx/stx-puppet/commit/2b026190a3cb6d561b6ec4a46dfb3add67f1fa69
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 3e3940824dfb830ebd39fd93265b983c6a22fc51
Author: Dan Voiculeasa <email address hidden>
Date: Thu May 13 18:03:45 2021 +0300

    Enable kubelet support for pod pid limit

    Enable limiting the number of pids inside of pods.

    Add a default value to protect against a missing value.
    Default to 750 pids limit to align with service parameter default
    value for most resource consuming StarlingX optional app (openstack).
    In fact any value above service parameter minimum value is good for the
    default.

    Closes-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I10c1684fe3145e0a46b011f8e87f7a23557ddd4a

commit 0c16d288fbc483103b7ba5dad7782e97f59f4e17
Author: Jessica Castelino <email address hidden>
Date: Tue May 11 10:21:57 2021 -0400

    Safe restart of the etcd SM service in etcd upgrade runtime class

    While upgrading the central cloud of a DC system, activation failed
    because there was an unexpected SWACT to controller-1. This was due
    to the etcd upgrade script. Part of this script runs the etcd
    manifest. This triggers a reload/restart of the etcd service. As this
    is done outside of the sm, sm saw the process failure and triggered
    the SWACT.

    This commit modifies platform::etcd::upgrade::runtime puppet class
    to do a safe restart of the etcd SM service and thus, solve the
    issue.

    Change-Id: I3381b6976114c77ee96028d7d96a00302ad865ec
    Signed-off-by: Jessica Castelino <email address hidden>
    Closes-Bug: 1928135

commit eec3008f600aeeb69a42338ed44332228a862d11
Author: Mihnea Saracin <email address hidden>
Date: Mon May 10 13:09:52 2021 +0300

    Serialize updates to global_filter in the AIO manifest

    Right now, looking at the aio manifest:
    https://review.opendev.org/c/starlingx/stx-puppet/+/780600/15/puppet-manifests/src/manifests/aio.pp
    there are 3 classes that update
    in parallel the lvm global_filter:
    - include ::platform::lvm::controller
    - include ::platform::worker::storage
    - include ::platform::lvm::compute
    And this generates some errors.

    We fix this by adding dependencies between the above classes
    in order to update the global_filter in a serial mode.

    Closes-Bug: 1927762
    Signed-off-by: Mihnea Saracin <email address hidden>
    Change-Id: If6971e520454cdef41138b2f29998c036d8307ff

commit 97371409b9b2ae3f0db6a6a0acaeabd74927160e
Author: Steven Webster <email address hidden>
Date: Fri May 7 15:33:43 2021 -0400

    Add SR-IOV rate-limit dependency

    Currently, the binding of an SR-IOV virtual function (VF) to a
    driver has a dependency on platform::networking. This is needed
    to ensure that SR-IOV is enabled (VFs created) before actually
    doing the bind.

    This dependency does not exist for configuring the VF rate-limits
    however. There is a cha...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (37.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/793754
Committed: https://opendev.org/starlingx/integ/commit/a13966754d4e19423874ca31bf1533f057380c52
Submitter: "Zuul (22348)"
Branch: f/centos8

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300

    Fix resize of filesystems in puppet logical_volume

    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.

    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146

    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.

    Tested the following scenarios:

    B&R on SX with default sizes of filesystems and cgts-vg.

    B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
    cgts-vg with additional physical volumes:

    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    B&R on DX system with backup of size 70G and cgts-vg
    with additional physical volumes:

    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300

    Execute once the ceph services script on AIO

    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh

    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.

    We fix this by exiting the ceph script if it is the one from
    /etc/services.d/worker on AIO systems.

    Closes-Bug: 1928934
    Change-Id: I3e4dc313cc3764f870b8f6c640a60338...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (147.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/794906
Committed: https://opendev.org/starlingx/config/commit/75758b37a5a23c8811355b67e2a430a1713cd85b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9e420d9513e5fafb1df4d29567bc299a9e04d58d
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400

    Add more logging to run docker login

    Add error log for running docker login. The new log could
    help identify docker login failure.

    Closes-Bug: 1930310
    Change-Id: I8a709fb6665de8301fbe3022563499a92b2a0211
    Signed-off-by: Bin Qian <email address hidden>

commit 31c77439d2cea590dfcca13cfa646522665f8686
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500

    Fix controller-0 downgrade failing to kill ceph

    kill_ceph_storage_monitor tried to manipulate a pmon
    file that does not exist in an AIO-DX environment.

    We no longer invoke kill_ceph_storage_monitor in an
    AIO SX or DX env.

    This allows: "system host-downgrade controller-0"
    to proceed in an AIO-DX environment where that second
    controller (controller-0) was upgraded.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I633853f75317736084feae96b5b849c601204c13

commit 0dc99eee608336fe01b58821ea404286371f1408
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500

    Fix file permissions failure during duplex upgrade abort

    When issuing a downgrade for controller-0 in a duplex upgrade
    abort and rollback scenario, the downgrade command was failing
    because the sysinv API does not have root permissions to set
    a file flag.
    The fix is to use RPC so the conductor can create the flag
    and allow the downgrade for controller-0 to get further.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I913bcad73309fe887a12cbb016a518da93327947

commit 7ef3724dad173754e40b45538b1cc726a458cc1c
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800

    Fix bug rook-ceph provision with multi osd on one host

    Test case:
    1, deploy simplex system
    2, apply rook-ceph with below override value
    value.yaml
    cluster:
      storage:
        nodes:
        - name: controller-0
          devices:
          - name: sdb
          - name: sdc
    3, reboot

    Without this fix, only osd pod could launch successfully after boot
    as vg start with ceph could not correctly add in sysinv-database

    Closes-bug: 1929511

    Change-Id: Ia5be599cd168d13d2aab7b5e5890376c3c8a0019
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 23505ba77d76114cf8a0bf833f9a5bcd05bc1dd1
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400

    Fix issue in partition data migration script

    The created partition dictonary partition_map is not
    an ordered dict so we need to sort it by its key -
    device node when iterating it to adjust the device
    nodes/paths for user created extra partitions to ensure
    the number of device node...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.