stx-openstack apply is stuck at 31%

Bug #1949360 reported by Alexandru Dimofte
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Felipe Sanches Zanoni

Bug Description

Brief Description
-----------------
StarlingX installation failed during provision(with timeout exit) because stx-openstack apply is stuck at 31%. All configurations are affected.

Severity
--------
Critical: Stx can't be installed

Steps to Reproduce
------------------
Try to install Starlingx 20211030T181901Z

Expected Behavior
------------------
Stx should install successfully

Actual Behavior
----------------
Stx installation failed because the stx-openstack apply stuck at 31%:
[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+-------------------------------+-----------------------------------+----------------------------------------+----------+-------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+-------------------------------+-----------------------------------+----------------------------------------+----------+-------------------------------------------+
| cert-manager | 1.0-22 | cert-manager-manifest | certmanager-manifest.yaml | applied | completed |
| nginx-ingress-controller | 1.1-16 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
| oidc-auth-apps | 1.0-59 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-42 | platform-integration-manifest | manifest.yaml | applied | completed |
| rook-ceph-apps | 1.0-11 | rook-ceph-manifest | manifest.yaml | uploaded | completed |
| stx-openstack | 1.0-119-centos-stable- | armada-manifest | stx-openstack.yaml | applying | processing chart: osh-openstack-glance, |
| | versioned | | | | overall completion: 31.0% |
| | | | | | |
+--------------------------+-------------------------------+-----------------------------------+----------------------------------------+----------+-------------------------------------------+

HINT:
sysinv 2021-11-01 12:49:46.980 99694 ERROR oslo.messaging._drivers.impl_rabbit [-] [da4b97c1-82ed-4042-93f3-78f2d5a72a32] AMQP server on <sysinv.objects.address.address object at 0x7fdef134d290>:5672 is unreachable: failed to resolve broker hostname. Trying again in 32 seconds. Client port: None: error: failed to resolve broker hostname
sysinv 2021-11-01 12:50:18.993 99694 ERROR oslo.messaging._drivers.impl_rabbit [-] [784066fb-11d9-4223-8148-ae034ec8fed8] AMQP server on <sysinv.objects.address.address object at 0x7fdef1401790>:5672 is unreachable: failed to resolve broker hostname. Trying again in 32 seconds. Client port: None: error: failed to resolve broker hostname
sysinv 2021-11-01 12:50:19.011 99694 ERROR oslo.messaging._drivers.impl_rabbit [-] [da4b97c1-82ed-4042-93f3-78f2d5a72a32] AMQP server on <sysinv.objects.address.address object at 0x7fdef134d290>:5672 is unreachable: failed to resolve broker hostname. Trying again in 32 seconds. Client port: None: error: failed to resolve broker hostname

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
all configurations baremetal and virtual.

Branch/Pull Time/Commit
-----------------------
20211030T181901Z

Last Pass
---------
20211027T024800Z

Timestamp/Logs
--------------
will be attached

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Austin Sun (sunausti) wrote :

Hi Chant:
    As talked, could you like to triage this issue ?

Thanks.
BR
Austin Sun.

Changed in starlingx:
assignee: nobody → YU CHENGDE (chant)
Revision history for this message
YU CHENGDE (chant) wrote :

hello

1.
stx-openstack apply was stuck because armada applied opendstack-glance failed

I surf the logs and found the issue in

====
ALL_NODES_20211101.125218/controller-0_20211101.125218/var/log/pods/openstack_glance-metadefs-load-jnkjv_542da925-6a87-4d52-88a8-5fa7ce16689c/glance-metadefs-load/0.log
====

2021-11-01T12:33:16.256289485Z stdout F 2021-11-01 12:33:16.255 6 ERROR glance.db.sqlalchemy.metadata [-] Failed to parse json file /var/lib/openstack/etc/glance/metadefs/compute-hypervisor.json while populating metadata due to: 'ascii' codec can't decode byte 0xc2 in position 898: ordinal not in range(128): UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 898: ordinal not in range(128)

2021-11-01T12:33:16.328460709Z stdout F 2021-11-01 12:33:16.328 6 ERROR glance.db.sqlalchemy.metadata [-] Failed to parse json file /var/lib/openstack/etc/glance/metadefs/software-runtimes.json while populating metadata due to: 'ascii' codec can't decode byte 0xe2 in position 1367: ordinal not in range(128): UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1367: ordinal not in range(128)

====
ALL_NODES_20211101.125218/controller-0_20211101.125218/var/log/pods/openstack_glance-ks-service-lhgkn_5de50a62-c1a6-4f9a-8895-de8a62581c17/image-ks-service-registration/0.log
====

2021-11-01T12:32:54.675000242Z stderr F + OS_SERVICE_DESC='RegionOne: glance (image) service'
2021-11-01T12:32:54.675043656Z stderr F + unset OS_SERVICE_ID
2021-11-01T12:32:54.675545656Z stderr F ++ seq 3
2021-11-01T12:32:54.678134714Z stderr F + for i in '$(seq 3)'
2021-11-01T12:32:54.679088664Z stderr F ++ grep ',glance,image$'
2021-11-01T12:32:54.67932324Z stderr F ++ openstack service list -f csv --quote none
2021-11-01T12:32:54.679546801Z stderr F ++ sed -e s/,glance,image//g
2021-11-01T12:32:59.228688184Z stderr F + OS_SERVICE_ID=1b912ffbbf2d4c89b4318e58bd5456b3
2021-11-01T12:32:59.228720252Z stderr F + [[ -n 1b912ffbbf2d4c89b4318e58bd5456b3 ]]
2021-11-01T12:32:59.228723849Z stderr F + exit 0

++++
next step
++++
I will trace the process for armada applying openstack-glance and find out the root cause

Revision history for this message
YU CHENGDE (chant) wrote :

Hello

After tracking relative logs, I found some error message in ceph-manager.log.
I guess that glance-api can’t found object storage backend so openstack-glance deployed failed.

/var/log/ceph-manager.log
4854:2021-11-01 12:54:22.945 97854 ERROR ceph_manager.monitor [-] Getting the quota for cinder-volumes pool failed:unrecognized pool 'cinder-volumes'): CephMgrDumpError: Failed to get ceph manager info. Command 'ceph mgr dump --connect-timeout 5' returned non-zero exit status 1
4857:2021-11-01 12:54:22.973 97854 ERROR ceph_manager.monitor [-] Getting the quota for images pool failed:unrecognized pool 'images'): CephMgrDumpError: Failed to get ceph manager info. Command 'ceph mgr dump --connect-timeout 5' returned non-zero exit status 1
4860:2021-11-01 12:54:23.002 97854 ERROR ceph_manager.monitor [-] Getting the quota for ephemeral pool failed:unrecognized pool 'ephemeral'): CephMgrDumpError: Failed to get ceph manager info. Command 'ceph mgr dump --connect-timeout 5' returned non-zero exit status 1
4865:2021-11-01 12:54:23.053 97854 ERROR ceph_manager.monitor [-] Rados gateway object data pool does not exist.: CephMgrDumpError: Failed to get ceph manager info. Command 'ceph mgr dump --connect-timeout 5' returned non-zero exit status 1

Revision history for this message
YU CHENGDE (chant) wrote :

hello

https://opendev.org/starlingx/config/commit/a33342728ae86c1ce529656c5c944a275a5e6f64

this patch modify may cause different output

    def _format_ceph_mon_address(self, ip_address):
- return utils.format_ceph_mon_address(
- ip_address, self.SERVICE_PORT_MON)
+ return utils.format_url_address(ip_address)

=========
def format_ceph_mon_address(ip_address, service_port_mon):
    return '%s:%d' % (format_url_address(ip_address), service_port_mon)

expected output
127.0.0.1:6789
=========

=========
def format_url_address(address):
    """Format the URL address according to RFC 2732"""
    try:
        addr = netaddr.IPAddress(address)
        if addr.version == constants.IPV6_FAMILY:
            return "[%s]" % address
        else:
            return str(address)
    except netaddr.AddrFormatError:
        return address

expected output
127.0.0.1
=========

Maybe it will cause Ceph monitor unreachable?

Revision history for this message
Austin Sun (sunausti) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to the storage prime as this appears related to the recent ceph upversion

tags: added: stx.6.0 stx.distro.openstack
tags: added: stx.storage
Changed in starlingx:
importance: Undecided → Critical
status: New → Triaged
assignee: YU CHENGDE (chant) → Delfino Gomes Curado Filho (dcuradof)
Revision history for this message
zhipeng liu (zhipengs) wrote :

Change need to be done below.

https://opendev.org/starlingx/config/src/branch/master/sysinv/sysinv/sysinv/sysinv/helm/base.py#L317
utils.format_ceph_mon_address(mon, port) ==> utils.format_url_address(ip_address)

Changed in starlingx:
assignee: Delfino Gomes Curado Filho (dcuradof) → Felipe Sanches Zanoni (fsanches)
Revision history for this message
Felipe Sanches Zanoni (fsanches) wrote :

The ceph pools are not being created properly because ceph has been upgraded to Nautilus version and the command to create pools has changed a little.

There is a workaround while the fix is not posted:

Before system application-apply stx-openstack run these commands to create the CEPH pools manually:

ceph osd pool create images 256 256 storage_tier_ruleset 256
ceph osd pool create cinder-volumes 512 512 storage_tier_ruleset 512
ceph osd pool create cinder.backups 256 256 storage_tier_ruleset 256
ceph osd pool create ephemeral 256 256 storage_tier_ruleset 256

I have tested this on a AIO-SX.
Now checking on Storage setup as well.

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/817776

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/817776
Committed: https://opendev.org/starlingx/config/commit/c861bc6f34ee722462ba98bbde8808d23e602733
Submitter: "Zuul (22348)"
Branch: master

commit c861bc6f34ee722462ba98bbde8808d23e602733
Author: Delfino Curado <email address hidden>
Date: Fri Nov 12 10:21:43 2021 -0500

    Fix ceph pool creation during stx-openstack apply

    Previously the helm overrides for nova, cinder and glance respectively
    defined 512, 256 and 256 as the PG_NUM for their Ceph pools. Until
    Mimic, ceph would just issue a warning message if this number was
    bigger than the number OSD * 100, now Nautilus returns an error
    message
    and asks for expected_num_objects parameter or --yes-i-really-mean-it.
    Neither of these options is supported by openstack-helm.

    The pool creation is adjusted to take into account the number of OSDs
    available to choose the PG_NUM.
    The steps are:
    Number of OSD times 100 minus 1;
    Get the nearest power of two numbers below this result;
    Limit PG_NUM to the previous defaults, as they are already high
    numbers.
    This logic roughly results in the same values described here:
    https://docs.ceph.com/en/nautilus/rados/operations/placement-groups/#a-preselection-of-pg-num

    This was done to solve an error message which demands a
    expected_num_objects when PG_NUM is considered too high by ceph.

    Test plan:
    StarlingX builds successfully
    stx-openstack is built successfully
    stx-openstack is applied successfully and ceph pools are created
    accordingly
    It is possible to override PG_NUM on ceph pools by changing chunk_size
    value through helm overrides

    Closes-Bug: #1949360
    Signed-off-by: Delfino Curado <email address hidden>
    Change-Id: I222bee29bcaa09a95a3706c72dd21b8ed3efbe60

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/817492
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/c28605b14f38927dedcef0a543261afade1441a4
Submitter: "Zuul (22348)"
Branch: master

commit c28605b14f38927dedcef0a543261afade1441a4
Author: Delfino Curado <email address hidden>
Date: Wed Nov 10 17:51:27 2021 -0500

    Fix ceph pool creation during stx-openstack apply

    Previously the helm overrides for nova, cinder and glance respectively
    defined 512, 256 and 256 as the PG_NUM for their Ceph pools. Until
    Mimic, ceph would just issue a warning message if this number was
    bigger than the number OSD * 100, now Nautilus returns an error message
    and asks for expected_num_objects parameter or --yes-i-really-mean-it.
    Neither of these options is supported by openstack-helm.

    The pool creation is adjusted to take into account the number of OSDs
    available to choose the PG_NUM.
    The steps are:
    Number of OSD times 100 minus 1;
    Get the nearest power of two numbers below this result;
    Limit PG_NUM to the previous defaults, as they are already high
    numbers.
    This logic roughly results in the same values described here:
    https://docs.ceph.com/en/nautilus/rados/operations/placement-groups/#a-preselection-of-pg-num

    This was done to solve an error message which demands a
    expected_num_objects when PG_NUM is considered too high by ceph.

    Test plan:
    StarlingX builds successfully
    stx-openstack is built successfully
    stx-openstack is applied successfully and ceph pools are created
    accordingly
    It is possible to override PG_NUM on ceph pools by changing chunk_size
    value through helm overrides

    Closes-Bug: #1949360
    Depends-On: I222bee29bcaa09a95a3706c72dd21b8ed3efbe60
    Signed-off-by: Delfino Curado <email address hidden>
    Change-Id: Ia1416e64afcdf91b86afdf750bf5b3a1727db985

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Download full text (13.4 KiB)

From baremetal duplex configuration I extracted this today(remove-failed):
/var/log/armada/stx-openstack-delete_2021-11-16-10-44-42.log

...
2021-11-16 10:44:59.764 77 INFO armada.handlers.tiller [-] Delete osh-openstack-glance release with disable_hooks=False, purge=True, timeout=300 flags
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller [-] Error while deleting release osh-openstack-glance: grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "error beginning transaction: read tcp 172.16.166.130:37260->10.10.53.10:5432: read: connection reset by peer"
        debug_error_string = "{"created":"@1637059508.648998031","description":"Error received from peer ipv4:127.0.0.1:24134","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"error beginning transaction: read tcp 172.16.166.130:37260->10.10.53.10:5432: read: connection reset by peer","grpc_status":2}"
>
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller Traceback (most recent call last):
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 613, in uninstall_release
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller release_request, timeout, metadata=self.metadata)
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 923, in __call__
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller return _end_unary_response_blocking(state, call, False, None)
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 826, in _end_unary_response_blocking
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller raise _InactiveRpcError(state)
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller status = StatusCode.UNKNOWN
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller details = "error beginning transaction: read tcp 172.16.166.130:37260->10.10.53.10:5432: read: connection reset by peer"
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller debug_error_string = "{"created":"@1637059508.648998031","description":"Error received from peer ipv4:127.0.0.1:24134","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"error beginning transaction: read tcp 172.16.166.130:37260->10.10.53.10:5432: read: connection reset by peer","grpc_status":2}"
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller >
2021-11-16 10:45:08.649 77 ERROR armada.handlers.tiller
2021-11-16 10:45:08.650 77 DEBUG armada.handlers.tiller [-] Helm getting release status for release=osh-openstack-glance, version=0 get_release_status /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:531
2021-11-16 10:45:08.651 77 ERROR armada.handlers.tiller [-] Cannot get tiller release status.: grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNO...

Revision history for this message
Felipe Sanches Zanoni (fsanches) wrote :

This seems to be a different problem than the reported initially.

I would suggest you to open a new bug reporting this.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.