Alarm 750.002 Application Apply Failure for cert-manager

Bug #1952400 reported by Alexandru Dimofte
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Critical
Frank Miller

Bug Description

Brief Description
-----------------
Alarm 750.002 Application Apply Failure for cert-manager. Cert-manager apply-failed.
In Jenkins I observed this error: type object 'ilvg' has no attribute 'isdigit'

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
Try to install StarlingX 20211126T031915Z

Expected Behavior
------------------
cert-manager should be applied fine

Actual Behavior
----------------
cert-manager apply-failed

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
all configurations

Branch/Pull Time/Commit
-----------------------
20211126T031915Z

Last Pass
---------
20211124T041933Z

Timestamp/Logs
--------------
will be attached

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: stx.6.0 / critical - issue seen in the last 2 sanity runs and is resulting in a red sanity

Changed in starlingx:
assignee: nobody → Sabeel Ansari (sansariwr)
importance: Undecided → Critical
status: New → Triaged
tags: added: stx.6.0 stx.apps stx.containers
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Sabeel to start the investigation, this error "type object 'ilvg' has no attribute 'isdigit'" seems to be related to logical volume groups, so perhaps this has something to do with the storage setup.

Revision history for this message
Frank Miller (sensfan22) wrote :

Alexandru - I cannot find any reference in the collect logs for the isdigit error. Please attach the jenkins output you are referencing.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

The robot framework log is not from the same run as the collect log so we cannot line up events between the two. Please attach the robot framework log that aligns with the collect logs (or attach new collect logs that align with the robot logs).

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

You can find here the collected logs from the same build. The debug.log from robot framework was attached few comments above.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

Hi Frank,
I believe that the issue regarding ilvg error is coming from this commit:
./cgcs-root/stx/config 64a1944a66a04e09e4515acba0f072accb51af06 2021-11-25 19:19:03 +0000 Gerrit Code Review <email address hidden> Merge "Cleanup pylint error: redefined-outer-name"

Somebody with better experience can check it? Thanks!

Revision history for this message
Thiago Paiva Brito (outbrito) wrote :

It seems like the variable was renamed on the parameter, but not where it was used. Easy fix, but I'm wondering why the unittests didn't got it...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/820180

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/820180
Committed: https://opendev.org/starlingx/config/commit/f5d836d161f11aa451a66e2c5f01821411932e09
Submitter: "Zuul (22348)"
Branch: master

commit f5d836d161f11aa451a66e2c5f01821411932e09
Author: Thiago Brito <email address hidden>
Date: Thu Dec 2 11:50:45 2021 -0300

    Fixing variable renaming

    On commit [1] the 3rd parameter for _find_ilvg() was renamed to fix a
    lint issue, but the usage of that parameter wasn't correctly renamed and
    a problem arised since there is a class in the same module with that
    name (that was being shadowed before). This commit fixes that and add
    some unitests for this method so we don't merge this kind of error in
    the future.

    [1] https://opendev.org/starlingx/config/commit/5923349485da0ea32042db1c60b0303f7d12e36d

    Partial-Bug: #1952400
    Signed-off-by: Thiago Brito <email address hidden>
    Change-Id: I328c3f5976f1903e621ff170a9da1a929983af18

Revision history for this message
Al Bailey (albailey1974) wrote :

ipv.py needs a similar fix.
I am asking Jia to take a look.

Revision history for this message
Al Bailey (albailey1974) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/820214

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/820216

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by "Jia Hu <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/config/+/820214

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/820216
Committed: https://opendev.org/starlingx/config/commit/f9a7febd46ebd8a28e26413f8d40338bffc99a8f
Submitter: "Zuul (22348)"
Branch: master

commit f9a7febd46ebd8a28e26413f8d40338bffc99a8f
Author: Thiago Brito <email address hidden>
Date: Thu Dec 2 14:56:03 2021 -0300

    Fixing variable renaming on _find_ipv

    On commit [1] the 3rd parameter for _find_ipv() was renamed to fix a
    lint issue, but the usage of that parameter wasn't correctly renamed and
    a problem arised since there is a class in the same module with that
    name (that was being shadowed before). This commit fixes that and add
    some unitests for this method so we don't merge this kind of error in
    the future.

    [1]
    https://opendev.org/starlingx/config/commit/5923349485da0ea32042db1c60b0303f7d12e36d

    Partial-Bug: #1952400

    Signed-off-by: Thiago Brito <email address hidden>
    Change-Id: I0a8660ec530a859a6d108e5d85b808907c259614

Revision history for this message
Frank Miller (sensfan22) wrote :

Analysis from Bob Church:

Long story short:
• This virtual system is suffering from I/O overload probably due to the new 5.10 kernel and I/O scheduler issues (hopefully Gerry’s update will help this: https://review.opendev.org/c/starlingx/config-files/+/820263 )
• On controller-1 unlock the DRBD devices start syncing, ETCD immediately starts to see slow response times, SM starts to see audit misses and kills processes (etcd and sysinv) during the middle of the application-apply
• SM thrashes between the controllers for approx. 5 minutes before stabilizing back on controller-0
• Etcd and sysinv are restarted
• Once sysinv is restarted, it will reset the state of the app to apply-failed from applying.

Timeline added into next comment.

Revision history for this message
Frank Miller (sensfan22) wrote :
Download full text (33.3 KiB)

Timeline/log analysis from Bob Church:

# Controller-1 unlocked
2021-11-26T06:47:25.662 | c0 255 | node-scn | controller-1 | locked-disabled | unlocked-disabled | customer action

# Controller-1 boots and DRBD is initialized
2021-11-26T06:47:59.349 controller-1 kernel: info [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.10.74-200.1803.tis.el7.x86_64 root=UUID=486cbd44-0343-4d37-9818-0104944aef9f ro security_profile=standard module_blacklist=integrity,ima audit=0 tboot=false crashkernel=512M biosdevname=0 console=ttyS0,115200 intel_iommu=off usbcore.autosuspend=-1 loop.max_part=15 selinux=0 enforcing=0 nmi_watchdog=panic,1 softlockup_panic=1 softdog.soft_panic=1 user_namespace.enable=1 nopti nospectre_v2 nospectre_v1
2021-11-26T06:50:48.921 controller-1 kernel: info [ 174.698270] drbd: initialized. Version: 8.4.11 (api:1/proto:86-101)

# Controller-1 services start
2021-11-26T06:52:23.195 | c0 256 | neighbor-scn | controller-1 | down | exchange-start | hello received for controller
2021-11-26T06:52:27.235 | c0 257 | neighbor-scn | controller-1 | exchange-start | exchange | exchange-start received for controller
2021-11-26T06:52:27.248 | c0 258 | neighbor-scn | controller-1 | exchange | full | exchange complete for controller
2021-11-26T06:52:30.028 | c1 26 | service-group-scn | controller-services | initial | go-standby |
2021-11-26T06:52:30.953 | c1 131 | service-group-scn | controller-services | go-standby | standby |

# Etcd starts syncing
2021-11-26T06:52:31.344 controller-1 kernel: info [ 277.121013] drbd drbd-etcd: conn( StandAlone -> Unconnected )
2021-11-26T06:52:31.344 controller-1 kernel: info [ 277.121031] drbd drbd-etcd: Starting receiver thread (from drbd_w_drbd-etc [93582])
2021-11-26T06:52:31.344 controller-1 kernel: info [ 277.121076] drbd drbd-etcd: receiver (re)started
2021-11-26T06:52:31.344 controller-1 kernel: info [ 277.121085] drbd drbd-etcd: conn( Unconnected -> WFConnection )
2021-11-26T06:52:31.850 controller-1 kernel: info [ 277.626578] drbd drbd-etcd: Handshake successful: Agreed network protocol version 101
2021-11-26T06:52:31.850 controller-1 kernel: info [ 277.626584] drbd drbd-etcd: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
2021-11-26T06:52:31.850 controller-1 kernel: info [ 277.627750] drbd drbd-etcd: Peer authenticated using 20 bytes HMAC
2021-11-26T06:52:31.850 controller-1 kernel: info [ 277.627898] drbd drbd-etcd: conn( WFConnection -> WFReportParams )
2021-11-26T06:52:31.850 controller-1 kernel: info [ 277.627904] drbd drbd-etcd: Starting ack_recv thread (from drbd_r_drbd-etc [93702])
2021-11-26T06:52:31.851 controller-0 kernel: info [ 3539.561294] drbd drbd-etcd: Handshake successful: Agreed ne...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Alex,
please let us know if you are still seeing this issue w/ the http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/flock/20211203T042048Z build. It includes the change that Frank Miller referenced above: https://review.opendev.org/c/starlingx/config-files/+/820263

Changed in starlingx:
assignee: Sabeel Ansari (sansariwr) → Frank Miller (sensfan22)
Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

This bug can be closed. Thanks!

Frank Miller (sensfan22)
Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to root (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/root/+/820863

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to clients (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/clients/+/820864

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to root (master)

Reviewed: https://review.opendev.org/c/starlingx/root/+/820863
Committed: https://opendev.org/starlingx/root/commit/e5c91e245fda66e5b213113acd50002e0317693f
Submitter: "Zuul (22348)"
Branch: master

commit e5c91e245fda66e5b213113acd50002e0317693f
Author: Thiago Miranda <email address hidden>
Date: Tue Dec 7 08:38:08 2021 -0500

    Update stx-platformclients tag to stx.6.0-v1.5.3

    This commit updates the image with the updated clients.

    Partial-Bug: #1952400

    Signed-off-by: Thiago Miranda <email address hidden>
    Change-Id: If2ff9c163adf8bb0092736cd74695f12506d5fae

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to clients (master)

Reviewed: https://review.opendev.org/c/starlingx/clients/+/820864
Committed: https://opendev.org/starlingx/clients/commit/34e46bcdd9182a4e8f313d578f8321aaa14b73b1
Submitter: "Zuul (22348)"
Branch: master

commit 34e46bcdd9182a4e8f313d578f8321aaa14b73b1
Author: Thiago Miranda <email address hidden>
Date: Tue Dec 7 08:47:01 2021 -0500

    Update stx-platformclients image to version 1.5.3

    Updated image with the new fixes since the last build

    Partial-Bug: #1952400
    Depends-On: https://review.opendev.org/c/starlingx/root/+/820863

    Signed-off-by: Thiago Miranda <email address hidden>
    Change-Id: I7283123fc3f6b1c3c5d0e4b590fc96bb667492c9

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers