Debian: using VMs, bootstrap fails sysinv-api reload

Bug #1979717 reported by Dan Voiculeasa
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
In Progress
Undecided
Unassigned

Bug Description

Brief Description
-----------------
We observe during a nightly sanity that during bootstrap sysinv-api may be slow to respond. Sanity is deploying AIO-SX.
Same behavior was observed during integration of AIO-DX, but work there was halted for a while.

Investigation:
---------------
This is the check that fails: https://opendev.org/starlingx/config/src/commit/cc3cdbd6474e574a20bc71b35cb6875e359875aa/sysinv/sysinv/sysinv/scripts/sysinv-api#L236

Logging in to the affected systems and doing the curl will pass: curl http://controller:6385/v1
This proves it is just a slow system/optimization issue, and increasing the timeout will make it work.

Severity
--------
Critical: System/Feature is not usable due to the defect

Steps to Reproduce
------------------
run bootstrap

Expected Behavior
------------------
bootstrap passes

Actual Behavior
----------------
bootstrap fails

Reproducibility
---------------
Intermittent 50%.

System Configuration
--------------------
AIO-SX IPv4 on sanity
AIO-DX IPv4 on developer testing
maybe more, but tests not run.

Branch/Pull Time/Commit
-----------------------
22, 23, 24 June

Last Pass
---------
?

Timestamp/Logs
--------------
2022-06-23 04:11:54,968 p=2357 u=sysadmin n=ansible | TASK [bootstrap/persist-config : Restart sysinv-agent and sysinv-api to pick up sysinv.conf update] ***
2022-06-23 04:11:54,969 p=2357 u=sysadmin n=ansible | Thursday 23 June 2022 04:11:54 +0000 (0:00:00.049) 0:28:07.672 *********
2022-06-23 04:11:56,299 p=2357 u=sysadmin n=ansible | changed: [localhost] => (item=/etc/init.d/sysinv-agent restart)
2022-06-23 04:12:17,023 p=2357 u=sysadmin n=ansible | failed: [localhost] (item=/usr/lib/ocf/resource.d/platform/sysinv-api reload) => changed=true
  ansible_loop_var: item
  cmd:
  - /usr/lib/ocf/resource.d/platform/sysinv-api
  - reload
  delta: '0:00:19.959658'
  end: '2022-06-23 04:12:16.883278'
  item: /usr/lib/ocf/resource.d/platform/sysinv-api reload
  msg: non-zero return code
  rc: 1
  start: '2022-06-23 04:11:56.923620'
  stderr: |-
    INFO: sysinv-api:status Sysinv API (sysinv-api) still hasn't stopped yet. Waiting ...
    WARNING: /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs: line 435: kill: (44446) - No such process
    INFO: sysinv-api:Old PID file found, but Sysinv API (sysinv-api) is not running
    INFO: sysinv-api:Sysinv API (sysinv-api) is not running
    INFO: sysinv-api:status Sysinv API (sysinv-api) stopped.
    INFO: sysinv-api:start running with pid 73018
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #1
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #2
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #3
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #4
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #5
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #6
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #7
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #8
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #9
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #10
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #11
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #12
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #13
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #14
    ERROR: Unable to run system show, trying direct request on sysinv-api URL (sysinv-api)
    ERROR: Unable to communicate with the System Inventory Service (sysinv-api)
    INFO: Retrying to connect to the System Inventory Service (sysinv-api), attempt #15
    ERROR: Inventory Service (sysinv-api) failed to start (rc=1)
    ERROR: System Inventory (sysinv-api) process failed to restart (rc=1)
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
2022-06-23 04:12:17,034 p=2357 u=sysadmin n=ansible | PLAY RECAP *********************************************************************

Test Activity
-------------
Sanity and Developer Testing

Workaround
----------
N/A

Tags: stx.debian
Changed in starlingx:
assignee: nobody → Dan Voiculeasa (dvoicule)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/847526

Changed in starlingx:
status: New → In Progress
Changed in starlingx:
assignee: Dan Voiculeasa (dvoicule) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/847526
Committed: https://opendev.org/starlingx/config/commit/648e5bcbba982a5e7720703cef2521d0d7df7a90
Submitter: "Zuul (22348)"
Branch: master

commit 648e5bcbba982a5e7720703cef2521d0d7df7a90
Author: Dan Voiculeasa <email address hidden>
Date: Thu Jun 23 20:24:08 2022 +0300

    sysinv-api: Temporarily increase timeout to let bootstrap pass

    On Debian we observe that bootstrap fails intermittently.
    From the data so far, this is observed only on VMs.

    Couldn't reproduce the error locally, but from the investigation
    deduce it is just a performance degradation.
    Investigation notes to prove it is a performance degradation and not a
    service crash are uploaded to the LP.
    Temporarily increase the retries for waiting for sysinv-api to come up.
    Jump from 15 seconds to 60 seconds to be defensive.
    This will allow sanities to pass and integration effort to continue.

    Tests on AIO-SX on Debian:
    PASS: bootstrap

    Partial-Bug: 1979717
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I10b00eab467303771d25cc5c79760005c3966446

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.