Unable to execute sysinv commands after ansible configuration

Bug #1830085 reported by Yosief Gebremariam
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Tee Ngo

Bug Description

Brief Description
-----------------
Attempted to install a storage system (2+2+4) with latest build. The controller-0 was configured using ansible and appeared to be successful. After sourcing /etc/platform/openrc, attempted to execute sysinv commands but failed with error message:[Errno 111] Connection refused:

source /etc/platform/openrc; system --os-endpoint-type internalURL --os-region-name RegionOne host-list' failed to execute. Output: system --os-endpoint-type internalURL --os-region-name RegionOne host-list
E [Errno 111] Connection refused

The same issue was observed in simplex, duplex and standard system configuration

Severity
--------
Critical - Unable to execute sysinv commands. Impossible to determine the state of controller-0 to proceed with installation.

Steps to Reproduce
------------------
1) Attempt to install storage system using ansible deployment
2) Boot controller-0
3) Run ansible to configure controller
4) source /etc/platform/openrc
5) attempt to execute sysinv commands like "system host-list"
6) [Errno 111] Connection refused

Expected Behavior
------------------
The sysinv commands should succeed.
Actual Behavior
----------------
controller-0:~$ source /etc/platform/openrc
[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
[Errno 111] Connection refused
[wrsroot@controller-0 ~(keystone_admin)]$

Reproducibility
---------------
<Reproducible/Intermittent>
Intermittent ( No issue was observed in some standard labs).

System Configuration
--------------------
<Multi-node system, Dedicated storage>

Branch/Pull Time/Commit

-----------------------
Build id and date: 2019-05-21_14-14-17

Last Pass
---------
Yes, it passed with older build 2019-05-18_06-36-50

Timestamp/Logs
In single node system:
2019-05-22 06:50:23.941 74534 INFO sysinv.agent.lldp.manager [-] Configured sysinv LLDP agent drivers: []
2019-05-22 06:50:23.972 74534 INFO sysinv.agent.lldp.manager [-] Loaded sysinv LLDP agent drivers: []
2019-05-22 06:50:23.973 74534 INFO sysinv.agent.lldp.manager [-] Registered sysinv LLDP agent drivers: []
2019-05-22 06:50:24.026 74534 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2019-05-22 06:50:25.028 74534 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on localhost:5672
2019-05-22 06:50:25.034 74534 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 3 seconds.
2019-05-22 06:50:28.038 74534 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on localhost:5672
2019-05-22 06:50:28.044 74534 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds.
2019-05-22 06:50:33.047 74534 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on localhost:5672
2019-05-22 06:50:33.054 74534 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 7 seconds.

In two node system:
2019-05-22 06:54:23.446 80212 INFO sysinv.agent.lldp.manager [-] Registered sysinv LLDP agent drivers: []
2019-05-22 06:54:23.505 80212 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2019-05-22 06:54:24.506 80212 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on localhost:5672
2019-05-22 06:54:24.513 80212 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 3 seconds.
2019-05-22 06:54:27.514 80212 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on localhost:5672
2019-05-22 06:54:27.521 80212 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds.
2019-05-22 06:54:32.523 80212 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on localhost:5672
2019-05-22 06:54:32.530 80212 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 7 seconds.

In standard lab:
2019-05-22 15:00:24.238 80545 INFO sysinv.agent.lldp.manager [-] Configured sysinv LLDP agent drivers: []
2019-05-22 15:00:24.265 80545 INFO sysinv.agent.lldp.manager [-] Loaded sysinv LLDP agent drivers: []
2019-05-22 15:00:24.265 80545 INFO sysinv.agent.lldp.manager [-] Registered sysinv LLDP agent drivers: []
2019-05-22 15:00:24.310 80545 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2019-05-22 15:00:25.311 80545 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on localhost:5672
2019-05-22 15:00:25.316 80545 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 3 seconds.
2019-05-22 15:00:28.319 80545 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on localhost:5672
2019-05-22 15:00:28.324 80545 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds.

Revision history for this message
Nimalini Rasa (nrasa) wrote :

It was seen in more than one lab.

description: updated
Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

Did the ansible config complete ? Were there are errors ?

Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; critical priority as this appears to be impacting multiple systems.

description: updated
tags: added: stx.sanity
Changed in starlingx:
importance: Undecided → Critical
status: New → Triaged
tags: added: stx.2.0 stx.config
Changed in starlingx:
assignee: nobody → Tee Ngo (teewrs)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

From Numan, this is happening in a number of labs: SM-2, WCP-113-121, WP-1-2, WCP-35-60

Revision history for this message
Yosief Gebremariam (ygebrema) wrote :

In all system ansible finished ok:
[2019-05-22 07:00:01,466] 4441 INFO MainThread install_helper.analyze_ansible_output:: Ansible result line = localhost : ok=193 changed=123 unreachable=0 failed=0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/661392

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :

Issue cannot be reproduced on virtual environments, we used three different ISOs:

BUILD_ID="20190523T013000Z"
BUILD_ID="20190524T013000Z"
BUILD_ID="20190525T013000Z"

Can execute system commands.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/661392
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=aa3ddafdddf993aaa328a336692b0d706b32e247
Submitter: Zuul
Branch: master

commit aa3ddafdddf993aaa328a336692b0d706b32e247
Author: Tee Ngo <email address hidden>
Date: Fri May 24 16:05:37 2019 -0400

    Serialize resources intensive operations

    Occasionally in some labs, sysinv api is inaccessible following
    Ansible bootstrap. This is due to incomplete sysinv endpoints update
    in keystone database even though puppet returned no errors and
    .config_applied flag is generated.

    This commit:
      a) serializes the two resources intensive operations
           - service endpoints reconfiguration
           - Kubernetes+Helm bringup, images download
         to avoid the possibility of some database transactions being timed
         out. Furthermore, checking for endpoint reconfig specific flag is
         a more reliable method than relying on the general .config_applied
         flag.
      b) verifies that the controller is online before exiting the playbook

    Test:
      Verify that installation succeeds in labs that exhibit the issue.

    Closes-Bug: 1830085
    Change-Id: Ifb6e77f08a64a80a854faf15c58200e2ba49824a
    Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Numan Waheed (nwaheed)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.