Distributed Cloud: System Controller services not enabled after first unlock

Bug #1863362 reported by Yosief Gebremariam
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Critical
Yosief Gebremariam

Bug Description

Brief Description
-----------------
1) Booted DC System Controller
2) Ansible bootstrap and controller configuration have completed successfully
3) After controller-0 unlock the services did not start and could do any basic CLI commands:
controller-0:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

controller-0:~$ kubectl get pods --all-namespaces
The connection to the server [aefd::1]:6443 was refused - did you specify the right host or port?

Severity
--------
Critical - cannot install the system controller.

Steps to Reproduce
------------------
1) Boot DC system Controller
2) Run ansible bootstrap and configure controller-0
3) unlock controller-0
4) The controller-0 is unlocked but the services fail to start

Expected Behavior
------------------
System Controller get installed and unlocked with all services up and enabled

Actual Behavior
----------------
The System Controller controller-0 is unlocked but the services were not enabled. As a result, cannot query the system status.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
IPv6 distributed cloud - System Controller
lab-name: WCP-90_91

Branch/Pull Time/Commit
-----------------------
2020-02-14_04-10-00

Last Pass
---------
2020-02-06_00-10-00

Timestamp/Logs
--------------
sysinv 2020-02-14 15:23:50.989 73744 INFO sysinv.agent.manager [-] interface ens801f0 enabled to receive LLDP PDUs
sysinv 2020-02-14 15:23:54.531 83323 INFO oslo_service.service [-] Caught SIGTERM, stopping children
sysinv 2020-02-14 15:23:54.535 83323 INFO oslo.service.wsgi [-] Stopping WSGI server.
sysinv 2020-02-14 15:23:54.536 83458 INFO oslo.service.wsgi [-] Stopping WSGI server.
sysinv 2020-02-14 15:23:54.536 83459 INFO oslo.service.wsgi [-] Stopping WSGI server.
sysinv 2020-02-14 15:23:54.535 83323 INFO oslo_service.service [-] Waiting on 2 children to exit
sysinv 2020-02-14 15:23:54.622 83323 INFO oslo_service.service [-] Child 83458 exited with status 0
sysinv 2020-02-14 15:23:54.626 83323 INFO oslo_service.service [-] Child 83459 exited with status 0
sysinv 2020-02-14 15:23:55.837 80741 INFO sysinv.openstack.common.service [-] Caught SIGTERM, exiting
sysinv 2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed: IOError: Socket closed
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common Traceback (most recent call last):
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/impl_kombu.py", line 564, in ensure
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common return method(*args, **kwargs)
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/impl_kombu.py", line 644, in _consume
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common return self.connection.drain_events(timeout=timeout)
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 301, in drain_events
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common return self.transport.drain_events(self.connection, **kwargs)
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 103, in drain_events
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common return connection.drain_events(**kwargs)
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 464, in drain_events
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common return self.blocking_read(timeout)
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 468, in blocking_read
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common frame = self.transport.read_frame()
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 269, in read_frame
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common frame_header = read(7, True)
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 417, in _read
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common raise IOError('Socket closed')
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common IOError: Socket closed
2020-02-14 15:23:58.030 73744 ERROR sysinv.openstack.common.rpc.common
sysinv 2020-02-14 15:23:58.034 73744 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on localhost:5672
sysinv 2020-02-14 15:23:58.040 73744 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: error: [Errno 111] ECONNREFUSED
sysinv 2020-02-14 15:23:59.041 73744 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on localhost:5672
sysinv 2020-02-14 15:23:59.046 73744 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 3 seconds.: error: [Errno 111] ECONNREFUSED
sysinv 2020-02-14 15:24:01.859 73744 INFO sysinv.openstack.common.service [-] Caught SIGTERM, exiting
sysinv 2020-02-14 15:29:49.366 10095 INFO sysinv.agent.lldp.manager [-] Configured sysinv LLDP agent drivers: []
sysinv 2020-02-14 15:29:49.367 10095 INFO sysinv.agent.lldp.manager [-] Loaded sysinv LLDP agent drivers: []
sysinv 2020-02-14 15:29:49.367 10095 INFO sysinv.agent.lldp.manager [-] Registered sysinv LLDP agent drivers: []
sysinv 2020-02-14 15:29:49.415 10095 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: error: [Errno 111] ECONNREFUSED

Test Activity
-------------
Sanity

 Workaround
 ----------
 None

Revision history for this message
Yosief Gebremariam (ygebrema) wrote :

A similar issue was also observed using 2020-02-13_00-10-00 build

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / critical - issue prevents systems from coming up

Changed in starlingx:
importance: Undecided → Critical
Ghada Khalil (gkhalil)
tags: added: stx.4.0 stx.config stx.sanity
Revision history for this message
Zhang Kunpeng (zhangkunpeng) wrote :

Can you share environment 'LANG' of your console with the follow commond.

$ echo $LANG

If the output is not 'en_US.UTF-8', there are some problems when to run ansible bootstrap,
and you could not get any error logs, because some tasks are skipped.

You can try again with command 'export LANG=en_US.UTF-8' before bootstrap.

hutianhao27 (hutianhao)
Changed in starlingx:
assignee: nobody → hutianhao27 (hutianhao)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/708603

Changed in starlingx:
status: New → In Progress
Revision history for this message
Yang Liu (yliu12) wrote :

Just a note that this issue seems to be intermittent after all.
We tried to reinstall the system twice with similar loads and did not see this issue again.

We no longer have the environment to check which LANG was used when the failure was encountered.
Currently on a successfully installed and configured system, we are seeing following.

controller-0:~$ echo $LANG
en_US.UTF-8

Revision history for this message
Ghada Khalil (gkhalil) wrote :

I understand that the review above is dealing with a valid issue, but I doubt this is the same root-cause as the original issue reported here. The WR labs are all set in a similar day and it's highly unlikely that the language settings would be different on the same lab between multiple executions.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Zhang Kunpeng, Is there any information in this launchpad that leads you to believe this issue is due to a different language setting? I see that a full collect log is not even attached to this LP.

Revision history for this message
Zhang Kunpeng (zhangkunpeng) wrote :

@Ghada Khalil Recently, we deployed distributed central cloud with starlingx 3.0, and all were failed and the behavior were same as the launchpad said, and the language setting was the failure reason. So I suggest to ensure the language setting.

In addition, drdb-pgsql was not configured successfully because of the language, just execute the command 'lsblk' to check if the dir /var/lib/postgresql was mounted.

Or please attach the full logs for more analyzes.

hutianhao27 (hutianhao)
Changed in starlingx:
assignee: hutianhao27 (hutianhao) → nobody
hutianhao27 (hutianhao)
Changed in starlingx:
status: In Progress → New
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Kunpeng, I agree that the language problem you are mentioning is a real issue. I'm just not sure that it is this issue. Unless the reporter attaches the full logs, we can't be sure.

I see that a specific launchpad was opened for the language issue: https://bugs.launchpad.net/starlingx/+bug/1859951

I think this is better anyway since we don't have the logs to confirm the occurrences reported here.

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as Incomplete until a re-occurrence and/or logs are provided

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Yosief Gebremariam (ygebrema)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

No information on reproduction since original occurrence. Closing as Not Reproducible.

Changed in starlingx:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.