Containers: openstack pods failed after force rebooting active controller

Bug #1816842 reported by Yang Liu
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Chris Friesen

Bug Description

Brief Description
-----------------
Openstack pods in bad states affer force rebooting active controller

Severity
--------
Major

Steps to Reproduce
------------------
- Install and configure system, apply stx-openstack application
- 'sudo reboot -f' from active controller

Expected Behavior
------------------
- system swacted to the other controller, pods are in good states. i.e., Running or Completed.

Actual Behavior
----------------
- After force rebooting the controller, a number of openstack pods stuck in Pending/Init state.

[wrsroot@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces | grep -v -e Completed -e Running
NAMESPACE NAME READY STATUS RESTARTS AGE
openstack aodh-api-65848ddb9f-d26zf 0/1 Init:0/1 0 17h
openstack aodh-evaluator-6f4597fb4-bsrh9 0/1 Init:0/1 0 17h
openstack aodh-listener-687d674c5b-ccscl 0/1 Init:0/1 0 17h
openstack aodh-notifier-6c69c55f49-xzsk6 0/1 Init:0/1 0 17h
openstack ceilometer-notification-5c89bcc85c-ckdvd 0/1 Init:0/1 0 17h
openstack cinder-api-588778fd88-47gbj 0/1 Init:0/2 0 17h
openstack cinder-backup-7f48f88bc8-tc88r 0/1 Init:0/3 0 17h
openstack cinder-scheduler-d46f6df84-xxpqm 0/1 Init:0/2 0 17h
openstack cinder-volume-857d996cc8-nqcg4 0/1 Init:0/3 0 17h
openstack cinder-volume-usage-audit-1550627100-jjllv 0/1 Init:0/1 0 17h
openstack glance-api-7bf78b9fd9-wcdpp 0/1 Init:0/3 0 17h
openstack gnocchi-api-7d69779896-468fs 0/1 Init:0/2 0 17h
openstack gnocchi-api-7d69779896-m2bh8 0/1 Init:0/2 0 17h
openstack gnocchi-metricd-7j6xw 0/1 Init:0/2 1 18h
openstack heat-api-694ddb96b5-clcr2 0/1 Init:0/1 0 17h
openstack heat-cfn-5c74bc6f69-nt2wj 0/1 Init:0/1 0 17h
openstack heat-engine-6b9c8dcf54-4s774 0/1 Init:0/1 0 17h
openstack heat-engine-cleaner-1550627100-xfw2l 0/1 Init:0/1 0 17h
openstack horizon-cb98c5576-dppqq 0/1 Init:0/1 0 17h
openstack keystone-api-778f89467b-8qlt5 0/1 Init:0/1 0 17h
openstack keystone-api-778f89467b-w96zh 0/1 Init:0/1 0 17h
openstack neutron-server-df798b75f-hwml7 0/1 Init:0/1 0 17h
openstack nova-api-metadata-6f7db984fc-4b64k 0/1 Pending 0 16h
openstack nova-api-metadata-6f7db984fc-v79qn 0/1 Init:0/2 0 16h
openstack nova-api-osapi-66d6b6778c-gpdhq 0/1 Pending 0 16h
openstack nova-api-osapi-66d6b6778c-l52ck 0/1 Init:0/1 0 16h
openstack nova-cell-setup-6j5hr 0/1 Init:0/2 0 16h
openstack nova-compute-compute-0-75ea0372-j7z48 0/2 Init:0/5 0 16h
openstack nova-compute-compute-1-3dfb81d6-zwtsd 0/2 Init:0/5 0 16h
openstack nova-conductor-6b5cf94dfd-flsbr 0/1 Init:0/1 0 16h
openstack nova-conductor-6b5cf94dfd-vcmmt 0/1 Pending 0 16h
openstack nova-consoleauth-86d46c65cd-5k8kf 0/1 Init:0/1 0 16h
openstack nova-consoleauth-86d46c65cd-n8xg7 0/1 Pending 0 16h
openstack nova-db-init-m7jvd 0/3 Init:0/1 0 16h
openstack nova-db-sync-g75rz 0/1 Init:0/1 0 16h
openstack nova-ks-service-vjmgm 0/1 Init:0/1 0 16h
openstack nova-ks-user-6rsd5 0/1 Init:0/1 0 16h
openstack nova-novncproxy-56c4b5dfd7-mvg94 0/1 Init:0/3 0 16h
openstack nova-placement-api-57447fd476-kvp42 0/1 Init:0/1 0 16h
openstack nova-placement-api-57447fd476-xd4kd 0/1 Pending 0 16h
openstack nova-scheduler-7f56659bc4-4kpj6 0/1 Init:0/1 0 16h
openstack nova-scheduler-7f56659bc4-gpfsr 0/1 Pending 0 16h
openstack panko-api-84b564cbfc-nr777 0/1 Init:0/1 0 17h
openstack panko-events-cleaner-1550628600-q55l8 0/1 Init:0/1 0 16h
openstack placement-ks-endpoints-fczxd 0/3 Init:0/1 0 16h
openstack placement-ks-service-6jl87 0/1 Init:0/1 0 16h
openstack placement-ks-user-9kng7 0/1 Init:0/1 0 16h
openstack prometheus-mysql-exporter-5fb97d88c6-lq6qj 0/1 Init:0/1 0 17h

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
master as of 2019-02-15

Timestamp/Logs
--------------
[2019-02-20 01:40:28,952] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-02-20 01:40:28,952] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; issue impacts container env. Medium priority as issue is intermittent

Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → Chris Friesen (cbf123)
status: New → Triaged
tags: added: stx.2019.05 stx.containers
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
Bart Wensley (bartwensley) wrote :
Download full text (3.6 KiB)

The force reboot of controller-0 happened here:
[2019-02-20 01:40:28,952] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-02-20 01:40:28,952] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

It appears that all the openstack pods were restarted. I guess this is because etcd goes away when controller-0 is killed. It looks to me like the problem is that the mariadb pods did not come up properly:
mariadb-ingress-9d475c8c7-46kgs 0/1 Running 0 16h 172.16.1.77 controller-1 <none>
mariadb-ingress-9d475c8c7-7td6w 0/1 Running 0 16h 172.16.1.76 controller-1 <none>
mariadb-ingress-error-pages-6b55f4468c-nhkvv 1/1 Running 0 16h 172.16.1.78 controller-1 <none>
mariadb-server-0 0/1 Running 0 16h 172.16.0.201 controller-0 <none>
mariadb-server-1 0/1 Running 0 16h 172.16.1.89 controller-1 <none>

The garbd seems to be OK:
osh-openstack-garbd-garbd-5744f5f85-cjhrb 1/1 Running 0 18h 172.16.2.2 compute-0 <none>

The mariadb-server-0 pod seems to be stuck in a loop - the following logs are repeating forever:
2019-02-20 17:59:24,021 - OpenStack-Helm Mariadb - INFO - Cluster info has been uptodate 0 times out of the required 12
2019-02-20 17:59:24,022 - OpenStack-Helm Mariadb - INFO - Checking to see if cluster data is fresh
2019-02-20 17:59:24,027 - OpenStack-Helm Mariadb - INFO - The data we have from the cluster is too old to make a decision for node mariadb-server-1
2019-02-20 17:59:24,027 - OpenStack-Helm Mariadb - INFO - The data we have from the cluster is ok for node mariadb-server-0
2019-02-20 17:59:27,372 - OpenStack-Helm Mariadb - INFO - Updating grastate configmap

The mariadb-server-1 pod stops generating logs shortly after it comes up:
2019-02-20 01:50:51,516 - OpenStack-Helm Mariadb - INFO - Cluster info has been uptodate 0 times out of the required 12
2019-02-20 01:50:51,516 - OpenStack-Helm Mariadb - INFO - Checking to see if cluster data is fresh
2019-02-20 01:50:51,521 - OpenStack-Helm Mariadb - INFO - The data we have from the cluster is ok for node mariadb-server-1
2019-02-20 01:50:51,521 - OpenStack-Helm Mariadb - INFO - The data we have from the cluster is too old to make a decision for node mariadb-server-0
2019-02-20 01:50:51,545 - OpenStack-Helm Mariadb - INFO - Updating grastate configmap
2019-02-20 01:51:01,531 - OpenStack-Helm Mariadb - INFO - Cluster info has been uptodate 0 times out of the required 12
2019-02-20 01:51:01,531 - OpenStack-Helm Mariadb - INFO - Checking to see if cluster data is fresh
2019-02-20 01:51:01,568 - OpenStack-Helm Mariadb - INFO - Updating grastate configmap

The garbd pod can't seem to connect to either of the mariadb-servers:
2019-02-20 18:03:27.728 INFO: (f14c4149, 'tcp://0.0.0.0:4567&#39;) connection to peer 00000000 with addr tcp://172.16.0.175:4567 timed out, no messages seen in PT3S
2019-02-20 18:03:30.228 INFO: (f14c4149, 'tcp://0.0.0.0:4567&#39;) connection to peer 00000000 with addr tcp://172.16.0.21:4567 timed out, no messages seen in PT3S
2019-02-20 18:03:32.729 INFO: (f14c4149, 'tcp://0.0.0.0:4567&#39;) connection to peer 00000000 with addr tcp://172.16.0.175:4567 timed out, no messages seen in PT3S
2019-02-20 18...

Read more...

Revision history for this message
Chris Friesen (cbf123) wrote :

The issue with mariadb-server-0 is a known issue. The problem is related to the design of the openstack-helm chart. Basically they're prioritizing data integrity over coming back into service, so when we're starting up a new mariadb cluster (i.e. there is no running server to join) then as long as one of the mariadb-server pods is not updating the state configmap the other mariadb-server pods will not come up. We plan on adding a monitor to deal with this case after a certain amount of time.

The issue with mariadb-server-1 is unknown at this point...need to figure out why it stopped updating the state configmap. Is the above the full logs from mariadb-server-1?

Revision history for this message
Chris Friesen (cbf123) wrote :

I'm not surprised that garbd can't connect, as neither of the mysqld daemons are actually running...we're still in the startup code in the mariadb-server pods.

Revision history for this message
Chris Friesen (cbf123) wrote :
Download full text (4.6 KiB)

Looking at the logs, we see the mariadb-server-1 logs ending here:

{"log":"2019-02-20 01:50:51,545 - OpenStack-Helm Mariadb - INFO - Updating grastate configmap\n","stream":"stderr","time":"2019-02-20T01:50:51.54616385Z"}
{"log":"2019-02-20 01:51:01,531 - OpenStack-Helm Mariadb - INFO - Cluster info has been uptodate 0 times out of the required 12\n","stream":"stderr","time":"2019-02-20T01:51:01.531861929Z"}
{"log":"2019-02-20 01:51:01,531 - OpenStack-Helm Mariadb - INFO - Checking to see if cluster data is fresh\n","stream":"stderr","time":"2019-02-20T01:51:01.531891753Z"}
{"log":"2019-02-20 01:51:01,568 - OpenStack-Helm Mariadb - INFO - Updating grastate configmap\n","stream":"stderr","time":"2019-02-20T01:51:01.568883913Z"}

The mariadb-server-0 logs show another story:

{"log":"2019-02-20 01:50:58,967 - OpenStack-Helm Mariadb - INFO - The data we have from the cluster is ok for node mariadb-server-0\n","stream":"stderr","time":"2019-02-20T01:50:58.967962473Z"}
{"log":"2019-02-20 01:51:08,978 - OpenStack-Helm Mariadb - INFO - Cluster info has been uptodate 1 times out of the required 12\n","stream":"stderr","time":"2019-02-20T01:51:08.978298167Z"}
{"log":"2019-02-20 01:51:08,978 - OpenStack-Helm Mariadb - INFO - Checking to see if cluster data is fresh\n","stream":"stderr","time":"2019-02-20T01:51:08.97833839Z"}
{"log":"2019-02-20 01:51:08,979 - OpenStack-Helm Mariadb - INFO - Updating grastate configmap\n","stream":"stderr","time":"2019-02-20T01:51:08.979132811Z"}
{"log":"2019-02-20 01:51:21,870 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', error(104, 'Connection reset by peer'))': /api/v1/namespaces/openst
ack/configmaps/osh-openstack-mariadb-mariadb-state\n","stream":"stderr","time":"2019-02-20T01:51:21.870759801Z"}
{"log":"2019-02-20 01:51:21,870 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', error(104, 'Connection reset by peer'))': /api/v1/namespaces/openst
ack/configmaps/osh-openstack-mariadb-mariadb-state\n","stream":"stderr","time":"2019-02-20T01:51:21.871024039Z"}
{"log":"2019-02-20 01:51:21,882 - OpenStack-Helm Mariadb - INFO - The data we have from the cluster is too old to make a decision for node mariadb-server-1\n","stream":"stderr","time":"2019-02-20T01:51:21.883047989Z"}
{"log":"2019-02-20 01:51:21,883 - OpenStack-Helm Mariadb - INFO - The data we have from the cluster is too old to make a decision for node mariadb-server-0\n","stream":"stderr","time":"2019-02-20T01:51:21.88352455Z"}
{"log":"2019-02-20 01:51:31,893 - OpenStack-Helm Mariadb - INFO - Cluster info has been uptodate 0 times out of the required 12\n","stream":"stderr","time":"2019-02-20T01:51:31.893865638Z"}
{"log":"2019-02-20 01:51:31,893 - OpenStack-Helm Mariadb - INFO - Checking to see if cluster data is fresh\n","stream":"stderr","time":"2019-02-20T01:51:31.893908681Z"}
{"log":"2019-02-20 01:51:31,898 - OpenStack-Helm Mariadb - INFO - The data we have from the cluster is too old to make a decision for node mariadb-server-1\n","stream":"stderr","time...

Read more...

Revision history for this message
Chris Friesen (cbf123) wrote :

Looking at the garbd logs on compute-0, we see this message which is odd given that controller-1 was rebooted:

{"log":"2019-02-20 01:23:20.981 WARN: Protocol violation. JOIN message sender 1.0 (mariadb-server-0.mariadb-discovery.openstack.svc.cluster.local) is not in state transfer (SYNCED). Message ignored.\n","stream":"stderr","time":"2019-02-20T01:23:20.98203438Z"}

Because of this, garbd does not think it's part of the primary cluster anymore, and so it doesn't allow mariadb-server-1 to rejoin the cluster:

{"log":"2019-02-20 01:23:22.590 WARN: Rejecting JOIN message from 0.0 (mariadb-server-1.mariadb-discovery.openstack.svc.cluster.local): new State Transfer required.\n","stream":"stderr","time":"2019-02-20T01:23:22.590158149Z"}
{"log":"2019-02-20 01:23:22.590 WARN: SYNC message from non-JOINED 0.0 (mariadb-server-1.mariadb-discovery.openstack.svc.cluster.local, PRIMARY). Ignored.\n","stream":"stderr","time":"2019-02-20T01:23:22.59054396Z"}

We're then left with garbd running but not thinking that either of the mariadb server pods are legit:

{"log":"2019-02-20 01:40:37.392 INFO: (f14c4149, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.16.0.175:4567 timed out, no messages seen in PT3S\n","stream":"stderr","time":"2019-02-20T01:40:37.392957049Z"}
{"log":"2019-02-20 01:40:37.392 INFO: (f14c4149, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.16.0.21:4567 timed out, no messages seen in PT3S\n","stream":"stderr","time":"2019-02-20T01:40:37.393031422Z"}
{"log":"2019-02-20 01:40:42.393 INFO: (f14c4149, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.16.0.21:4567 timed out, no messages seen in PT3S\n","stream":"stderr","time":"2019-02-20T01:40:42.393191002Z"}
{"log":"2019-02-20 01:40:42.393 INFO: (f14c4149, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.16.0.175:4567 timed out, no messages seen in PT3S\n","stream":"stderr","time":"2019-02-20T01:40:42.393237831Z"}

These messages continue until 2019-02-20 02:56:31 without any indication that it sees the new mariadb-server pods start up around 01:49:57 or 01:50.54. I'm suspicious that maybe something caused those pods to be deleted or restarted with different IP addresses and garbd was still watching for the old ones--this needs to be tested, I have no evidence yet.

tags: added: stx.regression
Revision history for this message
Chris Friesen (cbf123) wrote :

While there are some unresolved questions lingering about this issue, after making some changes in this area to deal with other issues the original problem doesn't seem to be hitting us anymore.

I'm going to close the issue for now, if it shows up again we'll work the new problem.

Changed in starlingx:
status: Triaged → In Progress
status: In Progress → Fix Committed
Frank Miller (sensfan22)
Changed in starlingx:
status: Fix Committed → Fix Released
Revision history for this message
Yang Liu (yliu12) wrote :

We have not seen this issue in recent sanity runs. Closing.

tags: removed: stx.retestneeded
liuqing (ml169807)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.