Comment 2 for bug 1816842

Revision history for this message
Bart Wensley (bartwensley) wrote :

The force reboot of controller-0 happened here:
[2019-02-20 01:40:28,952] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-02-20 01:40:28,952] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

It appears that all the openstack pods were restarted. I guess this is because etcd goes away when controller-0 is killed. It looks to me like the problem is that the mariadb pods did not come up properly:
mariadb-ingress-9d475c8c7-46kgs 0/1 Running 0 16h 172.16.1.77 controller-1 <none>
mariadb-ingress-9d475c8c7-7td6w 0/1 Running 0 16h 172.16.1.76 controller-1 <none>
mariadb-ingress-error-pages-6b55f4468c-nhkvv 1/1 Running 0 16h 172.16.1.78 controller-1 <none>
mariadb-server-0 0/1 Running 0 16h 172.16.0.201 controller-0 <none>
mariadb-server-1 0/1 Running 0 16h 172.16.1.89 controller-1 <none>

The garbd seems to be OK:
osh-openstack-garbd-garbd-5744f5f85-cjhrb 1/1 Running 0 18h 172.16.2.2 compute-0 <none>

The mariadb-server-0 pod seems to be stuck in a loop - the following logs are repeating forever:
2019-02-20 17:59:24,021 - OpenStack-Helm Mariadb - INFO - Cluster info has been uptodate 0 times out of the required 12
2019-02-20 17:59:24,022 - OpenStack-Helm Mariadb - INFO - Checking to see if cluster data is fresh
2019-02-20 17:59:24,027 - OpenStack-Helm Mariadb - INFO - The data we have from the cluster is too old to make a decision for node mariadb-server-1
2019-02-20 17:59:24,027 - OpenStack-Helm Mariadb - INFO - The data we have from the cluster is ok for node mariadb-server-0
2019-02-20 17:59:27,372 - OpenStack-Helm Mariadb - INFO - Updating grastate configmap

The mariadb-server-1 pod stops generating logs shortly after it comes up:
2019-02-20 01:50:51,516 - OpenStack-Helm Mariadb - INFO - Cluster info has been uptodate 0 times out of the required 12
2019-02-20 01:50:51,516 - OpenStack-Helm Mariadb - INFO - Checking to see if cluster data is fresh
2019-02-20 01:50:51,521 - OpenStack-Helm Mariadb - INFO - The data we have from the cluster is ok for node mariadb-server-1
2019-02-20 01:50:51,521 - OpenStack-Helm Mariadb - INFO - The data we have from the cluster is too old to make a decision for node mariadb-server-0
2019-02-20 01:50:51,545 - OpenStack-Helm Mariadb - INFO - Updating grastate configmap
2019-02-20 01:51:01,531 - OpenStack-Helm Mariadb - INFO - Cluster info has been uptodate 0 times out of the required 12
2019-02-20 01:51:01,531 - OpenStack-Helm Mariadb - INFO - Checking to see if cluster data is fresh
2019-02-20 01:51:01,568 - OpenStack-Helm Mariadb - INFO - Updating grastate configmap

The garbd pod can't seem to connect to either of the mariadb-servers:
2019-02-20 18:03:27.728 INFO: (f14c4149, 'tcp://0.0.0.0:4567&#39;) connection to peer 00000000 with addr tcp://172.16.0.175:4567 timed out, no messages seen in PT3S
2019-02-20 18:03:30.228 INFO: (f14c4149, 'tcp://0.0.0.0:4567&#39;) connection to peer 00000000 with addr tcp://172.16.0.21:4567 timed out, no messages seen in PT3S
2019-02-20 18:03:32.729 INFO: (f14c4149, 'tcp://0.0.0.0:4567&#39;) connection to peer 00000000 with addr tcp://172.16.0.175:4567 timed out, no messages seen in PT3S
2019-02-20 18:03:35.229 INFO: (f14c4149, 'tcp://0.0.0.0:4567&#39;) connection to peer 00000000 with addr tcp://172.16.0.21:4567 timed out, no messages seen in PT3S
2019-02-20 18:03:37.729 INFO: (f14c4149, 'tcp://0.0.0.0:4567&#39;) connection to peer 00000000 with addr tcp://172.16.0.175:4567 timed out, no messages seen in PT3S
2019-02-20 18:03:40.229 INFO: (f14c4149, 'tcp://0.0.0.0:4567&#39;) connection to peer 00000000 with addr tcp://172.16.0.21:4567 timed out, no messages seen in PT3S