Reported and investigated by Marius Cornea at https://bugzilla.redhat.com/show_bug.cgi?id=1652406
Description of problem:
Director deployed OCP 3.11: openshift-monitoring pods end up in CrashLoopBackOff after scale out:
[root@openshift-master-0 heat-admin]# oc get pods --all-namespaces | grep -v Running | grep -v Complete
NAMESPACE NAME READY STATUS RESTARTS AGE
openshift-monitoring prometheus-operator-5677fb6f87-xzdw5 0/1 CrashLoopBackOff 17 1h
Checking the infra node where the pod was running we can see:
[root@openshift-infra-0 heat-admin]# docker logs -f k8s_prometheus-operator_prometheus-operator-5677fb6f87-xzdw5_openshift-monitoring_cfed5b0c-ede6-11e8-8571-525400112488_19
ts=2018-11-22T01:34:30.683149725Z caller=main.go:130 msg="Starting Prometheus Operator version '0.23.1'."
ts=2018-11-22T01:34:30.687595956Z caller=main.go:193 msg="Unhandled error received. Exiting..." err="communicating with server failed: Get https://172.30.0.1:443/version?timeout=32s: dial tcp 172.30.0.1:443: connect: network is unreachable"
Checking openvswitch logs:
[root@openshift-infra-0 heat-admin]# tail -10 /var/log/openvswitch/ovsdb-server.log
2018-11-21T22:57:24.935Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2018-11-21T22:57:24.946Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.10.0
2018-11-21T22:57:34.961Z|00003|memory|INFO|4248 kB peak resident set size after 10.0 seconds
2018-11-21T22:57:34.961Z|00004|memory|INFO|cells:38 json-caches:1 monitors:2 sessions:1
2018-11-21T23:43:13.575Z|00005|jsonrpc|WARN|unix#78: receive error: Connection reset by peer
2018-11-21T23:43:13.575Z|00006|reconnect|WARN|unix#78: connection dropped (Connection reset by peer)
2018-11-21T23:43:39.723Z|00007|jsonrpc|WARN|unix#87: receive error: Connection reset by peer
2018-11-21T23:43:39.724Z|00008|reconnect|WARN|unix#87: connection dropped (Connection reset by peer)
2018-11-21T23:44:05.943Z|00009|jsonrpc|WARN|unix#94: receive error: Connection reset by peer
2018-11-21T23:44:05.943Z|00010|reconnect|WARN|unix#94: connection dropped (Connection reset by peer)
[root@openshift-infra-0 heat-admin]# tail -10 /var/log/openvswitch/ovs-vswitchd.log
2018-11-22T00:21:52.727Z|00181|connmgr|INFO|br0<->unix#362: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T00:22:46.366Z|00182|connmgr|INFO|br0<->unix#368: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T00:40:39.588Z|00183|connmgr|INFO|br0<->unix#449: 3 flow_mods in the last 0 s (3 adds)
2018-11-22T00:40:39.595Z|00184|connmgr|INFO|br0<->unix#451: 1 flow_mods in the last 0 s (1 adds)
2018-11-22T01:01:12.115Z|00185|bridge|INFO|bridge br0: added interface vethe6d048e0 on port 14
2018-11-22T01:01:12.127Z|00186|connmgr|INFO|br0<->unix#547: 4 flow_mods in the last 0 s (4 adds)
2018-11-22T01:01:12.150Z|00187|connmgr|INFO|br0<->unix#549: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T01:01:33.027Z|00188|connmgr|INFO|br0<->unix#551: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T01:01:33.051Z|00189|connmgr|INFO|br0<->unix#553: 4 flow_mods in the last 0 s (4 deletes)
2018-11-22T01:01:33.086Z|00190|bridge|INFO|bridge br0: deleted interface vethe6d048e0 on port 14
After running 'systemctl restart openvswitch' on the infra node the pod was able to start successfully.
How reproducible:
Not always.
Steps to Reproduce:
1. Deploy OCP with 3master + 2infra + 2worker nodes
2. Add one master node
Actual results:
Scale out operation completes fine but there are infra pods in CrashLoopBackOff state.
Expected results:
All pods remain in Running state.
Fix proposed to branch: master /review. openstack. org/619713
Review: https:/