Docker service gets restarted during openshift scale out causing an outage

Bug #1804790 reported by Martin André
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Martin André

Bug Description

Reported and investigated by Marius Cornea at https://bugzilla.redhat.com/show_bug.cgi?id=1652406

Description of problem:
Director deployed OCP 3.11: openshift-monitoring pods end up in CrashLoopBackOff after scale out:

[root@openshift-master-0 heat-admin]# oc get pods --all-namespaces | grep -v Running | grep -v Complete
NAMESPACE NAME READY STATUS RESTARTS AGE
openshift-monitoring prometheus-operator-5677fb6f87-xzdw5 0/1 CrashLoopBackOff 17 1h

Checking the infra node where the pod was running we can see:

[root@openshift-infra-0 heat-admin]# docker logs -f k8s_prometheus-operator_prometheus-operator-5677fb6f87-xzdw5_openshift-monitoring_cfed5b0c-ede6-11e8-8571-525400112488_19
ts=2018-11-22T01:34:30.683149725Z caller=main.go:130 msg="Starting Prometheus Operator version '0.23.1'."
ts=2018-11-22T01:34:30.687595956Z caller=main.go:193 msg="Unhandled error received. Exiting..." err="communicating with server failed: Get https://172.30.0.1:443/version?timeout=32s: dial tcp 172.30.0.1:443: connect: network is unreachable"

Checking openvswitch logs:

[root@openshift-infra-0 heat-admin]# tail -10 /var/log/openvswitch/ovsdb-server.log
2018-11-21T22:57:24.935Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2018-11-21T22:57:24.946Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.10.0
2018-11-21T22:57:34.961Z|00003|memory|INFO|4248 kB peak resident set size after 10.0 seconds
2018-11-21T22:57:34.961Z|00004|memory|INFO|cells:38 json-caches:1 monitors:2 sessions:1
2018-11-21T23:43:13.575Z|00005|jsonrpc|WARN|unix#78: receive error: Connection reset by peer
2018-11-21T23:43:13.575Z|00006|reconnect|WARN|unix#78: connection dropped (Connection reset by peer)
2018-11-21T23:43:39.723Z|00007|jsonrpc|WARN|unix#87: receive error: Connection reset by peer
2018-11-21T23:43:39.724Z|00008|reconnect|WARN|unix#87: connection dropped (Connection reset by peer)
2018-11-21T23:44:05.943Z|00009|jsonrpc|WARN|unix#94: receive error: Connection reset by peer
2018-11-21T23:44:05.943Z|00010|reconnect|WARN|unix#94: connection dropped (Connection reset by peer)
[root@openshift-infra-0 heat-admin]# tail -10 /var/log/openvswitch/ovs-vswitchd.log
2018-11-22T00:21:52.727Z|00181|connmgr|INFO|br0<->unix#362: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T00:22:46.366Z|00182|connmgr|INFO|br0<->unix#368: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T00:40:39.588Z|00183|connmgr|INFO|br0<->unix#449: 3 flow_mods in the last 0 s (3 adds)
2018-11-22T00:40:39.595Z|00184|connmgr|INFO|br0<->unix#451: 1 flow_mods in the last 0 s (1 adds)
2018-11-22T01:01:12.115Z|00185|bridge|INFO|bridge br0: added interface vethe6d048e0 on port 14
2018-11-22T01:01:12.127Z|00186|connmgr|INFO|br0<->unix#547: 4 flow_mods in the last 0 s (4 adds)
2018-11-22T01:01:12.150Z|00187|connmgr|INFO|br0<->unix#549: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T01:01:33.027Z|00188|connmgr|INFO|br0<->unix#551: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T01:01:33.051Z|00189|connmgr|INFO|br0<->unix#553: 4 flow_mods in the last 0 s (4 deletes)
2018-11-22T01:01:33.086Z|00190|bridge|INFO|bridge br0: deleted interface vethe6d048e0 on port 14

After running 'systemctl restart openvswitch' on the infra node the pod was able to start successfully.

How reproducible:
Not always.

Steps to Reproduce:
1. Deploy OCP with 3master + 2infra + 2worker nodes
2. Add one master node

Actual results:
Scale out operation completes fine but there are infra pods in CrashLoopBackOff state.

Expected results:
All pods remain in Running state.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/619713

Changed in tripleo:
status: Triaged → In Progress
Changed in tripleo:
assignee: Martin André (mandre) → Mike Fedosin (mfedosin)
Changed in tripleo:
assignee: Mike Fedosin (mfedosin) → Martin André (mandre)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-role-container-registry (master)

Reviewed: https://review.openstack.org/621241
Committed: https://git.openstack.org/cgit/openstack/ansible-role-container-registry/commit/?id=88c26d2cdaeeaa74c01f4a417f9eb7d83f9f5263
Submitter: Zuul
Branch: master

commit 88c26d2cdaeeaa74c01f4a417f9eb7d83f9f5263
Author: Mike Fedosin <email address hidden>
Date: Fri Nov 30 18:02:07 2018 +0100

    Allow to skip docker reconfiguration

    This commit adds an option `container_registry_skip_reconfiguration`,
    that, when enabled, disables the reconfiguration if docker has already
    been configured once.

    Change-Id: I0bcaeea9cd24ab35a81d8c3d6fc3a384c1e4c3c2
    Related-Bug: #1804790

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/620621
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=0101b463873d84e32f758113b43666eb98790534
Submitter: Zuul
Branch: master

commit 0101b463873d84e32f758113b43666eb98790534
Author: Mike Fedosin <email address hidden>
Date: Wed Nov 28 15:59:53 2018 +0100

    Allow to skip docker reconfiguration during stack update

    When installing OpenShift by means of TripleO, after
    the initial docker configuration, openshift-ansible
    also adds several parameters there.

    Then, if we want to remove a single node, then a stack
    update is performed, which returns the configuration
    to its original state. In other words, it removes all
    parameters added by openshift-ansible, which breaks OpenShift.

    This commit adds the ability to disable reconfiguration of
    docker at the time of stack update for all roles associated
    with OpenShift.

    Closes-Bug: #1804790

    Depends-On: I0bcaeea9cd24ab35a81d8c3d6fc3a384c1e4c3c2
    Change-Id: If202be5d27d81672e39cbe21867459d277220e23

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/624953

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/rocky)

Reviewed: https://review.openstack.org/624953
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=bcad833984c0cd3eccf00cf0a121c8d2f4d4b828
Submitter: Zuul
Branch: stable/rocky

commit bcad833984c0cd3eccf00cf0a121c8d2f4d4b828
Author: Mike Fedosin <email address hidden>
Date: Wed Nov 28 15:59:53 2018 +0100

    Allow to skip docker reconfiguration during stack update

    When installing OpenShift by means of TripleO, after
    the initial docker configuration, openshift-ansible
    also adds several parameters there.

    Then, if we want to remove a single node, then a stack
    update is performed, which returns the configuration
    to its original state. In other words, it removes all
    parameters added by openshift-ansible, which breaks OpenShift.

    This commit adds the ability to disable reconfiguration of
    docker at the time of stack update for all roles associated
    with OpenShift.

    Closes-Bug: #1804790

    Conflicts:
          puppet/services/docker.yaml

    Depends-On: I0bcaeea9cd24ab35a81d8c3d6fc3a384c1e4c3c2
    Change-Id: If202be5d27d81672e39cbe21867459d277220e23
    (cherry picked from commit 0101b463873d84e32f758113b43666eb98790534)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 9.2.0

This issue was fixed in the openstack/tripleo-heat-templates 9.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/619713
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=bb1a1209ace0d638bbd9835b1c002dbc8c0db6d9
Submitter: Zuul
Branch: master

commit bb1a1209ace0d638bbd9835b1c002dbc8c0db6d9
Author: Martin André <email address hidden>
Date: Fri Nov 23 11:08:36 2018 +0100

    Rework the generated openshift-ansible playbook

    The `prerequisites.yml` playbook should only be explicitly run on
    initial deployment to prepare the nodes. It is already included in the
    scaleup playbooks for the new nodes so there is no need to include it
    again. Re-running the `prerequisites.yml` playbook reconfigures the
    container runtime and may cause outage, it is supposed to be run only
    once.

    Make update and upgrade playbooks exclusive. There is no need to run
    both of them.

    Add comments to clarify the intent for each playbooks.

    Change-Id: I30278360fcc1ffa9bd7ce7cb77d023629fb6fa47
    Closes-Bug: #1804790

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/629862

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 10.3.0

This issue was fixed in the openstack/tripleo-heat-templates 10.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/rocky)

Reviewed: https://review.openstack.org/629862
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=1c96500c585acddba9741c9f1d43b6fa02369b68
Submitter: Zuul
Branch: stable/rocky

commit 1c96500c585acddba9741c9f1d43b6fa02369b68
Author: Martin André <email address hidden>
Date: Fri Nov 23 11:08:36 2018 +0100

    Rework the generated openshift-ansible playbook

    The `prerequisites.yml` playbook should only be explicitly run on
    initial deployment to prepare the nodes. It is already included in the
    scaleup playbooks for the new nodes so there is no need to include it
    again. Re-running the `prerequisites.yml` playbook reconfigures the
    container runtime and may cause outage, it is supposed to be run only
    once.

    Make update and upgrade playbooks exclusive. There is no need to run
    both of them.

    Add comments to clarify the intent for each playbooks.

    Change-Id: I30278360fcc1ffa9bd7ce7cb77d023629fb6fa47
    Closes-Bug: #1804790
    (cherry picked from commit bb1a1209ace0d638bbd9835b1c002dbc8c0db6d9)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 9.3.0

This issue was fixed in the openstack/tripleo-heat-templates 9.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.