tripleo

Docker service gets restarted during openshift scale out causing an outage

Bug #1804790 reported by Martin André on 2018-11-23

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	High	Martin André	tripleo stein-2

Bug Description

Reported and investigated by Marius Cornea at https://bugzilla.redhat.com/show_bug.cgi?id=1652406

Description of problem:
Director deployed OCP 3.11: openshift-monitoring pods end up in CrashLoopBackOff after scale out:

[root@openshift-master-0 heat-admin]# oc get pods --all-namespaces | grep -v Running | grep -v Complete
NAMESPACE NAME READY STATUS RESTARTS AGE
openshift-monitoring prometheus-operator-5677fb6f87-xzdw5 0/1 CrashLoopBackOff 17 1h

Checking the infra node where the pod was running we can see:

[root@openshift-infra-0 heat-admin]# docker logs -f k8s_prometheus-operator_prometheus-operator-5677fb6f87-xzdw5_openshift-monitoring_cfed5b0c-ede6-11e8-8571-525400112488_19
ts=2018-11-22T01:34:30.683149725Z caller=main.go:130 msg="Starting Prometheus Operator version '0.23.1'."
ts=2018-11-22T01:34:30.687595956Z caller=main.go:193 msg="Unhandled error received. Exiting..." err="communicating with server failed: Get https://172.30.0.1:443/version?timeout=32s: dial tcp 172.30.0.1:443: connect: network is unreachable"

Checking openvswitch logs:

After running 'systemctl restart openvswitch' on the infra node the pod was able to start successfully.

How reproducible:
Not always.

Steps to Reproduce:
1. Deploy OCP with 3master + 2infra + 2worker nodes
2. Add one master node

Actual results:
Scale out operation completes fine but there are infra pods in CrashLoopBackOff state.

Expected results:
All pods remain in Running state.

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-11-23: Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/619713

Changed in tripleo:
status:	Triaged → In Progress

OpenStack Infra (hudson-openstack) on 2018-12-03

Changed in tripleo:
assignee:	Martin André (mandre) → Mike Fedosin (mfedosin)

OpenStack Infra (hudson-openstack) on 2018-12-05

Changed in tripleo:
assignee:	Mike Fedosin (mfedosin) → Martin André (mandre)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-12-11: Related fix merged to ansible-role-container-registry (master)

Reviewed: https://review.openstack.org/621241
Committed: https://git.openstack.org/cgit/openstack/ansible-role-container-registry/commit/?id=88c26d2cdaeeaa74c01f4a417f9eb7d83f9f5263
Submitter: Zuul
Branch: master

commit 88c26d2cdaeeaa74c01f4a417f9eb7d83f9f5263
Author: Mike Fedosin <email address hidden>
Date: Fri Nov 30 18:02:07 2018 +0100

Allow to skip docker reconfiguration

    This commit adds an option `container_registry_skip_reconfiguration`,
    that, when enabled, disables the reconfiguration if docker has already
    been configured once.

Change-Id: I0bcaeea9cd24ab35a81d8c3d6fc3a384c1e4c3c2
Related-Bug: #1804790

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-12-11: Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/620621
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=0101b463873d84e32f758113b43666eb98790534
Submitter: Zuul
Branch: master

commit 0101b463873d84e32f758113b43666eb98790534
Author: Mike Fedosin <email address hidden>
Date: Wed Nov 28 15:59:53 2018 +0100

Allow to skip docker reconfiguration during stack update

    When installing OpenShift by means of TripleO, after
    the initial docker configuration, openshift-ansible
    also adds several parameters there.

    Then, if we want to remove a single node, then a stack
    update is performed, which returns the configuration
    to its original state. In other words, it removes all
    parameters added by openshift-ansible, which breaks OpenShift.

    This commit adds the ability to disable reconfiguration of
    docker at the time of stack update for all roles associated
    with OpenShift.

Closes-Bug: #1804790

Depends-On: I0bcaeea9cd24ab35a81d8c3d6fc3a384c1e4c3c2
Change-Id: If202be5d27d81672e39cbe21867459d277220e23

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-12-13: Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/624953

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-12-19: Fix merged to tripleo-heat-templates (stable/rocky)

Reviewed: https://review.openstack.org/624953
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=bcad833984c0cd3eccf00cf0a121c8d2f4d4b828
Submitter: Zuul
Branch: stable/rocky

commit bcad833984c0cd3eccf00cf0a121c8d2f4d4b828
Author: Mike Fedosin <email address hidden>
Date: Wed Nov 28 15:59:53 2018 +0100

Allow to skip docker reconfiguration during stack update

    When installing OpenShift by means of TripleO, after
    the initial docker configuration, openshift-ansible
    also adds several parameters there.

    This commit adds the ability to disable reconfiguration of
    docker at the time of stack update for all roles associated
    with OpenShift.

Closes-Bug: #1804790

Conflicts:
puppet/services/docker.yaml

    Depends-On: I0bcaeea9cd24ab35a81d8c3d6fc3a384c1e4c3c2
    Change-Id: If202be5d27d81672e39cbe21867459d277220e23
    (cherry picked from commit 0101b463873d84e32f758113b43666eb98790534)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-01-07: Fix included in openstack/tripleo-heat-templates 9.2.0

This issue was fixed in the openstack/tripleo-heat-templates 9.2.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-01-10: Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/619713
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=bb1a1209ace0d638bbd9835b1c002dbc8c0db6d9
Submitter: Zuul
Branch: master

commit bb1a1209ace0d638bbd9835b1c002dbc8c0db6d9
Author: Martin André <email address hidden>
Date: Fri Nov 23 11:08:36 2018 +0100

Rework the generated openshift-ansible playbook

    The `prerequisites.yml` playbook should only be explicitly run on
    initial deployment to prepare the nodes. It is already included in the
    scaleup playbooks for the new nodes so there is no need to include it
    again. Re-running the `prerequisites.yml` playbook reconfigures the
    container runtime and may cause outage, it is supposed to be run only
    once.

Make update and upgrade playbooks exclusive. There is no need to run
both of them.

Add comments to clarify the intent for each playbooks.

Change-Id: I30278360fcc1ffa9bd7ce7cb77d023629fb6fa47
Closes-Bug: #1804790

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-01-10: Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/629862

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-01-11: Fix included in openstack/tripleo-heat-templates 10.3.0

This issue was fixed in the openstack/tripleo-heat-templates 10.3.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-01-24: Fix merged to tripleo-heat-templates (stable/rocky)

#11

Reviewed: https://review.openstack.org/629862
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=1c96500c585acddba9741c9f1d43b6fa02369b68
Submitter: Zuul
Branch: stable/rocky

commit 1c96500c585acddba9741c9f1d43b6fa02369b68
Author: Martin André <email address hidden>
Date: Fri Nov 23 11:08:36 2018 +0100

Rework the generated openshift-ansible playbook

Make update and upgrade playbooks exclusive. There is no need to run
both of them.

Add comments to clarify the intent for each playbooks.

    Change-Id: I30278360fcc1ffa9bd7ce7cb77d023629fb6fa47
    Closes-Bug: #1804790
    (cherry picked from commit bb1a1209ace0d638bbd9835b1c002dbc8c0db6d9)