Nova/Neutron daemonset pods restarted on all workers when new worker is added

Bug #1820902 reported by Ghada Khalil
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Ovidiu Poncea

Bug Description

Brief Description
-----------------
When a new worker is added with the openstack labels and unlocked, the expected behavior is that there is no impact to the existing worker nodes.
What we see is that the nova and neutron daemon set pods are restarted on all worker nodes.

The issue seems to related to secrets being regenerated because of changes to the configmap between helm versions. The memcache_secret_key is the item changing between versions.

Severity
--------
Major

Steps to Reproduce
------------------
Add a new worker node to a running system w/ containers

Expected Behavior
------------------
The new worker node is unlocked successfully without any impact on the existing worker nodes

Actual Behavior
----------------
The nova and neutron daemon set pods are restarted on all worker nodes

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Multi-node system w/ containers

Branch/Pull Time/Commit
-----------------------
Any load w/ containers

Timestamp/Logs
--------------
N/A - Issue is easily reproducible

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Bob Church (rchurch)
importance: Undecided → Medium
Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → Triaged
tags: added: stx.2019.05 stx.containers
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Frank Miller (sensfan22)
Changed in starlingx:
assignee: Bob Church (rchurch) → Ovidiu Poncea (ovidiu.poncea)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Initial conclusiont point to a tricky helm-toolkit bug. I got some small guidance on the openstack-helm stash channel, but nothing to point me to a solutiom.

Short story: On a multihost nova compute deployment with nova-compute functionality we add or delete a host and reapply the manifests. To our surprise nova-compute services get restarted on ALL nodes. We are expecting it to juts start on the newly added nova-compute node (or not do anything in the host removal case). The interesting part here is that the config maps passed don't change at all (i.e. there is no change in the openstack config files!), so there shouldn't be any reason to restart/recreate the pods at all.

Long story:
When comparing the chart output from before and after removing a host
There are two changes that seems to tricker the POD recreation: the name of the daemonsets and one hash (configmap-etc-hash). Looking further, both the name of the daemonsets and the hash are differen for the same reason: the hostnames of all the nodes configured (i.e. a map with the hostnames) is included in the hash computation which is used for the generation of the pod names & configmap-etc-hash.
It seems that there is no reason to use the hostname list for this as there is no actual config file change... problem is that I don't yet know how to fix it. It's in the helm-toolkit, it will impact ALL the services in the system.

I was able to reconcile the differece in the config map hash by not taking into account the hostnames in helm-toolkit/templates/utils/_daemonset_overrides.tpl by replacing:

   {{- if not $context.Values.__daemonset_yaml.spec.template.metadata.annotations }}{{- $_ := set $context.Values.__daemonset_yaml.spec.template.metadata "annotations" dict }}{{- end }}
   {{- $cmap := list $current_dict.dns_1123_name $current_dict.nodeData | include $configmap_include }}
   {{- $values_hash := $cmap | quote | sha256sum }}
   {{- $_ := set $context.Values.__daemonset_yaml.spec.template.metadata.annotations "configmap-etc-hash" $values_hash }}

with:
   {{- if not $context.Values.__daemonset_yaml.spec.template.metadata.annotations }}{{- $_ := set $context.Values.__daemonset_yaml.spec.template.metadata "annotations" dict }}{{- end }}
   {{- $cmap := list $current_dict.dns_1123_name $current_dict.nodeData | include $configmap_include }}
   {{- $hashcmap := list "default" $current_dict.nodeData | include $configmap_include }}
   {{- $values_hash := $hashcmap | quote | sha256sum }}
   {{- $_ := set $context.Values.__daemonset_yaml.spec.template.metadata.annotations "configmap-etc-hash" $values_hash }}

Problem is that I still get the POD name change... and this is much harder to fix.

Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Indeed is a hard to catch bug in helm-toolkit, I opened an SB on the helm toolkit project
https://storyboard.openstack.org/#!/story/2006278

On manual apply fix works, I'm going to do more tests and raise a review on the openstack-helm-infra project.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to upstream (master)

Fix proposed to branch: master
Review: https://review.opendev.org/673300

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to upstream (master)

Reviewed: https://review.opendev.org/673300
Committed: https://git.openstack.org/cgit/starlingx/upstream/commit/?id=b120684a28a5830f5f2579a7a713450aadc134f0
Submitter: Zuul
Branch: master

commit b120684a28a5830f5f2579a7a713450aadc134f0
Author: Ovidiu Poncea <email address hidden>
Date: Mon Jul 29 09:14:18 2019 -0400

    Fix nova&neutron pod restarted on all workers when worker added

    This is based on upstream(openstack-helm/openstack-helm-infra) review:
    https://review.opendev.org/#/c/672966/

    The only difference is that we do not carry over the Ceph changes
    as we do not use Helm's Ceph in StarlingX.

    Change-Id: Iabc3689bca198a861f2ade03a620895320897568
    Closes-Bug: 1820902
    Signed-off-by: Ovidiu Poncea <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.