Nova/Neutron daemonset pods restarted on all workers when new worker is added
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Ovidiu Poncea |
Bug Description
Brief Description
-----------------
When a new worker is added with the openstack labels and unlocked, the expected behavior is that there is no impact to the existing worker nodes.
What we see is that the nova and neutron daemon set pods are restarted on all worker nodes.
The issue seems to related to secrets being regenerated because of changes to the configmap between helm versions. The memcache_secret_key is the item changing between versions.
Severity
--------
Major
Steps to Reproduce
------------------
Add a new worker node to a running system w/ containers
Expected Behavior
------------------
The new worker node is unlocked successfully without any impact on the existing worker nodes
Actual Behavior
----------------
The nova and neutron daemon set pods are restarted on all worker nodes
Reproducibility
---------------
Reproducible
System Configuration
-------
Multi-node system w/ containers
Branch/Pull Time/Commit
-------
Any load w/ containers
Timestamp/Logs
--------------
N/A - Issue is easily reproducible
Changed in starlingx: | |
assignee: | nobody → Bob Church (rchurch) |
importance: | Undecided → Medium |
Changed in starlingx: | |
status: | New → Triaged |
tags: | added: stx.2019.05 stx.containers |
tags: |
added: stx.2.0 removed: stx.2019.05 |
Changed in starlingx: | |
assignee: | Bob Church (rchurch) → Ovidiu Poncea (ovidiu.poncea) |
Changed in starlingx: | |
status: | Triaged → In Progress |
Initial conclusiont point to a tricky helm-toolkit bug. I got some small guidance on the openstack-helm stash channel, but nothing to point me to a solutiom.
Short story: On a multihost nova compute deployment with nova-compute functionality we add or delete a host and reapply the manifests. To our surprise nova-compute services get restarted on ALL nodes. We are expecting it to juts start on the newly added nova-compute node (or not do anything in the host removal case). The interesting part here is that the config maps passed don't change at all (i.e. there is no change in the openstack config files!), so there shouldn't be any reason to restart/recreate the pods at all.
Long story: etc-hash) . Looking further, both the name of the daemonsets and the hash are differen for the same reason: the hostnames of all the nodes configured (i.e. a map with the hostnames) is included in the hash computation which is used for the generation of the pod names & configmap-etc-hash.
When comparing the chart output from before and after removing a host
There are two changes that seems to tricker the POD recreation: the name of the daemonsets and one hash (configmap-
It seems that there is no reason to use the hostname list for this as there is no actual config file change... problem is that I don't yet know how to fix it. It's in the helm-toolkit, it will impact ALL the services in the system.
I was able to reconcile the differece in the config map hash by not taking into account the hostnames in helm-toolkit/ templates/ utils/_ daemonset_ overrides. tpl by replacing:
{{- if not $context. Values. __daemonset_ yaml.spec. template. metadata. annotations }}{{- $_ := set $context. Values. __daemonset_ yaml.spec. template. metadata "annotations" dict }}{{- end }} dict.dns_ 1123_name $current_ dict.nodeData | include $configmap_include }} Values. __daemonset_ yaml.spec. template. metadata. annotations "configmap- etc-hash" $values_hash }}
{{- $cmap := list $current_
{{- $values_hash := $cmap | quote | sha256sum }}
{{- $_ := set $context.
with: Values. __daemonset_ yaml.spec. template. metadata. annotations }}{{- $_ := set $context. Values. __daemonset_ yaml.spec. template. metadata "annotations" dict }}{{- end }} dict.dns_ 1123_name $current_ dict.nodeData | include $configmap_include }} dict.nodeData | include $configmap_include }} Values. __daemonset_ yaml.spec. template. metadata. annotations "configmap- etc-hash" $values_hash }}
{{- if not $context.
{{- $cmap := list $current_
{{- $hashcmap := list "default" $current_
{{- $values_hash := $hashcmap | quote | sha256sum }}
{{- $_ := set $context.
Problem is that I still get the POD name change... and this is much harder to fix.