Undercloud - changing os-net-config conf kills undercloud_[admin, public]_host IPs

Bug #1791238 reported by Harald Jensås on 2018-09-07
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
Harald Jensås

Bug Description

In the containerized undercloud re-running the 'openstack undercloud install* command removes the undercloud_admin_host and undercloud_public_host ip addresses if config for os-net-config is changed. (For example if the the undercloud_dnsnameservers option is changed in undercloud.conf)

The br-ctlplane interface is restarted by os-net-config and this removes the undercloud_admin_host and undercloud_public_host ip addresses set up by keepalived. The install/update operation fails later on because services fail to connect to the ip that is no longer there.

Reproduce:

1. Deploy undercloud with the following configuration

[DEFAULT]

enable_routed_networks = false
enable_tempest = false
enable_ui = false
inspection_interface = br-ctlplane
ipxe_enabled = true
local_interface = eth1
local_ip = 172.20.0.200/26
local_mtu = 1500
local_subnet = ctlplane-subnet
overcloud_domain_name = localdomain
scheduler_max_attempts = 3
subnets = ctlplane-subnet
undercloud_admin_host = 172.20.0.201
undercloud_debug = true
undercloud_hostname = container-undercloud.lab.example.com
undercloud_nameservers = 172.20.0.254
undercloud_ntp_servers = 0.se.pool.ntp.org
undercloud_public_host = 172.20.0.203

[ctlplane-subnet]
cidr = 172.20.0.192/26
dhcp_start = 172.20.0.210
dhcp_end = 172.20.0.219
inspection_iprange = 172.20.0.220,172.20.0.229
gateway = 172.20.0.254
masquerade = true

2. Change the undercloud_nameservers option in undercloud.conf

sed -i s/undercloud_nameservers = 172.20.0.254/undercloud_nameservers = 192.168.122.1/g /home/stack/undercloud.conf

3. Re-run undercloud install

openstack undercloud install

RESULTS:

1. The os-net-config is config.json is updated with the new dnsserver.

Every 5.0s: diff -aur /etc/os-net-config/config.json /tmp/os-net-config.json.orig Fri Sep 7 08:51:26 2018

--- /etc/os-net-config/config.json 2018-09-07 08:45:39.054174371 +0200
+++ /tmp/os-net-config.json.orig 2018-09-07 08:17:38.597808977 +0200
@@ -1 +1 @@
-{"network_config": [{"addresses": [{"ip_netmask": "172.20.0.200/26"}], "dns_servers": ["192.168.122.1"], "members": [{"mtu": 1500, "name": "eth1", "primary": true, "type": "interface"}], "name": "br-ctlplane",
"ovs_extra": ["br-set-external-id br-ctlplane bridge-id br-ctlplane"], "routes": [], "type": "ovs_bridge", "use_dhcp": false}]}
+{"network_config": [{"addresses": [{"ip_netmask": "172.20.0.200/26"}], "dns_servers": ["172.20.0.254"], "members": [{"mtu": 1500, "name": "eth1", "primary": true, "type": "interface"}], "name": "br-ctlplane", "
ovs_extra": ["br-set-external-id br-ctlplane bridge-id br-ctlplane"], "routes": [], "type": "ovs_bridge", "use_dhcp": false}]}

2. After os-net-config applied config the keepalived VIPs are gone:

Every 2.0s: ip addr show br-ctlplane Fri Sep 7 08:51:08 2018

47: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 52:54:00:7a:f6:c5 brd ff:ff:ff:ff:ff:ff
    inet 172.20.0.200/26 brd 172.20.0.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe7a:f6c5/64 scope link
       valid_lft forever preferred_lft forever

3. The upgrade is stuck on starting the containers:

TASK [Start containers for step 3] **********************************************

4. Log's show that services are failing to connect to the database via the keepalived VIPs:
/var/log/containers/nova/nova-compute.log:2018-09-07 08:52:47.462 6 ERROR oslo_service.periodic_task RemoteError: Remote error: DBConnectionError (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.20.0.201' ([Errno 113] EHOSTUNREACH)") (Background on this error at: http://sqlalche.me/e/e3q8)

description: updated
description: updated
Harald Jensås (harald-jensas) wrote :

Restarting the keepalived container make the VIP's re-apper:

[root@container-undercloud mysql]# docker container list | grep keepalived
bfe1c1cd76e3 docker.io/tripleomaster/centos-binary-keepalived:9ad93affedba8870315dd72c714770875ce24759_b72f0c42 "/usr/local/bin/ko..." 12 hours ago Up 7 minutes keepalived

[root@container-undercloud ~]# docker container restart bfe1c1cd76e3
bfe1c1cd76e3

[root@container-undercloud ~]# ip addr show br-ctlplane
47: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 52:54:00:7a:f6:c5 brd ff:ff:ff:ff:ff:ff
    inet 172.20.0.200/26 brd 172.20.0.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet 172.20.0.201/32 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet 172.20.0.203/32 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe7a:f6c5/64 scope link
       valid_lft forever preferred_lft forever

Once the VIP's are back up TASK [Start containers for step 3] completes and the undercloud install compltes successfully.

NOTE: If we had a recent enough keepalived we could probably have enabled the ``dynamic_interfaces`` option. But this is not implemented in the version used.

  """# Allow configuration to include interfaces that don't exist at startup.
           # This allows keepalived to work with interfaces that may be deleted and restored
           # and also allows virtual and static routes and rules on VMAC interfaces.
           dynamic_interfaces
  """

Harald Jensås (harald-jensas) wrote :

Recent version of keepalived have support for 'dynamic_interfaces', looks like that would solve this problem. We would have to package keepalived 2.0.in RDO? And

 # Allow configuration to include interfaces that don't exist at startup.
 # This allows keepalived to work with interfaces that may be deleted and restored
 # and also allows virtual and static routes and rules on VMAC interfaces.
   dynamic_interfaces

I built keepalived-2.0.6-1.el7.x86_64.rpm using the SRPM[1] from Fedora Rawhide in Centos 7. (With only a small tweak the RPM builds.)

Enabling dynamic_interfaces and using 2.0.6 version of keepalived in the keepalived container fixes this issue.

Suggest we package keepalived 2.0.x and place this in the OSP repositories.

[1] https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/development/rawhide/Everything/source/tree/Packages/k/keepalived-2.0.6-1.fc29.src.rpm

Fix proposed to branch: master
Review: https://review.openstack.org/603587

Changed in tripleo:
assignee: nobody → Harald Jensås (harald-jensas)
status: Triaged → In Progress

Reviewed: https://review.openstack.org/603587
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=b766e253f4df4bb61247640850c3490b988c36d0
Submitter: Zuul
Branch: master

commit b766e253f4df4bb61247640850c3490b988c36d0
Author: Harald Jensås <email address hidden>
Date: Wed Sep 19 09:28:16 2018 +0200

    Undercloud - Restart keepalived on update

    instack-undercloud had a workaround (30-reload-keepalived)
    in place to always restart keepalived on install/upgrade.
    This is required to ensure VIP's are present in case the
    network config was changed and os-net-config restarts
    the network interface. When containerizing the undercloud
    this workaround was missed.

    This change adds a similar workaround. A pre_deploy
    NodeExtraconfig script will restart the keepalived
    container when the undercloud installer is (re-)run.

    NOTE: We can remove this workaround once keepalived
          v2.0.6 or later is available.

    Closes-Bug: #1791238
    Change-Id: I8cada7be57cd50c54ca5f2f38ec010062512ae06

Changed in tripleo:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/605604
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=7c52f9489b00d36b8907827221ac71259da3bad4
Submitter: Zuul
Branch: stable/rocky

commit 7c52f9489b00d36b8907827221ac71259da3bad4
Author: Harald Jensås <email address hidden>
Date: Wed Sep 19 09:28:16 2018 +0200

    Undercloud - Restart keepalived on update

    instack-undercloud had a workaround (30-reload-keepalived)
    in place to always restart keepalived on install/upgrade.
    This is required to ensure VIP's are present in case the
    network config was changed and os-net-config restarts
    the network interface. When containerizing the undercloud
    this workaround was missed.

    This change adds a similar workaround. A pre_deploy
    NodeExtraconfig script will restart the keepalived
    container when the undercloud installer is (re-)run.

    NOTE: We can remove this workaround once keepalived
          v2.0.6 or later is available.

    Closes-Bug: #1791238
    Change-Id: I8cada7be57cd50c54ca5f2f38ec010062512ae06
    (cherry picked from commit b766e253f4df4bb61247640850c3490b988c36d0)

tags: added: in-stable-rocky

This issue was fixed in the openstack/tripleo-heat-templates 10.0.0 release.

Reviewed: https://review.openstack.org/623093
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=be61d8a2b5537e4ea3374f5245afaa299972a03e
Submitter: Zuul
Branch: master

commit be61d8a2b5537e4ea3374f5245afaa299972a03e
Author: Emilien Macchi <email address hidden>
Date: Wed Dec 5 17:45:52 2018 -0500

    Re-implement keepalived restart without pre_deploy

    ... and use host_prep_tasks from config-download.
    We are trying to HostPrepConfig resource that use OS::Heat::SoftwareConfig
    and the old fashion to run Ansible, for more native config-downlaod.
    undercloud_pre is the only service that needs HostPrepConfig now, so
    let's switch to config-download.

    It restarts keepalived container at each undercloud install & upgrade.
    Also it adds support for podman as it uses container_cli variable.

    Note: the workaround can still be removed once we have Keepalived 2.0.6
    but it won't happen before CentOS8 probably.

    Change-Id: I7454013c2e37058b5010a2a6cacfae0d0f873744
    Related-Bug: #1791238

This issue was fixed in the openstack/tripleo-heat-templates 9.1.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers