Unable to bootstrap pacemaker cluster when undercloud is containerized with ipv6 overcloud

Bug #1774898 reported by Emilien Macchi
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Emilien Macchi

Bug Description

Environment: featureset035 (3 controllers, 1 compute, ipv6 control plane) with containerized undercloud

The deployment fails at step 1:
https://logs.rdoproject.org/16/566916/11/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/Zb3297f9fb4a44a10b34f7afa1b9e860d/undercloud/home/jenkins/overcloud_deploy.log.txt.gz#_2018-06-03_04_20_37

When bootstrapping the pacemaker cluster:
https://logs.rdoproject.org/16/566916/11/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/Zb3297f9fb4a44a10b34f7afa1b9e860d/overcloud-controller-0/var/log/journal.txt.gz#_Jun_03_04_26_15

Error: Unable to communicate with overcloud-controller-0
Error: Unable to communicate with overcloud-controller-1
Error: Unable to communicate with overcloud-controller-2

It only fail when you try to enable the containerized undercloud.

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
milestone: none → rocky-2
tags: added: containers
Revision history for this message
Michele Baldessari (michele) wrote :

Jun 03 03:22:03 overcloud-controller-0 systemd[1]: pcsd.service start operation timed out. Terminating.
Jun 03 03:22:24 overcloud-controller-0 ntpd_intres[30308]: host name not found: pool.ntp.org
Jun 03 03:22:26 overcloud-controller-0 systemd[1]: Failed to start PCS GUI and remote configuration interface.
Jun 03 03:22:26 overcloud-controller-0 systemd[1]: Unit pcsd.service entered failed state.
Jun 03 03:22:26 overcloud-controller-0 systemd[1]: pcsd.service failed.
Jun 03 03:22:26 overcloud-controller-0 puppet-user[29328]: Systemd start for pcsd failed!

So this is the usual 'pcsd needs a dns to not time out requests to work'. So we likely are timing out dns queries...

https://logs.rdoproject.org/16/566916/11/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/Zb3297f9fb4a44a10b34f7afa1b9e860d/overcloud-controller-0/etc/resolv.conf.txt.gz has:
; generated by /usr/sbin/dhclient-script
search localdomain
nameserver 38.145.33.91
nameserver 38.145.32.66
nameserver 38.145.32.79

So https://logs.rdoproject.org/16/566916/11/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/Zb3297f9fb4a44a10b34f7afa1b9e860d/overcloud-controller-0/var/log/host_info.txt.gz has:
+ ip route
default via 192.168.24.1 dev eth0
169.254.169.254 via 192.168.24.1 dev eth0
172.16.0.0/24 dev br-tenant proto kernel scope link src 172.16.0.10
172.31.0.0/24 dev docker0 proto kernel scope link src 172.31.0.1
192.168.24.0/24 dev eth0 proto kernel scope link src 192.168.24.17

So I presume the undercloud when deployed via containers is blocking something around udp/tcp 53 (DNS) ?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Michele, but it works for IPv4, in ovb fs031. This should be something IPv6 specific.

Revision history for this message
Michele Baldessari (michele) wrote :

Aye, could it be that since we are deploying ipv6 overcloud, the containerized undercloud is missing some ipv4 rules? From https://logs.rdoproject.org/16/566916/11/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/Zb3297f9fb4a44a10b34f7afa1b9e860d/undercloud/var/log/host_info.txt.gz I see no single 53 dns packet rule allowing packets to go through.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

sudo os-collect-config --print shows there is missing "masquerade_networks": [
   "192.168.24.0/24"
  ],

compare it to the https://logs.rdoproject.org/30/570230/1/openstack-check/gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/Z99566eb66bf540f2b1be6fbb3f15f7df/undercloud/var/log/host_info.txt.gz

So that's the root cause prolly

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The passing instack job contains some FORWARD rules missing to the containerized UC, like
A FORWARD -d 192.168.24.0/24 -m state --state NEW -j ACCEPT
A FORWARD -s 192.168.24.0/24 -m state --state NEW -j ACCEPT

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

More of the missing forward rules:
+-A DOCKER-ISOLATION -j RETURN
+-A FORWARD -i docker0 -o docker0 -j ACCEPT
+-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
+-A FORWARD -j DOCKER-ISOLATION
+-A FORWARD -o docker0 -j DOCKER
+-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I cannot reproduce the missing iptablels rules on my undercloud deployed with the repro script I took from the failed job above

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/572151

Changed in tripleo:
assignee: nobody → Emilien Macchi (emilienm)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

With the proposed patch, https://review.openstack.org/#/c/566916/13 still fails with it still fails with (/Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]/returns) Error: Unable to communicate with overcloud-controller-*

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/572151
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=32ea5028fd2d4969ef1f2b089fb4ca6ef0dee8b1
Submitter: Zuul
Branch: master

commit 32ea5028fd2d4969ef1f2b089fb4ca6ef0dee8b1
Author: Emilien Macchi <email address hidden>
Date: Mon Jun 4 08:46:29 2018 -0700

    undercloud: enable KernelIpNonLocalBind

    We need KernelIpNonLocalBind on the undercloud to bind non local ips
    among other ip forward options. This sysctl parameter was managed by
    instack-undercloud but never ported to the containerized undercloud.
    We need the same sysctl parameters for parity with non containerized
    undercloud.

    Change-Id: Idd3d432b8f7eb573d94cd56be8e05614510ebddf
    Related-Bug: #1774898

Changed in tripleo:
status: Triaged → In Progress
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
status: In Progress → Fix Committed
Changed in tripleo:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/586531

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/queens)

Reviewed: https://review.openstack.org/586531
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=8c19bd04b49c57a3ff753cc2ec84fd969af82182
Submitter: Zuul
Branch: stable/queens

commit 8c19bd04b49c57a3ff753cc2ec84fd969af82182
Author: Emilien Macchi <email address hidden>
Date: Mon Jun 4 08:46:29 2018 -0700

    undercloud: enable KernelIpNonLocalBind

    We need KernelIpNonLocalBind on the undercloud to bind non local ips
    among other ip forward options. This sysctl parameter was managed by
    instack-undercloud but never ported to the containerized undercloud.
    We need the same sysctl parameters for parity with non containerized
    undercloud.

    Change-Id: Idd3d432b8f7eb573d94cd56be8e05614510ebddf
    Related-Bug: #1774898
    (cherry picked from 32ea5028fd2d4969ef1f2b089fb4ca6ef0dee8b1)
    Conflicts:
        environments/undercloud.yaml

tags: added: in-stable-queens
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.