Adding a new compute starts new service containers on controller

Bug #1805410 reported by Cédric Jeanneret on 2018-11-27
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
Bogdan Dobrelya

Bug Description

Hello,

Infra:
- 1 undercloud
- 1 controller
- 1 compute , then 2
- tripleo master
- podman as container engine

The initial deploy works as expected, the overcloud is deployed on the two selected machines.
But when I add the second compute, I have the following issue:

If I do no set the --skip-deploy-identifier, a mysql container is started on the controller, and this break the whole thing due to locking on the DB (namely, keystone crashed).

Apparently, something doesn't properly detect the controller is already deployed, or something like that, and we end with new containers while the original ones are still here, running and happy to live.

Changed in tripleo:
status: Triaged → Incomplete
Cédric Jeanneret (cjeanner) wrote :
Download full text (3.5 KiB)

Some logs and info (took time to re-run the whole thing)

In the deploy logs:
        "Running container: keystone_bootstrap",
        "$ podman ps -a --filter label=container_name=keystone --filter label=config_id=tripleo_step3 --format {{.Names}}",
        "keystone-vmn4c9po",
        "keystone",
        "$ podman exec --user=root keystone-vmn4c9po /usr/bin/bootstrap_host_exec keystone keystone-manage bootstrap --bootstrap-password vFViRs1DFjwJrmDz4I1ltcpgZ",
        "cannot exec into container that is not running",
        "Error running ['podman', 'exec', '--user=root', u'keystone-vmn4c9po', '/usr/bin/bootstrap_host_exec', 'keystone', 'keystone-manage', 'bootstrap', '--bootstrap-password', 'vFViRs1DFjwJrmDz4I1ltcpgZ']. [125]",
        "stderr: cannot exec into container that is not running",

On controller-0, we can see those NEW containers:
d64303343f36 docker.io/tripleomaster/centos-binary-keystone:current-tripleo /bin/bash -c /usr... 5 minutes ago Up 5 minutes ago keystone_cron-8p9o9k0h
b674af359a06 docker.io/tripleomaster/centos-binary-mariadb:current-tripleo kolla_start 8 minutes ago Up About a minute ago mysql-kw4r3w5x
069b6ca20a83 docker.io/tripleomaster/centos-binary-haproxy:current-tripleo kolla_start 10 minutes ago Up 10 minutes ago haproxy-nse1lud1
c2f11cba5a32 docker.io/tripleomaster/centos-binary-keepalived:current-tripleo /usr/local/bin/ko... 10 minutes ago Up 10 minutes ago keepalived-qapg6ssa

While the older ones are still running, and apparently in good shape:
e59cdeacdce9 docker.io/tripleomaster/centos-binary-keystone:current-tripleo /bin/bash -c /usr... About an hour ago Up About an hour ago keystone_cron
25d9bb139ed4 docker.io/tripleomaster/centos-binary-mariadb:current-tripleo kolla_start About an hour ago Up About an hour ago mysql
96919bdb2ac2 docker.io/tripleomaster/centos-binary-haproxy:current-tripleo kolla_start About an hour ago Up About an hour ago haproxy
8cca1f0d47cc docker.io/tripleomaster/centos-binary-keepalived:current-tripleo /usr/local/bin/ko... About an hour ago Up About an hour ago keepalived

I'm wondering if I4386b155a4bdba430dc350914db7a6b6fdf92ac0[1] could do that kind of thing?

Having multiple mysqld processes hitting the very same DB creates this issue in service log:
2018-11-27 15:41:37 140499793242304 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11
2018-11-27 15:41:37 140499793242304 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files.

Also, regarding the keystone container, if we `podman logs keystone-vmn4c9po` we can see this:
[Tue Nov 27 15:32:23.781274 2018] [alias:warn] [pid 9] AH00671: The Alias directive in /etc/httpd/conf.d/autoindex.conf at line 21 will probably never match because it overlaps an earlier Alias.
(98)Address already in use: AH00072: make_sock: could not bind to address 192.168.24.8:35357
no listening...

Read more...

summary: - Adding a new compute starts a new "mysql" container on controller
+ Adding a new compute starts a new service containers on controller
summary: - Adding a new compute starts a new service containers on controller
+ Adding a new compute starts new service containers on controller
Alex Schultz (alex-schultz) wrote :

This happens on redeploys if the existing container was run (but is dead). This seems to be a bug in paunch where the old container is not relaunch or cleaned up so paunch is creating a new container instance that follows the <containername>-<randomchars> pattern. This latter name comes from the paunch code https://github.com/openstack/paunch/blob/master/paunch/runner.py#L98-L104

Cédric Jeanneret (cjeanner) wrote :

@Alex: nope, in my case, the "old" container was running as expected. Proof: keystone port was already occupied, preventing the keystone-<blah> to start and use that very same port.

Cédric Jeanneret (cjeanner) wrote :

Another thing: I just deployed an undercloud, then ran a `openstack undercloud upgrade`, and I end up with the following situation:
sudo podman ps -a | grep keystone
58d0db49a66b docker.io/tripleomaster/centos-binary-keystone:618d3ab83cd319e03fac86c1d6de510ef4a5134b_be9e0d5c /bin/bash -c /usr... 13 minutes ago Up 13 minutes ago keystone_cron-8tprne4k
49cccd9d16b2 docker.io/tripleomaster/centos-binary-keystone:618d3ab83cd319e03fac86c1d6de510ef4a5134b_be9e0d5c kolla_start 13 minutes ago Exited (1) 12 minutes ago keystone-s99dch8g
17382f319bb2 docker.io/tripleomaster/centos-binary-keystone:618d3ab83cd319e03fac86c1d6de510ef4a5134b_be9e0d5c /usr/bin/bootstra... 13 minutes ago Exited (0) 13 minutes ago keystone_db_sync-srgb7fip
6ee00f10c833 docker.io/tripleomaster/centos-binary-keystone:618d3ab83cd319e03fac86c1d6de510ef4a5134b_be9e0d5c /bin/bash -c chow... 16 minutes ago Exited (0) 16 minutes ago keystone_init_log-9wayk053
8b075f747698 docker.io/tripleomaster/centos-binary-keystone:618d3ab83cd319e03fac86c1d6de510ef4a5134b_be9e0d5c /bin/bash -c /usr... About an hour ago Up About an hour ago keystone_cron
c19a540e2a3e docker.io/tripleomaster/centos-binary-keystone:618d3ab83cd319e03fac86c1d6de510ef4a5134b_be9e0d5c kolla_start About an hour ago Up About an hour ago keystone
925cc11c5d9c docker.io/tripleomaster/centos-binary-keystone:618d3ab83cd319e03fac86c1d6de510ef4a5134b_be9e0d5c /usr/bin/bootstra... About an hour ago Exited (0) About an hour ago keystone_db_sync
50ef09040c85 docker.io/tripleomaster/centos-binary-keystone:618d3ab83cd319e03fac86c1d6de510ef4a5134b_be9e0d5c /bin/bash -c chow... About an hour ago Exited (0) About an hour ago keystone_init_log

In short: I now have 2 keystone-cron containers:
13 minutes ago Up 13 minutes ago keystone_cron-8tprne4k
About an hour ago Up About an hour ago keystone_cron

Also, I have multiple, duplicated containers:
keystone_db_sync vs keystone_db_sync-srgb7fip
keystone_init_log vs keystone_db_sync-srgb7fip
keystone vs keystone-s99dch8g (exited 1 btw)

So yeah. We have a big, big issue, and idempotency is broken for some reason. I see actually two locations where we get the <random> at the end of the container name:
- the one pointed by Alex
- the other one in t-h-t "docker-puppet.py".

Changed in tripleo:
status: Incomplete → Triaged
Changed in tripleo:
importance: Medium → High
Changed in tripleo:
milestone: stein-3 → stein-2
Dan Prince (dan-prince) wrote :

Have there been any recent changes to the deployment identifier code? If so I'd start there.

Emilien Macchi (emilienm) wrote :
Emilien Macchi (emilienm) wrote :

probably not https://review.openstack.org/#/c/619759/ - I just reproduced on the undercloud, where the stack isn't updated but recreated every time. I wonder if it's because we need https://review.openstack.org/#/c/614290/. Trying the patch now.

Emilien Macchi (emilienm) wrote :

so with https://review.openstack.org/#/c/614290/ I managed to redeploy without error.

Emilien Macchi (emilienm) wrote :

so when testing with current (and not current-tripleo), it doesn't work. even with https://review.openstack.org/#/c/614290/ so something really broke lately.

Cédric Jeanneret (cjeanner) wrote :

nope, I doubt 602969 has any side effect... I also tested with the label w/a, but it didn't wrk.

More over, I didn't see that issue while using the docker engine - it's limited to podman only.

Bogdan Dobrelya (bogdando) wrote :

It should be something to the missing rename_container for podman. I can't think of more differences of docker vs podman we have in paunch.

Fix proposed to branch: master
Review: https://review.openstack.org/621607

Changed in tripleo:
assignee: nobody → Bogdan Dobrelya (bogdando)
status: Triaged → In Progress
Emilien Macchi (emilienm) wrote :

I couldn't reproduce the bug with https://review.openstack.org/#/c/614290/. Closing it.

Changed in tripleo:
importance: High → Medium
status: In Progress → Fix Released
Bogdan Dobrelya (bogdando) wrote :

@Emilien, the bug is a race condition and it only may happen for a consequent executions of 'paunch apply', which attempts to rename containers firstly. It needs to be executed a hundreds of times to really confirm there is no a race any more.

Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → nobody
importance: Medium → High

Reviewed: https://review.openstack.org/621607
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=510f0913539e92d2e874ae97efe0606fd277ad4b
Submitter: Zuul
Branch: master

commit 510f0913539e92d2e874ae97efe0606fd277ad4b
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Dec 3 16:23:32 2018 +0100

    Implement podman rename via re-apply of containers

    To w/a the missing a container rename feature of podman, implement
    renaming via removing of the original container and re-applying it from
    the same configs but using the new name.

    This fixes idempotency issues when service containers are executed
    under ephemeral names created via the paunch's unique containers names
    generator, while it is expected to have them executed under its wanted
    config names.

    Change-Id: If851604d25b6c7982d950bb9e13dceada3bfc161
    Closes-Bug: #1805410
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
assignee: nobody → Bogdan Dobrelya (bogdando)

This issue was fixed in the openstack/paunch 4.3.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers