tripleo ansible sometimes does not run certain containers

Bug #1878405 reported by Emilien Macchi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Emilien Macchi

Bug Description

Environment: scenario002 (standalone) on master:

2020-05-13T02:57:55.763617389+00:00 stderr F ++ cat /run_command
2020-05-13T02:57:55.765984144+00:00 stderr F + CMD=/usr/sbin/pacemaker_remoted
2020-05-13T02:57:55.766026031+00:00 stderr F + ARGS=
2020-05-13T02:57:55.766057692+00:00 stderr F + sudo kolla_copy_cacerts
2020-05-13T02:57:55.786655396+00:00 stderr F + [[ ! -n '' ]]
2020-05-13T02:57:55.786655396+00:00 stderr F + . kolla_extend_start
2020-05-13T02:57:55.786865676+00:00 stderr F ++ : /var/log/kolla/rabbitmq
2020-05-13T02:57:55.786939972+00:00 stderr F ++ [[ -n '' ]]
2020-05-13T02:57:55.786963710+00:00 stderr F ++ [[ ! -d /var/log/kolla/rabbitmq ]]
2020-05-13T02:57:55.787008809+00:00 stderr F ++ mkdir -p /var/log/kolla/rabbitmq
2020-05-13T02:57:55.790248997+00:00 stderr F +++ stat -c %a /var/log/kolla/rabbitmq
2020-05-13T02:57:55.792751083+00:00 stderr F ++ [[ 755 != \7\5\5 ]]
2020-05-13T02:57:55.792832331+00:00 stderr F + echo 'Running command: '\''/usr/sbin/pacemaker_remoted'\'''
2020-05-13T02:57:55.792857663+00:00 stdout F Running command: '/usr/sbin/pacemaker_remoted'
2020-05-13T02:57:55.792890328+00:00 stderr F + exec /usr/sbin/pacemaker_remoted
2020-05-13T02:57:55.803774116+00:00 stderr F (crm_add_logfile) error: Directory '/var/log/pacemaker' does not exist: logging to '/var/log/pacemaker/pacemaker.log' is disabled
2020-05-13T02:57:55.804483726+00:00 stderr F (crm_log_init) info: Changed active directory to /var/lib/pacemaker/cores
2020-05-13T02:57:55.804483726+00:00 stderr F (main) notice: Starting Pacemaker remote executor
2020-05-13T02:57:55.804483726+00:00 stderr F (qb_ipcs_us_publish) info: server name: lrmd
2020-05-13T02:57:55.804964007+00:00 stderr F (pcmk__init_tls_dh) info: Generating Diffie-Hellman parameters with 2048-bit prime for TLS
2020-05-13T02:57:57.799755084+00:00 stderr F (qb_ipcs_us_publish) info: server name: cib_ro
2020-05-13T02:57:57.800241641+00:00 stderr F (qb_ipcs_us_publish) info: server name: cib_rw
2020-05-13T02:57:57.800348900+00:00 stderr F (qb_ipcs_us_publish) info: server name: cib_shm
2020-05-13T02:57:57.800458108+00:00 stderr F (qb_ipcs_us_publish) info: server name: attrd
2020-05-13T02:57:57.800565196+00:00 stderr F (qb_ipcs_us_publish) info: server name: stonith-ng
2020-05-13T02:57:57.800695201+00:00 stderr F (qb_ipcs_us_publish) info: server name: crmd
2020-05-13T02:57:57.801369089+00:00 stderr F (main) notice: Pacemaker remote executor successfully started and accepting connections
2020-05-13T02:57:58.697776262+00:00 stderr F (crm_remote_accept) info: New remote connection from ::ffff:192.168.24.1
2020-05-13T02:57:58.698171725+00:00 stderr F (lrmd_remote_listen) info: Remote client pending authentication | 0x56161b600c40 id: 0a4f5a9b-c211-4b26-93b1-54f8aa59b4b0
2020-05-13T02:57:59.241997117+00:00 stderr F (remoted__read_handshake_data) notice: Remote client connection accepted
2020-05-13T02:57:59.354186683+00:00 stderr F (process_lrmd_get_rsc_info) info: Resource 'rabbitmq' not found (0 active resources)
2020-05-13T02:57:59.435843660+00:00 stderr F (process_lrmd_get_rsc_info) info: Resource 'rabbitmq:0' not found (0 active resources)
2020-05-13T02:57:59.517826701+00:00 stderr F (process_lrmd_rsc_register) info: Added 'rabbitmq' to the rsc list (1 active resources)
2020-05-13T02:58:00.959147005+00:00 stderr F (log_execute) info: executing - rsc:rabbitmq action:start call_id:13
2020-05-13T02:58:14.710017979+00:00 stderr F (operation_finished) notice: rabbitmq_start_0:208:stderr [ Call cib_query failed (-6): No such device or address ]
2020-05-13T02:58:14.710017979+00:00 stderr F (operation_finished) notice: rabbitmq_start_0:208:stderr [ Call cib_query failed (-6): No such device or address ]
2020-05-13T02:58:14.710017979+00:00 stderr F (log_finished) info: finished - rsc:rabbitmq action:start call_id:13 pid:208 exit-code:0 exec-time:13751ms queue-time:0ms
2020-05-13T02:58:14.754021075+00:00 stderr F (log_execute) info: executing - rsc:rabbitmq action:notify call_id:46
2020-05-13T02:58:14.770998552+00:00 stderr F (log_finished) info: finished - rsc:rabbitmq action:notify call_id:46 pid:1337 exit-code:0 exec-time:17ms queue-time:0ms

https://82823ad7229314245818-64a8cd117cc6b0c774af78e64d41d650.ssl.cf2.rackcdn.com/726927/3/check/tripleo-ci-centos-8-scenario002-standalone/86cc8df/logs/undercloud/var/log/containers/stdouts/rabbitmq-bundle.log

Observed on : https://review.opendev.org/#/c/726927/ where we force the containers to return exit code 0. Since this patch is new we might have this problem for a while and it didn't block the deployment apparently. Now with that patch, it will. So we need to figure out that bundle issue.

tags: added: ussuri-backport-potential
Revision history for this message
Emilien Macchi (emilienm) wrote :

The rabbitmq_init_bundle container isn't supposed to stop, so there is no exit code. This bug is invalid.

Changed in tripleo:
status: Triaged → Invalid
status: Invalid → Triaged
Revision history for this message
Emilien Macchi (emilienm) wrote :

Correction, this is a valid bug. we're looking into it.

Revision history for this message
Michele Baldessari (michele) wrote :

Ok so the issue is that rabbitmq_bootstrap container was never run at all. Not sure why though
We see that this has been created:
https://82823ad7229314245818-64a8cd117cc6b0c774af78e64d41d650.ssl.cf2.rackcdn.com/726927/3/check/tripleo-ci-centos-8-scenario002-standalone/86cc8df/logs/undercloud/var/lib/tripleo-config/container-startup-config/step_1/rabbitmq_bootstrap.json

But if you look into https://82823ad7229314245818-64a8cd117cc6b0c774af78e64d41d650.ssl.cf2.rackcdn.com/726927/3/check/tripleo-ci-centos-8-scenario002-standalone/86cc8df/logs/undercloud/var/log/extra/podman/containers/ we see mysql_bootstrap and a bunch of other transient containers, but *not* for rabbitmq_bootstrap. So it was never run, and we need to figure out why.

Revision history for this message
Emilien Macchi (emilienm) wrote :

FYI it didn't reproduce on https://review.opendev.org/#/c/726927/ - maybe some transient failures... i'll keep this open in case.

tags: added: idempotency
Revision history for this message
Michele Baldessari (michele) wrote :

So now i reproduced it with https://a8cec92243857c4a7cd6-bf9e01299685171685ed3d46ea9b2b77.ssl.cf5.rackcdn.com/725782/12/check/tripleo-ci-centos-8-scenario002-standalone/cff6486/logs/undercloud/ via https://review.opendev.org/#/c/725782/

Same symptoms:

TASK [tripleo_container_manage : Print the containers that failed to start] ****
Monday 18 May 2020 11:13:01 +0000 (0:00:00.191) 0:36:12.360 ************
fatal: [standalone]: FAILED! => changed=false
  msg: '[''rabbitmq_init_bundle''] failed to start, check logs in /var/log/containers/stdouts/'

And we can see that:
A) rabbitmq-bootstrap was never run: https://a8cec92243857c4a7cd6-bf9e01299685171685ed3d46ea9b2b77.ssl.cf5.rackcdn.com/725782/12/check/tripleo-ci-centos-8-scenario002-standalone/cff6486/logs/undercloud/var/log/containers/stdouts/
B) rabbitmqctl fails due to wrong cookie -> due to A) https://a8cec92243857c4a7cd6-bf9e01299685171685ed3d46ea9b2b77.ssl.cf5.rackcdn.com/725782/12/check/tripleo-ci-centos-8-scenario002-standalone/cff6486/logs/undercloud/var/log/containers/rabbitmq/rabbit%40standalone.log:
...
2020-05-18 11:14:17.080 [error] <0.1699.0> ** Connection attempt from disallowed node 'rabbitmqcli-5690-rabbit@standalone' **
....

summary: - rabbitmq_init_bundle fails to start "Call cib_query failed"
+ tripleo ansible sometimes does not run certain containers
Revision history for this message
Michele Baldessari (michele) wrote :
Download full text (3.3 KiB)

Ok so now I have the proof that sometimes (no idea just yet, but I suspect the concurrency somewhere in tripleo_container_manage) certain containers are simply skipped (i.e. it's not that we do not manage to collect logs from some containers or the likes):
1) This is Emilien's working review https://review.opendev.org/#/c/726927/ and in scenario002 we see that in the journal we run the rabbitmq_bootstrap container:
url -s https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_956/726927/6/check/tripleo-ci-centos-8-scenario002-standalone/956b199/logs/undercloud/var/log/extra/journal.txt |zcat|grep -i -e 'podman.*rabbitmq_bootstrap'
May 17 05:22:08 standalone.localdomain podman[140353]: 2020-05-17 05:22:08.644623185 +0000 UTC m=+0.207716609 container create 45ff272ff0104cf214919b2a9fcbb793683100f7b9b66109561b08e59092e9e2 (image=192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:d1a79ae6de3c7f23f42a0832e580b695-updated-20200517043506, name=rabbitmq_bootstrap)
May 17 05:22:09 standalone.localdomain podman[140353]: 2020-05-17 05:22:09.269117157 +0000 UTC m=+0.832210644 container init 45ff272ff0104cf214919b2a9fcbb793683100f7b9b66109561b08e59092e9e2 (image=192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:d1a79ae6de3c7f23f42a0832e580b695-updated-20200517043506, name=rabbitmq_bootstrap)
May 17 05:22:09 standalone.localdomain podman[140353]: 2020-05-17 05:22:09.290500129 +0000 UTC m=+0.853594281 container start 45ff272ff0104cf214919b2a9fcbb793683100f7b9b66109561b08e59092e9e2 (image=192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:d1a79ae6de3c7f23f42a0832e580b695-updated-20200517043506, name=rabbitmq_bootstrap)
May 17 05:22:09 standalone.localdomain podman[140573]: 2020-05-17 05:22:09.653798258 +0000 UTC m=+0.123963757 container died 45ff272ff0104cf214919b2a9fcbb793683100f7b9b66109561b08e59092e9e2 (image=192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:d1a79ae6de3c7f23f42a0832e580b695-updated-20200517043506, name=rabbitmq_bootstrap)
May 17 05:22:09 standalone.localdomain podman[140573]: 2020-05-17 05:22:09.766776175 +0000 UTC m=+0.236941989 container cleanup 45ff272ff0104cf214919b2a9fcbb793683100f7b9b66109561b08e59092e9e2 (image=192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:d1a79ae6de3c7f23f42a0832e580b695-updated-20200517043506, name=rabbitmq_bootstrap)

2) This is my review where rabbitmq bootstrap was never run:
$ curl -s https://a8cec92243857c4a7cd6-bf9e01299685171685ed3d46ea9b2b77.ssl.cf5.rackcdn.com/725782/12/check/tripleo-ci-centos-8-scenario002-standalone/cff6486/logs/undercloud/var/log/extra/journal.txt |grep -i -e 'podman.*rabbitmq_bootstrap'
$

So it was never even attempted. Even though it is clearly in the configured startup file:
https://a8cec92243857c4a7cd6-bf9e01299685171685ed3d46ea9b2b77.ssl.cf5.rackcdn.com/725782/12/check/tripleo-ci-centos-8-scenario002-standalone/cff6486/logs/undercloud/var/lib/tripleo-config/container-startup-config/step_1/rabbitmq_bootstrap.json

So the question is why don't we run this container (and maybe others, Bogdan was mentioning oddities around swift containers not being run)? Is it because at step 1 there are four containers to be run (thr...

Read more...

Revision history for this message
Michele Baldessari (michele) wrote :

Note that it is not always rabbitmq_bootstrap, I had it with memcached never being run too:
https://1ebea9ab2b131fd2f663-3c8db7d9a3092adc2fa7c0c039f7a5c8.ssl.cf5.rackcdn.com/725782/12/check/tripleo-ci-centos-8-undercloud-containers/5089370/

The deployment failed with:
May 18 10:50:04 underCloud.localDomain podman[131448]: Error: unable to find container memcached: no container with name or ID
memcached found: no such container
May 18 10:50:04 underCloud.localDomain systemd[1]: tripleo_memcached.service: Control process exited, code=exited status=125

But even though memcached is specified in logs/undercloud/var/lib/tripleo-config/container-startup-config/step_1/ it was never run at all:
curl -s https://1ebea9ab2b131fd2f663-3c8db7d9a3092adc2fa7c0c039f7a5c8.ssl.cf5.rackcdn.com/725782/12/check/tripleo-ci-centos-8-undercloud-containers/5089370/logs/undercloud/var/log/extra/journal.txt |grep -i podman |grep -i -e 'name=memcached'

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Rabi Mishra (rabi) wrote :

Not sure if it's the same issue or different, but, I see the following in one of the jobs.

2020-05-19 06:40:50 | TASK [tripleo_container_manage : Create facts for containers which changed or failed or which require rc check] ***
2020-05-19 06:40:50 | task path: /usr/share/ansible/roles/tripleo_container_manage/tasks/podman/create.yml:100
2020-05-19 06:40:50 | Tuesday 19 May 2020 06:40:50 +0000 (0:05:20.703) 0:31:01.814 ***********
2020-05-19 06:40:50 | ok: [centos-8-airship-kna1-0016650726] => changed=false
2020-05-19 06:40:50 | ansible_facts:
2020-05-19 06:40:50 | containers_changed:
2020-05-19 06:40:50 | - nova_compute_init_log
2020-05-19 06:40:50 | - nova_api_init_logs
2020-05-19 06:40:50 | - horizon_fix_perms
2020-05-19 06:40:50 | containers_failed:
2020-05-19 06:40:50 | - rabbitmq_init_bundle
2020-05-19 06:40:50 | containers_to_check:
2020-05-19 06:40:50 | - nova_compute_init_log
2020-05-19 06:40:50 | - nova_api_init_logs
2020-05-19 06:40:50 | - horizon_fix_perms
2020-05-19 06:40:50 | - rabbitmq_init_bundle

I can see rabbitmq_init_bundle both in containers_failed and containers_to_check.

A little before in the logs puppet config generation does not show config hash change for 'rabbitmq_init_bundle' which is kind of confusing..

https://f68e92826c849d66705e-baf4f807d40559415da582760ebf9456.ssl.cf5.rackcdn.com/729105/1/check/tripleo-ci-centos-8-containers-multinode/c35c277/logs/undercloud/home/zuul/overcloud_deploy.log

Changed in tripleo:
assignee: nobody → Emilien Macchi (emilienm)
Revision history for this message
Emilien Macchi (emilienm) wrote :

Rabi, please take a look at https://review.opendev.org/729238
I made it so we don't have the failed containers into the containers to check. I also documented the facts a bit better.

Revision history for this message
Emilien Macchi (emilienm) wrote :

Michele: you'll notice that the systemd files were created: https://1ebea9ab2b131fd2f663-3c8db7d9a3092adc2fa7c0c039f7a5c8.ssl.cf5.rackcdn.com/725782/12/check/tripleo-ci-centos-8-undercloud-containers/5089370/logs/undercloud/etc/systemd/system/tripleo_memcached.service

(handed in the same role).

I'm looking at the playbooks now, to investigate if there is a race somewhere.

Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Emilien Macchi (emilienm) wrote :
Download full text (5.3 KiB)

With more debug:

TASK [tripleo_container_manage : Async container create/run] *******************
Wednesday 20 May 2020 02:11:38 +0000 (0:00:00.189) 0:23:26.556 *********
changed: [standalone] => (item={'rabbitmq_bootstrap': {'environment': {'KOLLA_BOOTSTRAP': True, 'KOLLA_CONFIG_STRATEGY': 'COPY_ALWAYS', 'RABBITMQ_CLUSTER_COOKIE': 'WbjqGHt5hIY9XNSPUBET', 'TRIPLEO_DEPLOY_IDENTIFIER': '1589938845'}, 'image': '192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:0b5ec707ab4c049281536e2bfbb1984c-updated-20200520012349', 'net': 'host', 'privileged': False, 'start_order': 0, 'volumes': ['/var/lib/kolla/config_files/rabbitmq.json:/var/lib/kolla/config_files/config.json:ro', '/var/lib/config-data/puppet-generated/rabbitmq:/var/lib/kolla/config_files/src:ro', '/etc/hosts:/etc/hosts:ro', '/etc/localtime:/etc/localtime:ro', '/var/lib/rabbitmq:/var/lib/rabbitmq:z']}})
changed: [standalone] => (item={'memcached': {'environment': {'KOLLA_CONFIG_STRATEGY': 'COPY_ALWAYS'}, 'healthcheck': {'test': '/openstack/healthcheck'}, 'image': '192.168.24.1:8787/tripleomaster/centos-binary-memcached:0b5ec707ab4c049281536e2bfbb1984c-updated-20200520012349', 'net': 'host', 'privileged': False, 'restart': 'always', 'start_order': 0, 'volumes': ['/etc/hosts:/etc/hosts:ro', '/etc/localtime:/etc/localtime:ro', '/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '/dev/log:/dev/log', '/etc/puppet:/etc/puppet:ro', '/var/lib/kolla/config_files/memcached.json:/var/lib/kolla/config_files/config.json:ro', '/var/lib/config-data/puppet-generated/memcached:/var/lib/kolla/config_files/src:rw,z', '/var/log/containers/memcached:/var/log/memcached:rw']}})
changed: [standalone] => (item={'mysql_data_ownership': {'command': ['chown', '-R', 'mysql:', '/var/lib/mysql'], 'detach': False, 'image': '192.168.24.1:8787/tripleomaster/centos-binary-mariadb:0b5ec707ab4c049281536e2bfbb1984c-updated-20200520012349', 'net': 'host', 'start_order': 0, 'user': 'root', 'volumes': ['/var/lib/mysql:/var/lib/mysql:z']}})

TASK [tripleo_container_manage : Check podman create status] *******************
Wednesday 20 May 2020 02:11:40 +0000 (0:00:01.630) 0:23:28.187 *********
ok: [standalone] => (item={'started': 1, 'finished': 0, 'ansible_job_id': '646909186997.129760', 'results_file': '/tmp/.ansible_async/646909186997.129760', 'changed': True, 'failed': False, 'container_data': {'rabbitmq_bootstrap': {'environment': {'KOLLA_BOOTSTRAP': True, 'KOLLA_CONFIG_STRATEGY': 'COPY_ALWAYS', 'RABBITMQ_CLUSTER_COOKIE': 'WbjqGHt5hIY9XNSPUBET', 'TRIPLEO_DEPLOY_IDENTIFIER': '1589938845'}, 'image': '192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:0b5ec707ab4c049281536e2bfbb1984c-updated-20200520012349', 'net': 'host', 'privileged': False, 'start_order': 0, 'volumes': ['/var/lib/kolla/config_files/rabbitmq.json:/var/lib/kolla/config_files/config.json:ro', '/var/lib/config-data/puppet-generated/rabbitmq:/var/lib/kolla/co...

Read more...

Revision history for this message
Emilien Macchi (emilienm) wrote :
Download full text (5.4 KiB)

We can see podman_container module trying to create the container:

May 20 02:11:39 standalone.localdomain ansible-podman_container[129772]: Invoked with conmon_pidfile=/var/run/rabbitmq_bootstrap.pid debug=True detach=True env={'KOLLA_BOOTSTRAP': True, 'KOLLA_CONFIG_STRATEGY': 'COPY_ALWAYS', 'RABBITMQ_CLUSTER_COOKIE': 'WbjqGHt5hIY9XNSPUBET', 'TRIPLEO_DEPLOY_IDENTIFIER': '1589938845'} etc_hosts={} image=192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:0b5ec707ab4c049281536e2bfbb1984c-updated-20200520012349 interactive=False label={'config_id': 'tripleo_step1', 'container_name': 'rabbitmq_bootstrap', 'managed_by': 'tripleo_ansible', 'config_data': {'environment': {'KOLLA_BOOTSTRAP': True, 'KOLLA_CONFIG_STRATEGY': 'COPY_ALWAYS', 'RABBITMQ_CLUSTER_COOKIE': 'WbjqGHt5hIY9XNSPUBET', 'TRIPLEO_DEPLOY_IDENTIFIER': '1589938845'}, 'image': '192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:0b5ec707ab4c049281536e2bfbb1984c-updated-20200520012349', 'net': 'host', 'privileged': False, 'start_order': 0, 'volumes': ['/var/lib/kolla/config_files/rabbitmq.json:/var/lib/kolla/config_files/config.json:ro', '/var/lib/config-data/puppet-generated/rabbitmq:/var/lib/kolla/config_files/src:ro', '/etc/hosts:/etc/hosts:ro', '/etc/localtime:/etc/localtime:ro', '/var/lib/rabbitmq:/var/lib/rabbitmq:z']}} log_driver=k8s-file log_opt=path=/var/log/containers/stdouts/rabbitmq_bootstrap.log name=rabbitmq_bootstrap net=host privileged=False rm=False state=present stop_timeout=0 tty=False volume=['/var/lib/kolla/config_files/rabbitmq.json:/var/lib/kolla/config_files/config.json:ro', '/var/lib/config-data/puppet-generated/rabbitmq:/var/lib/kolla/config_files/src:ro', '/etc/hosts:/etc/hosts:ro', '/etc/localtime:/etc/localtime:ro', '/var/lib/rabbitmq:/var/lib/rabbitmq:z'] volumes_from=[] network=['host'] executable=podman force_restart=False image_strict=False recreate=False annotation=None authfile=None blkio_weight=None blkio_weight_device=None cap_add=None cap_drop=None cgroup_parent=None cgroupns=None cgroups=None cidfile=None cmd_args=None command=None cpu_period=None cpu_rt_period=None cpu_rt_runtime=None cpu_shares=None cpus=None cpuset_cpus=None cpuset_mems=None detach_keys=None device=None device_read_bps=None device_read_iops=None device_write_bps=None device_write_iops=None dns=None dns_option=None dns_search=None entrypoint=None env_file=None env_host=None expose=None gidmap=None group_add=None healthcheck=None healthcheck_interval=None healthcheck_retries=None healthcheck_start_period=None healthcheck_timeout=None hostname=None http_proxy=None image_volume=None init=None init_path=None ip=None ipc=None kernel_memory=None label_file=None memory=None memory_reservation=None memory_swap=None memory_swappiness=None mount=None no_hosts=None oom_kill_disable=None oom_score_adj=None pid=None pids_limit=None pod=None publish=None publish_all=None read_only=None read_only_tmpfs=None restart_policy=None rootfs=None security_opt=None shm_size=None sig_proxy=None stop_signal=None subgidname=None subuidname=None sysctl=None systemd=None tmpfs=None uidmap=None ulimit=None user=None userns=None uts=None workdir=None
May 20 02:11:39 standalone.localdomain ansible-async_wra...

Read more...

Revision history for this message
Emilien Macchi (emilienm) wrote :

The issue seems to be in the async task:

ok: [standalone] => (item={'started': 1, 'finished': 0, 'ansible_job_id': '646909186997.129760', 'results_file': '/tmp/.ansible_async/646909186997.129760', 'changed': True, 'failed': False, 'container_data': {'rabbitmq_bootstrap': {'environment': {'KOLLA_BOOTSTRAP': True, 'KOLLA_CONFIG_STRATEGY': 'COPY_ALWAYS', 'RABBITMQ_CLUSTER_COOKIE': 'WbjqGHt5hIY9XNSPUBET', 'TRIPLEO_DEPLOY_IDENTIFIER': '1589938845'}, 'image': '192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:0b5ec707ab4c049281536e2bfbb1984c-updated-20200520012349', 'net': 'host', 'privileged': False, 'start_order': 0, 'volumes': ['/var/lib/kolla/config_files/rabbitmq.json:/var/lib/kolla/config_files/config.json:ro', '/var/lib/config-data/puppet-generated/rabbitmq:/var/lib/kolla/config_files/src:ro', '/etc/hosts:/etc/hosts:ro', '/etc/localtime:/etc/localtime:ro', '/var/lib/rabbitmq:/var/lib/rabbitmq:z']}}, 'ansible_loop_var': 'container_data'})

Reporting "ok" but the task is "changed".

Revision history for this message
Emilien Macchi (emilienm) wrote :

There are 2 issues:

- podman wrongly return 0 when hitting this error: https://github.com/containers/libpod/blob/dc7d6f4818f4b986cfd15208d53f6765d8fad986/libpod/options.go#L1219 (May 20 12:02:25 standalone.localdomain ansible-podman_container[152375]: PODMAN-CONTAINER-DEBUG STDERR: Error: error checking path "/var/lib/config-data/puppet-generated/rabbitmq": stat /var/lib/config-data/puppet-generated/rabbitmq: no such file or directory)

- container-puppet-rabbitmq was not finished when rabbitmq_bootstrap container was created, therefore /var/lib/config-data/puppet-generated/rabbitmq didn't exist yet. We need to figure out two things: why Puppet took 37 minutes to start doing something and why tripleo_container_manage didn't catch that this container didn't finish.

Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
Michele Baldessari (michele) wrote :

Note that it took 37 seconds not minutes, so fairly reasonable more or less:
2020-05-20T12:01:39.732288961+00:00 stderr F + mkdir -p /etc/puppet
...
2020-05-20T12:02:26.590997720+00:00 stderr F + rm -rf /tmp/tmp.tD1rY9SPMb /tmp/tmp.rv4aUYMRLt

As to issue nr. 1 (wrong podman error code). I just tried this on my downstream env:

[root@controller-0 ~]# podman run --user=root --net=host --rm undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-api:16.1_20200508.1 sh -c 'echo foo' ; echo $?
foo
0

[root@controller-0 ~]# podman run -v /nonexistant:/tmp/ --user=root --net=host --rm undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-api:16.1_20200508.1 sh -c 'echo foo' ; echo $?
Error: error checking path "/nonexistant": stat /nonexistant: no such file or directory
125

[root@controller-0 ~]# podman run -d -v /nonexistant:/tmp/ --user=root --net=host --rm undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-api:16.1_20200508.1 sh -c 'echo foo' ; echo $?
Error: error checking path "/nonexistant": stat /nonexistant: no such file or directory
125

[root@controller-0 ~]# rpm -q podman
podman-1.6.4-11.module+el8.2.0+6369+1f4293b4.x86_64

CI instead has: podman-1.6.4-4.module_el8.1.0+298+41f9343a.x86_64 so I tried an older podman version and it seems correct to me as well?
 [root@undercloud-0 ~]# rpm -q podman
podman-1.6.4-2.module+el8.1.1+5363+bf8ff1af.x86_64
 [root@undercloud-0 ~]# podman run -it --net=host --user=root undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-api:20200226.1 sh -c 'echo foo' ; echo $?
foo
0
 [root@undercloud-0 ~]# podman run -it --net=host --user=root --rm -v /nonexistent:/tmp/ undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-api:20200226.1 sh -c 'echo foo' ; echo $?
Error: error checking path "/nonexistent": stat /nonexistent: no such file or directory
125
 [root@undercloud-0 ~]# podman run -d -it --net=host --user=root --rm -v /nonexistent:/tmp/ undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-api:20200226.1 sh -c 'echo foo' ; echo $?
Error: error checking path "/nonexistent": stat /nonexistent: no such file or directory
125

Unless obviously my simple tests are not sufficiently close to the failure we've seen. Hohum, odd.

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Emilien Macchi (emilienm) wrote :

https://review.opendev.org/729238 will probably fix the issue.
It will wait for the container-puppet-* to actually finish.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/729751

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/729238
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=d160d8c4b0e54dcb0089c859335415f6fa99d6cd
Submitter: Zuul
Branch: master

commit d160d8c4b0e54dcb0089c859335415f6fa99d6cd
Author: Emilien Macchi <email address hidden>
Date: Tue May 19 08:44:20 2020 -0400

    podman/create: make containers_to_check more realistic

    containers_to_check is a list of containers that run something and then
    exit with a return code. Later we will check what their return code is
    and make sure the containers didn't fail before we continue with the
    deployment.

    Also, if a container failed to be created or updated; we don't want to include
    it in the list that contains the containers in which we want to check if
    the returned RC is valid; since we already know the container didn't
    terminate correctly.

    Closes-Bug: #1878405
    Change-Id: Ia3a985d923c5f44babaae601f3ba62b6d48659da

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/train)

Reviewed: https://review.opendev.org/729751
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=ad1f8a2dec20484b2446db85e20f40fa24e03ff8
Submitter: Zuul
Branch: stable/train

commit ad1f8a2dec20484b2446db85e20f40fa24e03ff8
Author: Emilien Macchi <email address hidden>
Date: Tue May 19 08:44:20 2020 -0400

    podman/create: make containers_to_check more realistic

    containers_to_check is a list of containers that run something and then
    exit with a return code. Later we will check what their return code is
    and make sure the containers didn't fail before we continue with the
    deployment.

    Also, if a container failed to be created or updated; we don't want to include
    it in the list that contains the containers in which we want to check if
    the returned RC is valid; since we already know the container didn't
    terminate correctly.

    Closes-Bug: #1878405
    Change-Id: Ia3a985d923c5f44babaae601f3ba62b6d48659da
    (cherry picked from commit d160d8c4b0e54dcb0089c859335415f6fa99d6cd)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 0.6.0

This issue was fixed in the openstack/tripleo-ansible 0.6.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.