Can't run container rabbitmq_wait_bundle. Could not prefetch rabbitmq_user provider 'rabbitmqctl': Command is still failing after 180 seconds expired!

Bug #1949327 reported by Ananya Banerjee
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

periodic-tripleo-ci-centos-9-scenario007-standalone-master fails with

2021-10-29 09:31:56,999 p=49516 u=root n=ansible | 2021-10-29 09:31:56.999071 | fa163e29-1189-1eb5-6d10-000000001642 | TASK | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_2
2021-10-29 10:15:56,077 p=49516 u=root n=ansible | 2021-10-29 10:15:56.075058 | | WARNING | ERROR: Can't run container rabbitmq_wait_bundle
stderr: + STEP=2
+ TAGS=file,file_line,concat,augeas,rabbitmq_policy,rabbitmq_user,rabbitmq_ready
+ CONFIG='include tripleo::profile::pacemaker::rabbitmq_bundle'
+ EXTRA_ARGS=
+ '[' -d /tmp/puppet-etc ']'
+ cp -a /tmp/puppet-etc/devices /tmp/puppet-etc/hieradata /tmp/puppet-etc/hiera.yaml /tmp/puppet-etc/modules /tmp/puppet-etc/puppet.conf /tmp/puppet-etc/ssl /etc/puppet
+ echo '{"step": 2}'
+ export FACTER_deployment_type=containers
+ FACTER_deployment_type=containers
+ set +e
+ puppet apply --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,rabbitmq_policy,rabbitmq_user,rabbitmq_ready -e 'noop_resource('\''package'\''); include tripleo::profile::pacemaker::rabbitmq_bundle'
Warning: /etc/puppet/hiera.yaml: Use of 'hiera.yaml' version 3 is deprecated. It should be converted to version 5
   (file: /etc/puppet/hiera.yaml)
Warning: Undefined variable '::deploy_config_name';
   (file & line not available)
Warning: The function 'hiera' is deprecated in favor of using 'lookup'. See https://puppet.com/docs/puppet/7.10/deprecated_language.html
   (file & line not available)
Warning: Unknown variable: 'used_promoted_max'. (file: /etc/puppet/modules/pacemaker/manifests/resource/bundle.pp, line: 185, column: 27)
Error: 'rabbitmqctl eval "lists:keymember(rabbit, 1, application:which_applications())." | grep -q true' returned 1 instead of one of [0]
Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]/returns: change from 'notrun' to ['0'] failed: 'rabbitmqctl eval "lists:keymember(rabbit, 1, application:which_applications())." | grep -q true' returned 1 instead of one of [0]
Error: Could not prefetch rabbitmq_user provider 'rabbitmqctl': Command is still failing after 180 seconds expired!
Warning: /Stage[main]/Tripleo::Profile::Base::Rabbitmq/Rabbitmq_user[guest]: Skipping because of failed dependencies
Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy[ha-all@/]: Skipping because of failed dependencies

https://logserver.rdoproject.org/08/36508/2/check/periodic-tripleo-ci-centos-9-scenario007-standalone-master/7f7f905/logs/undercloud/home/zuul-worker/ansible.log.txt.gz

Changed in tripleo:
importance: Undecided → Critical
status: New → Triaged
milestone: none → xena-3
tags: added: promotion-blocker
Revision history for this message
Ananya Banerjee (frenzyfriday) wrote :

https://logserver.rdoproject.org/08/36508/2/check/periodic-tripleo-ci-centos-9-scenario007-standalone-master/7f7f905/logs/undercloud/var/log/containers/rabbitmq/startup_err.gz

Configuring logger redirection

BOOT FAILED
===========
Exception during startup:

    supervisor:children_map/4 line 1171
    supervisor:'-start_children/2-fun-0-'/3 line 355
    supervisor:do_start_child/2 line 371
    supervisor:do_start_child_i/3 line 385
    rabbit_prelaunch:run_prelaunch_first_phase/0 line 27
    rabbit_prelaunch:do_run/0 line 111
    rabbit_prelaunch_dist:setup/1 line 15
    rabbit_prelaunch_dist:duplicate_node_check/1 line 51
error:{badmatch,
          {error,
              {{shutdown,
                   {failed_to_start_child,net_kernel,{'EXIT',nodistribution}}},
               {child,undefined,net_sup_dynamic,
                   {erl_distribution,start_link,
                       [[rabbit_prelaunch_925@localhost,shortnames],
                        false,net_sup_dynamic]},
                   permanent,1000,supervisor,
                   [erl_distribution]}}}}

Kernel pid terminated (application_controller) ({application_start_failure,rabbitmq_prelaunch,{{shutdown,{failed_to_start_child,prelaunch,{badmatch,{error,{{shutdown,{failed_to_start_child,net_kernel,

Crash dump is being written to: erl_crash.dump...done

Revision history for this message
John Eckersberg (jeckersb) wrote :

Looks like it can't contact epmd for some reason. Usually name resolution is the culprit when that happens. I noticed here:

https://logserver.rdoproject.org/08/36508/2/check/periodic-tripleo-ci-centos-9-scenario007-standalone-master/7f7f905/logs/undercloud/etc/hosts.txt.gz

There are duplicate entries for 192.168.24.1, I suspect that may have something to do with it. I will need to get my hands on a representative reproducer environment to poke at it and be sure.

Revision history for this message
Ananya Banerjee (frenzyfriday) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

the journal log is also full of

epmd: failed to bind on ipaddr 0.0.0.0

Revision history for this message
John Eckersberg (jeckersb) wrote :

Good catch that it's trying to bind to 0.0.0.0.

It should only do that if ERL_EPMD_ADDRESS is unset. But we set that in rabbitmq-env.conf here:

https://logserver.rdoproject.org/08/36508/2/check/periodic-tripleo-ci-centos-9-scenario007-standalone-master/7f7f905/logs/undercloud/var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf.txt.gz

In that case it should bind to (1) ipv4 loopback (2) ipv6 loopback and (3+) comma-separated addresses from ERL_EPMD_ADDRESS. In no way should it try to bind to 0.0.0.0 (unless I guess if you explicitly set ERL_EPMD_ADDRESS=0.0.0.0)

Ronelle Landy (rlandy)
Changed in tripleo:
milestone: xena-3 → yoga-1
Revision history for this message
Ananya Banerjee (frenzyfriday) wrote :

Newer rechecks are failing with:
PLAY [Server network validation] ***********************************************
2021-11-12 02:53:44.205692 | fa163eb4-01af-65c1-1453-000000000086 | TASK | Basic Network Validation
2021-11-12 02:53:44.220267 | fa163eb4-01af-65c1-1453-000000000086 | TIMING | Basic Network Validation | standalone | 0:00:44.543561 | 0.01s
2021-11-12 02:53:44.259668 | fa163eb4-01af-65c1-1453-0000000004e8 | TASK | Collect default network fact
2021-11-12 02:53:44.893103 | fa163eb4-01af-65c1-1453-0000000004e8 | OK | Collect default network fact | standalone
2021-11-12 02:53:44.894214 | fa163eb4-01af-65c1-1453-0000000004e8 | TIMING | tripleo_nodes_validation : Collect default network fact | standalone | 0:00:45.217513 | 0.63s
2021-11-12 02:53:44.913062 | fa163eb4-01af-65c1-1453-0000000004e9 | TASK | Check Default IPv4 Gateway availability
2021-11-12 02:53:45.166891 | fa163eb4-01af-65c1-1453-0000000004e9 | OK | Check Default IPv4 Gateway availability | standalone
2021-11-12 02:53:45.168147 | fa163eb4-01af-65c1-1453-0000000004e9 | TIMING | tripleo_nodes_validation : Check Default IPv4 Gateway availability | standalone | 0:00:45.491441 | 0.25s
2021-11-12 02:53:45.187632 | fa163eb4-01af-65c1-1453-0000000004ea | TASK | Check all networks Gateway availability
2021-11-12 02:53:45.422475 | fa163eb4-01af-65c1-1453-0000000004ea | FATAL | Check all networks Gateway availability | standalone | error={"ansible_loop_var": "gateway_ip", "changed": false, "cmd": ["ping", "-w", "10", "-c", "1", "[]"], "delta": "0:00:00.003536", "end": "2021-11-12 02:53:45.405727", "gateway_ip": [], "msg": "non-zero return code", "rc": 2, "start": "2021-11-12 02:53:45.402191", "stderr": "ping: []: Name or service not known", "stderr_lines": ["ping: []: Name or service not known"], "stdout": "", "stdout_lines": []}

https://logserver.rdoproject.org/08/36508/4/check/periodic-tripleo-ci-centos-9-scenario007-standalone-master/a27f36c/logs/undercloud/home/zuul-worker/standalone_deploy.log.txt.gz

Revision history for this message
Marios Andreou (marios-b) wrote :

@Ananya o/ I am trying to catch up with this for the CIX (I am ruck this week)

This job isn't merged yet afaics it is only running in your test @ https://review.rdoproject.org/r/c/testproject/+/36508 (https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-scenario007-standalone-master&project=testproject)

I think we need to get some newer runs here last one was a couple weeks ago

Revision history for this message
Ananya Banerjee (frenzyfriday) wrote :
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.