[fs01][master] Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy[ha-all@/]: Could not evaluate: undefined method `scan' for nil:NilClass

Bug #1835615 reported by Arx Cruz on 2019-07-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Michele Baldessari

Bug Description

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/5b7be63/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

2019-07-05 20:55:34 | TASK [Debug output for task: Run puppet host configuration for step 1] *********
2019-07-05 20:55:34 | Friday 05 July 2019 20:53:45 +0000 (0:01:49.118) 0:17:39.881 ***********
2019-07-05 20:55:34 | ok: [overcloud-controller-2] => {
2019-07-05 20:55:34 | "failed_when_result": false,
2019-07-05 20:55:34 | "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [
2019-07-05 20:55:34 | "puppet-user: Warning: Support for ruby version 2.0.0 is deprecated and will be removed in a future release. See https://puppet.com/docs/puppet/latest/system_requirements.html for a list of supported ruby versions.",
2019-07-05 20:55:34 | "puppet-user: (location: /usr/share/ruby/vendor_ruby/puppet.rb:130:in `<module:Puppet>')",
2019-07-05 20:55:34 | "puppet-user: Warning: The function 'hiera' is deprecated in favor of using 'lookup'. See https://puppet.com/docs/puppet/5.5/deprecated_language.html\\n (file & line not available)",
2019-07-05 20:55:34 | "puppet-user: Warning: /etc/puppet/hiera.yaml: Use of 'hiera.yaml' version 3 is deprecated. It should be converted to version 5",
2019-07-05 20:55:34 | "puppet-user: (file: /etc/puppet/hiera.yaml)",
2019-07-05 20:55:34 | "puppet-user: Warning: Undefined variable '::deploy_config_name'; \\n (file & line not available)",
2019-07-05 20:55:34 | "puppet-user: Warning: ModuleLoader: module 'tripleo' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules\\n (file & line not available)",
2019-07-05 20:55:34 | "puppet-user: Warning: Unknown variable: '::deployment_type'. (file: /etc/puppet/modules/tripleo/manifests/profile/base/database/mysql/client.pp, line: 89, column: 31)",
2019-07-05 20:55:34 | "puppet-user: Warning: ModuleLoader: module 'pacemaker' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules\\n (file & line not available)",
2019-07-05 20:55:34 | "puppet-user: error: Could not connect to cluster (is it running?)",

We notice this error also on some checks job during the week, it only happens on overcloud-controller-2

There are some puppet errors in logs:

2019-07-07 03:04:35 | "stderr: Error: unable to find resource 'galera-bundle'",

2019-07-07 03:04:35 | "Error: Facter: error while resolving custom fact \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass",
2019-07-07 03:04:35 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy[ha-all@/]: Could not evaluate: undefined method `scan' for nil:NilClass",

Last jobs passed

tags: removed: promotion-blocker
Changed in tripleo:
milestone: train-2 → train-3
Arx Cruz (arxcruz) wrote :

Marking as invalid

Changed in tripleo:
status: Triaged → Invalid
chandan kumar (chkumar246) wrote :

We are seeing this issue again on fs020 master promotion job:
https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/e72fe9b/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-08-12_21_46_58

"+ puppet apply --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation,rabbitmq_policy,rabbitmq_user,rabbitmq_ready -e 'noop_resource('\\''package'\\''); include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::rabbitmq_bundle'",
2019-08-12 21:46:58 | "Warning: Support for ruby version 2.0.0 is deprecated and will be removed in a future release. See https://puppet.com/docs/puppet/latest/system_requirements.html for a list of supported ruby versions.",
2019-08-12 21:46:58 | " (location: /usr/share/ruby/vendor_ruby/puppet.rb:130:in `<module:Puppet>')",
2019-08-12 21:46:58 | "Error: Facter: error while resolving custom fact \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass",

Changed in tripleo:
status: Invalid → Confirmed
summary: - Promotion jobs are failing with could not connect to cluster
+ [fs020][master] Error: Facter: error while resolving custom fact
+ \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass
summary: - [fs020][master] Error: Facter: error while resolving custom fact
+ [fs01][master] Error: Facter: error while resolving custom fact
\"rabbitmq_nodename\": undefined method `[]' for nil:NilClass
tags: added: promotion-blocker
removed: alert ci

facter errors (as annoying as they are) are non fatal, that error has nothing to do with your failure. The error you are hitting is this:
2019-08-12 21:46:58 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy[ha-all@/]: Could not evaluate: undefined method `scan' for nil:NilClass",

I will try and look a bit into it. How easily reproduceable is this?

summary: - [fs01][master] Error: Facter: error while resolving custom fact
- \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass
+ [fs01][master] Error:
+ /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy
+ [ha-all@/]: Could not evaluate: undefined method `scan' for nil:NilClass
Michele Baldessari (michele) wrote :

So the failure is on controller-2 which is the bootstrap node (since when is this the case I am not sure). My current theory is that the rabbitmq cluster gets bootstrapped on controller-0 and then spawns up on ctrl-1 and ctrl-2 and when we call the following snippet:
rabbitmq_policy { 'ha-all@/':
        applyto => 'queues',
        pattern => '^(?!amq\.).*',
        definition => hash($ha_policy.map |$k, $v| {[$k, "${v}"]}),
}

The rabbitmq provider barfs on the output because likely at some point the command has some unexpected output.

We do run that command *only* after this 'rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true', so it might be that this condition is not enough. I for one was able to observe the output of list_policies changing during a restart:
 while true; do rabbitmqctl -q list_policies -p /; done
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* all {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* all {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* all {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* all {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* queues {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* queues {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority

It is likely that when this happens puppet-rabbitmqctl barfs due to either some other unexpected output and/or this output changing under its feet ("all" vs "queues"). To be exactly sure we'd really need the --debug option to these containers (this used to work but somehow got broken in the past)

Bogdan Dobrelya (bogdando) wrote :

@Michele for the alternative upstream OCF RA, policy file gets applied only after the rabbitmq app started on the promoted master resource [0]. That's used to be working well AFAIKT.

[0] https://github.com/rabbitmq/rabbitmq-server-release/blob/master/scripts/rabbitmq-server-ha.ocf#L2314

Michele Baldessari (michele) wrote :

Thanks Bogdan. My hunch would be that 'rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true' is more stringent than checking if the rabbitmq app (+ plugins) are up, but I have zero data to back it up (except that I presume that without rabbitmq up and running there can't be a cluster ;). Ideally there would be a check for queues to be synced

I'd love to get --debug from the rabbitmq_init_bundle container, because in theory puppet-rabbitmq has some code to be a bit resilient in case of errors when invoking rabbitmqctl commands. It seems in oooq we've stopped including environments/config-debug.yaml (or i cannot see it)

Bogdan Dobrelya (bogdando) wrote :

@Michele, here is some quick testing:

()[rabbitmq@standalone /]$ rabbitmqctl eval "rabbit_mnesia:is_clustered()."
false
()[rabbitmq@standalone /]$ rabbitmqctl stop_app
Stopping rabbit application on node rabbit@standalone
()[rabbitmq@standalone /]$ rabbitmqctl eval "rabbit_mnesia:is_clustered()."
false

as you can see, we could improve the policy applications with waiting for the app started

Fix proposed to branch: master
Review: https://review.opendev.org/676431

Changed in tripleo:
assignee: nobody → Michele Baldessari (michele)
status: Confirmed → In Progress
Changed in tripleo:
assignee: Michele Baldessari (michele) → Alex Schultz (alex-schultz)
Changed in tripleo:
assignee: Alex Schultz (alex-schultz) → Michele Baldessari (michele)

Reviewed: https://review.opendev.org/676431
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=3a8c2b0dc713458ccea3d1ee2657a4ef448f7b2f
Submitter: Zuul
Branch: master

commit 3a8c2b0dc713458ccea3d1ee2657a4ef448f7b2f
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 14 16:40:46 2019 +0200

    Make the rabbitmq-ready exec more stringent

    Currently we use the following command to determine if rabbit is
    up and running *and* ready to service requests:
    rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true

    Now we have occasionally observed that rabbitmqctl policies commands
    which are executed after said exec['rabbitmq-ready'] will fail.

    One potential reason is that is_clustered() can return true *before*
    the rabbit app is actually running. In fact we can see it does
    return true even though the app is stopped:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_mnesia:is_clustered().'
    true

    Let's switch to a combination of commands that check for the cluster to
    be up *and* the rabbitmq app to be running:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_nodes:is_running(node(), rabbit).'
    false

    Suggested-By: Bogdan Dobrelya <email address hidden>
    Closes-Bug: #1835615

    Change-Id: I29f779145a39cd16374a91626f7fae1581a18224

Changed in tripleo:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/677404
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=914d66ad81dac2be83ff94ee54291fdbf5f7e3c7
Submitter: Zuul
Branch: stable/stein

commit 914d66ad81dac2be83ff94ee54291fdbf5f7e3c7
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 14 16:40:46 2019 +0200

    Make the rabbitmq-ready exec more stringent

    Currently we use the following command to determine if rabbit is
    up and running *and* ready to service requests:
    rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true

    Now we have occasionally observed that rabbitmqctl policies commands
    which are executed after said exec['rabbitmq-ready'] will fail.

    One potential reason is that is_clustered() can return true *before*
    the rabbit app is actually running. In fact we can see it does
    return true even though the app is stopped:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_mnesia:is_clustered().'
    true

    Let's switch to a combination of commands that check for the cluster to
    be up *and* the rabbitmq app to be running:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_nodes:is_running(node(), rabbit).'
    false

    Suggested-By: Bogdan Dobrelya <email address hidden>
    Closes-Bug: #1835615

    Change-Id: I29f779145a39cd16374a91626f7fae1581a18224

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/677405
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=cb5102003b811d1835b8f03eb2b83f612094a788
Submitter: Zuul
Branch: stable/rocky

commit cb5102003b811d1835b8f03eb2b83f612094a788
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 14 16:40:46 2019 +0200

    Make the rabbitmq-ready exec more stringent

    Currently we use the following command to determine if rabbit is
    up and running *and* ready to service requests:
    rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true

    Now we have occasionally observed that rabbitmqctl policies commands
    which are executed after said exec['rabbitmq-ready'] will fail.

    One potential reason is that is_clustered() can return true *before*
    the rabbit app is actually running. In fact we can see it does
    return true even though the app is stopped:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_mnesia:is_clustered().'
    true

    Let's switch to a combination of commands that check for the cluster to
    be up *and* the rabbitmq app to be running:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_nodes:is_running(node(), rabbit).'
    false

    Suggested-By: Bogdan Dobrelya <email address hidden>
    Closes-Bug: #1835615

    Change-Id: I29f779145a39cd16374a91626f7fae1581a18224

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/677406
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=051e05a91dc89021de45644a8a34ffb93c2ddce7
Submitter: Zuul
Branch: stable/queens

commit 051e05a91dc89021de45644a8a34ffb93c2ddce7
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 14 16:40:46 2019 +0200

    Make the rabbitmq-ready exec more stringent

    Currently we use the following command to determine if rabbit is
    up and running *and* ready to service requests:
    rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true

    Now we have occasionally observed that rabbitmqctl policies commands
    which are executed after said exec['rabbitmq-ready'] will fail.

    One potential reason is that is_clustered() can return true *before*
    the rabbit app is actually running. In fact we can see it does
    return true even though the app is stopped:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_mnesia:is_clustered().'
    true

    Let's switch to a combination of commands that check for the cluster to
    be up *and* the rabbitmq app to be running:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_nodes:is_running(node(), rabbit).'
    false

    Suggested-By: Bogdan Dobrelya <email address hidden>
    Closes-Bug: #1835615

    Change-Id: I29f779145a39cd16374a91626f7fae1581a18224

tags: added: in-stable-queens

This issue was fixed in the openstack/puppet-tripleo 10.5.1 release.

This issue was fixed in the openstack/puppet-tripleo 9.5.1 release.

This issue was fixed in the openstack/puppet-tripleo 8.5.1 release.

This issue was fixed in the openstack/puppet-tripleo 11.2.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers