[fs01][master] Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy[ha-all@/]: Could not evaluate: undefined method `scan' for nil:NilClass

Bug #1835615 reported by Arx Cruz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Michele Baldessari

Bug Description

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/5b7be63/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

2019-07-05 20:55:34 | TASK [Debug output for task: Run puppet host configuration for step 1] *********
2019-07-05 20:55:34 | Friday 05 July 2019 20:53:45 +0000 (0:01:49.118) 0:17:39.881 ***********
2019-07-05 20:55:34 | ok: [overcloud-controller-2] => {
2019-07-05 20:55:34 | "failed_when_result": false,
2019-07-05 20:55:34 | "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [
2019-07-05 20:55:34 | "puppet-user: Warning: Support for ruby version 2.0.0 is deprecated and will be removed in a future release. See https://puppet.com/docs/puppet/latest/system_requirements.html for a list of supported ruby versions.",
2019-07-05 20:55:34 | "puppet-user: (location: /usr/share/ruby/vendor_ruby/puppet.rb:130:in `<module:Puppet>')",
2019-07-05 20:55:34 | "puppet-user: Warning: The function 'hiera' is deprecated in favor of using 'lookup'. See https://puppet.com/docs/puppet/5.5/deprecated_language.html\\n (file & line not available)",
2019-07-05 20:55:34 | "puppet-user: Warning: /etc/puppet/hiera.yaml: Use of 'hiera.yaml' version 3 is deprecated. It should be converted to version 5",
2019-07-05 20:55:34 | "puppet-user: (file: /etc/puppet/hiera.yaml)",
2019-07-05 20:55:34 | "puppet-user: Warning: Undefined variable '::deploy_config_name'; \\n (file & line not available)",
2019-07-05 20:55:34 | "puppet-user: Warning: ModuleLoader: module 'tripleo' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules\\n (file & line not available)",
2019-07-05 20:55:34 | "puppet-user: Warning: Unknown variable: '::deployment_type'. (file: /etc/puppet/modules/tripleo/manifests/profile/base/database/mysql/client.pp, line: 89, column: 31)",
2019-07-05 20:55:34 | "puppet-user: Warning: ModuleLoader: module 'pacemaker' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules\\n (file & line not available)",
2019-07-05 20:55:34 | "puppet-user: error: Could not connect to cluster (is it running?)",

We notice this error also on some checks job during the week, it only happens on overcloud-controller-2

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

There are some puppet errors in logs:

2019-07-07 03:04:35 | "stderr: Error: unable to find resource 'galera-bundle'",

2019-07-07 03:04:35 | "Error: Facter: error while resolving custom fact \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass",
2019-07-07 03:04:35 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy[ha-all@/]: Could not evaluate: undefined method `scan' for nil:NilClass",

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Last jobs passed

tags: removed: promotion-blocker
Changed in tripleo:
milestone: train-2 → train-3
Revision history for this message
Arx Cruz (arxcruz) wrote :

Marking as invalid

Changed in tripleo:
status: Triaged → Invalid
Revision history for this message
chandan kumar (chkumar246) wrote :

We are seeing this issue again on fs020 master promotion job:
https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/e72fe9b/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-08-12_21_46_58

"+ puppet apply --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation,rabbitmq_policy,rabbitmq_user,rabbitmq_ready -e 'noop_resource('\\''package'\\''); include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::rabbitmq_bundle'",
2019-08-12 21:46:58 | "Warning: Support for ruby version 2.0.0 is deprecated and will be removed in a future release. See https://puppet.com/docs/puppet/latest/system_requirements.html for a list of supported ruby versions.",
2019-08-12 21:46:58 | " (location: /usr/share/ruby/vendor_ruby/puppet.rb:130:in `<module:Puppet>')",
2019-08-12 21:46:58 | "Error: Facter: error while resolving custom fact \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass",

Changed in tripleo:
status: Invalid → Confirmed
summary: - Promotion jobs are failing with could not connect to cluster
+ [fs020][master] Error: Facter: error while resolving custom fact
+ \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass
summary: - [fs020][master] Error: Facter: error while resolving custom fact
+ [fs01][master] Error: Facter: error while resolving custom fact
\"rabbitmq_nodename\": undefined method `[]' for nil:NilClass
tags: added: promotion-blocker
removed: alert ci
Revision history for this message
Michele Baldessari (michele) wrote : Re: [fs01][master] Error: Facter: error while resolving custom fact \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass

facter errors (as annoying as they are) are non fatal, that error has nothing to do with your failure. The error you are hitting is this:
2019-08-12 21:46:58 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy[ha-all@/]: Could not evaluate: undefined method `scan' for nil:NilClass",

I will try and look a bit into it. How easily reproduceable is this?

summary: - [fs01][master] Error: Facter: error while resolving custom fact
- \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass
+ [fs01][master] Error:
+ /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy[ha-
+ all@/]: Could not evaluate: undefined method `scan' for nil:NilClass
Revision history for this message
Michele Baldessari (michele) wrote :

So the failure is on controller-2 which is the bootstrap node (since when is this the case I am not sure). My current theory is that the rabbitmq cluster gets bootstrapped on controller-0 and then spawns up on ctrl-1 and ctrl-2 and when we call the following snippet:
rabbitmq_policy { 'ha-all@/':
        applyto => 'queues',
        pattern => '^(?!amq\.).*',
        definition => hash($ha_policy.map |$k, $v| {[$k, "${v}"]}),
}

The rabbitmq provider barfs on the output because likely at some point the command has some unexpected output.

We do run that command *only* after this 'rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true', so it might be that this condition is not enough. I for one was able to observe the output of list_policies changing during a restart:
 while true; do rabbitmqctl -q list_policies -p /; done
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* all {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* all {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* all {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* all {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* queues {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* queues {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority

It is likely that when this happens puppet-rabbitmqctl barfs due to either some other unexpected output and/or this output changing under its feet ("all" vs "queues"). To be exactly sure we'd really need the --debug option to these containers (this used to work but somehow got broken in the past)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Michele for the alternative upstream OCF RA, policy file gets applied only after the rabbitmq app started on the promoted master resource [0]. That's used to be working well AFAIKT.

[0] https://github.com/rabbitmq/rabbitmq-server-release/blob/master/scripts/rabbitmq-server-ha.ocf#L2314

Revision history for this message
Michele Baldessari (michele) wrote :

Thanks Bogdan. My hunch would be that 'rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true' is more stringent than checking if the rabbitmq app (+ plugins) are up, but I have zero data to back it up (except that I presume that without rabbitmq up and running there can't be a cluster ;). Ideally there would be a check for queues to be synced

I'd love to get --debug from the rabbitmq_init_bundle container, because in theory puppet-rabbitmq has some code to be a bit resilient in case of errors when invoking rabbitmqctl commands. It seems in oooq we've stopped including environments/config-debug.yaml (or i cannot see it)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Michele, here is some quick testing:

()[rabbitmq@standalone /]$ rabbitmqctl eval "rabbit_mnesia:is_clustered()."
false
()[rabbitmq@standalone /]$ rabbitmqctl stop_app
Stopping rabbit application on node rabbit@standalone
()[rabbitmq@standalone /]$ rabbitmqctl eval "rabbit_mnesia:is_clustered()."
false

as you can see, we could improve the policy applications with waiting for the app started

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.opendev.org/676431

Changed in tripleo:
assignee: nobody → Michele Baldessari (michele)
status: Confirmed → In Progress
Changed in tripleo:
assignee: Michele Baldessari (michele) → Alex Schultz (alex-schultz)
Changed in tripleo:
assignee: Alex Schultz (alex-schultz) → Michele Baldessari (michele)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/677404

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/677405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/677406

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.opendev.org/676431
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=3a8c2b0dc713458ccea3d1ee2657a4ef448f7b2f
Submitter: Zuul
Branch: master

commit 3a8c2b0dc713458ccea3d1ee2657a4ef448f7b2f
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 14 16:40:46 2019 +0200

    Make the rabbitmq-ready exec more stringent

    Currently we use the following command to determine if rabbit is
    up and running *and* ready to service requests:
    rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true

    Now we have occasionally observed that rabbitmqctl policies commands
    which are executed after said exec['rabbitmq-ready'] will fail.

    One potential reason is that is_clustered() can return true *before*
    the rabbit app is actually running. In fact we can see it does
    return true even though the app is stopped:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_mnesia:is_clustered().'
    true

    Let's switch to a combination of commands that check for the cluster to
    be up *and* the rabbitmq app to be running:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_nodes:is_running(node(), rabbit).'
    false

    Suggested-By: Bogdan Dobrelya <email address hidden>
    Closes-Bug: #1835615

    Change-Id: I29f779145a39cd16374a91626f7fae1581a18224

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/stein)

Reviewed: https://review.opendev.org/677404
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=914d66ad81dac2be83ff94ee54291fdbf5f7e3c7
Submitter: Zuul
Branch: stable/stein

commit 914d66ad81dac2be83ff94ee54291fdbf5f7e3c7
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 14 16:40:46 2019 +0200

    Make the rabbitmq-ready exec more stringent

    Currently we use the following command to determine if rabbit is
    up and running *and* ready to service requests:
    rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true

    Now we have occasionally observed that rabbitmqctl policies commands
    which are executed after said exec['rabbitmq-ready'] will fail.

    One potential reason is that is_clustered() can return true *before*
    the rabbit app is actually running. In fact we can see it does
    return true even though the app is stopped:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_mnesia:is_clustered().'
    true

    Let's switch to a combination of commands that check for the cluster to
    be up *and* the rabbitmq app to be running:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_nodes:is_running(node(), rabbit).'
    false

    Suggested-By: Bogdan Dobrelya <email address hidden>
    Closes-Bug: #1835615

    Change-Id: I29f779145a39cd16374a91626f7fae1581a18224

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/rocky)

Reviewed: https://review.opendev.org/677405
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=cb5102003b811d1835b8f03eb2b83f612094a788
Submitter: Zuul
Branch: stable/rocky

commit cb5102003b811d1835b8f03eb2b83f612094a788
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 14 16:40:46 2019 +0200

    Make the rabbitmq-ready exec more stringent

    Currently we use the following command to determine if rabbit is
    up and running *and* ready to service requests:
    rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true

    Now we have occasionally observed that rabbitmqctl policies commands
    which are executed after said exec['rabbitmq-ready'] will fail.

    One potential reason is that is_clustered() can return true *before*
    the rabbit app is actually running. In fact we can see it does
    return true even though the app is stopped:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_mnesia:is_clustered().'
    true

    Let's switch to a combination of commands that check for the cluster to
    be up *and* the rabbitmq app to be running:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_nodes:is_running(node(), rabbit).'
    false

    Suggested-By: Bogdan Dobrelya <email address hidden>
    Closes-Bug: #1835615

    Change-Id: I29f779145a39cd16374a91626f7fae1581a18224

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/queens)

Reviewed: https://review.opendev.org/677406
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=051e05a91dc89021de45644a8a34ffb93c2ddce7
Submitter: Zuul
Branch: stable/queens

commit 051e05a91dc89021de45644a8a34ffb93c2ddce7
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 14 16:40:46 2019 +0200

    Make the rabbitmq-ready exec more stringent

    Currently we use the following command to determine if rabbit is
    up and running *and* ready to service requests:
    rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true

    Now we have occasionally observed that rabbitmqctl policies commands
    which are executed after said exec['rabbitmq-ready'] will fail.

    One potential reason is that is_clustered() can return true *before*
    the rabbit app is actually running. In fact we can see it does
    return true even though the app is stopped:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_mnesia:is_clustered().'
    true

    Let's switch to a combination of commands that check for the cluster to
    be up *and* the rabbitmq app to be running:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_nodes:is_running(node(), rabbit).'
    false

    Suggested-By: Bogdan Dobrelya <email address hidden>
    Closes-Bug: #1835615

    Change-Id: I29f779145a39cd16374a91626f7fae1581a18224

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 10.5.1

This issue was fixed in the openstack/puppet-tripleo 10.5.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 9.5.1

This issue was fixed in the openstack/puppet-tripleo 9.5.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 8.5.1

This issue was fixed in the openstack/puppet-tripleo 8.5.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 11.2.0

This issue was fixed in the openstack/puppet-tripleo 11.2.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.