Bug #1835615 “[fs01][master] Error: /Stage[main]/Tripleo::Profil...” : Bugs : tripleo

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2019-07-07:

#1

There are some puppet errors in logs:

2019-07-07 03:04:35 | "stderr: Error: unable to find resource 'galera-bundle'",

2019-07-07 03:04:35 | "Error: Facter: error while resolving custom fact \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass",
2019-07-07 03:04:35 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy[ha-all@/]: Could not evaluate: undefined method `scan' for nil:NilClass",

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2019-07-07:

#2

Last jobs passed

tags:

removed: promotion-blocker

Alex Schultz (alex-schultz) on 2019-07-29

Changed in tripleo:
milestone:	train-2 → train-3

Revision history for this message

Arx Cruz (arxcruz) wrote on 2019-07-31:

#3

Marking as invalid

Changed in tripleo:
status:	Triaged → Invalid

Revision history for this message

chandan kumar (chkumar246) wrote on 2019-08-13:

#4

We are seeing this issue again on fs020 master promotion job:
https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/e72fe9b/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-08-12_21_46_58

"+ puppet apply --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::resource::ocf,pacemaker::constraint::order,pacemaker::constraint::colocation,rabbitmq_policy,rabbitmq_user,rabbitmq_ready -e 'noop_resource('\\''package'\\''); include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::rabbitmq_bundle'",
2019-08-12 21:46:58 | "Warning: Support for ruby version 2.0.0 is deprecated and will be removed in a future release. See https://puppet.com/docs/puppet/latest/system_requirements.html for a list of supported ruby versions.",
2019-08-12 21:46:58 | " (location: /usr/share/ruby/vendor_ruby/puppet.rb:130:in `<module:Puppet>')",
2019-08-12 21:46:58 | "Error: Facter: error while resolving custom fact \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass",

Changed in tripleo:
status:	Invalid → Confirmed
summary:	- Promotion jobs are failing with could not connect to cluster + [fs020][master] Error: Facter: error while resolving custom fact + \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass
summary:	- [fs020][master] Error: Facter: error while resolving custom fact + [fs01][master] Error: Facter: error while resolving custom fact \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass

chandan kumar (chkumar246) on 2019-08-13

tags:

added: promotion-blocker
removed: alert ci

Revision history for this message

Michele Baldessari (michele) wrote on 2019-08-13: Re: [fs01][master] Error: Facter: error while resolving custom fact \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass

#5

facter errors (as annoying as they are) are non fatal, that error has nothing to do with your failure. The error you are hitting is this:
2019-08-12 21:46:58 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy[ha-all@/]: Could not evaluate: undefined method `scan' for nil:NilClass",

I will try and look a bit into it. How easily reproduceable is this?

Michele Baldessari (michele) on 2019-08-13

summary:

- [fs01][master] Error: Facter: error while resolving custom fact
- \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass
+ [fs01][master] Error:
+ /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy[ha-
+ all@/]: Could not evaluate: undefined method `scan' for nil:NilClass

Revision history for this message

Michele Baldessari (michele) wrote on 2019-08-13:

#6

So the failure is on controller-2 which is the bootstrap node (since when is this the case I am not sure). My current theory is that the rabbitmq cluster gets bootstrapped on controller-0 and then spawns up on ctrl-1 and ctrl-2 and when we call the following snippet:
rabbitmq_policy { 'ha-all@/':
        applyto => 'queues',
        pattern => '^(?!amq\.).*',
        definition => hash($ha_policy.map |$k, $v| {[$k, "${v}"]}),
}

The rabbitmq provider barfs on the output because likely at some point the command has some unexpected output.

We do run that command *only* after this 'rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true', so it might be that this condition is not enough. I for one was able to observe the output of list_policies changing during a restart:
while true; do rabbitmqctl -q list_policies -p /; done
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* all {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* all {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* all {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* all {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* queues {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority
/ ha-all ^(?!amq\.).* queues {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
vhost name pattern apply-to definition priority

It is likely that when this happens puppet-rabbitmqctl barfs due to either some other unexpected output and/or this output changing under its feet ("all" vs "queues"). To be exactly sure we'd really need the --debug option to these containers (this used to work but somehow got broken in the past)

So the failure is on controller-2 which is the bootstrap node (since when is this the case I am not sure). My current theory is that the rabbitmq cluster gets bootstrapped on controller-0 and then spawns up on ctrl-1 and ctrl-2 and when we call the following snippet:
rabbitmq_policy { 'ha-all@/':
        applyto    => 'queues',
        pattern    => '^(?!amq\.).*',
        definition => hash($ha_policy.map |$k, $v| {[$k, "${v}"]}),
}

The rabbitmq provider barfs on the output because likely at some point the command has some unexpected output.

We do run that command *only* after this 'rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true', so it might be that this condition is not enough. I for one was able to observe the output of list_policies changing during a restart:
 while true; do rabbitmqctl -q list_policies -p /; done
vhost   name    pattern apply-to        definition      priority
/       ha-all  ^(?!amq\.).*    all     {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"}   0
vhost   name    pattern apply-to        definition      priority
/       ha-all  ^(?!amq\.).*    all     {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"}   0
vhost   name    pattern apply-to        definition      priority
/       ha-all  ^(?!amq\.).*    all     {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"}   0
vhost   name    pattern apply-to        definition      priority
/       ha-all  ^(?!amq\.).*    all     {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"}   0
vhost   name    pattern apply-to        definition      priority
/       ha-all  ^(?!amq\.).*    queues  {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"}   0
vhost   name    pattern apply-to        definition      priority
/       ha-all  ^(?!amq\.).*    queues  {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"}   0
vhost   name    pattern apply-to        definition      priority

It is likely that when this happens puppet-rabbitmqctl barfs due to either some other unexpected output and/or this output changing under its feet ("all" vs "queues"). To be exactly sure we'd really need the --debug option to these containers (this used to work but somehow got broken in the past)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2019-08-14:

#7

@Michele for the alternative upstream OCF RA, policy file gets applied only after the rabbitmq app started on the promoted master resource [0]. That's used to be working well AFAIKT.

[0] https://github.com/rabbitmq/rabbitmq-server-release/blob/master/scripts/rabbitmq-server-ha.ocf#L2314

Revision history for this message

Michele Baldessari (michele) wrote on 2019-08-14:

#8

Thanks Bogdan. My hunch would be that 'rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true' is more stringent than checking if the rabbitmq app (+ plugins) are up, but I have zero data to back it up (except that I presume that without rabbitmq up and running there can't be a cluster ;). Ideally there would be a check for queues to be synced

I'd love to get --debug from the rabbitmq_init_bundle container, because in theory puppet-rabbitmq has some code to be a bit resilient in case of errors when invoking rabbitmqctl commands. It seems in oooq we've stopped including environments/config-debug.yaml (or i cannot see it)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2019-08-14:

#9

@Michele, here is some quick testing:

()[rabbitmq@standalone /]$ rabbitmqctl eval "rabbit_mnesia:is_clustered()."
false
()[rabbitmq@standalone /]$ rabbitmqctl stop_app
Stopping rabbit application on node rabbit@standalone
()[rabbitmq@standalone /]$ rabbitmqctl eval "rabbit_mnesia:is_clustered()."
false

as you can see, we could improve the policy applications with waiting for the app started

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-14: Fix proposed to puppet-tripleo (master)

#10

Fix proposed to branch: master
Review: https://review.opendev.org/676431

Changed in tripleo:
assignee:	nobody → Michele Baldessari (michele)
status:	Confirmed → In Progress

OpenStack Infra (hudson-openstack) on 2019-08-19

Changed in tripleo:
assignee:	Michele Baldessari (michele) → Alex Schultz (alex-schultz)

Alex Schultz (alex-schultz) on 2019-08-19

Changed in tripleo:
assignee:	Alex Schultz (alex-schultz) → Michele Baldessari (michele)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-20: Fix proposed to puppet-tripleo (stable/stein)

#11

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/677404

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-20: Fix proposed to puppet-tripleo (stable/rocky)

#12

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/677405

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-20: Fix proposed to puppet-tripleo (stable/queens)

#13

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/677406

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-20: Fix merged to puppet-tripleo (master)

#14

Reviewed: https://review.opendev.org/676431
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=3a8c2b0dc713458ccea3d1ee2657a4ef448f7b2f
Submitter: Zuul
Branch: master

commit 3a8c2b0dc713458ccea3d1ee2657a4ef448f7b2f
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 14 16:40:46 2019 +0200

Make the rabbitmq-ready exec more stringent

    Currently we use the following command to determine if rabbit is
    up and running *and* ready to service requests:
    rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true

Now we have occasionally observed that rabbitmqctl policies commands
which are executed after said exec['rabbitmq-ready'] will fail.

    One potential reason is that is_clustered() can return true *before*
    the rabbit app is actually running. In fact we can see it does
    return true even though the app is stopped:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_mnesia:is_clustered().'
    true

    Let's switch to a combination of commands that check for the cluster to
    be up *and* the rabbitmq app to be running:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_nodes:is_running(node(), rabbit).'
    false

Suggested-By: Bogdan Dobrelya <email address hidden>
Closes-Bug: #1835615

Change-Id: I29f779145a39cd16374a91626f7fae1581a18224

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-20: Fix merged to puppet-tripleo (stable/stein)

#15

Reviewed: https://review.opendev.org/677404
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=914d66ad81dac2be83ff94ee54291fdbf5f7e3c7
Submitter: Zuul
Branch: stable/stein

commit 914d66ad81dac2be83ff94ee54291fdbf5f7e3c7
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 14 16:40:46 2019 +0200

Make the rabbitmq-ready exec more stringent

    Currently we use the following command to determine if rabbit is
    up and running *and* ready to service requests:
    rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true

Now we have occasionally observed that rabbitmqctl policies commands
which are executed after said exec['rabbitmq-ready'] will fail.

    One potential reason is that is_clustered() can return true *before*
    the rabbit app is actually running. In fact we can see it does
    return true even though the app is stopped:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_mnesia:is_clustered().'
    true

    Let's switch to a combination of commands that check for the cluster to
    be up *and* the rabbitmq app to be running:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_nodes:is_running(node(), rabbit).'
    false

Suggested-By: Bogdan Dobrelya <email address hidden>
Closes-Bug: #1835615

Change-Id: I29f779145a39cd16374a91626f7fae1581a18224

tags:

added: in-stable-stein

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-22: Fix merged to puppet-tripleo (stable/rocky)

#16

Reviewed: https://review.opendev.org/677405
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=cb5102003b811d1835b8f03eb2b83f612094a788
Submitter: Zuul
Branch: stable/rocky

commit cb5102003b811d1835b8f03eb2b83f612094a788
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 14 16:40:46 2019 +0200

Make the rabbitmq-ready exec more stringent

    Currently we use the following command to determine if rabbit is
    up and running *and* ready to service requests:
    rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true

Now we have occasionally observed that rabbitmqctl policies commands
which are executed after said exec['rabbitmq-ready'] will fail.

    One potential reason is that is_clustered() can return true *before*
    the rabbit app is actually running. In fact we can see it does
    return true even though the app is stopped:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_mnesia:is_clustered().'
    true

    Let's switch to a combination of commands that check for the cluster to
    be up *and* the rabbitmq app to be running:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_nodes:is_running(node(), rabbit).'
    false

Suggested-By: Bogdan Dobrelya <email address hidden>
Closes-Bug: #1835615

Change-Id: I29f779145a39cd16374a91626f7fae1581a18224

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-22: Fix merged to puppet-tripleo (stable/queens)

#17

Reviewed: https://review.opendev.org/677406
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=051e05a91dc89021de45644a8a34ffb93c2ddce7
Submitter: Zuul
Branch: stable/queens

commit 051e05a91dc89021de45644a8a34ffb93c2ddce7
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 14 16:40:46 2019 +0200

Make the rabbitmq-ready exec more stringent

    Currently we use the following command to determine if rabbit is
    up and running *and* ready to service requests:
    rabbitmqctl eval "rabbit_mnesia:is_clustered()." | grep -q true

Now we have occasionally observed that rabbitmqctl policies commands
which are executed after said exec['rabbitmq-ready'] will fail.

    One potential reason is that is_clustered() can return true *before*
    the rabbit app is actually running. In fact we can see it does
    return true even though the app is stopped:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_mnesia:is_clustered().'
    true

    Let's switch to a combination of commands that check for the cluster to
    be up *and* the rabbitmq app to be running:
    ()[root@controller-1 /]$ rabbitmqctl stop_app
    Stopping rabbit application on node rabbit@controller-1 ...
    ()[root@controller-1 /]$ rabbitmqctl eval 'rabbit_nodes:is_running(node(), rabbit).'
    false

Suggested-By: Bogdan Dobrelya <email address hidden>
Closes-Bug: #1835615

Change-Id: I29f779145a39cd16374a91626f7fae1581a18224

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-11: Fix included in openstack/puppet-tripleo 10.5.1

#18

This issue was fixed in the openstack/puppet-tripleo 10.5.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-12: Fix included in openstack/puppet-tripleo 9.5.1

#19

This issue was fixed in the openstack/puppet-tripleo 9.5.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-13: Fix included in openstack/puppet-tripleo 8.5.1

#20

This issue was fixed in the openstack/puppet-tripleo 8.5.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-18: Fix included in openstack/puppet-tripleo 11.2.0

#21

This issue was fixed in the openstack/puppet-tripleo 11.2.0 release.

tripleo

[fs01][master] Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy[ha-all@/]: Could not evaluate: undefined method `scan' for nil:NilClass

Bug Description

Other bug subscribers

Remote bug watches