Fuel for OpenStack

Adding controller to the cluster fails with "Primitive 'p_dns' was not found in CIB!" because deploy_changes should be used instead of a single node deploy commands

Bug #1528488 reported by Ksenia Svechnikova on 2015-12-22

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	Dmitry Bilunov	Fuel for OpenStack 8.0

Bug Description

Reproduced on ISO 307 and 328

Steps:

1. Revert snapshot NeutronVlanHa with 5 slaves
2. Add 1 controller to the ready cluster:

fuel node --node 6 --env 1 --provision
fuel node --node 6 --env 1 --deploy

Expected result: node is ready
Actual result: node is in Error (2015-12-22 09:12:22 ERR (/Stage[main]/Cluster::Dns_ocf/Service[p_dns]) Could not evaluate: Primitive 'p_dns' was not found in CIB!)

Puppet: http://paste.openstack.org/show/482486/

Pacemaker:
Dec 22 09:09:50 [5883] node-6.test.domain.local cib: error: crm_element_value: Couldn't find admin_epoch in NULL
Dec 22 09:09:50 [5883] node-6.test.domain.local cib: error: crm_abort: crm_element_value: Triggered assert at xml.c:6047 : data != NULL
Dec 22 09:09:50 [5883] node-6.test.domain.local cib: error: crm_element_value: Couldn't find epoch in NULL
Dec 22 09:09:50 [5883] node-6.test.domain.local cib: error: crm_abort: crm_element_value: Triggered assert at xml.c:6047 : data != NULL
Dec 22 09:09:50 [5883] node-6.test.domain.local cib: error: crm_element_value: Couldn't find num_updates in NULL
Dec 22 09:09:50 [5883] node-6.test.domain.local cib: error: crm_abort: crm_element_value: Triggered assert at xml.c:6047 : data != NULL

See original description

Tags:

Revision history for this message

Ksenia Svechnikova (kdemina) wrote on 2015-12-22:

Snapshot: https://drive.google.com/file/d/0B2v38w72jlwTTzY5WWxiNm5WS2s/view?usp=sharing

Bogdan Dobrelya (bogdando) on 2015-12-22

tags:

removed: qa-

Matthew Mosesohn (raytrac3r) on 2015-12-22

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Dmitry Bilunov (dbilunov)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-22:

Please elaborate the time frame of the issue. There are no log entries since 2015-12-22T09:00. Was it the failure at 2015-12-21T16:34:04 node-4 perhaps?

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-22:

And there is no logs for node-6 at all

Revision history for this message

Ksenia Svechnikova (kdemina) wrote on 2015-12-22:

There are some issue with snapshot generation at that env. The attached one is from another lab (3+1 nodes) node-4 is the node is error

Revision history for this message

slava valyavskiy (slava-val-al) wrote on 2015-12-22:

Problem's reason that corosync configuration for old env controllers has not been updated before application of cluster.pp task on node-6. So, node-6 was not able to get already defined cluster resources. And, dns resource it's kind of resources what starts only for primary controller role..and if we have separate corosync cluster for new node with controller node this resource will not be found in its own cib data database and deployment process will be failed.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-22:

AFAICT from the affected env I had access to, the issue is what node-6 was deployed as separate corosync cluster. While the hiera data looks OK, at least its after-state

# corosync-cmapctl runtime.totem.pg.mrp.srp.members
runtime.totem.pg.mrp.srp.members.6.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.6.ip (str) = r(0) ip(10.109.2.9)
runtime.totem.pg.mrp.srp.members.6.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.6.status (str) = joined

# cat test.pp
$corosync_nodes = corosync_nodes(
    get_nodes_hash_by_roles(
        hiera_hash('network_metadata'),
        hiera('corosync_roles')
    ),
    'mgmt/corosync'
)
echo($corosync_nodes)

#puppet apply test.pp
2015/12/22 11:09:03.937: (Hash)
{"node-5.test.domain.local"=>{"id"=>"5", "ip"=>"10.109.2.6"},
"node-4.test.domain.local"=>{"id"=>"4", "ip"=>"10.109.2.4"},
"node-6.test.domain.local"=>{"id"=>"6", "ip"=>"10.109.2.9"},
"node-2.test.domain.local"=>{"id"=>"2", "ip"=>"10.109.2.8"}}

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-22:

If I reapply puppet manually, nothing changes:

# puppet apply /etc/puppet/modules/osnailyfacter/modular/cluster/cluster.pp
Notice: /Stage[main]/Main/Pcmk_nodes[pacemaker]/pacemaker_nodes: pacemaker_nodes changed '{"node-6.test.domain.local"=>"6"}' to '{"node-5.test.domain.local"=>"5", "node-4.test.domain.local"=>"4", "node-6.test.domain.local"=>"6", "node-2.test.domain.local"=>"2"}'
Notice: Finished catalog run in 13.28 seconds

tags:

added: granular

Revision history for this message

slava valyavskiy (slava-val-al) wrote on 2015-12-22:

Please, see - https://bugs.launchpad.net/fuel/+bug/1494507/comments/4

Bogdan Dobrelya (bogdando) on 2015-12-22

tags:	added: life-cycle-management
Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

slava valyavskiy (slava-val-al) wrote on 2015-12-22:

Corosync configuration for node-6 is ok. But, problem that this node was not added in corosync confugration files for old cluster node.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-22:

#10

Interesting, but restarting the corosync with pacemaker changes nothing as well. Even though the nodelist config in the corosync.conf looks correct:

nodelist {
  node {
    # node-5.test.domain.local
    ring0_addr: 10.109.2.6
    nodeid: 5
  }
  node {
    # node-4.test.domain.local
    ring0_addr: 10.109.2.4
    nodeid: 4
  }
  node {
    # node-6.test.domain.local
    ring0_addr: 10.109.2.9
    nodeid: 6
  }
  node {
    # node-2.test.domain.local
    ring0_addr: 10.109.2.8
    nodeid: 2
  }
}

I'm not sure if this is pure LCM case, as Slava described in the comment # 5 or the dynamic cluster scale issue as well

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-22:

#12

The dynamic scale seems working, so that should be a pure LCM issue, see comment 5.

PS I tested the manual task run with "fuel node --node 2,4,5,6 --tasks cluster --env 1" and even though there were " Call cib_apply_diff failed (-205): Update was older than existing configuration" errors on nodes, the cluster membership have been updated OK.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-22:

#13

Here is the log snippet from the reported logs http://pastebin.com/Crihqbaf
It shows that the data in hiera was correct. I believe the issue is that then scaling corosync the cluster task shall be executed on all controllers before to proceed with other dependency tasks for pacemaker resources

Revision history for this message

slava valyavskiy (slava-val-al) wrote on 2015-12-22:

#14

Ksenia, I would like to propose you to use 'deploy_changes' approach instead to call direct api methods like deploy and provision for separate node. Does it make sense?

Bogdan Dobrelya (bogdando) on 2015-12-22

summary:

Adding controller to the cluster fails with "Primitive 'p_dns' was not
- found in CIB!"
+ found in CIB!" because cluster task shall be executed on all corosync
+ nodes ahead of any other tasks manipulating with pacemaker resources

Revision history for this message

Ksenia Svechnikova (kdemina) wrote on 2015-12-22: Re: Adding controller to the cluster fails with "Primitive 'p_dns' was not found in CIB!" because cluster task shall be executed on all corosync nodes ahead of any other tasks manipulating with pacemaker resources

#15

@slava-val-al

Yes, this workaround will help, that's why the issue doesn't have Critical priority, but we still should have such option as deploying only some nodes. (https://docs.mirantis.com/openstack/fuel/fuel-7.0/user-guide.html#using-fuel-cli)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-22:

#16

Okay, marking as invalid as Fuel CLI is supposed to apply deploy to only specified nodes, by design

Changed in fuel:
status:	Confirmed → Invalid

Revision history for this message

Ksenia Svechnikova (kdemina) wrote on 2015-12-23:

#17

@bogdando

Why the issue was marked as Invalid?

We have two option for adding node to the cluster:

1) assign node to the env, assign role, deploy-changes
2) assign node to the env, assign role, run
fuel node --node 6 --env 1 --provision
fuel node --node 6 --env 1 --deploy

So, this bug is for the second case of adding node and according to docs we have such option by design

Revision history for this message

slava valyavskiy (slava-val-al) wrote on 2015-12-23:

#18

Bug status has been changed to 'Confirmed' due to @kdemina description.

Changed in fuel:
status:	Invalid → Confirmed

Matthew Mosesohn (raytrac3r) on 2015-12-23

tags:

added: team-bugfix

slava valyavskiy (slava-val-al) on 2015-12-25

description:	updated
tags:	added: area-python tricky removed: area-library

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-28:

#19

@Ksenia, AFAIK, the way #2 never worked by design. Deploy changes should be used only. I believe this should be addressed in the guide instead.

Revision history for this message

Dmitry Tyzhnenko (dtyzhnenko) wrote on 2015-12-28:

#20

fuel-snapshot-2015-12-28_16-07-08.tar.xz Edit (59.8 MiB, application/octet-stream)

Reproduced on 8.0-361

Scenario:
1. Create new environment
2. Choose Neutron, VLAN
3. Choose cinder for volumes and Ceph for ephemeral
4. Add 3 controller
5. Add 2 compute
6. Add 1 cinder
7. Add 3 ceph
8. Verify networks
9. Deploy the environment
10. Verify networks
11. Run OSTF tests
12. Reset cluster

Failed on 9 step

Snapshot with logs attached

Revision history for this message

Egor Kotko (ykotko) wrote on 2015-12-30:

#21

Got the same on ISO #361
On Baremetall lab

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "361"
  build_id: "361"
  fuel-nailgun_sha: "53c72a9600158bea873eec2af1322a716e079ea0"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "7463551bc74841d1049869aaee777634fb0e5149"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "ba8063d34ff6419bddf2a82b1de1f37108d96082"
  fuel-ostf_sha: "889ddb0f1a4fa5f839fd4ea0c0017a3c181aa0c1"
  fuel-mirror_sha: "8adb10618bb72bb36bb018386d329b494b036573"
  fuelmenu_sha: "824f6d3ebdc10daf2f7195c82a8ca66da5abee99"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "07d5f1c3e1b352cb713852a3a96022ddb8fe2676"

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-08:

#22

@Egor, please add logs. @Dmitry, in the logs snapshot you provided in the comment #20 the node-2 was deployed in the separate corosync cluster Online: [ node-2.domain.local ]. This looks like the env specific connectivity issue.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-08:

#23

According to logs there was multiple issues to corosync cluster
2015-12-28T14:58:09.844304+00:00 node-1 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (2533-4530-15): Broken pipe (32)
2015-12-28T15:39:01.692282+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-14952-14): Broken pipe (32)
2015-12-28T15:39:30.812831+00:00 node-2 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (12296-14470-15): Broken pipe (32)
2015-12-28T15:40:48.299180+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-18797-14): Broken pipe (32)
2015-12-28T15:41:24.231353+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-21555-14): Broken pipe (32)
2015-12-28T15:44:13.955670+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-8720-14): Broken pipe (32)
2015-12-28T15:45:08.344040+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-12822-14): Broken pipe (32)
2015-12-28T15:47:19.501078+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-31522-14): Broken pipe (32)
2015-12-28T15:57:37.452896+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-2755-14): Broken pipe (32)
2015-12-28T15:58:02.834803+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-5758-14): Broken pipe (32)
2015-12-28T15:58:43.488773+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-11541-14): Broken pipe (32)
2015-12-28T15:59:06.798997+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-16187-14): Broken pipe (32)
2015-12-28T16:00:32.513386+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-3744-14): Broken pipe (32)
Looks invalid due to the env specific issues. Moving back to invalid

According to logs there was multiple issues to corosync cluster
2015-12-28T14:58:09.844304+00:00   node-1                        crmd warning:   warning: qb_ipcs_event_sendv: new_event_notification (2533-4530-15): Broken pipe (32)
2015-12-28T15:39:01.692282+00:00   node-3                        crmd warning:   warning: qb_ipcs_event_sendv: new_event_notification (13066-14952-14): Broken pipe (32)
2015-12-28T15:39:30.812831+00:00   node-2                        crmd warning:   warning: qb_ipcs_event_sendv: new_event_notification (12296-14470-15): Broken pipe (32)
2015-12-28T15:40:48.299180+00:00   node-3                        crmd warning:   warning: qb_ipcs_event_sendv: new_event_notification (13066-18797-14): Broken pipe (32)
2015-12-28T15:41:24.231353+00:00   node-3                        crmd warning:   warning: qb_ipcs_event_sendv: new_event_notification (13066-21555-14): Broken pipe (32)
2015-12-28T15:44:13.955670+00:00   node-3                        crmd warning:   warning: qb_ipcs_event_sendv: new_event_notification (13066-8720-14): Broken pipe (32)
2015-12-28T15:45:08.344040+00:00   node-3                        crmd warning:   warning: qb_ipcs_event_sendv: new_event_notification (13066-12822-14): Broken pipe (32)
2015-12-28T15:47:19.501078+00:00   node-3                        crmd warning:   warning: qb_ipcs_event_sendv: new_event_notification (13066-31522-14): Broken pipe (32)
2015-12-28T15:57:37.452896+00:00   node-3                        crmd warning:   warning: qb_ipcs_event_sendv: new_event_notification (13066-2755-14): Broken pipe (32)
2015-12-28T15:58:02.834803+00:00   node-3                        crmd warning:   warning: qb_ipcs_event_sendv: new_event_notification (13066-5758-14): Broken pipe (32)
2015-12-28T15:58:43.488773+00:00   node-3                        crmd warning:   warning: qb_ipcs_event_sendv: new_event_notification (13066-11541-14): Broken pipe (32)
2015-12-28T15:59:06.798997+00:00   node-3                        crmd warning:   warning: qb_ipcs_event_sendv: new_event_notification (13066-16187-14): Broken pipe (32)
2015-12-28T16:00:32.513386+00:00   node-3                        crmd warning:   warning: qb_ipcs_event_sendv: new_event_notification (13066-3744-14): Broken pipe (32)
Looks invalid due to the env specific issues. Moving back to invalid

Changed in fuel:
status:	Confirmed → Invalid

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-08:

#24

The current status is still as it reported at the comment #19 https://bugs.launchpad.net/fuel/+bug/1528488/comments/19

summary:

Adding controller to the cluster fails with "Primitive 'p_dns' was not
- found in CIB!" because cluster task shall be executed on all corosync
- nodes ahead of any other tasks manipulating with pacemaker resources
+ found in CIB!" because deploy_changes should be used instead of a single
+ node deploy commands