Adding controller to the cluster fails with "Primitive 'p_dns' was not found in CIB!" because deploy_changes should be used instead of a single node deploy commands

Bug #1528488 reported by Ksenia Svechnikova
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Dmitry Bilunov

Bug Description

Reproduced on ISO 307 and 328

Steps:

            1. Revert snapshot NeutronVlanHa with 5 slaves
            2. Add 1 controller to the ready cluster:

                    fuel node --node 6 --env 1 --provision
                    fuel node --node 6 --env 1 --deploy

Expected result: node is ready
Actual result: node is in Error (2015-12-22 09:12:22 ERR (/Stage[main]/Cluster::Dns_ocf/Service[p_dns]) Could not evaluate: Primitive 'p_dns' was not found in CIB!)

Puppet: http://paste.openstack.org/show/482486/

Pacemaker:
Dec 22 09:09:50 [5883] node-6.test.domain.local cib: error: crm_element_value: Couldn't find admin_epoch in NULL
Dec 22 09:09:50 [5883] node-6.test.domain.local cib: error: crm_abort: crm_element_value: Triggered assert at xml.c:6047 : data != NULL
Dec 22 09:09:50 [5883] node-6.test.domain.local cib: error: crm_element_value: Couldn't find epoch in NULL
Dec 22 09:09:50 [5883] node-6.test.domain.local cib: error: crm_abort: crm_element_value: Triggered assert at xml.c:6047 : data != NULL
Dec 22 09:09:50 [5883] node-6.test.domain.local cib: error: crm_element_value: Couldn't find num_updates in NULL
Dec 22 09:09:50 [5883] node-6.test.domain.local cib: error: crm_abort: crm_element_value: Triggered assert at xml.c:6047 : data != NULL

Revision history for this message
Ksenia Svechnikova (kdemina) wrote :
tags: removed: qa-
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Dmitry Bilunov (dbilunov)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please elaborate the time frame of the issue. There are no log entries since 2015-12-22T09:00. Was it the failure at 2015-12-21T16:34:04 node-4 perhaps?

Changed in fuel:
status: New → Incomplete
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

And there is no logs for node-6 at all

Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

There are some issue with snapshot generation at that env. The attached one is from another lab (3+1 nodes) node-4 is the node is error

Revision history for this message
slava valyavskiy (slava-val-al) wrote :

Problem's reason that corosync configuration for old env controllers has not been updated before application of cluster.pp task on node-6. So, node-6 was not able to get already defined cluster resources. And, dns resource it's kind of resources what starts only for primary controller role..and if we have separate corosync cluster for new node with controller node this resource will not be found in its own cib data database and deployment process will be failed.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

AFAICT from the affected env I had access to, the issue is what node-6 was deployed as separate corosync cluster. While the hiera data looks OK, at least its after-state

# corosync-cmapctl runtime.totem.pg.mrp.srp.members
runtime.totem.pg.mrp.srp.members.6.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.6.ip (str) = r(0) ip(10.109.2.9)
runtime.totem.pg.mrp.srp.members.6.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.6.status (str) = joined

# cat test.pp
$corosync_nodes = corosync_nodes(
    get_nodes_hash_by_roles(
        hiera_hash('network_metadata'),
        hiera('corosync_roles')
    ),
    'mgmt/corosync'
)
echo($corosync_nodes)

#puppet apply test.pp
2015/12/22 11:09:03.937: (Hash)
 {"node-5.test.domain.local"=>{"id"=>"5", "ip"=>"10.109.2.6"},
 "node-4.test.domain.local"=>{"id"=>"4", "ip"=>"10.109.2.4"},
 "node-6.test.domain.local"=>{"id"=>"6", "ip"=>"10.109.2.9"},
 "node-2.test.domain.local"=>{"id"=>"2", "ip"=>"10.109.2.8"}}

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

If I reapply puppet manually, nothing changes:

# puppet apply /etc/puppet/modules/osnailyfacter/modular/cluster/cluster.pp
Notice: /Stage[main]/Main/Pcmk_nodes[pacemaker]/pacemaker_nodes: pacemaker_nodes changed '{"node-6.test.domain.local"=>"6"}' to '{"node-5.test.domain.local"=>"5", "node-4.test.domain.local"=>"4", "node-6.test.domain.local"=>"6", "node-2.test.domain.local"=>"2"}'
Notice: Finished catalog run in 13.28 seconds

# corosync-cmapctl runtime.totem.pg.mrp.srp.members
runtime.totem.pg.mrp.srp.members.6.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.6.ip (str) = r(0) ip(10.109.2.9)
runtime.totem.pg.mrp.srp.members.6.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.6.status (str) = joined

tags: added: granular
Revision history for this message
slava valyavskiy (slava-val-al) wrote :
tags: added: life-cycle-management
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
slava valyavskiy (slava-val-al) wrote :

Corosync configuration for node-6 is ok. But, problem that this node was not added in corosync confugration files for old cluster node.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Interesting, but restarting the corosync with pacemaker changes nothing as well. Even though the nodelist config in the corosync.conf looks correct:

nodelist {
  node {
    # node-5.test.domain.local
    ring0_addr: 10.109.2.6
    nodeid: 5
  }
  node {
    # node-4.test.domain.local
    ring0_addr: 10.109.2.4
    nodeid: 4
  }
  node {
    # node-6.test.domain.local
    ring0_addr: 10.109.2.9
    nodeid: 6
  }
  node {
    # node-2.test.domain.local
    ring0_addr: 10.109.2.8
    nodeid: 2
  }
}

I'm not sure if this is pure LCM case, as Slava described in the comment # 5 or the dynamic cluster scale issue as well

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The dynamic scale seems working, so that should be a pure LCM issue, see comment 5.

PS I tested the manual task run with "fuel node --node 2,4,5,6 --tasks cluster --env 1" and even though there were " Call cib_apply_diff failed (-205): Update was older than existing configuration" errors on nodes, the cluster membership have been updated OK.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Here is the log snippet from the reported logs http://pastebin.com/Crihqbaf
It shows that the data in hiera was correct. I believe the issue is that then scaling corosync the cluster task shall be executed on all controllers before to proceed with other dependency tasks for pacemaker resources

Revision history for this message
slava valyavskiy (slava-val-al) wrote :

Ksenia, I would like to propose you to use 'deploy_changes' approach instead to call direct api methods like deploy and provision for separate node. Does it make sense?

summary: Adding controller to the cluster fails with "Primitive 'p_dns' was not
- found in CIB!"
+ found in CIB!" because cluster task shall be executed on all corosync
+ nodes ahead of any other tasks manipulating with pacemaker resources
Revision history for this message
Ksenia Svechnikova (kdemina) wrote : Re: Adding controller to the cluster fails with "Primitive 'p_dns' was not found in CIB!" because cluster task shall be executed on all corosync nodes ahead of any other tasks manipulating with pacemaker resources

@slava-val-al

Yes, this workaround will help, that's why the issue doesn't have Critical priority, but we still should have such option as deploying only some nodes. (https://docs.mirantis.com/openstack/fuel/fuel-7.0/user-guide.html#using-fuel-cli)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Okay, marking as invalid as Fuel CLI is supposed to apply deploy to only specified nodes, by design

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

@bogdando

Why the issue was marked as Invalid?

We have two option for adding node to the cluster:

1) assign node to the env, assign role, deploy-changes
2) assign node to the env, assign role, run
        fuel node --node 6 --env 1 --provision
        fuel node --node 6 --env 1 --deploy

So, this bug is for the second case of adding node and according to docs we have such option by design

Revision history for this message
slava valyavskiy (slava-val-al) wrote :

Bug status has been changed to 'Confirmed' due to @kdemina description.

Changed in fuel:
status: Invalid → Confirmed
tags: added: team-bugfix
description: updated
tags: added: area-python tricky
removed: area-library
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Ksenia, AFAIK, the way #2 never worked by design. Deploy changes should be used only. I believe this should be addressed in the guide instead.

Revision history for this message
Dmitry Tyzhnenko (dtyzhnenko) wrote :

Reproduced on 8.0-361

 Scenario:
1. Create new environment
2. Choose Neutron, VLAN
3. Choose cinder for volumes and Ceph for ephemeral
4. Add 3 controller
5. Add 2 compute
6. Add 1 cinder
7. Add 3 ceph
8. Verify networks
9. Deploy the environment
10. Verify networks
11. Run OSTF tests
12. Reset cluster

Failed on 9 step

Snapshot with logs attached

Revision history for this message
Egor Kotko (ykotko) wrote :

Got the same on ISO #361
On Baremetall lab

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "361"
  build_id: "361"
  fuel-nailgun_sha: "53c72a9600158bea873eec2af1322a716e079ea0"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "7463551bc74841d1049869aaee777634fb0e5149"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "ba8063d34ff6419bddf2a82b1de1f37108d96082"
  fuel-ostf_sha: "889ddb0f1a4fa5f839fd4ea0c0017a3c181aa0c1"
  fuel-mirror_sha: "8adb10618bb72bb36bb018386d329b494b036573"
  fuelmenu_sha: "824f6d3ebdc10daf2f7195c82a8ca66da5abee99"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "07d5f1c3e1b352cb713852a3a96022ddb8fe2676"

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Egor, please add logs. @Dmitry, in the logs snapshot you provided in the comment #20 the node-2 was deployed in the separate corosync cluster Online: [ node-2.domain.local ]. This looks like the env specific connectivity issue.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to logs there was multiple issues to corosync cluster
2015-12-28T14:58:09.844304+00:00 node-1 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (2533-4530-15): Broken pipe (32)
2015-12-28T15:39:01.692282+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-14952-14): Broken pipe (32)
2015-12-28T15:39:30.812831+00:00 node-2 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (12296-14470-15): Broken pipe (32)
2015-12-28T15:40:48.299180+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-18797-14): Broken pipe (32)
2015-12-28T15:41:24.231353+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-21555-14): Broken pipe (32)
2015-12-28T15:44:13.955670+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-8720-14): Broken pipe (32)
2015-12-28T15:45:08.344040+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-12822-14): Broken pipe (32)
2015-12-28T15:47:19.501078+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-31522-14): Broken pipe (32)
2015-12-28T15:57:37.452896+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-2755-14): Broken pipe (32)
2015-12-28T15:58:02.834803+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-5758-14): Broken pipe (32)
2015-12-28T15:58:43.488773+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-11541-14): Broken pipe (32)
2015-12-28T15:59:06.798997+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-16187-14): Broken pipe (32)
2015-12-28T16:00:32.513386+00:00 node-3 crmd warning: warning: qb_ipcs_event_sendv: new_event_notification (13066-3744-14): Broken pipe (32)
Looks invalid due to the env specific issues. Moving back to invalid

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The current status is still as it reported at the comment #19 https://bugs.launchpad.net/fuel/+bug/1528488/comments/19

summary: Adding controller to the cluster fails with "Primitive 'p_dns' was not
- found in CIB!" because cluster task shall be executed on all corosync
- nodes ahead of any other tasks manipulating with pacemaker resources
+ found in CIB!" because deploy_changes should be used instead of a single
+ node deploy commands
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.