Pacemaker and Application of an update diff failed (-206)

Bug #1283062 reported by Bogdan Dobrelya
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
Sergey Vasilenko
5.0.x
Fix Committed
High
Fuel Library (Deprecated)
6.0.x
Confirmed
High
Sergey Vasilenko

Bug Description

{"build_id": "2014-02-17_15-22-13", "mirantis": "no", "build_number": "133", "nailgun_sha": "4a37495bfeb70528653287224323b7997ca5d93a", "ostf_sha": "f86abe5544b5ffcf621e0c450bca15737c92361f", "fuelmain_sha": "91c07ac3c25361f36836904851fb066909f5fb3c", "astute_sha": "7eed50fc30cec675fff7787c37fcf6da6dd518ee", "release": "4.1", "fuellib_sha": "8e0b1ae5b1c4c137c1dd2a0be06d0d68e99d75bf"}

Centos HA:
pacemaker-1.1.10-3.el6.2
corosync-1.4.6-26.2
libqb-0.14.2-4

Issue: looks like after some unpredictable amount of uptime, if I issue multiple 'puppet apply' for controller node in HA deployment, stonith-ng is getting upset (see node-3 in snapshot):
==> /var/log/remote/node-3.test.domain.local/stonith-ng.log <==
2014-02-20T17:26:30.844111+00:00 warning: warning: cib_process_diff: Diff 0.121.1 -> 0.122.1 from local not applied to 0.121.1: Failed application of an update diff
2014-02-20T17:26:30.844317+00:00 notice: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206)

and crm_shadow --commit as well:
2014-02-21T10:52:51.025264+00:00 err: (/Stage[corosync_setup]/Osnailyfacter::Cluster_ha::Virtual_ips/Cluster::Virtual_ips[public_old]/Cluster::Virtual_ip[public_old]/Cs_commit[vip__public_old]/cib) change from absent to vip__public_old failed: Execution of '/usr/sbin/crm_shadow --force --commit vip__public_old' returned 50: Could not commit shadow instance 'vip__public_old' to the CIB: Application of an update diff failed
2014-02-21T13:29:27.967798+00:00 err: (/Stage[corosync_setup]/Osnailyfacter::Cluster_ha::Virtual_ips/Cluster::Virtual_ips[management_old]/Cluster::Virtual_ip[management_old]/Cs_commit[vip__management_old]/cib) change from absent to vip__management_old failed: Execution of '/usr/sbin/crm_shadow --force --commit vip__management_old' returned 50: Could not commit shadow instance 'vip__management_old' to the CIB: Application of an update diff failed

Note: logs are from 4 days uptime lab. Puppet apply was run ~10 times, and the last ones were issued with --logdest syslog, so you can find them in the snapshot.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

> libqb-0.14.2-4

We already have Pacemaker with libqb-0.16 in both operation systems.
I can't reproduce this bug with this version.

Please try reproduce this bug after renew, or close as incomplete.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Changed in fuel:
status: New → Confirmed
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Hasn't been confirmed by a second source - back to incomplete

Changed in fuel:
status: Confirmed → New
status: New → Incomplete
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

I confirm it.

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Anastasia Palkina (apalkina) wrote :

Reptoduced it on ISO #101
"build_id": "2014-04-14_01-00-26",
"mirantis": "yes",
"build_number": "101",
"nailgun_sha": "61410bcf3201cd737e68ece8ab15313acc746476",
"production": "dev",
"ostf_sha": "118c955085ea7829f3a34decd38d63554b74451c",
"fuelmain_sha": "ddc94c52c267f0276cbd8485d6e704aea05b23a5",
"astute_sha": "401bc474b1d8cebb8ba70b3b6154107e08fd725d",
"release": "5.0",
"fuellib_sha": "101f3645ead182bc47024ff7568b04554de06bba"

Revision history for this message
Anastasia Palkina (apalkina) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I removed the dup status due to the new details discovered, see https://lists.launchpad.net/fuel-dev/msg01100.html.
Looks like the root cause for this issue is: "we use crm_attribute to store GTID, and in manifest we use cs_shadow/cs_commit for every pacemaker resource.
This lead to cs_commit problem with different configuration in shadow copy and running configuration (running config changed by RA).
"Could not commit shadow instance [..] to the CIB: Application of an update diff failed”

Hence, making cs_commit/cs_shadow realy ensurable (and able to fix "merge conflicts" on fly) could help as well, but I believe these two issues should be separated since now.

Changed in fuel:
milestone: 5.0 → 5.1
status: Confirmed → Triaged
assignee: Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)
tags: added: ha
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergey Vasilenko (xenolog)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Revision history for this message
Anastasia Palkina (apalkina) wrote :

Analogue situation reproduced on ISO #255
"build_id": "2014-06-16_00-31-15",
"mirantis": "yes",
"build_number": "255",
"ostf_sha": "67b61ed3788297fa5d985afec32498d8c0f812db",
"nailgun_sha": "984aa7a86487f1488c2f83c052904abd9f589b7f",
"production": "docker",
"api": "1.0",
"fuelmain_sha": "6f355160366475d52050d7898a1080a95ecb9cbf",
"astute_sha": "17b1afa5f0dc8f4fca5ed4eb03ec566fbfb5ed19",
"release": "5.1",
"fuellib_sha": "99d74172887ab81d38132655d6e5d180e8726437"

1. Create new environment (Ubuntu, HA mode)
2. Choose VLAN segmentation
3. Choose both Ceph
4. Choose Murano installation
5. Add 3 controllers, compute, 3 ceph nodes
6. Start deployment. It was successful
7. There is error in puppet.log on second controller (node-5):

Mon Jun 16 11:03:04 +0000 2014 Puppet (err): Execution of '/usr/sbin/crm_shadow --force --commit dhcp' returned 50: Could not commit shadow instance 'dhcp' to the CIB: Application of an update diff failed
Mon Jun 16 11:03:04 +0000 2014 /Stage[main]/Neutron::Agents::Dhcp/Cs_commit[dhcp]/cib (err): change from absent to dhcp failed: Execution of '/usr/sbin/crm_shadow --force --commit dhcp' returned 50: Could not commit shadow instance 'dhcp' to the CIB: Application of an update diff failed

Revision history for this message
Anastasia Palkina (apalkina) wrote :
Revision history for this message
Anastasia Palkina (apalkina) wrote :

Reproduced on ISO #258. Deployment has failed.

"build_id": "2014-06-17_18-06-05",
"mirantis": "yes",
"build_number": "258",
"ostf_sha": "1740b5ce42ea1893f7d3e2c6cc59720bdb77c007",
"nailgun_sha": "057bb88abab1048322ed0ff48d632f8caf146e5a",
"production": "docker",
"api": "1.0",
"fuelmain_sha": "ba9e19a3822d9c1dcda2f4046f2f5e3e6ac505dd",
"astute_sha": "17b1afa5f0dc8f4fca5ed4eb03ec566fbfb5ed19",
"release": "5.1",
"fuellib_sha": "ff050d23d8a845cd097f7aa617285da0ab1894f6"

1. Create new environment (CentOS, HA mode)
2. Choose nova-network, VLAN manager
3. Add 4 controllers, 2 computes, 3 cinder
4. Start deployment. It has failed
5. There are errors on second controller:

Jun 18 17:04:56 err: (/Stage[main]/Cluster::Haproxy_ocf/Cs_commit[p_haproxy]/cib) change from absent to p_haproxy failed: Execution of '/usr/sbin/crm_shadow --force --commit p_haproxy' returned 50: Could not commit shadow instance 'p_haproxy' to the CIB: Application of an update diff failed

Revision history for this message
Anastasia Palkina (apalkina) wrote :
Changed in fuel:
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/101242

Changed in fuel:
status: Triaged → In Progress
Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Bogdan Dobrelya (bogdando)
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Sergey Vasilenko (xenolog)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/102773

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/101242
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=91e9eddcc44d1f5744d31fd97bb8d6136ab0eb2e
Submitter: Jenkins
Branch: master

commit 91e9eddcc44d1f5744d31fd97bb8d6136ab0eb2e
Author: Sergey Vasilenko <email address hidden>
Date: Thu Jun 19 16:20:54 2014 +0400

    remove unnided CIB definition for some OCF resources (part #1)

    Pacemaker resources will created only at deploy primary controller.
    At this stage we do not need pacemaker shadow mechanick, because
    we have only one running cluster node.

    For bunch of pacemaker resources using shadows is a unnecessarily.

    Blueprint: ha-pacemaker-improvements
    Closes-Bug: #1283062

    Change-Id: Iead6d37b6050905dba6042c4d58aafeb6d614664

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/102773
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=a3485f98c297e29593e64ffbdfbdcd418ad947dd
Submitter: Jenkins
Branch: master

commit a3485f98c297e29593e64ffbdfbdcd418ad947dd
Author: Sergey Vasilenko <email address hidden>
Date: Thu Jun 26 12:50:16 2014 +0400

    remove unnided CIB definition for some OCF resources (part #2)

    Heat, Celiometer

    Pacemaker resources will created only at deploy primary controller.
    At this stage we do not need pacemaker shadow mechanick, because
    we have only one running cluster node.

    For bunch of pacemaker resources using shadows is a unnecessarily.

    Blueprint: ha-pacemaker-improvements
    Closes-Bug: #1283062

    Change-Id: I25507e335643af2acfb76f0ae29e2b69fab9cb68

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Verified on ISO #274
"build_id": "2014-06-27_00-31-14",
"mirantis": "yes",
"build_number": "274",
"ostf_sha": "a4978638de3951dbc229276608a839a19ece2b70",
"nailgun_sha": "5f2944a8d5077a1c96acb076ba9194f670b818e8",
"production": "docker",
"api": "1.0",
"fuelmain_sha": "bf8660309601cee2f8f3e1bb881d272e638dcffa",
"astute_sha": "694b5a55695e01e1c42185bfac9cc7a641a9bd48",
"release": "5.1",
"fuellib_sha": "acc99fcd0ba9eeef0a504dc26507eb91ce757220"

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Reproduced the same issue on 5.0.1 on bare metal:

api: '1.0'
astute_sha: 9a74b788be9a7c5682f1c52a892df36e4766ce3f
build_id: 2014-07-17_00-31-14
build_number: '134'
fuellib_sha: 2d1e1369c13bc9771e9473086cb064d257a21fc2
fuelmain_sha: 069686abb90f458f67cfcb4018cacc19971e4b4d
mirantis: 'yes'
nailgun_sha: 1d08d6f80b6514085dd8c0af4d437ef5d37e2802
ostf_sha: 09b6bccf7d476771ac859bb3c76c9ebec9da9e1f
production: docker
release: 5.0.1

Steps to reproduce:

1. Upgrade master node from 5.0 to 5.0.1
2. Deploy new cluster (Ubuntu + NeutronVlan + Ceph) with 4 controllers and 1 compute
3. Add one more controller and one compute node.s Deploy changes
4. Deployment has failed. Error occurred while running method 'deploy'

Here is the part of puppet logs (node-16, added controller):

http://paste.openstack.org/show/87713/

All logs from failed node are attached (can't upload full diagnostic snapshot, because its size is 1.6GB)

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

addressed by blueprint https://blueprints.launchpad.net/fuel/+spec/ha-pacemaker-improvements . this is an intermittent pacemaker bug. cannot be fixed in 5.0.1

Changed in fuel:
status: Fix Released → Fix Committed
tags: added: release-notes
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

It looks like this problem is still persistent in 5.0.x branch. This sometimes happens when applied diff is empty, e.g. when there is no difference between pacemaker config being applied. On the second run of the same command diff applies cleanly and everything is ok. So it looks like we need to catch this exception and simply retry.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

we also need to backport the fix to 5.0 manifests to allow rollback to succeed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/119263

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/5.0)

Reviewed: https://review.openstack.org/119263
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=c891490242d7d88f3c9582cf023e8f2db0b9d337
Submitter: Jenkins
Branch: stable/5.0

commit c891490242d7d88f3c9582cf023e8f2db0b9d337
Author: Vladimir Kuklin <email address hidden>
Date: Fri Sep 5 04:14:06 2014 +0400

    Add crm shadow commit failure workaround

    Try to commit the same shadow again if it fails.
    Sometimes pacemaker fails to apply shadow
    which does not differ from current configuration.
    Manual reapplication succeeds and does not break
    deployment.

    Change-Id: Ie6f0fa8703fffeee27afc38a544f616e5cd97a29
    Closes-bug: #1283062

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/119380
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=5968b43f26c925341b47b71845610de9318eab99
Submitter: Jenkins
Branch: master

commit 5968b43f26c925341b47b71845610de9318eab99
Author: Dmitry Ilyin <email address hidden>
Date: Fri Sep 5 18:01:56 2014 +0400

    Backport fixes for swift and crm

    * Fix for broken swift deploy order
    * Fix for 'apply diff failed' situation
    * Fix missing stop for native heat service in HA mode

    Change-Id: I059d7bc48d073f37eca449b10bb9e66b24123c6b
    Closes-Bug: 1365951
    Related-Bug: 1364026
    Related-Bug: 1283062

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Reproduced on CI job: http://jenkins-product.srt.mirantis.net:8080/job/6.0.ubuntu.promo_bvt/38/

Deploy failed on node-4.

============= http://paste.openstack.org/show/133186/ :
2014-11-14T14:34:01.667917+00:00 err: Execution of '/usr/sbin/pcs resource meta p_mysql target-role=Started' returned 1: Error: Unable to update cib
2014-11-14T14:34:01.668350+00:00 err: Call cib_replace failed (-206): Application of an update diff failed

MySQL service is running, but access credentials weren't set:

root@node-4:~# mysql
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)

no longer affects: fuel/5.1.x
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.