Fuel for OpenStack

Pacemaker shows healthy status for rabbitmq node meanwhile the node is actually down/split brain

Bug #1472230 reported by Tatyanka on 2015-07-07

This bug affects 6 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 9.0
7.0.x	Won't Fix	High	Denis Puchkin	Fuel for OpenStack 7.0-updates
8.0.x	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 8.0
Mitaka	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 9.0

Bug Description

http://jenkins-product.srt.mirantis.net:8080/job/7.0.system_test.ubuntu.cic_maintenance_mode/18/testReport/junit/(root)/auto_cic_maintenance_mode/auto_cic_maintenance_mode/

𝐒𝐭𝐞𝐩𝐬:
1. Create cluster
2. Add 3 node with controller and mongo roles
3. Add 2 node with compute and cinder roles
4. Deploy the cluster
5. Run ostf
6. Check that umm feature is enabled (umm status )
7. Call unexpected reboot: reboot --force >/dev/null &
8. Check that node is rebooted, back online and enter into auto mode(umm status):
9. Disable umm mode: umm off and wait umm stops
10. Check while node back to the online status in psc
11. Run ostf ha

𝐀𝐜𝐭𝐮𝐚𝐥 𝐫𝐞𝐬𝐮𝐥𝐭:
OSTF rabbit tests are failed:

Failed 2 OSTF tests; should fail 0 tests. Names of failed tests: [{u'RabbitMQ availability (failure)': u'Number of controllers is not equal to number of cluster nodes.'}, {u'RabbitMQ replication (failure)': u'Failed to connect to 5673 port on host 10.109.20.4 Please refer to OpenStack logs for more details.'}]

rabbit failed node is node-2:

𝐂𝐥𝐮𝐬𝐭𝐞𝐫 𝐬𝐭𝐚𝐭𝐮𝐬 𝐨𝐟 𝐧𝐨𝐝𝐞 '𝐫𝐚𝐛𝐛𝐢𝐭@𝐧𝐨𝐝𝐞-𝟐' ...
Error: unable to connect to node 'rabbit@node-2': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@node-2']

rabbit@node-2:
  * connected to epmd (port 4369) on node-2
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on node-2
  * suggestion: start the node

current node details:
- node name: 'rabbitmqctl11900@node-2'
- home dir: /var/lib/rabbitmq
- cookie hash: soeIWU2jk2YNseTyDSlsEA==

𝐀𝐧𝐝 𝐬𝐞𝐞𝐦𝐬 𝐫𝐚𝐛𝐛𝐢𝐭 𝐢𝐬 𝐧𝐨𝐭 𝐫𝐮𝐧𝐧𝐢𝐧𝐠 𝐨𝐧 𝐢𝐭:
root@node-2:~# ps uuax| grep erla
rabbitmq 8081 0.0 0.0 8132 1088 ? S 09:01 0:01 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon
root 14648 0.0 0.0 10460 936 pts/0 S+ 10:23 0:00 grep --color=auto erla
root@node-2:~# ps uuax| grep beam
root 14729 0.0 0.0 10460 936 pts/0 S+ 10:24 0:00 grep --color=auto beam
root@node-2:~# ps uuax| grep rabb
rabbitmq 3438 0.0 0.4 90432 11844 ? Ss 09:00 0:00 /usr/bin/python /usr/bin/rabbit-fence.py
rabbitmq 8081 0.0 0.0 8132 1088 ? S 09:01 0:01 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon

𝐀𝐧𝐝 𝐨𝐜𝐟 𝐭𝐨𝐨:
root@node-2:~# OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/fuel/rabbitmq-server status ; echo $?
7

𝐁𝐮𝐭 𝐩𝐚𝐜𝐞𝐦𝐚𝐤𝐞𝐫 𝐬𝐡𝐨𝐰 𝐢𝐭 𝐚𝐬 𝐡𝐞𝐚𝐥𝐭𝐡𝐲 𝐚𝐧𝐝 𝐨𝐧𝐥𝐢𝐧𝐞 𝐚𝐧𝐝 𝐞𝐯𝐞𝐧 𝐝𝐨 𝐧𝐨 𝐭𝐫𝐲 𝐭𝐨 𝐫𝐞-𝐮𝐩:

Online: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]

Full list of resources:

Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
vip__management (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
vip__public_vrouter (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
vip__management_vrouter (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
vip__public (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-1.test.domain.local ]
     Slaves: [ node-2.test.domain.local node-5.test.domain.local ]
Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
Clone Set: clone_p_dns [p_dns]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
p_ceilometer-agent-central (ocf::fuel:ceilometer-agent-central): Started node-1.test.domain.local
p_ceilometer-alarm-evaluator (ocf::fuel:ceilometer-alarm-evaluator): Started node-2.test.domain.local

𝑴𝒂𝒔𝒕𝒆𝒓/𝑺𝒍𝒂𝒗𝒆 𝑺𝒆𝒕: 𝒎𝒂𝒔𝒕𝒆𝒓_𝒑_𝒓𝒂𝒃𝒃𝒊𝒕𝒎𝒒-𝒔𝒆𝒓𝒗𝒆𝒓 [𝒑_𝒓𝒂𝒃𝒃𝒊𝒕𝒎𝒒-𝒔𝒆𝒓𝒗𝒆𝒓]
𝑴𝒂𝒔𝒕𝒆𝒓𝒔: [ 𝒏𝒐𝒅𝒆-1.𝒕𝒆𝒔𝒕.𝒅𝒐𝒎𝒂𝒊𝒏.𝒍𝒐𝒄𝒂𝒍 ]
𝑺𝒍𝒂𝒗𝒆𝒔: [ 𝒏𝒐𝒅𝒆-2.𝒕𝒆𝒔𝒕.𝒅𝒐𝒎𝒂𝒊𝒏.𝒍𝒐𝒄𝒂𝒍 𝒏𝒐𝒅𝒆-5.𝒕𝒆𝒔𝒕.𝒅𝒐𝒎𝒂𝒊𝒏.𝒍𝒐𝒄𝒂𝒍 ]

Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
Clone Set: clone_p_ntp [p_ntp]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]

PCSD Status:
  10.109.22.4: Offline
  10.109.22.5: Offline
  10.109.22.8: Offline

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2014.2.2-7.0"
  api: "1.0"
  build_number: "26"
  build_id: "2015-07-06_18-08-24"
  nailgun_sha: "d040c5cebc9cdd24ef20cb7ecf0a337039baddec"
  python-fuelclient_sha: "315d8bf991fbe7e2ab91abfc1f59b2f24fd92f45"
  astute_sha: "9cbb8ae5adbe6e758b24b3c1021aac1b662344e8"
  fuel-library_sha: "251c54e8de2f41aacd260751e7a891e9fbffc45d"
  fuel-ostf_sha: "a752c857deafd2629baf646b1b3188f02ff38084"
  fuelmain_sha: "4f2dff3bdc327858fa45bcc2853cfbceae68a40c"

See original description

Tags:

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-07-07:

fuel-snapshot-2015-07-07_10-40-09.tar.xz Edit (95.5 MiB, application/octet-stream)

summary:	- Pecemaker shows healty status fro rabbitmq meanwhile onde node of + Pecemaker shows healthy status fro rabbitmq meanwhile onу node of rabbitmq failed
description:	updated

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-07-07: Re: Pecemaker shows healthy status fro rabbitmq meanwhile onу node of rabbitmq failed

According to logs, monitor have been returning "not running" and pacemaker did not trigger any stop/start events as this situation considered OK (the resource may be not running after a graceful stop, for example). The solution is to return generic error instead of not running when the script logic expects the resource to be restarted by pacemaker.

OpenStack Infra (hudson-openstack) on 2015-07-07

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
status:	New → In Progress

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-07-07:

https://review.openstack.org/199059
the patch should also be backported for supported releases

Tatyanka (tatyana-leontovich) on 2015-07-07

description:

updated

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-07-08:

The patch https://review.openstack.org/190137 will adress this issue as well. rabbitmqctl exit code 2 should be reported to pacemaker as a generic error.

summary:	- Pecemaker shows healthy status fro rabbitmq meanwhile onу node of - rabbitmq failed + Pacemaker shows healthy status for rabbitmq node meanwhile the node is + actually down
tags:	added: rabbitmq

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2015-07-08: Re: Pacemaker shows healthy status for rabbitmq node meanwhile the node is actually down

Bogdan - is it regression?

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-07-09:

No, this is not a regression

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-07-09:

The patch https://review.openstack.org/190137 merged. This issue is expected to be resolved as well.

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-07-13: Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/199059
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=5097d94f5d56fd6126ca9b7c1227961536c94399
Submitter: Jenkins
Branch: master

commit 5097d94f5d56fd6126ca9b7c1227961536c94399
Author: Bogdan Dobrelya <email address hidden>
Date: Tue Jul 7 13:32:25 2015 +0200

Fix error return codes for rabbit OCF

    W/o this fix the situation is possible when
    rabbit OCF returns OCF_NOT_RUNNING in the hope of
    future restart of the resource by pacemaker.

    But in fact, pacemaker will not trigger restart action
    if monitor returns "not running". This is an issue
    as we want resource restarted.

    The solution is to return OCF_ERR_GENERIC instead of
    OCF_NOT_RUNNING when we expect the resource to be restarted
    (which is action stop plus action start).

Closes-bug: #1472230

Change-Id: I10c6e43d92cb23596636d86932674b36864d1595
Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-08-07: Re: Pacemaker shows healthy status for rabbitmq node meanwhile the node is actually down

verified 140 iso

Changed in fuel:
status:	Fix Committed → Fix Released

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-08-12:

#10

This bugfix introduced a regression, please look into bug # 1484280

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-27:

#11

The issue was reproduced again. Node-6's crm_mon -fotAW -1 output for the rabbit resource indicates the last monitor check for the resource was a day before:
   p_rabbitmq-server: migration-threshold=1000000
    + (164) probe: last-rc-change='Wed Aug 26 21:06:49 2015' last-run='Wed Aug 26 21:06:49 2015' exec-time=4229ms queue-time=0ms rc=0 (ok)
    + (169) monitor: interval=103000ms last-rc-change='Wed Aug 26 21:07:23 2015' exec-time=6375ms queue-time=0ms rc=0 (ok)
    + (170) monitor: interval=30000ms last-rc-change='Wed Aug 26 21:07:30 2015' exec-time=5573ms queue-time=6361ms rc=0 (ok)

While the current date is Thu Aug 27 12:35:43 UTC 2015 and the /var/log/remote/node-6.domain.tld/lrmd.log is full of generic errors returned from the monitor action:
2015-08-27T12:39:29.046644+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): rabbit node is running out of the cluster
2015-08-27T12:39:29.051209+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): get_status() returns generic error 1

And manually issued monitor check returns generic error as well.

But something is definitely wrong with pacemaker as it shows status OK and doesn't update the monitor statistics...
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Masters: [ node-1.domain.tld ]
Slaves: [ node-6.domain.tld node-7.domain.tld ]

Changed in fuel:
status:	Fix Released → Confirmed

Bogdan Dobrelya (bogdando) on 2015-08-27

summary:

Pacemaker shows healthy status for rabbitmq node meanwhile the node is
- actually down
+ actually down/split brain

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-27:

#12

node-6_pacemaker.tgz Edit (304.4 KiB, application/x-tar)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-27:

#13

The last records in pacemaker log for the monitor action ok, are:
Aug 26 21:07:30 [14594] node-6.domain.tld crmd: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_103000: ok (node=node-6.domain.tld, call=169, rc=0, cib-update=136, confirmed=false)
Aug 26 21:07:35 [14594] node-6.domain.tld crmd: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: ok (node=node-6.domain.tld, call=170, rc=0, cib-update=137, confirmed=false)

after that, there was no more tries to monitor the resource, and pacemaker thinks it is running OK. Looks like a pacemaker bug.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-27:

#14

Note, these two was not confirmed.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-27:

#15

This snippet shows all rabbit monitor events logged and the pacemaker failures list reported with pcs status http://pastebin.com/GSD3RamW

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-27:

#16

events_around.tgz Edit (295.5 KiB, application/x-tar)

According to logs and the snippet above, something went really wrong after 2015-08-27T11:16:00
here is a snippet of suspicious Stonith/Shutdown events http://pastebin.com/GkMHkbeG

And all of the events around as well.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-27:

#17

I believe the pacemaker reached the "broken" state because there was no STONITH configured to fence the bad node-6. Hence, I returning the status of this bug back to fix released. The reproduced case seems not related to original bug.

Changed in fuel:
status:	Confirmed → Fix Released

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-28:

#18

Yes, I can see now there were few events after which it seems something became very broken in the pacemaker cluster, although looks and being reported healthy:

/var/log/remote/node-6.domain.tld/crmd.log:2015-08-27T11:15:52.760185+00:00 notice: notice: peer_update_callback: Our peer on the DC (node-1.domain.tld) is dead
/var/log/remote/node-6.domain.tld/crmd.log:2015-08-27T11:16:22.346080+00:00 warning: warning: reap_dead_nodes: Our DC node (node-7.domain.tld) left the cluster

And probably w/o STONITH enabled this situation could lead to such type of bugs. We probably should address this in the ops guide

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-28:

#19

Related ops guide update on this topic https://review.openstack.org/218150

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-01: Change abandoned on fuel-docs (master)

#20

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/218150
Reason: this is wrong info

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-09-15:

#21

With the patch https://review.openstack.org/#/c/223548, this bug maybe valid again. Returning to the Fix committed, additional verification required

Changed in fuel:
status:	Fix Released → Fix Committed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-09-15:

#22

AS we discussed with @Vladimir Kuklin, the test case may be not good as well. In terms of nodes RAM configuration.

Artem Hrechanychenko (agrechanichenko) on 2015-09-17

tags:

added: on-verification

Revision history for this message

Artem Hrechanychenko (agrechanichenko) wrote on 2015-09-17:

#23

verified on ISO #297 by system /fuelweb_test/tests/tests_strength/test_cic_maintenance_mode/test auto_cic_maintenance_mode

2015-09-17 19:17:25,132 - INFO decorators.py:46 -- Saving logs to "/home/agrechanichenko/fuel-qa/logs/pass_auto_cic_maintenance_mode-fuel-snapshot-2015-09-17_19-07-06.tar.xz" file
ok

----------------------------------------------------------------------
Ran 5 tests in 16307.860s

Changed in fuel:
status:	Fix Committed → Fix Released

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-09-21:

#24

@Artem,

this issue is floating and I've just got it on bare-metal lab after primary controller shutdown:

Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-29.mirantis.com ]
     Slaves: [ node-28.mirantis.com node-35.mirantis.com ]
     Stopped: [ node-30.mirantis.com ]

Pacemaker says that RabbitMQ is running on node-35, but it's actually down:

root@node-35:~# ps auxfw | grep [r]abbit
rabbitmq 7332 0.0 0.0 90832 12956 ? Ss 08:58 0:03 /usr/bin/python /usr/bin/rabbit-fence.py
root@node-35:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-35' ...
Error: unable to connect to node 'rabbit@node-35': nodedown

rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-29' ...
[{nodes,[{disc,['rabbit@node-28','rabbit@node-29','rabbit@node-35']}]},
{running_nodes,['rabbit@node-29']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]

There are no issues with server resources (most of controllers have 16+ GB RAM, 8 CPUs and SSD drives): http://paste.openstack.org/show/472715/

Also, the fix https://review.openstack.org/#/c/223548 was merged to master (8.0) only, the patch for 7.0 https://review.openstack.org/#/c/223552/ is still on review.

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2015-09-21:

#25

Increased to Critical, because we have patch for backport, issue had reproduced twice on Tema's and Matt's envs.

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-09-21:

#26

User impact is following: after failover of some controller node, RabbitMQ cluster could be rebuilt and run without some live controllers, which means that high availability of AMQP could be broken. For example:

1) Cloud has 5 controller nodes
2) One controller node goes down
3) RabbitMQ cluster re-assembles, but service is running only on one controller
4) Controller node with alive RabbitMQ goes down

Result: AMQP messages are lost, some cloud operations failed

Diagnostic snapshot: https://drive.google.com/file/d/0BzaZINLQ8-xkanF2Z3cxYVljVVU/view?usp=sharing

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-09-21:

#27

Nastya,

"we have patch for backport" << Which patch are you referring to?

Thanks,
Dims

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-09-22:

#28

Nastya,

This one was already merged - https://review.openstack.org/#/c/223552/ hence asking

-- Dims

Dennis Dmitriev (ddmitriev) on 2015-09-24

tags:	removed: on-verification
tags:	added: on-verification

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-09-25:

#29

Fix released, issue is not reproduced.

Checked on an environment with 3 controllers, 2 computes.

For each controller, in the order: controller №3, controller №2, controller №1, and then again controller №3:

1. Enable umm mode: `umm on` (node will be automatically rebooted)
2. Wait until node is rebooted
3. Disable umm mode: `umm off`
4. Wait for all required resources are started by pacemaker on the node: `pcs status`
5. Run OSTF HA tests
6. Repeat from step №1 for next controller.

Result: OSTF HA tests are passed successfully for each controller.

[root@nailgun ~]# fuel --fuel-version
DEPRECATION WARNING: /etc/fuel/client/config.yaml exists and will be used as the source for settings. This behavior is deprecated. Please specify the path to your custom settings file in the FUELCLIENT_CUSTOM_SETTINGS environment variable.
api: '1.0'
astute_sha: 6c5b73f93e24cc781c809db9159927655ced5012
auth_required: true
build_id: '301'
build_number: '301'
feature_groups:
- mirantis
fuel-agent_sha: 50e90af6e3d560e9085ff71d2950cfbcca91af67
fuel-library_sha: 5d50055aeca1dd0dc53b43825dc4c8f7780be9dd
fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
fuel-ostf_sha: 2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c
fuelmain_sha: a65d453215edb0284a2e4761be7a156bb5627677
nailgun_sha: 4162b0c15adb425b37608c787944d1983f543aa8
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 486bde57cda1badb68f915f66c61b544108606f3
release: '7.0'
release_versions:
  2015.1.0-7.0:
    VERSION:
      api: '1.0'
      astute_sha: 6c5b73f93e24cc781c809db9159927655ced5012
      build_id: '301'
      build_number: '301'
      feature_groups:
      - mirantis
      fuel-agent_sha: 50e90af6e3d560e9085ff71d2950cfbcca91af67
      fuel-library_sha: 5d50055aeca1dd0dc53b43825dc4c8f7780be9dd
      fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
      fuel-ostf_sha: 2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c
      fuelmain_sha: a65d453215edb0284a2e4761be7a156bb5627677
      nailgun_sha: 4162b0c15adb425b37608c787944d1983f543aa8
      openstack_version: 2015.1.0-7.0
      production: docker
      python-fuelclient_sha: 486bde57cda1badb68f915f66c61b544108606f3
      release: '7.0'

Fix released, issue is not reproduced.

Checked on an environment with 3 controllers, 2 computes.

For each controller, in the order: controller №3, controller №2, controller №1, and then again controller №3:

Result: OSTF HA tests are passed successfully for each controller.

tags:

removed: on-verification

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-library

Sergey Shevorakov (sshevorakov) on 2015-11-05

tags:

added: rca-done

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-31:

#30

Bad news, folks. As we know, the fix to this issue https://review.openstack.org/#/c/199059 was undone by https://review.openstack.org/#/c/223548 (see the comment https://bugs.launchpad.net/fuel/+bug/1472230/comments/21 ) .

And now we have 2 or 3 bugs with the same issue being reproduced again. Raising to critical and attaching them here as duplicates

Changed in fuel:
status:	Fix Released → Confirmed
importance:	High → Critical
no longer affects:	fuel/8.0.x

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-31:

#31

The bug is tricky and I have to analyze all of the new duplicates carefully to find all of the root causes, as there are likely many of them.

Changed in fuel:
status:	Confirmed → In Progress
tags:	added: ha tricky removed: rca-done

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-31:

#32

Two new duplicates point to the same flow of events, like https://bugs.launchpad.net/fuel/+bug/1529875/comments/6 or https://bugs.launchpad.net/fuel/+bug/1530228/comments/5. Which is: at some point, after the rabbit OCF monitor reported an error followed by several "not running" reports, pacemaker starts thinking everything is fine with the resource and shows it as running in the status. That is very strange, I have no idea why it happens.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-04:

#33

Looks like a major bug in pacemaker or in the OCF script as well. I'm still investigating but using the dummy OCF, see details here http://clusterlabs.org/pipermail/users/2016-January/002045.html

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-04:

#34

The bug is likely in the ocf-shellfuncs, see for details and for the w/a as well https://github.com/ClusterLabs/resource-agents/issues/734

Changed in fuel:
status:	In Progress → Triaged

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-05:

#35

I can confirm that atop logs from both https://bugs.launchpad.net/fuel/+bug/1530228 and https://bugs.launchpad.net/fuel/+bug/1529875 contain the same patterns with misbehaving ocf-shellfuncs spawning 4-5 nested monitors. This should be a RC of the issue, but I'm not sure if only that.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-05:

#36

Added MOS packaging team to rebuild the resource-agents for MOS mirrors to include the patch suggested by kskmori in https://github.com/ClusterLabs/resource-agents/issues/734

Changed in mos:
assignee:	nobody → MOS Packaging Team (mos-packaging)
status:	New → Triaged
Changed in fuel:
status:	Triaged → In Progress
Changed in mos:
milestone:	none → 8.0
importance:	Undecided → Critical

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-01-05: Related fix proposed to fuel-infra/jeepyb-config (master)

#37

Related fix proposed to branch: master
Change author: Ivan Udovichenko <email address hidden>
Review: https://review.fuel-infra.org/15971

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-01-05: Related fix merged to fuel-infra/jeepyb-config (master)

#38

Reviewed: https://review.fuel-infra.org/15971
Submitter: Mateusz Matuszkowiak <email address hidden>
Branch: master

Commit: 96b80af93620e7dc6672db594d4930cd813e5639
Author: Ivan Udovichenko <email address hidden>
Date: Tue Jan 5 15:32:47 2016

Add resource-agents project [MOS 8.0]

- resource-agents

Current version in Ubuntu Trusty repository
doesn't satisfy required needs:
http://packages.ubuntu.com/trusty/resource-agents
1:3.9.3+git20121009-3ubuntu2

We need version 3.9.5 with applied patch on-top of it.

Change-Id: I4feccdc6d5bbd44e1b66b7e73c4e371338416efb
Related-Bug: #1472230

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-08:

#39

Cannot reproduce this issue with the resource-agents fix for the shell fork bomb

Changed in fuel:
status:	In Progress → Invalid

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-08:

#40

Moving to invalid, nothing to fix for the Fuel side, the package shall be patched instead

Revision history for this message

Ivan Udovichenko (iudovichenko) wrote on 2016-01-08:

#41

Link to a resource-agents source package with patch: https://review.fuel-infra.org/#/c/15974/

Changed in mos:
status:	Triaged → In Progress

Igor Yozhikov (iyozhikov) on 2016-01-12

Changed in mos:
assignee:	MOS Packaging Team (mos-packaging) → Ivan Udovichenko (iudovichenko)

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-01-13: Related fix merged to packages/trusty/resource-agents (8.0)

#42

Reviewed: https://review.fuel-infra.org/15974
Submitter: Pkgs Jenkins <email address hidden>
Branch: 8.0

Commit: 35a347684dfde00ef5746aaa3291d1a76dae7c7d
Author: Ivan Udovichenko <email address hidden>
Date: Wed Jan 6 15:40:40 2016

Update resource-agents package [MOS 8.0]

Version: 1:3.9.5+git+a626847-1 experimental (rc-buggy) [1]
Add MIRA0001-Check-Bash-shell-presence.patch patch [2]

[1] https://packages.debian.org/experimental/resource-agents
[2] https://github.com/ClusterLabs/resource-agents/issues/734

Related-Bug: #1472230

Change-Id: I6c1d547d4341a6f22491d94f24811fb48a9f204c

Igor Yozhikov (iyozhikov) on 2016-01-13

Changed in mos:
status:	In Progress → Fix Committed

Nastya Urlapova (aurlapova) on 2016-01-13

Changed in mos:
status:	Fix Committed → Confirmed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-14:

#43

There were a bunch of fixes to related bugs accepted, so I'd better put the Fuel status to Fix committed. The Invalid looks not really good :/

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-14:

#44

For the maintenance team, the 7.0-updates should include at least backports for:
https://review.fuel-infra.org/15974
https://review.openstack.org/#/c/262572/
https://review.openstack.org/#/c/262519/
https://review.openstack.org/264751

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-01-14: Related fix proposed to packages/trusty/resource-agents (9.0)

#45

Related fix proposed to branch: 9.0
Change author: Ivan Udovichenko <email address hidden>
Review: https://review.fuel-infra.org/16099

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-01-14: Related fix proposed to packages/trusty/resource-agents (7.0)

#46

Related fix proposed to branch: 7.0
Change author: Ivan Udovichenko <email address hidden>
Review: https://review.fuel-infra.org/16123

Nastya Urlapova (aurlapova) on 2016-01-15

Changed in mos:
status:	Confirmed → Fix Committed

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2016-01-20:

#47

Download full text (3.7 KiB)

Reproduced on 429 iso(fix was not included in it) but scenario differs, so just add here case to be verified on iso with fix:
1. Deploy 1 controller
2. Add 2 controller - redeploy - run ostf
3. Add 2 controller + 1 compute + cinder - redeploy - run ostf leave env for 24 h - run ostf:
Ostf test failed, crm status says that all resources looks like:
Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
vip__management (ocf::fuel:ns_IPaddr2): Started node-7.test.domain.local
vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-7.test.domain.local
vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-7.test.domain.local
vip__public (ocf::fuel:ns_IPaddr2): Started node-7.test.domain.local
Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-7.test.domain.local ]
     Slaves: [ node-10.test.domain.local node-11.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
Clone Set: clone_p_dns [p_dns]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-7.test.domain.local ]
     Slaves: [ node-11.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
     Stopped: [ node-10.test.domain.local ]
Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
Clone Set: clone_p_neutron-dhcp-agent [p_neutron-dhcp-agent]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domain.local node-8.test.domain.local node-9.test.domain.local ]
Clone Set: clone_p_ntp [p_ntp]
     Started: [ node-10.test.domain.local node-11.test.domain.local node-7.test.domai...

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.