HA does not work in case if cluster node is suffering from out of memory

Bug #1422186 reported by Leontii Istomin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
High
Fuel Library (Deprecated)
7.0.x
Won't Fix
High
Fuel Library (Deprecated)

Bug Description

[root@fuel ~]# fuel --fuel-version
api: '1.0'
astute_sha: f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
auth_required: true
build_id: 2015-02-07_20-50-01
build_number: '76'
feature_groups:
mirantis
fuellib_sha: 64f3ebe9fcbd18bf6c80a948e06061783a090347
fuelmain_sha: c799e3a6d88289e58db764a6be7910aab7da3149
nailgun_sha: 2ef819732a3ee7acf7b610e7d1c1a6da0434c1a0
ostf_sha: 3b57985d4d2155510894a1f6d03b478b201f7780
production: docker
release: 6.0.1
release_versions:
2014.2-6.0.1:
VERSION:
api: '1.0'
astute_sha: f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
build_id: 2015-02-07_20-50-01
build_number: '76'
feature_groups:
mirantis
fuellib_sha: 64f3ebe9fcbd18bf6c80a948e06061783a090347
fuelmain_sha: c799e3a6d88289e58db764a6be7910aab7da3149
nailgun_sha: 2ef819732a3ee7acf7b610e7d1c1a6da0434c1a0
ostf_sha: 3b57985d4d2155510894a1f6d03b478b201f7780
production: docker
release: 6.0.1
Baremetal,Ubuntu, HA, Neutron-gre,Ceilometer,Ceph-all, Debug, 6.0.1_76
Controllers:3 Computes:97
deployment was successful
rally light was succssfulhttp://mos-scale.vm.mirantis.net/test_results/build_6.0.1-76/jenkins-10_env_run_rally_light-15-ubuntu-MSK-2015-02-13-20:26:27-2015-02-13-22:34:22/rally_report.html
But after rally ull test I have found the following bihaviour on the primary controller:
root@node-32:~# ls -la
-bash: fork: Cannot allocate memory
root@node-32:~# top
-bash: fork: Cannot allocate memory
root@node-32:~# less /var/log/messages
-bash: fork: Cannot allocate memory
root@node-32:~# dmesg
-bash: fork: Cannot allocate memory

I could execute the following command, but after that i lost ssh connection:
root@node-32:~# exec ls
ceph.bootstrap-mds.keyring ceph.bootstrap-osd.keyring ceph.client.admin.keyring ceph.conf ceph.log ceph.mon.keyring openrc
Connection to node-32 closed.

I couldn't open nw ssh connection. I couldn't reach bash of the node even ipmi. (screenshot is attached)

I've rebooted the node at Feb 15 17:39:33 UTC. from messages (I can see log files after rebooting):
<6>Feb 15 17:39:33 node-32 kernel: imklog 5.8.6, log source = /proc/kmsg started.
latest message before rebooting:
<134>Feb 14 13:55:54 node-32 haproxy[11311]: 192.168.0.34:52990 [14/Feb/2015:13:14:34.260] mysqld mysqld/node-32 1/0/2480311 1015 – 1141/1130/1130/1130/0 0/0
Now can reach ivia ssh and execute bash cmmands

from messages log:
Feb 15 18:39:49 node-32 kernel: [ 3650.131654] perf samples too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 50000

The latest strings from keystone:
<15>Feb 14 13:55:54 node-32 keystone-all Auth token not in the request header. Will not build auth context.
<15>Feb 14 13:55:54 node-32 keystone-all arg_dict: {}

I have no idea how I can know why node was hanged

Diagnostic snapshot is here: https://drive.google.com/a/mirantis.com/file/d/0Bx4ptZV1Jt7helc3OFh3T19WcFk/view?usp=sharing

Revision history for this message
Leontii Istomin (listomin) wrote :
Changed in mos:
assignee: nobody → MOS Linux (mos-linux)
Dina Belova (dbelova)
Changed in mos:
importance: Undecided → Critical
Revision history for this message
Pavel Boldin (pboldin) wrote :

We will require an environment. Can you please collaborate on skype?

Changed in mos:
assignee: MOS Linux (mos-linux) → Pavel Boldin (pboldin)
Revision history for this message
Pavel Boldin (pboldin) wrote :

2015-02-14 11:04:32.486 23509 DEBUG keystoneclient.session [-] Request returned failure status: 504 request /opt/stack/.venv/lib/python2.7/site-packages/keystoneclient/session.py:383
2015-02-14 11:04:32.489 23509 WARNING rally.common.broker [-] Failed to consume a task from the queue: Authorization Failed: Gateway Timeout (HTTP 504)
2015-02-14 11:04:32.489 23509 ERROR rally.common.broker [-] Authorization Failed: Gateway Timeout (HTTP 504)
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker Traceback (most recent call last):
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/rally/common/broker.py", line 41, in _consumer
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker consume(cache, queue.popleft())
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/context/users.py", line 145, in consume
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker cache["client"] = keystone.wrap(clients.keystone())
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/rally/osclients.py", line 63, in wrapper
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker self.cache[key] = func(self, *args, **kwargs)
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/rally/osclients.py", line 101, in keystone
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker client = create_keystone_client(kw)
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/rally/osclients.py", line 74, in create_keystone_client
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker return keystone_v2.Client(**args)
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/keystoneclient/v2_0/client.py", line 152, in __init__
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker self.authenticate()
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/keystoneclient/utils.py", line 318, in inner
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker return func(*args, **kwargs)
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/keystoneclient/httpclient.py", line 503, in authenticate
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker resp = self.get_raw_token_from_identity_service(**kwargs)
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/keystoneclient/v2_0/client.py", line 196, in get_raw_token_from_identity_service
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker _("Authorization Failed: %s") % e)
2015-02-14 11:04:32.489 23509 TRACE rally.common.broker AuthorizationFailure: Authorization Failed: Gateway Timeout (HTTP 504)

Revision history for this message
Pavel Boldin (pboldin) wrote :
Download full text (3.2 KiB)

2015-02-14 06:31:35.420 3687 DEBUG keystoneclient.session [-] Request returned failure status: 500 request /opt/stack/.venv/lib/python2.7/site-packages/keystoneclient/session.py:383
2015-02-14 06:31:35.424 3687 ERROR rally.common.broker [-] Maximum lock attempts on _lockrevocation-list occurred. (Disable debug mode to suppress these details.) (HTTP 500)
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker Traceback (most recent call last):
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/rally/common/broker.py", line 41, in _consumer
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker consume(cache, queue.popleft())
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/context/users.py", line 227, in consume
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker cache["client"].delete_user(user_id)
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/wrappers/keystone.py", line 133, in delete_user
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker self.client.users.delete(user_id)
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/keystoneclient/v2_0/users.py", line 106, in delete
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker return self._delete("/users/%s" % base.getid(user))
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/keystoneclient/base.py", line 210, in _delete
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker return self.client.delete(url, **kwargs)
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/keystoneclient/adapter.py", line 179, in delete
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker return self.request(url, 'DELETE', **kwargs)
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/keystoneclient/adapter.py", line 200, in request
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker resp = super(LegacyJsonAdapter, self).request(*args, **kwargs)
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/keystoneclient/adapter.py", line 89, in request
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker return self.session.request(url, method, **kwargs)
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/keystoneclient/utils.py", line 318, in inner
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker return func(*args, **kwargs)
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker File "/opt/stack/.venv/lib/python2.7/site-packages/keystoneclient/session.py", line 384, in request
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker raise exceptions.from_response(resp, method, url)
2015-02-14 06:31:35.424 3687 TRACE rally.common.broker InternalServerError: Maximum lock attempts on _lockrevocation-list occurred. (Disable debug ...

Read more...

Changed in mos:
status: New → Confirmed
Revision history for this message
Pavel Boldin (pboldin) wrote :

The node-32 has a failed hard disk drive.

Changed in mos:
status: Confirmed → Incomplete
Revision history for this message
Leontii Istomin (listomin) wrote :

there was hardware issue with disk
http://paste.openstack.org/show/175278/.

So, the issue now is corosync didn't migrate/move VIPs addresses and master roles (rabbitmq for example) to another controller

no longer affects: mos
summary: - Cannot allocate memory on the primary controller node during rally test
+ HA does not work in case if cluster node is suffering from out of memory
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
status: New → Confirmed
importance: Undecided → High
importance: High → Medium
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

When there is no free memory available and swap file exhausted, that is an expected behavior and cannot be fixed. Please provide the diagnostic logs snapshot or at least atop binary record so we could figure out what was the free RAM and space of the swap file for the moment of "fork: Cannot allocate memory" errors

Changed in fuel:
importance: Medium → Wishlist
milestone: none → 6.1
status: Confirmed → Incomplete
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Sorry, there *is* a diagnostic snapshot link provided, but could you please provide only atop_current files from it as a separate?

Revision history for this message
Leontii Istomin (listomin) wrote :

I've reproduced the issue. atop binary is here https://drive.google.com/a/mirantis.com/file/d/0Bx4ptZV1Jt7hMWRvbXJ1QnZQaFU/view?usp=sharing
 In this case node was hanged at Feb 21 10:43:32

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

When there is no free memory available and swap file exhausted, that is an expected behavior and cannot be fixed.

Changed in fuel:
status: Confirmed → Won't Fix
Changed in fuel:
importance: Wishlist → High
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This issue was confirmed back as for given case at least running resources must migrate from affected node. So, the issue is correct from the cluster management perspective.

But the fix is not possible without built-in Pacemaker STONITH (fencing). Note that such issue might not affect corosync service health, and pacemaker might not be able to detect the node failure in order to trigger a STONITH. So, some external monitoring actions could be required additionally to "help" stonith-ng daemon with fencing decision.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

As this bug could only be fixed as a part of HA fencing feature, the status is won't fix and superseded by https://blueprints.launchpad.net/fuel/+spec/ha-fencing

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Bogdan, please could you provide to fuel-docs team all information for describing such behaviour in our production docs.

tags: added: qa-agree-7.0 release-notes
tags: removed: qa-agree-7.0
tags: added: qa-agree-7.0
tags: added: release-notes-done
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-docs (stable/6.1)

Related fix proposed to branch: stable/6.1
Review: https://review.openstack.org/194961

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-docs (stable/6.1)
Download full text (45.4 KiB)

Reviewed: https://review.openstack.org/194961
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=0e26e7d7cc153d179ec34985645dd23cdd239ddb
Submitter: Jenkins
Branch: stable/6.1

commit 5cc5f0c643aebecaf3bf4580535a3ea7c3334a6c
Author: Mike Scherbakov <email address hidden>
Date: Tue Jun 23 13:43:35 2015 -0700

    Removed streamlined patching backend pieces

    Change-Id: I955e76ccdbd12a9145f4e9b689f80bdf9fcaf929

commit 563c4b5c78ebfcb1f4f91047c2919f6270f9a1d4
Author: Mike Scherbakov <email address hidden>
Date: Tue Jun 23 13:30:30 2015 -0700

    Removed outdated patching guide

    Change-Id: I76180c277789ade9c5ebedd19fe2092847c0b7d9

commit 8d120c14bec1ab41d448683ad146a3053a57c4ee
Author: Irina Povolotskaya <email address hidden>
Date: Tue Jun 23 19:59:11 2015 +0300

    Add dual hypervisor ref arch into 6.1 docs

    Change-Id: I900c24c9de878eafadbfc995aa879b7f55737fac

commit feebd1592d3305b64bbdfd0bc5fe108190aef120
Author: OlgaGusarenko <email address hidden>
Date: Tue Jun 23 18:38:17 2015 +0300

    [OPs guide] Running Ceilometer section edits

    1. conf file extract is updated
    2. note is updated

    Closes-bug: 1467817
    Change-Id: I0217e164108e0ba6c1397045a5e57d13ff429223

commit 44a93f9dead7511a3461ec35248dbb689c81eafd
Author: OlgaGusarenko <email address hidden>
Date: Tue Jun 23 18:04:40 2015 +0300

    [RN6_1] Final changes

    1. capitalization
    2. 2014.2 to 2014.2.2
    3. general improvements

    Change-Id: I45057e90c90550559f66bc67ccdf97a559fd9000

commit bb41389cae58084285688853281516b659686422
Author: evkonstantinov <email address hidden>
Date: Tue Jun 23 16:45:35 2015 +0300

    Update patching decription

    Update patching description with
    the standard Linux commands.

    Change-Id: Ia1a8346639c468fdfce15a11d2430bf3a4731244

commit bf3018fae3f2e564413d33aba6cdebf8868f0b4e
Author: OlgaGusarenko <email address hidden>
Date: Tue Jun 23 15:55:49 2015 +0300

    [RN6_1] Clean up

    1. Rearranges sections
    2. Improves RST
    3. Changes titles order

    Change-Id: I6110bf515667d3d6ba08ad35ff5d593dbc96641e

commit 1c7e4457808e8f2d6c56fdf31252170972e444b9
Author: Maria Zlatkova <email address hidden>
Date: Tue Jun 23 15:26:28 2015 +0300

    Replaces VBOX screenshots

    This patch:
    - replaces VBOX screenshots
    - changes the link for Download Mirantis VirtualBox scripts
     to https://docs.mirantis.com/openstack/fuel/fuel-master/#downloads

    Change-Id: I58dede960c5c3355d39b07ff44b757403f6af02c
    Closes-Bug: #1467872

commit 0a568bf53fc0e25d1d692d5d74b4a7b4d983bbcc
Author: evkonstantinov <email address hidden>
Date: Tue Jun 23 14:01:55 2015 +0300

    6.1 --separate repos

    change wording and add links to the
    separate repos feature.

    Change-Id: Ib5d0778a0d8f1534f79ed2f553574cb69a3150b0

commit 95a188b21cbdd064d92696b7920e6a0105fe0c56
Author: Maria Zlatkova <email address hidden>
Date: Tue Jun 23 12:07:28 2015 +0300

    Corrects the output 'pcs status'

    Changes the example outputs to appropriate ones.

    Change-Id: Ib6d83...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-docs (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/223135

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-docs (master)

Reviewed: https://review.openstack.org/223135
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=f974957f0fc14fd68be490501a3a1bc799adbe5a
Submitter: Jenkins
Branch: master

commit f974957f0fc14fd68be490501a3a1bc799adbe5a
Author: evkonstantinov <email address hidden>
Date: Mon Sep 14 17:43:00 2015 +0300

    Add HA Pacemaker memory issue to relnotes

    Change-Id: I7d134fd81337ab64cd04e220280017ad3f0e5965
    Related-Bug:#1422186

tags: added: ha
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.