Bug #1399272 “Openstack operations finish with 502 error after f...” : Bugs : Mirantis OpenStack

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-12-04:

#1

fuel-snapshot-2014-12-04_16-17-44.tgz Edit (62.2 MiB, application/x-tar)

Nastya Urlapova (aurlapova) on 2014-12-05

Changed in fuel:
milestone:	5.1.2 → 5.1.1
milestone:	5.1.1 → 5.1.2

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-05:

#2

The logs indicate an issue with rabbitmq cluster. As I mentioned here https://bugs.launchpad.net/fuel/+bug/1399181/comments/4 the rabbitmq failover procedure could take quite a while, and OSTF HA health checks should be used *prior* to any other checks. Please elaborate which was a report of OSTF HA check before the step 6 , did you recieve success for "RabbitMQ availability" ?

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-12-05:

#3

Bogdan, thank you for this comment, but yes, I receive success for "RabbitMQ availability" , moreover nova compute and nova scheduler works fine, and there is no problem with the rabbit for this services, only cinder has problem,

Changed in fuel:
status:	Incomplete → New

Tatyanka (tatyana-leontovich) on 2014-12-05

description:

updated

Stanislaw Bogatkin (sbogatkin) on 2014-12-05

no longer affects:

fuel/6.1.x

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-05:

#4

Thanks for update, will investigate the logs

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-05:

#6

This bug looks related to the similair issue in Nova project https://bugs.launchpad.net/nova/+bug/1343613, therefore it could be an issue in oslo.db

I believe these log records are a kind of related:
2014-12-04 15:30:53.162 20902 TRACE cinder.service OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') None None
2014-12-04T15:30:55.285613+00:00 err: 2014-12-04 15:30:55.277 20902 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Socket closed

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-12-05:

#7

Yep exactly the same, what I saw, thank you, Bogdan

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-05:

#8

There are also the sign of time desync
2014-12-04T15:31:26.098633 node-7 ./node-7.test.domain.local/ntpdate.log:2014-12-04T15:31:26.098633+00:00 notice: step time server 10.108.0.2 offset -0.864229 sec
2014-12-04T15:52:49.706081 node-7 ./node-7.test.domain.local/ntpdate.log:2014-12-04T15:52:49.706081+00:00 notice: step time server 10.108.0.2 offset -103.894753 sec

that could have caused galera issues as well

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-05:

#9

Actually, offset -0.864229 sec event happened at 2014-12-04T15:31:26 shows that both aforementioned errors might have been caused by time desync in galera cluster, so, this bug is invalid

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-12-05:

#10

stop stop stop, so there is other issue ?

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-12-05:

#11

Reproduce again:
Ubuntu ha neutron, 3 controllers + 3 cinders +1 compute + 1 cinder
cinder for volumes - ceph for images
Steps:
Destroy primary controller
wait until system recovers(I waiting for 20 minutes) - run ostf ha - it pass
run ostf sanity - it pass
run ostf smoke - ALL tests except cinder passed
Create volume and boot instance from it
Volume was not created Please refer to OpenStack logs for more details.
Cinder tests failed on step 2 with the same error
2. Wait for volume status to become "available".

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-12-05:

#12

Next errors in cinder-volume.log root@node-15:~# grep ERROR /var/log/cinder/cinder-volume.log
2014-12-05 12:54:02.020 22157 ERROR cinder.service [-] model server went away
2014-12-05 12:54:41.799 22157 ERROR cinder.service [-] Recovered model server connection!
2014-12-05 13:07:55.583 22157 ERROR cinder.service [-] model server went away
2014-12-05 13:07:55.588 22157 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: [Errno 110] Connection timed out
2014-12-05 13:07:55.598 22157 ERROR oslo.messaging._drivers.impl_rabbit [-] [Errno 110] Connection timed out
2014-12-05 13:07:55.621 22157 ERROR cinder.service [-] Recovered model server connection!
2014-12-05 13:07:58.607 22157 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.108.3.3:5673 is unreachable: [Errno 113] EHOSTUNREACH. Trying again in 1 seconds.

After services restarted - volume was created without problem

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-12-05:

#13

Move to invalid after discussion with Bogdan, QA need to verify this case on HW, and if the same behavior appears, reopen the issue

Tatyanka (tatyana-leontovich) on 2014-12-05

summary:

- Cinder operations finish with 502 error after failover
+ OS operations finish with 502 error after failover, with errors on
+ oslo.messaging

Tatyanka (tatyana-leontovich) on 2014-12-05

description:

updated

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-12-05: Re: OS operations finish with 502 error after failover, with errors on oslo.messaging

#14

https://drive.google.com/file/d/0B_tSitrwrgvoeDh1Wmh2aEs1VkU/view?usp=sharing

Bogdan Dobrelya (bogdando) on 2014-12-08

summary:

- OS operations finish with 502 error after failover, with errors on
- oslo.messaging
+ Openstack operations finish with 502 error after failover, with errors
+ on oslo.messaging

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-08:

#15

Thanks, I will investigate the logs provided

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-08:

#16

According to logs at #14, I cannot see socket closed errors in cinder logs once failover procedure completed (16:28 - 16:32), and I can see the OSTF boot from voulme have passed w/o issues at 16:32 as well.

Although, looks like there are some services failed to fix their messaging on failover: http://paste.openstack.org/show/147047/

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-08:

#17

As far as I can see from the environment (the logs from which were provided at the comment #14) there are issues with unconsumed AMQP messages exist:

For example, in logs it is clear that heat-engine at node-17,18 recovered AMQP connection at 2014-12-05T16:41:34.500079+00:00 info: 2014-12-05 16:41:34.495
But 'stack' OSTF checks are failing, queues notifications.error, info grow with unconsumed messages on each OSTF run
(See report.txt, report2.txt files attached):
<email address hidden> notifications.error false false [{"x-ha-policy","all"}] ha-all 43 0 43
<email address hidden> notifications.error false false [{"x-ha-policy","all"}] ha-all 49 0 49
<email address hidden> notifications.info false false [{"x-ha-policy","all"}] ha-all 0 19 19

ceilometer-agent-notification at node-17,18 AMQP issues - known issue https://bugs.launchpad.net/mos/+bug/1380800
and there are many unconsumed messages at ceilometer.collector.metering queue, and it grows on each OSTF run:
<email address hidden> ceilometer.collector.metering false false [{"x-ha-policy","all"}] ha-all 2903 722 3625
<email address hidden> ceilometer.collector.metering false false [{"x-ha-policy","all"}] ha-all 3053 722 3775

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-08:

#18

rabbitmqctl report Edit (104.0 KiB, text/plain)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-08:

#19

rabbitmqctl report after several OSTF runs Edit (103.7 KiB, text/plain)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-08:

#20

logs.tgz Edit (21.5 MiB, application/x-tar)

These are the logs from #14 + several more OSTF runs added

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2014-12-10:

#21

Moving from 6.0 to 6.1 due to HCF.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-12-11:

#22

It seems to be related firstly to the process of rabbitmq cluster reassembling. We have some issues with it and are going to improve rabbitmq OCF script in 6.0.1 and 6.1 releases. Nevertheless, after a decent amount of time rabbitmq cluster reassembles or can be reassembled manually using steps here: https://bugs.launchpad.net/fuel/+bug/1394635.

summary:

Openstack operations finish with 502 error after failover, with errors
- on oslo.messaging
+ on oslo.messaging due to long rabbitmq cluster reassembling

Revision history for this message

Ilya Pekelny (i159) wrote on 2014-12-11: Re: Openstack operations finish with 502 error after failover, with errors on oslo.messaging due to long rabbitmq cluster reassembling

#23

We took all the steps outlined in the bug description (Other case: Ubuntu HA with nova network, 3 controllers + 3 mongo + cinder + compute with ceilometer enabled), and we failed to reproduce the behavior in our environment (a link to the bug is here: https://bugs.launchpad.net/fuel/6.0.x/+bug/1399272). The thing is that the root cause of this behavior is as follows:

1) The primary controller is shut down.
2) The HA proxy has no time to connect to a different HA node (it's sleeping)
3) The RabbitMQ engine attempts to connect to the node that is switched off (primary controller) and crashes.

However, this bug doesn't have anything to do with Oslo.messaging for the following primary reasons:

1) While the primary controller is switched off, all the other services are not working as well. Thus, others services are non-functional at this point in time (cannot connect to the controller). The short evidence is given below:

http://paste.openstack.org/show/149436/ for MySQL fail logged in a controller node nova-cert.log
http://paste.openstack.org/show/149382/ oslo.messaging fail logged in the controller node nova-cert.log
http://paste.openstack.org/show/149381/ oslo.messaging fail logged in a cinder.log at cinder node

2) All the OSTF tests crash at this very moment ,when the controller is switched off and HA proxy (or other routing/load balancing issue) has not yet found an alternative route, which can be determined by looking at the logs.

3) After several minutes passed the bug couldn't be reproduced as well. The described logs don't repeat.

4) Even when it comes to the Sanity Check tests, they should actually fail (even though they pass for some reason, which should not be the case, since the primary controller is down).

Conclusions

Conjectural cause of the described bug is that:
Rabbit MQ instances need to spend some time to rebuild cluster when a primary controller is down. While this process is running every messaging task can (defenitly will) be failed. This is expected behaviour. The problem is a spent period is large enaugh to try real tasks, such as launch instance.

We are forced to mark the bug as invalid, since we cannot reproduce this permanently.
We will reopen the bug if we find the fail reasons directly related to oslo.messaging lib.

We took all the steps outlined in the bug description (Other case: Ubuntu HA with nova network, 3 controllers + 3 mongo + cinder + compute with ceilometer enabled), and we failed to reproduce the behavior in our environment (a link to the bug is here: https://bugs.launchpad.net/fuel/6.0.x/+bug/1399272). The thing is that the root cause of this behavior is as follows:

1) The primary controller is shut down.
2) The HA proxy has no time to connect to a different HA node (it's sleeping)
3) The RabbitMQ engine attempts to connect to the node that is switched off (primary controller) and crashes.

However, this bug doesn't have anything to do with Oslo.messaging for the following primary reasons:

1) While the primary controller is switched off, all the other services are not working as well. Thus, others services are non-functional at this point in time (cannot connect to the controller). The short evidence is given below:

http://paste.openstack.org/show/149436/ for MySQL fail logged in a controller node nova-cert.log
http://paste.openstack.org/show/149382/ oslo.messaging fail logged in the controller node nova-cert.log
http://paste.openstack.org/show/149381/ oslo.messaging fail logged in a cinder.log at cinder node

2) All the OSTF tests crash at this very moment ,when the controller is switched off and HA proxy (or other routing/load balancing issue) has not yet found an alternative route, which can be determined by looking at the logs.

3) After several minutes passed the bug couldn't be reproduced as well. The described logs don't repeat.

4) Even when it comes to the Sanity Check tests, they should actually fail (even though they pass for some reason, which should not be the case, since the primary controller is down).

Conclusions

Conjectural cause of the described bug is that:
    Rabbit MQ instances need to spend some time to rebuild cluster when a primary   controller is down. While this process is running every messaging task can (defenitly will) be failed. This is expected behaviour. The problem is a spent period is large enaugh to try real tasks, such as launch instance.
     
We are forced to mark the bug as invalid, since we cannot reproduce this permanently.
We will reopen the bug if we find the fail reasons directly related to oslo.messaging lib.

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2014-12-11:

#24

Ilya, this issue is no Invalid. All fixes that we'll make for 6.1 you have to backport to 6.0.1 and 5.1.2.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-12:

#25

Please note, that this issue *could* be as well fixed from Oslo.messaging side, see x-cancel-on-ha-failover https://bugs.launchpad.net/nova/+bug/856764/comments/70

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-12:

#26

According to https://bugs.launchpad.net/fuel/+bug/1399272/comments/17
this issue has nothing to the process of rabbitmq cluster reassembling. Rabbit cluster was OK at the environment with the issue reproduced.

@Ilya, in case of failed to reproduce the behavior, please contact Tatyana Leontovich, she was able to reproduce it 3/3 times.
Also, the comment #17 wasn't addressed as appropriate, please get the reproduced env and analyse why the queues are growing

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-12:

#27

The conclusion is also not valid, there was no issues with rabbitmq cluster and OSTF tests for Heat stack had been failing since a half of a hour after the failover completed w/o any issues.

summary:

Openstack operations finish with 502 error after failover, with errors
- on oslo.messaging due to long rabbitmq cluster reassembling
+ on oslo.messaging

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-12:

#28

I updated the bug due to it still has to be reproduced and investigated. I removed the change about long Rabbit cluster reassemble, because of according to the logs at the comment https://bugs.launchpad.net/fuel/+bug/1399272/comments/14 cluster assembled as always and OSTF tests had been failing all of the time after this

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2014-12-12:

#29

Bodgan, we still have environment with issue, could you investigate it?

Revision history for this message

Denis M. (dmakogon) wrote on 2015-01-30:

#30

Hello to All.

I was trying to reproduce this bug using 2nd variant (3 controllers + 3 monogs + compute with ceilo + cinder). And I succeed with it.
It does seem like right after shutting down one of the nodes it takes time for rabbit to rebalance its cluster. But if I'd run again set of smoke tests it would be completed with all "thumbs up" due to availability of rabbit cluster (of two nodes), so it wouldn't fail due to the same problem as for the first run.

I would try to figure out the most appropriate workarounds for cases when rabbitmq trying to rebalance.

Here what cinder-volume shows https://gist.github.com/denismakogon/4d7832f95f6fd8c70236 (pretty similar thing was shown by nova-compute).

Revision history for this message

Denis M. (dmakogon) wrote on 2015-02-03:

#31

Hello to All, i've been looking at https://bugs.launchpad.net/fuel/+bug/1399272 and i think i was able to figure out why such problem is happening, it takes certain time to rebalance rabbitmq cluster (this is a rabbitmq workflow), so when cluster is in "shutdown" state (actual rebalancing), there's no way to consume or publish messages, my first suggestion is to increase reconnection time (see rabbit_retry_interval) from 5 seconds to at least 20 seconds. But mostly this timeout depends on how long rabbitmq cluster takes time to rebalance.

Revision history for this message

Denis M. (dmakogon) wrote on 2015-02-04:

#32

Also, i've noticed next thing. Both nova-compute and cinder-volume service configuration file are missing very significant option for oslo.messaging workflow:

rabbit_max_retries - Maximum number of RabbitMQ connection retries. Default is 0 (infinite retry count). Should be set to 10.

rabbit_retry_interval - How frequently to retry connecting with RabbitMQ. Should be set to 10.

Also, it does matter to increase kombu_reconnect_delay from 5 seconds to 20 because even if rabbit cluster is up right after rebalancing, oslo.messaging has connections that are makred as 'shutdown' once cluster rebalancing started. So, it is storgly recommended to use rabbit_max_retries, rabbit_retry_interval and kombu_reconnect_delay combination to prevent such cases like this one.

Revision history for this message

Denis M. (dmakogon) wrote on 2015-02-04:

#33

We were able to avoid given bug by using options described above, see two diffs from two different Cinder nodes (from two different envs):

https://gist.github.com/denismakogon/0fe4848324d48692047b

Fuel env-1 info:
[root@nailgun ~]# fuel --fuel-version
api: '1.0'
astute_sha: ed5270bf9c6c1234797e00bd7d4dd3213253a413
auth_required: true
build_id: 2015-02-01_22-05-01
build_number: '21'
feature_groups:
- mirantis
fuellib_sha: c5e4a0410ba66f9e9911f62b3b71c0b9c29aed6e
fuelmain_sha: ''
nailgun_sha: c0932eb5c2aa7fd1e13a999cb1b4bf5aff101c3b
ostf_sha: c9100263140008abfcc2704732e98fbdfd644068
production: docker
python-fuelclient_sha: 2ea7b3e91c1d2ff85110bf5abb161a6f4e537358
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: ed5270bf9c6c1234797e00bd7d4dd3213253a413
      build_id: 2015-02-01_22-05-01
      build_number: '21'
      feature_groups:
      - mirantis
      fuellib_sha: c5e4a0410ba66f9e9911f62b3b71c0b9c29aed6e
      fuelmain_sha: ''
      nailgun_sha: c0932eb5c2aa7fd1e13a999cb1b4bf5aff101c3b
      ostf_sha: c9100263140008abfcc2704732e98fbdfd644068
      production: docker
      python-fuelclient_sha: 2ea7b3e91c1d2ff85110bf5abb161a6f4e537358
      release: '6.1'

Fuel env-2 info:

[root@nailgun ~]# fuel --fuel-version
api: '1.0'
astute_sha: ed5270bf9c6c1234797e00bd7d4dd3213253a413
auth_required: true
build_id: 2015-02-01_22-05-01
build_number: '21'
feature_groups:
- mirantis
fuellib_sha: c5e4a0410ba66f9e9911f62b3b71c0b9c29aed6e
fuelmain_sha: ''
nailgun_sha: c0932eb5c2aa7fd1e13a999cb1b4bf5aff101c3b
ostf_sha: c9100263140008abfcc2704732e98fbdfd644068
production: docker
python-fuelclient_sha: 2ea7b3e91c1d2ff85110bf5abb161a6f4e537358
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: ed5270bf9c6c1234797e00bd7d4dd3213253a413
      build_id: 2015-02-01_22-05-01
      build_number: '21'
      feature_groups:
      - mirantis
      fuellib_sha: c5e4a0410ba66f9e9911f62b3b71c0b9c29aed6e
      fuelmain_sha: ''
      nailgun_sha: c0932eb5c2aa7fd1e13a999cb1b4bf5aff101c3b
      ostf_sha: c9100263140008abfcc2704732e98fbdfd644068
      production: docker
      python-fuelclient_sha: 2ea7b3e91c1d2ff85110bf5abb161a6f4e537358
      release: '6.1'

We were able to avoid given bug by using options described above, see two diffs from two different Cinder nodes (from two different envs):

https://gist.github.com/denismakogon/0fe4848324d48692047b

Fuel env-1 info:
[root@nailgun ~]# fuel --fuel-version
api: '1.0'
astute_sha: ed5270bf9c6c1234797e00bd7d4dd3213253a413
auth_required: true
build_id: 2015-02-01_22-05-01
build_number: '21'
feature_groups:
- mirantis
fuellib_sha: c5e4a0410ba66f9e9911f62b3b71c0b9c29aed6e
fuelmain_sha: ''
nailgun_sha: c0932eb5c2aa7fd1e13a999cb1b4bf5aff101c3b
ostf_sha: c9100263140008abfcc2704732e98fbdfd644068
production: docker
python-fuelclient_sha: 2ea7b3e91c1d2ff85110bf5abb161a6f4e537358
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: ed5270bf9c6c1234797e00bd7d4dd3213253a413
      build_id: 2015-02-01_22-05-01
      build_number: '21'
      feature_groups:
      - mirantis
      fuellib_sha: c5e4a0410ba66f9e9911f62b3b71c0b9c29aed6e
      fuelmain_sha: ''
      nailgun_sha: c0932eb5c2aa7fd1e13a999cb1b4bf5aff101c3b
      ostf_sha: c9100263140008abfcc2704732e98fbdfd644068
      production: docker
      python-fuelclient_sha: 2ea7b3e91c1d2ff85110bf5abb161a6f4e537358
      release: '6.1'

Fuel env-2 info:

[root@nailgun ~]# fuel --fuel-version
api: '1.0'
astute_sha: ed5270bf9c6c1234797e00bd7d4dd3213253a413
auth_required: true
build_id: 2015-02-01_22-05-01
build_number: '21'
feature_groups:
- mirantis
fuellib_sha: c5e4a0410ba66f9e9911f62b3b71c0b9c29aed6e
fuelmain_sha: ''
nailgun_sha: c0932eb5c2aa7fd1e13a999cb1b4bf5aff101c3b
ostf_sha: c9100263140008abfcc2704732e98fbdfd644068
production: docker
python-fuelclient_sha: 2ea7b3e91c1d2ff85110bf5abb161a6f4e537358
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: ed5270bf9c6c1234797e00bd7d4dd3213253a413
      build_id: 2015-02-01_22-05-01
      build_number: '21'
      feature_groups:
      - mirantis
      fuellib_sha: c5e4a0410ba66f9e9911f62b3b71c0b9c29aed6e
      fuelmain_sha: ''
      nailgun_sha: c0932eb5c2aa7fd1e13a999cb1b4bf5aff101c3b
      ostf_sha: c9100263140008abfcc2704732e98fbdfd644068
      production: docker
      python-fuelclient_sha: 2ea7b3e91c1d2ff85110bf5abb161a6f4e537358
      release: '6.1'

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-02-12:

#34

That is a bug in OpenStack, hence moving it to MOS project

no longer affects:	fuel/6.1.x
affects:	fuel → mos
Changed in mos:
milestone:	6.0.1 → none
no longer affects:	fuel/5.1.x
no longer affects:	fuel/6.0.x

Revision history for this message

Dmitry Savenkov (dsavenkov) wrote on 2015-02-13:

#35

It's very likely to be the case that it won't be fixed by any of 6.0.x releases due to a number of systemic issues that we hope to resolve when there are obtained some results of stress-testing.

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-03-03:

#36

After a little investigation I think that two different issues were described in the discussion
First issue is it is RabbitMQ cluster failure
Second issue is Galera cluster failure

In our little investigation we saw both of them as consequences of the actions which were described above

Rebooting a primary controller and killing haproxy can be quite enough to break up galera cluster, but I think we cant kill rabbitmq by killing haproxy. There are some suspicions that we can break murano rabbitmq but we need investigate this in more details and this is not the subject of this discussion

At the same time when we see
root@node-18:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-18' ...
[{nodes,[{disc,['rabbit@node-17','rabbit@node-18']}]},
{running_nodes,['rabbit@node-17','rabbit@node-18']},
{cluster_name,<<"rabbit@node-17">>},
{partitions,[]}]
...done.

It means that one of the rabbit instances has gone away from the cluster and will never come back automatically
(I will write about this a little later)

If one of the rabbit instances has gone from cluster and halted, the system will work. In the case when one rabbit instance has gone from cluster and returned as standalone instance (we observed such issue) then whole cloud might be broken.

On other side, say again, while killing HA-proxy and rebooting an instance it is possible to break Galera cluster. So when you post issues that implies HA-proxy killing please check Galera cluster status. There are no mysql and galera logs in the snapshot and I can.

So my offer is to discuss only a rabbitmq cluster failure in this particular ticket and if someone wants to post ticket on Galera cluster failure or oslo.db error then might be better to open a new ticket with galera and mysql logs in it and galera cluister status description.

I will now try to reproduce rabbitmq cluster failure

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-03-03:

#37

By the way

I just wonder

-- citation --
crm status says the same

Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Masters: [ node-17 ]
Slaves: [ node-18 ]

----

why crm shows just to nodes
Any HA deployment should contain three controllers at least

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-03-03:

#38

*why crm shows just tWo nodes

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-03-18:

#39

Thе reason of this http://paste.openstack.org/show/145627/ behaviour is a corosync script

Each time primary controller has shutdown, corosync tries to rebuild RabbitMQ cluster and firstly corosync kills (or sometime stops rabbit application but not beam process ) RabbitMQ on the others nodes. So, whole RabbitMQ cluster becomes unavailable for a several minutes. After several experiments I saw that sometimes RabbitMQ application was stopped by corsync on all nodes and after that whole RabbitMQ cluster became unavailable permanently (or for a too long period of time).

Alexey Khivin (akhivin) on 2015-03-19

tags:

added: ha

Revision history for this message

Alexander Bozhenko (alexbozhenko) wrote on 2015-03-26:

#40

@Denis M. (dmakogon) #33:

According to the docs:
http://docs.openstack.org/juno/config-reference/content/configuring-rpc.html
Default values are:
rabbit_max_retries = 0 (IntOpt) Maximum number of RabbitMQ connection retries. Default is 0 (infinite retry count).
rabbit_retry_interval = 1 (IntOpt) How frequently to retry connecting with RabbitMQ.

So your changes increased rabbit_retry_interval=12 and added limit of retries.
Do you have idea how it solved the problem?
Or default parameters are not the same, as documentation said?

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-03-26:

#41

One part of this issue should be solved in

https://bugs.launchpad.net/fuel/+bug/1436812

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-04-14:

#42

what I see for 5.1.1

I tried to reproduce this by shutdown and destroy primary controller but it seems works fine

after some period of time
http://paste.openstack.org/show/203911/
http://paste.openstack.org/show/203913/

all tests passed except
"Some nova services have not been started.. Please refer to OpenStack logs for more details"

but nova-manage service list
works fine

Images and Volumes creation works fine

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-04-17:

#43

checked for 6.1 and I not able to reproduce this issue

http://paste.openstack.org/show/204502/

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-17:

#44

As related bug example https://bugs.launchpad.net/mos/+bug/1454174/comments/10 shows, there *are* some wierd places in Oslo.messaging logic, I believe we should re-investigate this one as well

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-07-14:

#45

Can not reproduce

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Invalid	Critical	Alexey Khivin	Mirantis OpenStack 6.1
5.1.x	Invalid	Critical	Alexey Khivin	Mirantis OpenStack 5.1.1-updates
6.0.x	Invalid	Critical	Alexey Khivin	Mirantis OpenStack 6.0-updates
6.1.x	Invalid	Critical	Alexey Khivin	Mirantis OpenStack 6.1

Mirantis OpenStack

Openstack operations finish with 502 error after failover, with errors on oslo.messaging

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches