Full reassemble of Galera cluster fails in case of epoch divergence
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Committed
|
High
|
Bogdan Dobrelya | ||
5.1.x |
Invalid
|
High
|
Bogdan Dobrelya | ||
6.0.x |
Won't Fix
|
High
|
Bogdan Dobrelya | ||
6.1.x |
Won't Fix
|
High
|
Bogdan Dobrelya | ||
7.0.x |
Won't Fix
|
High
|
Bogdan Dobrelya | ||
8.0.x |
Won't Fix
|
High
|
Bogdan Dobrelya | ||
Mitaka |
Won't Fix
|
High
|
Sergii Golovatiuk | ||
Newton |
Won't Fix
|
High
|
Bogdan Dobrelya |
Bug Description
Regularly observed on the system test 'ceph_ha_restart' , this time on: http://
Steps to reproduce:
1. Create cluster (Ubuntu, nova-network flat-dhcp, Ceph for images and volumes)
2. Add 3 nodes with controller and ceph OSD roles
3. Add 1 node with ceph OSD roles
4. Add 2 nodes with compute and ceph OSD roles
5. Deploy the cluster
6. Reset all nodes.
7. Check cluster status with 'crm status' and pacemaker logs on all controllers.
If mysql failed to start after nodes are reset, then pacemaker hangs on waiting for mysql status for 475 sec. That cause a long time to re-assemble cluster for other resources such as rabbitmq.
Related bug about rabbitmq: https:/
There is more detailed information while CI test was running:
`crm status` right after the nodes was rebooted (Nov 3 09:43:15 2014) : http://
Pacemaker logs taken from ssh session:
- from controller-1: http://
- from controller-2: http://
- from controller-3: http://
`crm status` before the timeout of the test (Nov 3 09:49:12 2014) : http://
In the pacemaker logs from controller-2 and 3 is the following warning:
"<28>Nov 3 09:48:19 node-6 lrmd[2021]: warning: operation_finished: p_mysql_
Pacemaker was not doing any operations for ~8 minutes on all controllers until this timeout appeared, so the rabbitmq resource wasn't processed too.
Changed in fuel: | |
importance: | Undecided → High |
tags: | added: ha pacemaker |
Changed in fuel: | |
assignee: | Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando) |
Changed in fuel: | |
status: | Confirmed → In Progress |
Changed in fuel: | |
status: | Fix Committed → Confirmed |
tags: | added: area-library |
Changed in fuel: | |
status: | Confirmed → In Progress |
tags: | added: tech-debt |
There are no any "timed out after 475000ms" log records in snapshot, please make sure you created snapshot as appropriate