upgrade jobs failing with "cluster remained unstable for more than 1800 seconds"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
Sofer Athlan-Guyot |
Bug Description
Hi,
since today, the ha-upgrade jobs are failing:
This is failing during the restart of the resources while the upgrade script wait for the service to reboot which gives:
20980:Jun 15 07:59:46 localhost os-collect-config: ERROR: cluster remained unstable for more than 1800 seconds, exiting.
From the corosync log we see that some services do not restart:
Jun 15 07:49:33 localhost pengine[11452]: notice: Start openstack-
Jun 15 07:49:33 localhost pengine[11452]: notice: Start openstack-
Jun 15 07:49:33 localhost pengine[11452]: notice: Start openstack-
Jun 15 07:49:33 localhost pengine[11452]: notice: Start openstack-
Jun 15 07:49:33 localhost pengine[11452]: notice: Calculated Transition 72: /var/lib/
Jun 15 07:49:33 localhost crmd[11453]: warning: Transition 72 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=12, Source=
inated
Jun 15 07:49:33 localhost crmd[11453]: warning: Transition failed: terminated
The "transition failed" and comparison with other successful log seems
to confirm this.
The error was reproduced locally.
Carlos Camacho (ccamacho) wrote : | #1 |
Carlos Camacho (ccamacho) wrote : | #2 |
From controller (1000 times):
aodh/evaluator.
aodh/evaluator.
aodh/evaluator.
aodh/evaluator.
aodh/evaluator.
aodh/evaluator.
aodh/evaluator.
aodh/evaluator.
rt
aodh/evaluator.
aodh/evaluator.
aodh/evaluator.
aodh/evaluator.
slate_failures
aodh/evaluator.
aodh/evaluator.
_with_cause
aodh/evaluator.
aodh/evaluator.
se_with_cause
aodh/evaluator.
aodh/evaluator.
aodh/evaluator.
aodh/evaluator.
r.',)
Carlos Camacho (ccamacho) wrote : | #3 |
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
ceilometer/
Carlos Camacho (ccamacho) wrote : | #4 |
cluster/
cluster/
cluster/
cluster/
cluster/
cluster/
cluster/
cluster/
cluster/
cluster/
cluster/
cluster/
cluster/
cluster/
cluster/
Carlos Camacho (ccamacho) wrote : | #5 |
keystone/
keystone/
keystone/
keystone/
keystone/
keystone/
keystone/
keystone/
keystone/
Jiří Stránský (jistr) wrote : | #6 |
In the controller host info there's `sudo pcs status` output, seems like pacemaker couldn't start up the VIPs:
Failed Actions:
* ip-fd00.
last-
* ip-2001.
last-
* ip-fd00.
last-
* ip-fd00.
last-
Carlos Camacho (ccamacho) wrote : | #7 |
This seems interesting:
neutron/
creasing the rpc_response_
Carlos Camacho (ccamacho) wrote : | #8 |
mysqld.
istent or empty.
mysqld.
mysqld.
mysqld.
mysqld.
mysqld.
neutron/
on server(s) may be overloaded and unable to respond quickly enough.
Emilien Macchi (emilienm) wrote : | #9 |
Something is failing during ControllerPostC
It tries to stop all resources and check they are stopped but it timeouts:
Grep: "pacemaker is active"
and then scroll down, until timeout happens:
Jun 15 20:04:03 localhost os-collect-config: [2016-06-15 20:04:02,998] (heat-config) [INFO] {"deploy_stdout": "httpd has stopped\nERROR: cluster remained unstable for more than 1800 seconds, exiting.\n", "deploy_stderr": "+ echo 'pacemaker_
We need to figure why it timeouts and if all warnings between the logs are "normal". Example:
Jun 15 19:52:56 localhost pengine[11455]: warning: Processing failed op start for ip-fd00.
or:
Jun 15 19:52:56 localhost crmd[11456]: warning: Transition 58 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=12, Source=
or:
Jun 15 19:45:22 localhost crmd[11456]: notice: High CPU load detected: 1.040000
summary: |
- Ha upgrade jobs failing with "cluster remained unstable for more than - 1800 seconds" + upgrade jobs failing with "cluster remained unstable for more than 1800 + seconds" |
Changed in tripleo: | |
status: | New → Confirmed |
importance: | Undecided → Critical |
tags: | added: alert |
Michele Baldessari (michele) wrote : | #10 |
So just for the record, these two warnings are expected:
neutron/
creasing the rpc_response_
This one is also harmless (just due to the fact that galera is not using ssl stuff)
mysqld.
istent or empty.
mysqld.
mysqld.
mysqld.
mysqld.
mysqld.
neutron/
Our focus needs to be this:
IPaddr2(
Jun 15 06:43:00 [11450] overcloud-
Michele Baldessari (michele) wrote : | #11 |
Ran this review https:/
My current theory is that we seem to have removed some constraints in a peacemeal way and it might
be that we are triggering this bug:
https:/
Erno Kuvaja (jokke) wrote : | #12 |
I'm wondering of the high loads and pacemaker trying to fence on controller oc-controller has something to do with this?
From http://
Jun 15 13:15:55 localhost crmd[11469]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_
Jun 15 13:15:55 localhost pengine[11468]: warning: Processing failed op start for ip-fd00.
Jun 15 13:15:55 localhost pengine[11468]: warning: Processing failed op start for ip-2001.
Jun 15 13:15:55 localhost pengine[11468]: warning: Processing failed op start for ip-fd00.
Jun 15 13:15:55 localhost pengine[11468]: warning: Processing failed op start for ip-fd00.
Jun 15 13:15:55 localhost pengine[11468]: warning: Forcing ip-fd00.
Jun 15 13:15:55 localhost pengine[11468]: warning: Forcing ip-2001.
Jun 15 13:15:55 localhost pengine[11468]: warning: Forcing ip-fd00.
Jun 15 13:15:55 localhost pengine[11468]: warning: Forcing ip-fd00.
Jun 15 13:15:55 localhost pengine[11468]: notice: Start openstack-
Jun 15 13:15:55 localhost pengine[11468]: notice: Start openstack-
Jun 15 13:15:55 localhost pengine[11468]: notice: Start openstack-
Jun 15 13:15:55 localhost pengine[11468]: notice: Start openstack-
Jun 15 13:15:55 localhost pengine[11468]: notice: Calculated Transition 87: /var/lib/
Jun 15 13:15:55 localhost crmd[11469]: warning: Transition 87 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=12, Source=
Jun 15 13:15:55 localhost crmd[11469]: warning: Transition failed: terminated
Jun 15 13:15:55 localhost crmd[11469]: notice: Graph 87 with 12 actions: batch-limit=12 jobs, network-delay=0ms
Jun 15 13:15:55 localhost crmd[11469]: notice: [Action 104]: Pending rsc op openstack-
Jun 15 13:15:55 localhost crmd[11469]: notice: [Action 103]: Pending rsc op openstack-
Jun 15 13:15:55 localhost crmd[11469]: notice: [Action 106]: Pending pseudo op openstack-
Jun 15 13:15:55 localhost crmd[11469]: notice: [Action 105]: Pending pseu...
Erno Kuvaja (jokke) wrote : | #13 |
Also an hour 20min give or take systemd starts over 500 sessions for rabbitmq which looks quite a lot to me:
-bash-4.2$ egrep -e "systemd.
536
-bash-4.2$ egrep -e "systemd.
Jun 15 12:07:38 localhost systemd: Starting Session c1 of user rabbitmq.
-bash-4.2$ egrep -e "systemd.
Jun 15 13:28:45 localhost systemd: Starting Session c541 of user rabbitmq.
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote : | #14 |
Just to comment on the IPaddr2 ipv6 issue. There are not the real problem here as this error is present even for the successful ha upgrade. For instance we got this:
unpack_
In http://
Relative to a successful one, in corosync log we have this:
Jun 15 07:49:33 [11453] overcloud-
Jun 15 07:49:33 [11453] overcloud-
Jun 15 07:49:33 [11453] overcloud-
Jun 15 07:49:33 [11453] overcloud-
Jun 15 07:49:33 [11453] overcloud-
Jun 15 07:49:33 [11453] overcloud-
Jun 15 07:49:33 [11453] overcloud-
Jun 15 07:49:33 [11453] overcloud-
Jun 15 07:49:33 [11453] overcloud-
Jun 15 07:49:33 [11453] overcloud-
Jun 15 07:49:33 [11453] overcloud-
Jun 15 07:49:33 [11453] overcloud-
Jun 15 07:49:33 [11453] overcloud-
Jun 15 07:49:3...
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master) | #15 |
Fix proposed to branch: master
Review: https:/
Changed in tripleo: | |
assignee: | nobody → Jiří Stránský (jistr) |
status: | Confirmed → In Progress |
Changed in tripleo: | |
assignee: | Jiří Stránský (jistr) → Sofer Athlan-Guyot (sofer-athlan-guyot) |
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master) | #16 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 7b22f2d8305931b
Author: Sofer Athlan-Guyot <email address hidden>
Date: Thu Jun 16 17:58:55 2016 +0200
Colocation make a group for pcmk nova resources.
This ensure that the entire nova-* service form a pacemaker group and
that somehow pacemaker doesn't try to restart services elsewhere.
Closes-bug: 1592776
Change-Id: I629db624f41796
Changed in tripleo: | |
status: | In Progress → Fix Released |
tags: | removed: alert |
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master) | #17 |
Change abandoned by Jiri Stransky (<email address hidden>) on branch: master
Review: https:/
Reason: Worked, but we found a better solution :)
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/tripleo-heat-templates 5.0.0.0b2 | #18 |
This issue was fixed in the openstack/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (master) | #19 |
Change abandoned by Athlan-Guyot sofer (<email address hidden>) on branch: master
Review: https:/
Reason: Not relevant anymore.
Reproduced locally: Error: http:// paste.openstack .org/show/ 516225/
Steps:
Deploy the overcloud as:
openstack overcloud deploy \ tripleo- heat-templates \ tripleo- heat-templates/ overcloud- resource- registry- puppet. yaml \ tripleo- heat-templates/ environments/ puppet- pacemaker. yaml
--libvirt-type qemu \
--ntp-server pool.ntp.org \
--templates /home/stack/
-e /home/stack/
-e /home/stack/
Then run:
./tripleo- ci/scripts/ tripleo. sh --overcloud-update