upgrade jobs failing with "cluster remained unstable for more than 1800 seconds"

Bug #1592776 reported by Sofer Athlan-Guyot
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Sofer Athlan-Guyot

Bug Description

Hi,

since today, the ha-upgrade jobs are failing:

 - example log: http://logs.openstack.org/18/329718/1/check-tripleo/gate-tripleo-ci-centos-7-upgrades/cb4b9c1/logs/postci.txt.gz#_2016-06-15_08_00_11_000

This is failing during the restart of the resources while the upgrade script wait for the service to reboot which gives:

    20980:Jun 15 07:59:46 localhost os-collect-config: ERROR: cluster remained unstable for more than 1800 seconds, exiting.

From the corosync log we see that some services do not restart:

    Jun 15 07:49:33 localhost pengine[11452]: notice: Start openstack-nova-scheduler:0#011(overcloud-controller-0)
    Jun 15 07:49:33 localhost pengine[11452]: notice: Start openstack-cinder-volume#011(overcloud-controller-0 - blocked)
    Jun 15 07:49:33 localhost pengine[11452]: notice: Start openstack-nova-api:0#011(overcloud-controller-0)
    Jun 15 07:49:33 localhost pengine[11452]: notice: Start openstack-nova-conductor:0#011(overcloud-controller-0)
    Jun 15 07:49:33 localhost pengine[11452]: notice: Calculated Transition 72: /var/lib/pacemaker/pengine/pe-input-72.bz2
    Jun 15 07:49:33 localhost crmd[11453]: warning: Transition 72 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=12, Source=/var/lib/pacemaker/pengine/pe-input-72.bz2): Term
    inated
    Jun 15 07:49:33 localhost crmd[11453]: warning: Transition failed: terminated

The "transition failed" and comparison with other successful log seems
to confirm this.

The error was reproduced locally.

Revision history for this message
Carlos Camacho (ccamacho) wrote :

Reproduced locally: Error: http://paste.openstack.org/show/516225/

Steps:

Deploy the overcloud as:

  openstack overcloud deploy \
  --libvirt-type qemu \
  --ntp-server pool.ntp.org \
  --templates /home/stack/tripleo-heat-templates \
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml

Then run:

  ./tripleo-ci/scripts/tripleo.sh --overcloud-update

Revision history for this message
Carlos Camacho (ccamacho) wrote :

From controller (1000 times):

aodh/evaluator.log:2016-06-15 11:52:11.794 31855 ERROR aodh.coordination
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination [-] Error connecting to coordination backend.
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination Traceback (most recent call last):
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/aodh/coordination.py", line 134, in start
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination self._coordinator.start()
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 373, in start
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination self._start()
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 440, in _sta
rt
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination self._server_info = self._client.info()
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination self.gen.throw(type, value, traceback)
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 51, in _tran
slate_failures
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination cause=e)
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 753, in raise
_with_cause
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 144, in rai
se_with_cause
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/six.py", line 692, in raise_from
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination raise value
aodh/evaluator.log:2016-06-15 11:52:12.793 31855 ERROR aodh.coordination ToozConnectionError: Error while reading from socket: ('Connection closed by serve
r.',)

Revision history for this message
Carlos Camacho (ccamacho) wrote :

ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination [-] Error connecting to coordination backend.
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination Traceback (most recent call last):
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination File "/usr/lib/python2.7/site-packages/ceilometer/coordination.py", line 102, in start
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination self._coordinator.start()
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 373, in start
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination self._start()
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 440, in _start
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination self._server_info = self._client.info()
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination self.gen.throw(type, value, traceback)
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 51, in _translate_failures
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination cause=e)
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 753, in raise_with_cause
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 144, in raise_with_cause
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination File "/usr/lib/python2.7/site-packages/six.py", line 692, in raise_from
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination raise value
ceilometer/central.log:2016-06-15 11:45:40.691 4702 ERROR ceilometer.coordination ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)

Revision history for this message
Carlos Camacho (ccamacho) wrote :

cluster/corosync.log:Jun 15 11:34:24 [10168] overcloud-controller-0 pengine: error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
cluster/corosync.log:Jun 15 11:34:24 [10168] overcloud-controller-0 pengine: error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
cluster/corosync.log:Jun 15 11:34:24 [10168] overcloud-controller-0 pengine: error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
cluster/corosync.log:Jun 15 11:34:24 [10168] overcloud-controller-0 pengine: notice: process_pe_message: Configuration ERRORs found during PE processing. Please run "crm_verify -L" to identify issues.
cluster/corosync.log:Jun 15 11:34:24 [10168] overcloud-controller-0 pengine: error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
cluster/corosync.log:Jun 15 11:34:24 [10168] overcloud-controller-0 pengine: error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
cluster/corosync.log:Jun 15 11:34:24 [10168] overcloud-controller-0 pengine: error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
cluster/corosync.log:Jun 15 11:34:24 [10168] overcloud-controller-0 pengine: notice: process_pe_message: Configuration ERRORs found during PE processing. Please run "crm_verify -L" to identify issues.
cluster/corosync.log:Jun 15 11:35:18 [10169] overcloud-controller-0 crmd: warning: status_from_rc: Action 4 (ip-192.0.2.17_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error
cluster/corosync.log:Jun 15 11:35:18 [10169] overcloud-controller-0 crmd: warning: status_from_rc: Action 4 (ip-192.0.2.17_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error
cluster/corosync.log:galera(galera)[16850]: 2016/06/15_11:35:21 ERROR: MySQL is not running
cluster/corosync.log:galera(galera)[17388]: 2016/06/15_11:35:26 ERROR: MySQL is not running
cluster/corosync.log:Jun 15 11:44:38 [10169] overcloud-controller-0 crmd: warning: status_from_rc: Action 34 (neutron-server:0_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error
cluster/corosync.log:Jun 15 11:44:38 [10169] overcloud-controller-0 crmd: warning: status_from_rc: Action 34 (neutron-server:0_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error
cluster/corosync.log:Jun 15 11:44:48 [10169] overcloud-controller-0 crmd: warning: status_from_rc: Action 39 (httpd:0_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error

Revision history for this message
Carlos Camacho (ccamacho) wrote :

keystone/keystone.log:2016-06-15 11:45:04.176 4829 ERROR keystone.middleware.auth Traceback (most recent call last):
keystone/keystone.log:2016-06-15 11:45:04.176 4829 ERROR keystone.middleware.auth File "/usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py", line 540, in _do_fetch_token
keystone/keystone.log:2016-06-15 11:45:04.176 4829 ERROR keystone.middleware.auth return data, access.create(body=data, auth_token=token)
keystone/keystone.log:2016-06-15 11:45:04.176 4829 ERROR keystone.middleware.auth File "/usr/lib/python2.7/site-packages/positional/__init__.py", line 94, in inner
keystone/keystone.log:2016-06-15 11:45:04.176 4829 ERROR keystone.middleware.auth return func(*args, **kwargs)
keystone/keystone.log:2016-06-15 11:45:04.176 4829 ERROR keystone.middleware.auth File "/usr/lib/python2.7/site-packages/keystoneauth1/access/access.py", line 49, in create
keystone/keystone.log:2016-06-15 11:45:04.176 4829 ERROR keystone.middleware.auth raise ValueError('Unrecognized auth response')
keystone/keystone.log:2016-06-15 11:45:04.176 4829 ERROR keystone.middleware.auth ValueError: Unrecognized auth response
keystone/keystone.log:2016-06-15 11:45:04.176 4829 ERROR keystone.middleware.auth

Revision history for this message
Jiří Stránský (jistr) wrote :

http://logs.openstack.org/18/329718/1/check-tripleo/gate-tripleo-ci-centos-7-upgrades/cb4b9c1/logs/overcloud-controller-0/var/log/host_info.txt.gz

In the controller host info there's `sudo pcs status` output, seems like pacemaker couldn't start up the VIPs:

Failed Actions:
* ip-fd00.fd00.fd00.4000..10_start_0 on overcloud-controller-0 'unknown error' (1): call=6, status=complete, exitreason=' failed.',
    last-rc-change='Wed Jun 15 06:43:00 2016', queued=0ms, exec=589ms
* ip-2001.db8.fd00.1000..10_start_0 on overcloud-controller-0 'unknown error' (1): call=78, status=complete, exitreason=' failed.',
    last-rc-change='Wed Jun 15 06:43:48 2016', queued=0ms, exec=66ms
* ip-fd00.fd00.fd00.3000..10_start_0 on overcloud-controller-0 'unknown error' (1): call=81, status=complete, exitreason=' failed.',
    last-rc-change='Wed Jun 15 06:43:49 2016', queued=0ms, exec=57ms
* ip-fd00.fd00.fd00.2000..11_start_0 on overcloud-controller-0 'unknown error' (1): call=89, status=complete, exitreason=' failed.',
    last-rc-change='Wed Jun 15 06:43:53 2016', queued=0ms, exec=71ms

Revision history for this message
Carlos Camacho (ccamacho) wrote :

This seems interesting:
neutron/dhcp-agent.log:2016-06-15 12:19:20.373 3441 ERROR neutron.common.rpc [req-7309316c-170e-44a6-8106-11011169cd50 - - - - -] Timeout in RPC method report_state. Waiting for 2 seconds before next attempt. If the server is not down, consider in
creasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.

Revision history for this message
Carlos Camacho (ccamacho) wrote :

mysqld.log:2016-06-15 11:35:30 139935821121664 [ERROR] mysqld: Server GSSAPI error (major 851968, minor 2529639093) : gss_acquire_cred failed -Unspecified GSS failure. Minor code may provide more information. Keytab FILE:/etc/krb5.keytab is nonex
istent or empty.
mysqld.log:2016-06-15 11:35:30 139935821121664 [ERROR] Plugin 'gssapi' init function returned error.
mysqld.log:2016-06-15 11:35:30 139935821121664 [Warning] SSL error: SSL_CTX_set_default_verify_paths failed
mysqld.log:2016-06-15 11:35:30 139935821121664 [Warning] SSL error: error:02001002:system library:fopen:No such file or directory
mysqld.log:2016-06-15 11:35:30 139935821121664 [Warning] SSL error: error:2006D080:BIO routines:BIO_new_file:no such file
mysqld.log:2016-06-15 11:35:30 139935821121664 [Warning] SSL error: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib
neutron/l3-agent.log:2016-06-15 12:12:51.247 31961 ERROR neutron.common.rpc [-] Timeout in RPC method report_state. Waiting for 30 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutr
on server(s) may be overloaded and unable to respond quickly enough.

Revision history for this message
Emilien Macchi (emilienm) wrote :

Something is failing during ControllerPostConfig when running pacemaker_resource_restart.sh script.
It tries to stop all resources and check they are stopped but it timeouts:

See:
http://logs.openstack.org/69/330069/2/check-tripleo/gate-tripleo-ci-centos-7-upgrades/aecb52e/logs/overcloud-controller-0/var/log/messages

Grep: "pacemaker is active"
and then scroll down, until timeout happens:
Jun 15 20:04:03 localhost os-collect-config: [2016-06-15 20:04:02,998] (heat-config) [INFO] {"deploy_stdout": "httpd has stopped\nERROR: cluster remained unstable for more than 1800 seconds, exiting.\n", "deploy_stderr": "+ echo 'pacemaker_resource_restart script begins'\n++ systemctl is-active pacemaker\n+ pacemaker_status=active\n+ echo 'pacemaker is active'\n++ hiera bootstrap_nodeid\n++ facter hostname\n++ hiera stack_action\n+ '[' active = active -a overcloud-controller-0 = overcloud-controller-0 -a UPDATE = UPDATE ']'\n+ echo 'pacemaker is active, node is master and we'\\''re running an update'\n+ grep 'start neutron-server-clone then start neutron-ovs-cleanup-clone'\n+ pcs constraint order show\n+ pcs resource disable httpd\n+ check_resource httpd stopped 300\n+ '[' 3 -ne 3 ']'\n+ service=httpd\n+ state=stopped\n+ timeout=300\n+ '[' stopped = stopped ']'\n+ match_for_incomplete=Started\n++ sed 's/ /\\|/g'\n++ sed 's/.*\\[ \\(.*\\) \\]/\\1/g'\n++ grep '^Online'\n++ pcs status\n+ nodes_local=overcloud-controller-0\n+ timeout -k 10 300 crm_resource --wait\n++ egrep overcloud-controller-0\n++ grep -v Clone\n++ grep httpd\n++ pcs status --full\n++ true\n+ node_states=\n+ grep -q Started\n+ echo ''\n+ echo 'httpd has stopped'\n+ pcs resource disable openstack-core\n+ check_resource openstack-core stopped 1800\n+ '[' 3 -ne 3 ']'\n+ service=openstack-core\n+ state=stopped\n+ timeout=1800\n+ '[' stopped = stopped ']'\n+ match_for_incomplete=Started\n++ grep '^Online'\n++ pcs status\n++ sed 's/.*\\[ \\(.*\\) \\]/\\1/g'\n++ sed 's/ /\\|/g'\n+ nodes_local=overcloud-controller-0\n+ timeout -k 10 1800 crm_resource --wait\n+ echo_error 'ERROR: cluster remained unstable for more than 1800 seconds, exiting.'\n+ echo 'ERROR: cluster remained unstable for more than 1800 seconds, exiting.'\n+ tee /dev/fd2\n+ exit 1\n", "deploy_status_code": 1}

We need to figure why it timeouts and if all warnings between the logs are "normal". Example:
Jun 15 19:52:56 localhost pengine[11455]: warning: Processing failed op start for ip-fd00.fd00.fd00.3000..18 on overcloud-controller-0: unknown error (1)

or:
Jun 15 19:52:56 localhost crmd[11456]: warning: Transition 58 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=12, Source=/var/lib/pacemaker/pengine/pe-input-58.bz2): Terminated

or:
Jun 15 19:45:22 localhost crmd[11456]: notice: High CPU load detected: 1.040000

summary: - Ha upgrade jobs failing with "cluster remained unstable for more than
- 1800 seconds"
+ upgrade jobs failing with "cluster remained unstable for more than 1800
+ seconds"
Changed in tripleo:
status: New → Confirmed
importance: Undecided → Critical
tags: added: alert
Revision history for this message
Michele Baldessari (michele) wrote :
Download full text (3.8 KiB)

So just for the record, these two warnings are expected:
neutron/dhcp-agent.log:2016-06-15 12:19:20.373 3441 ERROR neutron.common.rpc [req-7309316c-170e-44a6-8106-11011169cd50 - - - - -] Timeout in RPC method report_state. Waiting for 2 seconds before next attempt. If the server is not down, consider in
creasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.

This one is also harmless (just due to the fact that galera is not using ssl stuff)
mysqld.log:2016-06-15 11:35:30 139935821121664 [ERROR] mysqld: Server GSSAPI error (major 851968, minor 2529639093) : gss_acquire_cred failed -Unspecified GSS failure. Minor code may provide more information. Keytab FILE:/etc/krb5.keytab is nonex
istent or empty.
mysqld.log:2016-06-15 11:35:30 139935821121664 [ERROR] Plugin 'gssapi' init function returned error.
mysqld.log:2016-06-15 11:35:30 139935821121664 [Warning] SSL error: SSL_CTX_set_default_verify_paths failed
mysqld.log:2016-06-15 11:35:30 139935821121664 [Warning] SSL error: error:02001002:system library:fopen:No such file or directory
mysqld.log:2016-06-15 11:35:30 139935821121664 [Warning] SSL error: error:2006D080:BIO routines:BIO_new_file:no such file
mysqld.log:2016-06-15 11:35:30 139935821121664 [Warning] SSL error: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib
neutron/l3-agent.log:2016-06-15 12:12:51.247 31961 ERROR neutron.common.rpc [-] Timeout in RPC method report_state. Waiting for 30 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron

Our focus needs to be this:
IPaddr2(ip-fd00.fd00.fd00.4000..10)[18628]: 2016/06/15_06:43:00 ERROR: failed. Jun 15 06:43:00 [11450] overcloud-controller-0 lrmd: notice: operation_finished: ip-fd00.fd00.fd00.4000..10_start_0:18628:stderr [ ocf-exit-reason: failed. ]
Jun 15 06:43:00 [11450] overcloud-controller-0 lrmd: info: log_finished: finished - rsc:ip-fd00.fd00.fd00.4000..10 action:start call_id:6 pid:18628 exit-code:1 exec-time:589ms queue-time:0ms Jun 15 06:43:00 [11453] overcloud-controller-0 crmd: info: action_synced_wait: Managed IPaddr2_meta-data_0 process 18716 exited with rc=0 Jun 15 06:43:00 [11453] overcloud-controller-0 crmd: notice: process_lrm_event: Operation ip-fd00.fd00.fd00.4000..10_start_0: unknown error (node=overcloud-controller-0, call=6, rc=1, cib-update=40, confirmed=true) Jun 15 06:43:00 [11453] overcloud-controller-0 crmd: notice: process_lrm_event: overcloud-controller-0-ip-fd00.fd00.fd00.4000..10_start_0:6 [ ocf-exit-reason: failed.\n ] Jun 15 06:43:00 [11448]...

Read more...

Revision history for this message
Michele Baldessari (michele) wrote :

Ran this review https://review.openstack.org/#/c/330069/ to verify a potential bug.

My current theory is that we seem to have removed some constraints in a peacemeal way and it might
be that we are triggering this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1327469

Revision history for this message
Erno Kuvaja (jokke) wrote :
Download full text (5.0 KiB)

I'm wondering of the high loads and pacemaker trying to fence on controller oc-controller has something to do with this?
From http://logs.openstack.org/18/311218/6/check-tripleo/gate-tripleo-ci-centos-7-upgrades/a985e87/logs/overcloud-controller-0/var/log/messages

Jun 15 13:15:55 localhost crmd[11469]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jun 15 13:15:55 localhost pengine[11468]: warning: Processing failed op start for ip-fd00.fd00.fd00.4000..10 on overcloud-controller-0: unknown error (1)
Jun 15 13:15:55 localhost pengine[11468]: warning: Processing failed op start for ip-2001.db8.fd00.1000..17 on overcloud-controller-0: unknown error (1)
Jun 15 13:15:55 localhost pengine[11468]: warning: Processing failed op start for ip-fd00.fd00.fd00.2000..17 on overcloud-controller-0: unknown error (1)
Jun 15 13:15:55 localhost pengine[11468]: warning: Processing failed op start for ip-fd00.fd00.fd00.3000..10 on overcloud-controller-0: unknown error (1)
Jun 15 13:15:55 localhost pengine[11468]: warning: Forcing ip-fd00.fd00.fd00.4000..10 away from overcloud-controller-0 after 1000000 failures (max=1000000)
Jun 15 13:15:55 localhost pengine[11468]: warning: Forcing ip-2001.db8.fd00.1000..17 away from overcloud-controller-0 after 1000000 failures (max=1000000)
Jun 15 13:15:55 localhost pengine[11468]: warning: Forcing ip-fd00.fd00.fd00.2000..17 away from overcloud-controller-0 after 1000000 failures (max=1000000)
Jun 15 13:15:55 localhost pengine[11468]: warning: Forcing ip-fd00.fd00.fd00.3000..10 away from overcloud-controller-0 after 1000000 failures (max=1000000)
Jun 15 13:15:55 localhost pengine[11468]: notice: Start openstack-nova-scheduler:0#011(overcloud-controller-0)
Jun 15 13:15:55 localhost pengine[11468]: notice: Start openstack-cinder-volume#011(overcloud-controller-0 - blocked)
Jun 15 13:15:55 localhost pengine[11468]: notice: Start openstack-nova-api:0#011(overcloud-controller-0)
Jun 15 13:15:55 localhost pengine[11468]: notice: Start openstack-nova-conductor:0#011(overcloud-controller-0)
Jun 15 13:15:55 localhost pengine[11468]: notice: Calculated Transition 87: /var/lib/pacemaker/pengine/pe-input-87.bz2
Jun 15 13:15:55 localhost crmd[11469]: warning: Transition 87 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=12, Source=/var/lib/pacemaker/pengine/pe-input-87.bz2): Terminated
Jun 15 13:15:55 localhost crmd[11469]: warning: Transition failed: terminated
Jun 15 13:15:55 localhost crmd[11469]: notice: Graph 87 with 12 actions: batch-limit=12 jobs, network-delay=0ms
Jun 15 13:15:55 localhost crmd[11469]: notice: [Action 104]: Pending rsc op openstack-nova-scheduler_monitor_60000 on overcloud-controller-0 (priority: 0, waiting: 103)
Jun 15 13:15:55 localhost crmd[11469]: notice: [Action 103]: Pending rsc op openstack-nova-scheduler_start_0 on overcloud-controller-0 (priority: 0, waiting: 105 195)
Jun 15 13:15:55 localhost crmd[11469]: notice: [Action 106]: Pending pseudo op openstack-nova-scheduler-clone_running_0 on N/A (priority: 1000000, waiting: 103 105)
Jun 15 13:15:55 localhost crmd[11469]: notice: [Action 105]: Pending pseu...

Read more...

Revision history for this message
Erno Kuvaja (jokke) wrote :

Also an hour 20min give or take systemd starts over 500 sessions for rabbitmq which looks quite a lot to me:
-bash-4.2$ egrep -e "systemd.*Starting.*rabbitmq" messages | wc -l
536
-bash-4.2$ egrep -e "systemd.*Starting.*rabbitmq" messages | head -n 1
Jun 15 12:07:38 localhost systemd: Starting Session c1 of user rabbitmq.
-bash-4.2$ egrep -e "systemd.*Starting.*rabbitmq" messages | tail -n 1
Jun 15 13:28:45 localhost systemd: Starting Session c541 of user rabbitmq.

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :
Download full text (4.1 KiB)

Just to comment on the IPaddr2 ipv6 issue. There are not the real problem here as this error is present even for the successful ha upgrade. For instance we got this:

    unpack_rsc_op_failure: Processing failed op start for ip-fd00.fd00.fd00.4000..10 on overcloud-controller-0: unknown error

In http://logs.openstack.org/29/315129/16/check-tripleo/gate-tripleo-ci-centos-7-upgrades/d7765e4/logs/overcloud-controller-0/var/log/cluster/corosync.txt.gz which is associated with this review https://review.openstack.org/#/c/315129/ where the ha upgrade was successful.

Relative to a successful one, in corosync log we have this:

    Jun 15 07:49:33 [11453] overcloud-controller-0 crmd: warning: run_graph: Transition 72 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=12, Source=/var/lib/pacemaker/pengine/pe-input-72.bz2): Terminated
    Jun 15 07:49:33 [11453] overcloud-controller-0 crmd: warning: te_graph_trigger: Transition failed: terminated
    Jun 15 07:49:33 [11453] overcloud-controller-0 crmd: notice: print_graph: Graph 72 with 12 actions: batch-limit=12 jobs, network-delay=0ms
    Jun 15 07:49:33 [11453] overcloud-controller-0 crmd: notice: print_synapse: [Action 104]: Pending rsc op openstack-nova-scheduler_monitor_60000 on overcloud-controller-0 (priority: 0, waiting: 103)
    Jun 15 07:49:33 [11453] overcloud-controller-0 crmd: notice: print_synapse: [Action 103]: Pending rsc op openstack-nova-scheduler_start_0 on overcloud-controller-0 (priority: 0, waiting: 105 195)
    Jun 15 07:49:33 [11453] overcloud-controller-0 crmd: notice: print_synapse: [Action 106]: Pending pseudo op openstack-nova-scheduler-clone_running_0 on N/A (priority: 1000000, waiting: 103 105)
    Jun 15 07:49:33 [11453] overcloud-controller-0 crmd: notice: print_synapse: [Action 105]: Pending pseudo op openstack-nova-scheduler-clone_start_0 on N/A (priority: 0, waiting: 198)
    Jun 15 07:49:33 [11453] overcloud-controller-0 crmd: notice: print_synapse: [Action 196]: Pending rsc op openstack-nova-api_monitor_60000 on overcloud-controller-0 (priority: 0, waiting: 195)
    Jun 15 07:49:33 [11453] overcloud-controller-0 crmd: notice: print_synapse: [Action 195]: Pending rsc op openstack-nova-api_start_0 on overcloud-controller-0 (priority: 0, waiting: 197)
    Jun 15 07:49:33 [11453] overcloud-controller-0 crmd: notice: print_synapse: * [Input 243]: Unresolved dependency rsc op openstack-nova-novncproxy_start_0 on overcloud-controller-0
    Jun 15 07:49:33 [11453] overcloud-controller-0 crmd: notice: print_synapse: [Action 198]: Pending pseudo op openstack-nova-api-clone_running_0 on N/A (priority: 1000000, waiting: 195 197)
    Jun 15 07:49:33 [11453] overcloud-controller-0 crmd: notice: print_synapse: [Action 197]: Pending pseudo op openstack-nova-api-clone_start_0 on N/A (priority: 0, waiting: none)
    Jun 15 07:49:33 [11453] overcloud-controller-0 crmd: notice: print_synapse: * [Input 246]: Unresolved dependency pseudo op openstack-nova-novncproxy-clone_running_0
    Jun 15 07:49:3...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/330682

Changed in tripleo:
assignee: nobody → Jiří Stránský (jistr)
status: Confirmed → In Progress
Changed in tripleo:
assignee: Jiří Stránský (jistr) → Sofer Athlan-Guyot (sofer-athlan-guyot)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/330661
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=7b22f2d8305931b8f556f8a6a2d049a9b9ac90a9
Submitter: Jenkins
Branch: master

commit 7b22f2d8305931b8f556f8a6a2d049a9b9ac90a9
Author: Sofer Athlan-Guyot <email address hidden>
Date: Thu Jun 16 17:58:55 2016 +0200

    Colocation make a group for pcmk nova resources.

    This ensure that the entire nova-* service form a pacemaker group and
    that somehow pacemaker doesn't try to restart services elsewhere.

    Closes-bug: 1592776

    Change-Id: I629db624f4179663696ffe8c80a567bf6d6b2e16

Changed in tripleo:
status: In Progress → Fix Released
tags: removed: alert
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Jiri Stransky (<email address hidden>) on branch: master
Review: https://review.openstack.org/330682
Reason: Worked, but we found a better solution :)

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/tripleo-heat-templates 5.0.0.0b2

This issue was fixed in the openstack/tripleo-heat-templates 5.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (master)

Change abandoned by Athlan-Guyot sofer (<email address hidden>) on branch: master
Review: https://review.openstack.org/332069
Reason: Not relevant anymore.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.