After restarting keepalived , Most of the newly launched instances get stuck in build

Bug #1467782 reported by Vinod Nair
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
Critical
Jeya ganesh babu J
Trunk
Fix Committed
Critical
Jeya ganesh babu J

Bug Description

On a HA cluster, after restating keepalived , most of the instances gets stuck in Build

Launching instances becomes very slow.. It becomes slight better after 20-30 mins

LOGS in : http://cmbu-sv02.englab.juniper.net/pxe/Standard/vin/ha/b1/

version : 2.20 build 59 Juno with patches

Logs: http://cmbu-sv02.englab.juniper.net/pxe/Standard/vin/ha/b1/

Someimes the below trace is also seen in nova- conductor

root@cs-scale-3:/var/log/nova# tailf /var/log/nova/nova-conductor.log
 tailf /var/log/nova/nova-conductor.log

2015-06-22 21:41:34.933 769 ERROR nova.scheduler.utils [req-a6ed435f-b08c-4cdf-9c6f-b39d88e0ac6a None] [instance: 89a31e4e-e52b-4cf8-8a07-ea99b74c61d3] Error from last host: cs-scale-6 (node cs-scale-6): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2051, in _do_build_and_run_instance\n filter_properties)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2182, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance 89a31e4e-e52b-4cf8-8a07-ea99b74c61d3 was re-scheduled: Timed out waiting for a reply to message ID afe0230e020e4c2e9d5b6e62d0531407\n']
2015-06-22 21:41:34.936 769 INFO oslo.messaging._drivers.impl_rabbit [req-a6ed435f-b08c-4cdf-9c6f-b39d88e0ac6a ] Connecting to AMQP server on 13.1.0.10:5673

2015-06-22 21:41:34.933 769 ERROR nova.scheduler.utils [req-a6ed435f-b08c-4cdf-9c6f-b39d88e0ac6a None] [instance: 89a31e4e-e52b-4cf8-8a07-ea99b74c61d3] Error from last host: cs-scale-6 (node cs-scale-6): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2051, in _do_build_and_run_instance\n filter_properties)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2182, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance 89a31e4e-e52b-4cf8-8a07-ea99b74c61d3 was re-scheduled: Timed out waiting for a reply to message ID afe0230e020e4c2e9d5b6e62d0531407\n']
2015-06-22 21:41:34.936 769 INFO oslo.messaging._drivers.impl_rabbit [req-a6ed435f-b08c-4cdf-9c6f-b39d88e0ac6a ] Connecting to AMQP server on 13.1.0.10:5673

root@cs-scale-2:/var/log/nova# nova show eeca56e7-4a13-4787-9820-75f3ea07f24c
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | cs-scale-5 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | cs-scale-5 |
| OS-EXT-SRV-ATTR:instance_name | instance-00001e1f |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | block_device_mapping |
| OS-EXT-STS:vm_state | error |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2015-06-23T06:53:19Z https://bugs.launchpad.net/juniperopenstack/+bug/1467782/+nominate |
| fault | {"message": "Build of instance eeca56e7-4a13-4787-9820-75f3ea07f24c aborted: Failure prepping block device.", "code": 500, "details": " File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 2051, in _do_build_and_run_instance |
| | filter_properties) |
| | File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 2150, in _build_and_run_instance |
| | 'create.error', fault=e) |
| | File \"/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py\", line 82, in __exit__ |
| | six.reraise(self.type_, self.value, self.tb) |
| | File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 2123, in _build_and_run_instance |
| | block_device_mapping) as resources: |
| | File \"/usr/lib/python2.7/contextlib.py\", line 17, in __enter__ |
| | return self.gen.next() |
| | File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 2261, in _build_resources |
| | reason=msg) |
| | ", "created": "2015-06-23T07:12:01Z"} |
| flavor | V1 (10) |
| hostId | b0f49b19c9d1a90b3207e2e3a88b14c36062d4a61e6e92801a3c1262 |
| id | eeca56e7-4a13-4787-9820-75f3ea07f24c |
| image | A1-SNAP1 (3da3258d-c630-4396-aca9-017f7f62d7fb) |
| key_name | - |
| metadata | {} |
| name | VIN2--eeca56e7-4a13-4787-9820-75f3ea07f24c |
| os-extended-volumes:volumes_attached | [] |
| status | ERROR |
| tenant_id | 1657ff89c9c54aa1b42d100a4a663a1f |
| updated | 2015-06-23T07:12:01Z |
| user_id | a9b90ce5977d44ec8b65ada3d31c4747 |
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
root@cs-scale-2:/var/log/nova#

Tags: ha storage
Vinod Nair (vinodnair)
description: updated
description: updated
description: updated
Vinod Nair (vinodnair)
description: updated
Revision history for this message
Jeya ganesh babu J (jjeya) wrote :
Download full text (8.4 KiB)

Cinder api shows timeouts from amqp - messages are logged in nova-api, cinder-api and cinder-volume

2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit routing_key=self.routing_key)
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 82, in __init__
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit self.revive(self._channel)
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 216, in revive
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit self.declare()
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 102, in declare
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit self.exchange.declare()
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/entity.py", line 166, in declare
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit nowait=nowait, passive=passive,
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/channel.py", line 613, in exchange_declare
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit self._send_method((40, 10), args)
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 56, in _send_method
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit self.channel_id, method_sig, args, content,
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/method_framing.py", line 221, in write_method
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit write_frame(1, channel, payload)
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/transport.py", line 177, in write_frame
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit frame_type, channel, size, payload, 0xce,
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/eventlet/greenio.py", line 307, in sendall
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit tail = self.send(data, flags)
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/eventlet/greenio.py", line 293, in send
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit total_sent += fd.send(data[total_sent:], flags)
2015-06-22 21:30:13.432 38170 TRACE oslo.messaging._drivers.impl_rabbit error: [Errno 104] Connection reset by peer

2015-06-22 21:31:24.102 26908 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/cinder/api/contrib/volume_actions.py", line 197, in _initialize_connection
2015-06-22 21:31:24.102 26908 TRA...

Read more...

information type: Proprietary → Public
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/12001
Submitter: Jeya ganesh babu (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/12001
Committed: http://github.org/Juniper/contrail-provisioning/commit/03b02d7e26656aa59de1ff5e720b11b3d538ca79
Submitter: Zuul
Branch: R2.20

commit 03b02d7e26656aa59de1ff5e720b11b3d538ca79
Author: Jeya ganesh babu J <email address hidden>
Date: Tue Jun 23 23:17:05 2015 -0700

Storage HA provision fix

Partial-Bug: #1467782
The sql_connection is mapped to local ip instead it should be
mapped to the vip. This is causing boot from volume to fail
in case of failover and when keepalived is restarted.

Change-Id: I1f1d1be00058663e0c840148c7565b1bfd726ca7

Revision history for this message
Jeya ganesh babu J (jjeya) wrote :

Partial fix committed. The cinder issue is tracked as part of bug #1468798

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/12082
Submitter: Jeya ganesh babu (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/12082
Committed: http://github.org/Juniper/contrail-provisioning/commit/425262f92588916f8bac9b78bfac58d086866f98
Submitter: Zuul
Branch: master

commit 425262f92588916f8bac9b78bfac58d086866f98
Author: Jeya ganesh babu J <email address hidden>
Date: Mon Jun 29 11:37:46 2015 -0700

Storage HA provision fix

Partial-Bug: #1467782
The sql_connection is mapped to local ip instead it should be
mapped to the vip. This is causing boot from volume to fail
in case of failover and when keepalived is restarted.

Change-Id: I3a15335c5541b836b5e804894aef458b367fdef4

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.