Fuel for OpenStack

Bug #1463433
Activity log

Activity log for bug #1463433

Date	Who	What changed	Old value	New value	Message
2015-06-09 14:28:06	Leontii Istomin	bug			added bug
2015-06-09 14:34:09	Bogdan Dobrelya	description	During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz	During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz ISO: #521 + net_ticktime patch https://review.openstack.org/189292
2015-06-09 14:35:31	Bogdan Dobrelya	fuel: status	New	Confirmed
2015-06-09 14:35:45	Bogdan Dobrelya	summary	rabbitmq was down on one of controllers	rabbitmq was down on one of controllers during shaker test
2015-06-09 14:35:48	Bogdan Dobrelya	fuel: milestone		7.0
2015-06-09 14:35:50	Bogdan Dobrelya	fuel: importance	Undecided	High
2015-06-09 14:35:56	Bogdan Dobrelya	fuel: assignee		MOS Oslo (mos-oslo)
2015-06-09 14:47:29	Leontii Istomin	description	During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz ISO: #521 + net_ticktime patch https://review.openstack.org/189292	During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-09 14:52:34	Bogdan Dobrelya	summary	rabbitmq was down on one of controllers during shaker test	rabbitmq was down on one of controllers during shaker test but there are multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging
2015-06-09 14:53:08	Bogdan Dobrelya	summary	rabbitmq was down on one of controllers during shaker test but there are multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging	rabbitmq was down on one of controllers during shaker test but there are multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning
2015-06-09 15:07:50	Leontii Istomin	description	During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz	During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-09 15:36:18	Leontii Istomin	description	During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz	During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-09 18:03:05	Bogdan Dobrelya	fuel: importance	High	Critical
2015-06-09 18:03:05	Bogdan Dobrelya	fuel: milestone	7.0	6.1
2015-06-10 07:39:41	Bogdan Dobrelya	summary	rabbitmq was down on one of controllers during shaker test but there are multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning	[shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with some connections got blocked
2015-06-10 07:40:17	Bogdan Dobrelya	summary	[shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with some connections got blocked	[shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with some connections got blocked because virt memory got exhausted by publishers
2015-06-10 10:06:39	Bogdan Dobrelya	fuel: assignee	MOS Oslo (mos-oslo)	Bogdan Dobrelya (bogdando)
2015-06-10 10:06:44	Bogdan Dobrelya	fuel: status	Confirmed	Triaged
2015-06-10 10:12:31	Bogdan Dobrelya	nominated for series		fuel/7.0.x
2015-06-10 10:12:31	Bogdan Dobrelya	bug task added		fuel/7.0.x
2015-06-10 10:12:43	Bogdan Dobrelya	nominated for series		fuel/6.1.x
2015-06-10 10:12:43	Bogdan Dobrelya	bug task added		fuel/6.1.x
2015-06-10 10:12:53	Bogdan Dobrelya	fuel/7.0.x: status	New	In Progress
2015-06-10 10:12:56	Bogdan Dobrelya	fuel/7.0.x: importance	Undecided	Critical
2015-06-10 10:13:00	Bogdan Dobrelya	fuel/7.0.x: assignee		Bogdan Dobrelya (bogdando)
2015-06-10 10:13:03	Bogdan Dobrelya	fuel/7.0.x: milestone		7.0
2015-06-10 10:18:34	Bogdan Dobrelya	summary	[shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with some connections got blocked because virt memory got exhausted by publishers	[shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with AMQP publish got blocked because virt memory got exhausted at rabbit node
2015-06-10 11:51:49	OpenStack Infra	fuel: status	Triaged	In Progress
2015-06-11 11:27:08	Bogdan Dobrelya	bug task added		mos
2015-06-11 11:27:17	Bogdan Dobrelya	mos: milestone		6.1
2015-06-11 11:27:23	Bogdan Dobrelya	mos: milestone	6.1	7.0
2015-06-11 11:27:31	Bogdan Dobrelya	mos: assignee		MOS Oslo (mos-oslo)
2015-06-11 11:27:37	Bogdan Dobrelya	mos: importance	Undecided	High
2015-06-11 12:30:38	Eugene Bogdanov	tags	scale	non-release scale
2015-06-11 12:30:47	Eugene Bogdanov	tags	non-release scale	scale
2015-06-11 12:32:51	Eugene Bogdanov	tags	scale	non-release scale
2015-06-11 12:32:59	Eugene Bogdanov	tags	non-release scale	scale
2015-06-11 12:34:48	Eugene Bogdanov	tags	scale	6.1-rc2 scale
2015-06-11 12:35:32	Eugene Bogdanov	tags	6.1-rc2 scale	6.1rc2 scale
2015-06-12 10:28:21	Bogdan Dobrelya	mos: importance	High	Critical
2015-06-12 10:28:27	Bogdan Dobrelya	mos: milestone	7.0	6.1
2015-06-15 13:41:50	Dmitry Mescheryakov	mos: status	New	Confirmed
2015-06-15 15:29:26	Leontii Istomin	description	During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz	During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-15 17:02:59	Alexander Nevenchannyy	bug			added subscriber Alexander Nevenchannyy
2015-06-15 17:44:07	Eugene Bogdanov	tags	6.1rc2 scale	6.1 scale
2015-06-15 17:44:14	Eugene Bogdanov	tags	6.1 scale	scale
2015-06-16 12:16:01	Bogdan Dobrelya	fuel/6.1.x: importance	Critical	High
2015-06-16 12:16:04	Bogdan Dobrelya	fuel/7.0.x: importance	Critical	High
2015-06-16 12:16:18	Bogdan Dobrelya	fuel/6.1.x: milestone	6.1	6.1-updates
2015-06-16 12:17:14	Bogdan Dobrelya	mos: importance	Critical	High
2015-06-16 12:17:17	Bogdan Dobrelya	mos: milestone	6.1	7.0
2015-06-16 12:25:17	Bogdan Dobrelya	attachment added		logs.tgz https://bugs.launchpad.net/fuel/+bug/1463433/+attachment/4415650/+files/logs.tgz
2015-06-16 18:45:33	Dmitry Mescheryakov	description	During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz	Steps to reproduce: 1. Run Shaker tests. (For explanation what it is, see User Impact section below) First the Rabbit MQ experienced network partition. After it recovered from partition, on one of the nodes it started to consume RAM. And after it consumed tens of GB of RAM, it stopped accepting more messages from OpenStack services. As a result, OpenStack failed all the incoming requests. The issue did not end after tests were finished. Conditions for reproduction: The issue is reproduced only once so far. Another run finished successfully, without Rabbit MQ consuming a lot of memory or being stuck for a long time (see comment #22 below). User impact: Shaker creates a number of VMs and then tests network throughput between them. The traffic created is rather close to the limit of network throughput of the lab (a little less then 10G). When traffic is cross-network (i.e. it flows between VMs located in different Neutron networks), it always goes through controllers. Sometimes that hits Rabbit MQ and it switches to inoperable state. From user's point of view, the cloud is not working until the issue is healed. Workaround: Restart all Rabbit MQ nodes. After Rabbit is operable (in several minutes after restart), the cloud should start working properly. Current plan: Reproduce the issue once more and more throughly investigate it. Original description by Leontiy Istomin ======================================= During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-16 20:53:48	Dan Hata	description	Steps to reproduce: 1. Run Shaker tests. (For explanation what it is, see User Impact section below) First the Rabbit MQ experienced network partition. After it recovered from partition, on one of the nodes it started to consume RAM. And after it consumed tens of GB of RAM, it stopped accepting more messages from OpenStack services. As a result, OpenStack failed all the incoming requests. The issue did not end after tests were finished. Conditions for reproduction: The issue is reproduced only once so far. Another run finished successfully, without Rabbit MQ consuming a lot of memory or being stuck for a long time (see comment #22 below). User impact: Shaker creates a number of VMs and then tests network throughput between them. The traffic created is rather close to the limit of network throughput of the lab (a little less then 10G). When traffic is cross-network (i.e. it flows between VMs located in different Neutron networks), it always goes through controllers. Sometimes that hits Rabbit MQ and it switches to inoperable state. From user's point of view, the cloud is not working until the issue is healed. Workaround: Restart all Rabbit MQ nodes. After Rabbit is operable (in several minutes after restart), the cloud should start working properly. Current plan: Reproduce the issue once more and more throughly investigate it. Original description by Leontiy Istomin ======================================= During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz	Steps to reproduce: 1. Run Shaker tests. (For explanation what it is, see User Impact section below) First the Rabbit MQ experienced network partition. After it recovered from partition, on one of the nodes it started to consume RAM. And after it consumed tens of GB of RAM, it stopped accepting more messages from OpenStack services. As a result, OpenStack failed all the incoming requests. The issue did not end after tests were finished. Conditions for reproduction: The issue is reproduced only once so far. Another run finished successfully, without Rabbit MQ consuming a lot of memory or being stuck for a long time (see comment #22 below). User impact: Shaker creates 8 VMs and then tests network throughput between them. The traffic created is rather close to the limit of network throughput of the lab (a little less then 10G). When traffic is cross-network (i.e. it flows between VMs located in different Neutron networks), it always goes through controllers. Sometimes that hits Rabbit MQ and it switches to inoperable state. The rate of the messages hitting Rabbit MQ are in the 100's per second. From user's point of view, the cloud is not working until the issue is healed. Workaround: Restart all Rabbit MQ nodes. After Rabbit is operable (in several minutes after restart), the cloud should start working properly. Current plan: Reproduce the issue once more and more throughly investigate it. Original description by Leontiy Istomin ======================================= During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-16 20:56:52	Dan Hata	description	Steps to reproduce: 1. Run Shaker tests. (For explanation what it is, see User Impact section below) First the Rabbit MQ experienced network partition. After it recovered from partition, on one of the nodes it started to consume RAM. And after it consumed tens of GB of RAM, it stopped accepting more messages from OpenStack services. As a result, OpenStack failed all the incoming requests. The issue did not end after tests were finished. Conditions for reproduction: The issue is reproduced only once so far. Another run finished successfully, without Rabbit MQ consuming a lot of memory or being stuck for a long time (see comment #22 below). User impact: Shaker creates 8 VMs and then tests network throughput between them. The traffic created is rather close to the limit of network throughput of the lab (a little less then 10G). When traffic is cross-network (i.e. it flows between VMs located in different Neutron networks), it always goes through controllers. Sometimes that hits Rabbit MQ and it switches to inoperable state. The rate of the messages hitting Rabbit MQ are in the 100's per second. From user's point of view, the cloud is not working until the issue is healed. Workaround: Restart all Rabbit MQ nodes. After Rabbit is operable (in several minutes after restart), the cloud should start working properly. Current plan: Reproduce the issue once more and more throughly investigate it. Original description by Leontiy Istomin ======================================= During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz	Steps to reproduce: 1. Run Shaker tests. (For explanation what it is, see User Impact section below) First the Rabbit MQ experienced network partition. After it recovered from partition, on one of the nodes it started to consume RAM. And after it consumed tens of GB of RAM, it stopped accepting more messages from OpenStack services. As a result, OpenStack failed all the incoming requests. The issue did not end after tests were finished. Conditions for reproduction: The issue is reproduced only once so far. Another run finished successfully, without Rabbit MQ consuming a lot of memory or being stuck for a long time (see comment #22 below). User impact: Shaker creates VMs in batches of 8 at a time and tests network throughput between them. The traffic created is rather close to the limit of network throughput of the lab (a little less then 10G). When traffic is cross-network (i.e. it flows between VMs located in different Neutron networks), it always goes through controllers. Sometimes that hits Rabbit MQ and it switches to inoperable state. The rate of the messages hitting Rabbit MQ are in the 100's per second. From user's point of view, the cloud is not working until the issue is healed. Workaround: Restart all Rabbit MQ nodes. After Rabbit is operable (in several minutes after restart), the cloud should start working properly. Current plan: Reproduce the issue once more and more throughly investigate it. Original description by Leontiy Istomin ======================================= During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-17 09:32:23	Bogdan Dobrelya	attachment added		app-side.tgz https://bugs.launchpad.net/fuel/+bug/1463433/+attachment/4416217/+files/app-side.tgz
2015-06-17 09:34:37	Bogdan Dobrelya	summary	[shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with AMQP publish got blocked because virt memory got exhausted at rabbit node	[shaker] test failing due to multiple "Timed out waiting for reply to ID"/"Queue not found:Basic.consume" events logged by Oslo.messaging after AMQP reconnect
2015-06-17 10:02:04	Bogdan Dobrelya	bug task deleted	mos
2015-06-17 10:07:43	Bogdan Dobrelya	summary	[shaker] test failing due to multiple "Timed out waiting for reply to ID"/"Queue not found:Basic.consume" events logged by Oslo.messaging after AMQP reconnect	[shaker] test failing when rabbitmq node rasies memory alert
2015-06-18 09:40:36	Bogdan Dobrelya	nominated for series		fuel/5.1.x
2015-06-18 09:40:36	Bogdan Dobrelya	bug task added		fuel/5.1.x
2015-06-18 09:40:36	Bogdan Dobrelya	nominated for series		fuel/6.0.x
2015-06-18 09:40:36	Bogdan Dobrelya	bug task added		fuel/6.0.x
2015-06-18 09:40:44	Bogdan Dobrelya	fuel/6.0.x: importance	Undecided	High
2015-06-18 09:40:47	Bogdan Dobrelya	fuel/6.0.x: status	New	Triaged
2015-06-18 09:40:50	Bogdan Dobrelya	fuel/5.1.x: status	New	Triaged
2015-06-18 09:40:54	Bogdan Dobrelya	fuel/6.0.x: milestone		6.0.2
2015-06-18 09:40:59	Bogdan Dobrelya	fuel/5.1.x: milestone		5.1.2
2015-06-18 09:41:08	Bogdan Dobrelya	fuel/5.1.x: importance	Undecided	High
2015-06-19 09:25:17	deactivateduser	fuel/5.1.x: assignee		MOS Sustaining (mos-sustaining)
2015-06-19 09:25:25	deactivateduser	fuel/6.0.x: assignee		MOS Sustaining (mos-sustaining)
2015-06-25 12:34:30	Vitaly Sedelnik	tags	scale	6.1-mu-1 scale
2015-06-29 17:13:51	Eugene Bogdanov	tags	6.1-mu-1 scale	6.1 scale
2015-06-29 20:53:00	Leontii Istomin	attachment added		rabbit_stat.tar.gz https://bugs.launchpad.net/fuel/+bug/1463433/+attachment/4422092/+files/rabbit_stat.tar.gz
2015-07-01 11:35:31	Leontii Istomin	attachment added		rabbit_stat_16%3A08.log https://bugs.launchpad.net/fuel/+bug/1463433/+attachment/4422932/+files/rabbit_stat_16%253A08.log
2015-07-09 07:25:20	OpenStack Infra	fuel: status	In Progress	Fix Committed
2015-07-09 07:25:32	Bogdan Dobrelya	fuel/6.1.x: status	In Progress	Triaged
2015-07-09 07:25:38	Bogdan Dobrelya	fuel/6.1.x: assignee	Bogdan Dobrelya (bogdando)	MOS Sustaining (mos-sustaining)
2015-09-11 13:02:54	Michal Rostecki	fuel/6.1.x: assignee	MOS Maintenance (mos-maintenance)	Michal Rostecki (mrostecki)
2015-09-11 13:46:49	Michal Rostecki	fuel/6.1.x: status	Triaged	In Progress
2015-09-13 16:17:37	Vitaly Sedelnik	fuel/6.1.x: milestone	6.1-updates	6.1-mu-3
2015-09-26 11:07:05	Vitaly Sedelnik	fuel/6.0.x: milestone	6.0.2	6.0.1
2015-09-28 18:34:05	Leontii Istomin	fuel/7.0.x: status	Fix Committed	Fix Released
2015-10-06 08:21:10	Vitaly Sedelnik	fuel/6.1.x: status	In Progress	Fix Committed
2015-10-26 12:54:03	Vitaly Sedelnik	fuel/5.1.x: assignee	MOS Maintenance (mos-maintenance)	Denis Meltsaykin (dmeltsaykin)
2015-10-26 12:54:10	Vitaly Sedelnik	fuel/6.0.x: assignee	MOS Maintenance (mos-maintenance)	Denis Meltsaykin (dmeltsaykin)
2015-10-26 12:54:15	Vitaly Sedelnik	fuel/5.1.x: milestone	5.1.1-updates	5.1.1-mu-2
2015-10-26 12:54:22	Vitaly Sedelnik	fuel/6.0.x: milestone	6.0-updates	6.0-mu-7
2015-10-26 13:45:22	Denis Meltsaykin	fuel/5.1.x: status	Triaged	Won't Fix
2015-10-26 13:45:25	Denis Meltsaykin	fuel/6.0.x: status	Triaged	Won't Fix
2015-10-26 15:06:00	Vitaly Sedelnik	fuel/5.1.x: milestone	5.1.1-mu-2	5.1.1-updates
2015-10-26 15:06:03	Vitaly Sedelnik	fuel/6.0.x: milestone	6.0-mu-7	6.0-updates
2015-10-28 15:18:19	Vitaly Sedelnik	fuel/6.1.x: status	Fix Committed	Fix Released