Activity log for bug #1463433

Date Who What changed Old value New value Message
2015-06-09 14:28:06 Leontii Istomin bug added bug
2015-06-09 14:34:09 Bogdan Dobrelya description During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz ISO: #521 + net_ticktime patch https://review.openstack.org/189292
2015-06-09 14:35:31 Bogdan Dobrelya fuel: status New Confirmed
2015-06-09 14:35:45 Bogdan Dobrelya summary rabbitmq was down on one of controllers rabbitmq was down on one of controllers during shaker test
2015-06-09 14:35:48 Bogdan Dobrelya fuel: milestone 7.0
2015-06-09 14:35:50 Bogdan Dobrelya fuel: importance Undecided High
2015-06-09 14:35:56 Bogdan Dobrelya fuel: assignee MOS Oslo (mos-oslo)
2015-06-09 14:47:29 Leontii Istomin description During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz ISO: #521 + net_ticktime patch https://review.openstack.org/189292 During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-09 14:52:34 Bogdan Dobrelya summary rabbitmq was down on one of controllers during shaker test rabbitmq was down on one of controllers during shaker test but there are multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging
2015-06-09 14:53:08 Bogdan Dobrelya summary rabbitmq was down on one of controllers during shaker test but there are multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging rabbitmq was down on one of controllers during shaker test but there are multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning
2015-06-09 15:07:50 Leontii Istomin description During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-09 15:36:18 Leontii Istomin description During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-09 18:03:05 Bogdan Dobrelya fuel: importance High Critical
2015-06-09 18:03:05 Bogdan Dobrelya fuel: milestone 7.0 6.1
2015-06-10 07:39:41 Bogdan Dobrelya summary rabbitmq was down on one of controllers during shaker test but there are multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with some connections got blocked
2015-06-10 07:40:17 Bogdan Dobrelya summary [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with some connections got blocked [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with some connections got blocked because virt memory got exhausted by publishers
2015-06-10 10:06:39 Bogdan Dobrelya fuel: assignee MOS Oslo (mos-oslo) Bogdan Dobrelya (bogdando)
2015-06-10 10:06:44 Bogdan Dobrelya fuel: status Confirmed Triaged
2015-06-10 10:12:31 Bogdan Dobrelya nominated for series fuel/7.0.x
2015-06-10 10:12:31 Bogdan Dobrelya bug task added fuel/7.0.x
2015-06-10 10:12:43 Bogdan Dobrelya nominated for series fuel/6.1.x
2015-06-10 10:12:43 Bogdan Dobrelya bug task added fuel/6.1.x
2015-06-10 10:12:53 Bogdan Dobrelya fuel/7.0.x: status New In Progress
2015-06-10 10:12:56 Bogdan Dobrelya fuel/7.0.x: importance Undecided Critical
2015-06-10 10:13:00 Bogdan Dobrelya fuel/7.0.x: assignee Bogdan Dobrelya (bogdando)
2015-06-10 10:13:03 Bogdan Dobrelya fuel/7.0.x: milestone 7.0
2015-06-10 10:18:34 Bogdan Dobrelya summary [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with some connections got blocked because virt memory got exhausted by publishers [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with AMQP publish got blocked because virt memory got exhausted at rabbit node
2015-06-10 11:51:49 OpenStack Infra fuel: status Triaged In Progress
2015-06-11 11:27:08 Bogdan Dobrelya bug task added mos
2015-06-11 11:27:17 Bogdan Dobrelya mos: milestone 6.1
2015-06-11 11:27:23 Bogdan Dobrelya mos: milestone 6.1 7.0
2015-06-11 11:27:31 Bogdan Dobrelya mos: assignee MOS Oslo (mos-oslo)
2015-06-11 11:27:37 Bogdan Dobrelya mos: importance Undecided High
2015-06-11 12:30:38 Eugene Bogdanov tags scale non-release scale
2015-06-11 12:30:47 Eugene Bogdanov tags non-release scale scale
2015-06-11 12:32:51 Eugene Bogdanov tags scale non-release scale
2015-06-11 12:32:59 Eugene Bogdanov tags non-release scale scale
2015-06-11 12:34:48 Eugene Bogdanov tags scale 6.1-rc2 scale
2015-06-11 12:35:32 Eugene Bogdanov tags 6.1-rc2 scale 6.1rc2 scale
2015-06-12 10:28:21 Bogdan Dobrelya mos: importance High Critical
2015-06-12 10:28:27 Bogdan Dobrelya mos: milestone 7.0 6.1
2015-06-15 13:41:50 Dmitry Mescheryakov mos: status New Confirmed
2015-06-15 15:29:26 Leontii Istomin description During Shaker test (http://pyshaker.readthedocs.org/en/latest/examples.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-15 17:02:59 Alexander Nevenchannyy bug added subscriber Alexander Nevenchannyy
2015-06-15 17:44:07 Eugene Bogdanov tags 6.1rc2 scale 6.1 scale
2015-06-15 17:44:14 Eugene Bogdanov tags 6.1 scale scale
2015-06-16 12:16:01 Bogdan Dobrelya fuel/6.1.x: importance Critical High
2015-06-16 12:16:04 Bogdan Dobrelya fuel/7.0.x: importance Critical High
2015-06-16 12:16:18 Bogdan Dobrelya fuel/6.1.x: milestone 6.1 6.1-updates
2015-06-16 12:17:14 Bogdan Dobrelya mos: importance Critical High
2015-06-16 12:17:17 Bogdan Dobrelya mos: milestone 6.1 7.0
2015-06-16 12:25:17 Bogdan Dobrelya attachment added logs.tgz https://bugs.launchpad.net/fuel/+bug/1463433/+attachment/4415650/+files/logs.tgz
2015-06-16 18:45:33 Dmitry Mescheryakov description During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz Steps to reproduce: 1. Run Shaker tests. (For explanation what it is, see User Impact section below) First the Rabbit MQ experienced network partition. After it recovered from partition, on one of the nodes it started to consume RAM. And after it consumed tens of GB of RAM, it stopped accepting more messages from OpenStack services. As a result, OpenStack failed all the incoming requests. The issue did not end after tests were finished. Conditions for reproduction: The issue is reproduced only once so far. Another run finished successfully, without Rabbit MQ consuming a lot of memory or being stuck for a long time (see comment #22 below). User impact: Shaker creates a number of VMs and then tests network throughput between them. The traffic created is rather close to the limit of network throughput of the lab (a little less then 10G). When traffic is cross-network (i.e. it flows between VMs located in different Neutron networks), it always goes through controllers. Sometimes that hits Rabbit MQ and it switches to inoperable state. From user's point of view, the cloud is not working until the issue is healed. Workaround: Restart all Rabbit MQ nodes. After Rabbit is operable (in several minutes after restart), the cloud should start working properly. Current plan: Reproduce the issue once more and more throughly investigate it. Original description by Leontiy Istomin ======================================= During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-16 20:53:48 Dan Hata description Steps to reproduce: 1. Run Shaker tests. (For explanation what it is, see User Impact section below) First the Rabbit MQ experienced network partition. After it recovered from partition, on one of the nodes it started to consume RAM. And after it consumed tens of GB of RAM, it stopped accepting more messages from OpenStack services. As a result, OpenStack failed all the incoming requests. The issue did not end after tests were finished. Conditions for reproduction: The issue is reproduced only once so far. Another run finished successfully, without Rabbit MQ consuming a lot of memory or being stuck for a long time (see comment #22 below). User impact: Shaker creates a number of VMs and then tests network throughput between them. The traffic created is rather close to the limit of network throughput of the lab (a little less then 10G). When traffic is cross-network (i.e. it flows between VMs located in different Neutron networks), it always goes through controllers. Sometimes that hits Rabbit MQ and it switches to inoperable state. From user's point of view, the cloud is not working until the issue is healed. Workaround: Restart all Rabbit MQ nodes. After Rabbit is operable (in several minutes after restart), the cloud should start working properly. Current plan: Reproduce the issue once more and more throughly investigate it. Original description by Leontiy Istomin ======================================= During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz Steps to reproduce: 1. Run Shaker tests. (For explanation what it is, see User Impact section below) First the Rabbit MQ experienced network partition. After it recovered from partition, on one of the nodes it started to consume RAM. And after it consumed tens of GB of RAM, it stopped accepting more messages from OpenStack services. As a result, OpenStack failed all the incoming requests. The issue did not end after tests were finished. Conditions for reproduction: The issue is reproduced only once so far. Another run finished successfully, without Rabbit MQ consuming a lot of memory or being stuck for a long time (see comment #22 below). User impact: Shaker creates 8 VMs and then tests network throughput between them. The traffic created is rather close to the limit of network throughput of the lab (a little less then 10G). When traffic is cross-network (i.e. it flows between VMs located in different Neutron networks), it always goes through controllers. Sometimes that hits Rabbit MQ and it switches to inoperable state. The rate of the messages hitting Rabbit MQ are in the 100's per second. From user's point of view, the cloud is not working until the issue is healed. Workaround: Restart all Rabbit MQ nodes. After Rabbit is operable (in several minutes after restart), the cloud should start working properly. Current plan: Reproduce the issue once more and more throughly investigate it. Original description by Leontiy Istomin ======================================= During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-16 20:56:52 Dan Hata description Steps to reproduce: 1. Run Shaker tests. (For explanation what it is, see User Impact section below) First the Rabbit MQ experienced network partition. After it recovered from partition, on one of the nodes it started to consume RAM. And after it consumed tens of GB of RAM, it stopped accepting more messages from OpenStack services. As a result, OpenStack failed all the incoming requests. The issue did not end after tests were finished. Conditions for reproduction: The issue is reproduced only once so far. Another run finished successfully, without Rabbit MQ consuming a lot of memory or being stuck for a long time (see comment #22 below). User impact: Shaker creates 8 VMs and then tests network throughput between them. The traffic created is rather close to the limit of network throughput of the lab (a little less then 10G). When traffic is cross-network (i.e. it flows between VMs located in different Neutron networks), it always goes through controllers. Sometimes that hits Rabbit MQ and it switches to inoperable state. The rate of the messages hitting Rabbit MQ are in the 100's per second. From user's point of view, the cloud is not working until the issue is healed. Workaround: Restart all Rabbit MQ nodes. After Rabbit is operable (in several minutes after restart), the cloud should start working properly. Current plan: Reproduce the issue once more and more throughly investigate it. Original description by Leontiy Istomin ======================================= During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz Steps to reproduce: 1. Run Shaker tests. (For explanation what it is, see User Impact section below) First the Rabbit MQ experienced network partition. After it recovered from partition, on one of the nodes it started to consume RAM. And after it consumed tens of GB of RAM, it stopped accepting more messages from OpenStack services. As a result, OpenStack failed all the incoming requests. The issue did not end after tests were finished. Conditions for reproduction: The issue is reproduced only once so far. Another run finished successfully, without Rabbit MQ consuming a lot of memory or being stuck for a long time (see comment #22 below). User impact: Shaker creates VMs in batches of 8 at a time and tests network throughput between them. The traffic created is rather close to the limit of network throughput of the lab (a little less then 10G). When traffic is cross-network (i.e. it flows between VMs located in different Neutron networks), it always goes through controllers. Sometimes that hits Rabbit MQ and it switches to inoperable state. The rate of the messages hitting Rabbit MQ are in the 100's per second. From user's point of view, the cloud is not working until the issue is healed. Workaround: Restart all Rabbit MQ nodes. After Rabbit is operable (in several minutes after restart), the cloud should start working properly. Current plan: Reproduce the issue once more and more throughly investigate it. Original description by Leontiy Istomin ======================================= During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU: http://paste.openstack.org/show/277774/ atop SHIFT+P ksoftirqd: http://paste.openstack.org/show/277891/ At the time rabbitmq on this controller node (node-49) was down: from node-1 =INFO REPORT==== 8-Jun-2015::23:48:07 === rabbit on node 'rabbit@node-49' down from node-44 =INFO REPORT==== 8-Jun-2015::23:46:55 === rabbit on node 'rabbit@node-49' down Configuration: Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521 Controllers:3 Computes:47 net_ticktime parameter has been added: http://paste.openstack.org/show/278020/ api: '1.0' astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479 auth_required: true build_id: 2015-06-08_06-13-27 build_number: '521' feature_groups: - mirantis fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761 fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24 nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b openstack_version: 2014.2.2-6.1 production: docker python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b release: '6.1' Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz
2015-06-17 09:32:23 Bogdan Dobrelya attachment added app-side.tgz https://bugs.launchpad.net/fuel/+bug/1463433/+attachment/4416217/+files/app-side.tgz
2015-06-17 09:34:37 Bogdan Dobrelya summary [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with AMQP publish got blocked because virt memory got exhausted at rabbit node [shaker] test failing due to multiple "Timed out waiting for reply to ID"/"Queue not found:Basic.consume" events logged by Oslo.messaging after AMQP reconnect
2015-06-17 10:02:04 Bogdan Dobrelya bug task deleted mos
2015-06-17 10:07:43 Bogdan Dobrelya summary [shaker] test failing due to multiple "Timed out waiting for reply to ID"/"Queue not found:Basic.consume" events logged by Oslo.messaging after AMQP reconnect [shaker] test failing when rabbitmq node rasies memory alert
2015-06-18 09:40:36 Bogdan Dobrelya nominated for series fuel/5.1.x
2015-06-18 09:40:36 Bogdan Dobrelya bug task added fuel/5.1.x
2015-06-18 09:40:36 Bogdan Dobrelya nominated for series fuel/6.0.x
2015-06-18 09:40:36 Bogdan Dobrelya bug task added fuel/6.0.x
2015-06-18 09:40:44 Bogdan Dobrelya fuel/6.0.x: importance Undecided High
2015-06-18 09:40:47 Bogdan Dobrelya fuel/6.0.x: status New Triaged
2015-06-18 09:40:50 Bogdan Dobrelya fuel/5.1.x: status New Triaged
2015-06-18 09:40:54 Bogdan Dobrelya fuel/6.0.x: milestone 6.0.2
2015-06-18 09:40:59 Bogdan Dobrelya fuel/5.1.x: milestone 5.1.2
2015-06-18 09:41:08 Bogdan Dobrelya fuel/5.1.x: importance Undecided High
2015-06-19 09:25:17 deactivateduser fuel/5.1.x: assignee MOS Sustaining (mos-sustaining)
2015-06-19 09:25:25 deactivateduser fuel/6.0.x: assignee MOS Sustaining (mos-sustaining)
2015-06-25 12:34:30 Vitaly Sedelnik tags scale 6.1-mu-1 scale
2015-06-29 17:13:51 Eugene Bogdanov tags 6.1-mu-1 scale 6.1 scale
2015-06-29 20:53:00 Leontii Istomin attachment added rabbit_stat.tar.gz https://bugs.launchpad.net/fuel/+bug/1463433/+attachment/4422092/+files/rabbit_stat.tar.gz
2015-07-01 11:35:31 Leontii Istomin attachment added rabbit_stat_16%3A08.log https://bugs.launchpad.net/fuel/+bug/1463433/+attachment/4422932/+files/rabbit_stat_16%253A08.log
2015-07-09 07:25:20 OpenStack Infra fuel: status In Progress Fix Committed
2015-07-09 07:25:32 Bogdan Dobrelya fuel/6.1.x: status In Progress Triaged
2015-07-09 07:25:38 Bogdan Dobrelya fuel/6.1.x: assignee Bogdan Dobrelya (bogdando) MOS Sustaining (mos-sustaining)
2015-09-11 13:02:54 Michal Rostecki fuel/6.1.x: assignee MOS Maintenance (mos-maintenance) Michal Rostecki (mrostecki)
2015-09-11 13:46:49 Michal Rostecki fuel/6.1.x: status Triaged In Progress
2015-09-13 16:17:37 Vitaly Sedelnik fuel/6.1.x: milestone 6.1-updates 6.1-mu-3
2015-09-26 11:07:05 Vitaly Sedelnik fuel/6.0.x: milestone 6.0.2 6.0.1
2015-09-28 18:34:05 Leontii Istomin fuel/7.0.x: status Fix Committed Fix Released
2015-10-06 08:21:10 Vitaly Sedelnik fuel/6.1.x: status In Progress Fix Committed
2015-10-26 12:54:03 Vitaly Sedelnik fuel/5.1.x: assignee MOS Maintenance (mos-maintenance) Denis Meltsaykin (dmeltsaykin)
2015-10-26 12:54:10 Vitaly Sedelnik fuel/6.0.x: assignee MOS Maintenance (mos-maintenance) Denis Meltsaykin (dmeltsaykin)
2015-10-26 12:54:15 Vitaly Sedelnik fuel/5.1.x: milestone 5.1.1-updates 5.1.1-mu-2
2015-10-26 12:54:22 Vitaly Sedelnik fuel/6.0.x: milestone 6.0-updates 6.0-mu-7
2015-10-26 13:45:22 Denis Meltsaykin fuel/5.1.x: status Triaged Won't Fix
2015-10-26 13:45:25 Denis Meltsaykin fuel/6.0.x: status Triaged Won't Fix
2015-10-26 15:06:00 Vitaly Sedelnik fuel/5.1.x: milestone 5.1.1-mu-2 5.1.1-updates
2015-10-26 15:06:03 Vitaly Sedelnik fuel/6.0.x: milestone 6.0-mu-7 6.0-updates
2015-10-28 15:18:19 Vitaly Sedelnik fuel/6.1.x: status Fix Committed Fix Released