Mirantis OpenStack

Security group creation randomly failed with Timed out waiting for a reply to message ID

Bug #1394576 reported by Tatyanka on 2014-11-20

This bug affects 3 people

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Invalid	High	Unassigned	Mirantis OpenStack 6.1
5.1.x	Won't Fix	High	MOS Oslo	Mirantis OpenStack 5.1.1-updates
6.0.x	Won't Fix	High	MOS Oslo	Mirantis OpenStack 6.0-updates
6.1.x	Invalid	High	Unassigned	Mirantis OpenStack 6.1

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/5.1_swarm/job/5.1_fuelmain.system_test.centos.ha_neutron_destructive/34/testReport/junit/%28root%29/ha_destroy_controllers/ha_destroy_controllers/

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "5.1.1"
  api: "1.0"
  build_number: "24"
  build_id: "2014-11-19_21-01-00"
  astute_sha: "fce051a6d013b1c30aa07320d225f9af734545de"
  fuellib_sha: "5611c516362bea0fd47fcb5376a9f22dcfbb8307"
  ostf_sha: "64cb59c681658a7a55cc2c09d079072a41beb346"
  nailgun_sha: "7580f6341a726c2019f880ae23ff3f1c581fd850"
fuelmain_sha: "eac9e2704424d1cb3f183c9f74567fd42a1fa6f3"

𝑰𝒔𝒔𝒖𝒆 𝒓𝒆𝒑𝒓𝒐𝒅𝒖𝒄𝒆𝒅 𝒐𝒏 𝑼𝒃𝒖𝒏𝒕𝒖 𝒂𝒏𝒅 𝑪𝒆𝒏𝒕𝒐𝒔 𝑶𝑺𝒆𝒔

𝟏. 𝐃𝐞𝐩𝐥𝐨𝐲 𝐞𝐧𝐯𝐢𝐫𝐨𝐧𝐦𝐞𝐧𝐭
Env configuration:
mode: HA
3 nodes with controller role
2 nodes with compute role
1 node with cinder role

network_provider: neutron gre
storage back end cinder
interfaces:
INTERFACES = {
    'admin': 'eth0',
    'public': 'eth1',
    'management': 'eth2',
    'private': 'eth3',
    'storage': 'eth4',
}

2. Be sure that date command on the salves node show the same result
3. run ostf ha test until it passes
4. run ostf sanity tests until it passes
5. run ostf smoke tests - that failed on sec group creation
6. ssh to the controller node and run manually creation of secroups

𝐀𝐜𝐜𝐨𝐫𝐝𝐢𝐧𝐠 𝐢𝐬𝐬𝐮𝐞 𝐰𝐚𝐬 𝐟𝐨𝐮𝐧𝐝𝐞𝐝 𝐨𝐧 𝐂𝐈: 𝐲𝐨𝐮 𝐜𝐚𝐧 𝐮𝐬𝐞 𝐬𝐧𝐚𝐩𝐬𝐡𝐨𝐭 𝐨𝐟 𝐞𝐧𝐯𝐢𝐫𝐨𝐧𝐦𝐞𝐧𝐭 𝐰𝐢𝐭𝐡 𝐫𝐞𝐩𝐫𝐨𝐝𝐮𝐜𝐞𝐝 𝐩𝐫𝐨𝐛𝐥𝐞𝐦. 𝐈𝐧 𝐭𝐡𝐢𝐬 𝐜𝐚𝐬𝐞 𝐲𝐨𝐮 𝐧𝐞𝐞𝐝 𝐭𝐨

1. Revert current deployment (you can use command dos.py revert <env_name> --snapshot-name <snapshot_name> && dos.py resume <env_name> && virsh net-dumpxml <env_name>_admin | grep -P "(\d+\.){3}" -o | awk '{print "Admin node IP: "$0"2"}')
2. Be sure that "date" command on the salves node show the same result. If not synchronize time on slaves manually
3. run ostf ha test until it passes
4. run ostf sanity tests until it passes
5. run ostf smoke tests - that failed on sec group creation
6. ssh to the controller node and run manually creation of secroups

Also you can execute test 𝕙𝕒_𝕕𝕖𝕤𝕥𝕣𝕠𝕪_𝕔𝕠𝕟𝕥𝕣𝕠𝕝𝕝𝕖𝕣𝕤 on localhost end using steps described below see the same issue
1. Revert current deployment (you can use command dos.py revert <env_name> --snapshot-name <snapshot_name> && dos.py resume <env_name> && virsh net-dumpxml <env_name>_admin | grep -P "(\d+\.){3}" -o | awk '{print "Admin node IP: "$0"2"}')
2. Be sure that "date" command on the salves node show the same result. If not synchronize time on slaves manually
3. run ostf ha test until it passes
4. run ostf sanity tests until it passes
5. run ostf smoke tests - that failed on sec group creation
6. ssh to the controller node and run manually creation of secroups

𝑨𝒄𝒕𝒖𝒂𝒍 𝒓𝒆𝒔𝒖𝒍𝒕:
When I try to create sec group manualy
http://paste.openstack.org/show/135222/
in compute.log on compute node
http://paste.openstack.org/show/135221/

at the same time one of retries for manual creation(the same group with the same name) finish with success
http://paste.openstack.org/show/135223/

So there are some randomly fails and it would be great if object creation will be more stable

See original description

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-11-20:

fuel-snapshot-2014-11-20_13-09-26.tgz Edit (13.9 MiB, application/x-tar)

Changed in fuel:
importance:	Undecided → Medium
description:	updated

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-11-20:

"Timed out waiting for a reply to message ID" should be related to heartbeats and Oslo.messaging.

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → MOS Oslo (mos-oslo)
milestone:	5.1.1 → 5.1.2
status:	New → Confirmed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-11-20:

Medium bugs cannot be assigned to 5.1.1 due to SCF, moved to 5.1.2

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-11-21:

the same things happens during run of our stahe jobs on attepmts to delete networks
http://paste.openstack.org/show/135894/
http://jenkins-product.srt.mirantis.net:8080/job/6.0.staging.centos.bvt_1/82/

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-12-02:

the same situation with different operantions (snapshot creation, laucnh vm, create/ delete net etc).

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-12-12:

the same behavior
http://jenkins-product.srt.mirantis.net:8080/view/6.0_swarm/job/6.0_fuelmain.system_test.centos.ha_neutron_destructive/54/consoleFull

http://paste.openstack.org/show/149901/ a lot of such errors in diff times / on diff operation

Tatyanka (tatyana-leontovich) on 2015-02-06

description:	updated
description:	updated

Tatyanka (tatyana-leontovich) on 2015-02-06

description:

updated

Revision history for this message

Andrew Woodward (xarses) wrote on 2015-02-06:

are we ensuring that crm has not been operating on rabbitmq-server after resuming the environment? The time-sqew after restoring a snapshot may be enough to trigger it into promoting as a false failover

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-02-06:

Andrew, yes, we are. Also I was mentioned we check if there some time sqew on ther nodes, and if so sync it manually. As an additional check for rabbit here( that creates queues publish it and than consumes) and it pass. But when we manually created instances, or secgroups or other objects it unpredictable failed with oslo traces. It can be 2-4 times from 20 attempts

Revision history for this message

Denis M. (dmakogon) wrote on 2015-02-10:

It appears that most of this faults are caused by Neutron because Neutron doens't use any where RPC call timeouts to wait reply from other remote Neutron service.

Take a look at Neutron L3 agent PRC API https://github.com/openstack/neutron/blob/stable/juno/neutron/agent/l3_agent.py
You will see that Neutron doesn't uses timeout kwarg anywhere.

Investigating oslo.messaging workflow it appears that this problem can be addressed but only partially due to Neutron's RPC API timepouts.

Recomendation:

This should be pasted into neutron.conf at any controller https://gist.github.com/denismakogon/839105ca2487df9b837d

To be short, to each nuetron.conf:
        rabbit_retry_interval=12
        rabbit_max_retries=5
        kombu_reconnect_delay=20

From Neutron team, please consider fixing RPC API to allow deployers to configure timeouts during RPC API calls.

Next comment will contain diagnostic snapshot and rabbitqm cluster status report during smoke tests execution.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-02-10:

#10

It looks like this bug should be high, as it affects RPC failover for Neutron's requests

Revision history for this message

Denis M. (dmakogon) wrote on 2015-02-10:

#11

Bogdan, but as you can see, this bug is caused by Neutron by itself, and oslo.messaging is being a single point of failure. This bug should be considered to be moved from MOS Oslo team to Neutron team.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-02-11:

#12

Neutron doesn't use RPC CALL timeouts for certain procedures and this why Broken Pipe appears.

Explanation: RPC Call procedure opens socket from RPC Client to RPC Server with appropriate timeout. So, when client executes call without timeout, client would not wait to receive the server's response and transport socket will be closed immediately by sending SIGPIPE, and when server will try to write its response to that socket it will fail due to ERROR 32 - Broken Pipe.

So, as you can see, all OpenStack services should use explicitly defined timeouts when using RPC CALL method. That's why this bug is not caused by oslo.messaging too. Reason: stabale/juno Neutron RPC L3 agent API doesn't use RPC CALL timeouts at all, see [2].

[2] https://github.com/openstack/neutron/blob/stable/juno/neutron/agent/l3_agent.py

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-02-11:

#13

Neutron doesn't specify the RPC timeout explicitly meaning that the default value is used (60 seconds). See https://github.com/openstack/oslo.messaging/blob/stable/juno/oslo/messaging/rpc/client.py#L137-L144 and https://github.com/openstack/oslo.messaging/blob/stable/juno/oslo/messaging/rpc/client.py#L34-L38. Timeout exceptions are raised when a service doesn't respond during this time.

"Socket closed" and "Broken pipe" errors come from Rabbit driver located deep inside of oslo.messaging. They mean that TCP connection between service and Rabbit server was lost. Neutron as a consumer of high-level API shouldn't bother about handling these connections and setting any parameters for them.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-02-12:

#14

The bug is in OpenStack, hence moving it to MOS project

no longer affects:	fuel/5.1.x
no longer affects:	fuel/6.1.x
no longer affects:	fuel/6.0.x
no longer affects:	fuel

Revision history for this message

Dmitry Savenkov (dsavenkov) wrote on 2015-02-13:

#15

It's very likely to be the case that it won't be fixed by any of 6.0.x releases due to a number of systemic issues that we hope to resolve when there are obtained some results of stress-testing.

Revision history for this message

Oleksii Zamiatin (ozamiatin) wrote on 2015-04-27:

#16

Can not reproduce on MOS 6.1 (iso 338)

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-10-23:

#17

Not customer-found, no fix available - Won't Fix for 5.1.1-updates and 6.0-updates