Security group creation randomly failed with Timed out waiting for a reply to message ID

Bug #1394576 reported by Tatyanka
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
Unassigned
5.1.x
Won't Fix
High
MOS Oslo
6.0.x
Won't Fix
High
MOS Oslo
6.1.x
Invalid
High
Unassigned

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/5.1_swarm/job/5.1_fuelmain.system_test.centos.ha_neutron_destructive/34/testReport/junit/%28root%29/ha_destroy_controllers/ha_destroy_controllers/

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "5.1.1"
  api: "1.0"
  build_number: "24"
  build_id: "2014-11-19_21-01-00"
  astute_sha: "fce051a6d013b1c30aa07320d225f9af734545de"
  fuellib_sha: "5611c516362bea0fd47fcb5376a9f22dcfbb8307"
  ostf_sha: "64cb59c681658a7a55cc2c09d079072a41beb346"
  nailgun_sha: "7580f6341a726c2019f880ae23ff3f1c581fd850"
fuelmain_sha: "eac9e2704424d1cb3f183c9f74567fd42a1fa6f3"

𝑰𝒔𝒔𝒖𝒆 𝒓𝒆𝒑𝒓𝒐𝒅𝒖𝒄𝒆𝒅 𝒐𝒏 𝑼𝒃𝒖𝒏𝒕𝒖 𝒂𝒏𝒅 𝑪𝒆𝒏𝒕𝒐𝒔 𝑶𝑺𝒆𝒔

𝟏. 𝐃𝐞𝐩𝐥𝐨𝐲 𝐞𝐧𝐯𝐢𝐫𝐨𝐧𝐦𝐞𝐧𝐭
Env configuration:
mode: HA
3 nodes with controller role
2 nodes with compute role
1 node with cinder role

network_provider: neutron gre
storage back end cinder
interfaces:
INTERFACES = {
    'admin': 'eth0',
    'public': 'eth1',
    'management': 'eth2',
    'private': 'eth3',
    'storage': 'eth4',
}

2. Be sure that date command on the salves node show the same result
3. run ostf ha test until it passes
4. run ostf sanity tests until it passes
5. run ostf smoke tests - that failed on sec group creation
6. ssh to the controller node and run manually creation of secroups

𝐀𝐜𝐜𝐨𝐫𝐝𝐢𝐧𝐠 𝐢𝐬𝐬𝐮𝐞 𝐰𝐚𝐬 𝐟𝐨𝐮𝐧𝐝𝐞𝐝 𝐨𝐧 𝐂𝐈: 𝐲𝐨𝐮 𝐜𝐚𝐧 𝐮𝐬𝐞 𝐬𝐧𝐚𝐩𝐬𝐡𝐨𝐭 𝐨𝐟 𝐞𝐧𝐯𝐢𝐫𝐨𝐧𝐦𝐞𝐧𝐭 𝐰𝐢𝐭𝐡 𝐫𝐞𝐩𝐫𝐨𝐝𝐮𝐜𝐞𝐝 𝐩𝐫𝐨𝐛𝐥𝐞𝐦. 𝐈𝐧 𝐭𝐡𝐢𝐬 𝐜𝐚𝐬𝐞 𝐲𝐨𝐮 𝐧𝐞𝐞𝐝 𝐭𝐨

1. Revert current deployment (you can use command dos.py revert <env_name> --snapshot-name <snapshot_name> && dos.py resume <env_name> && virsh net-dumpxml <env_name>_admin | grep -P "(\d+\.){3}" -o | awk '{print "Admin node IP: "$0"2"}')
2. Be sure that "date" command on the salves node show the same result. If not synchronize time on slaves manually
3. run ostf ha test until it passes
4. run ostf sanity tests until it passes
5. run ostf smoke tests - that failed on sec group creation
6. ssh to the controller node and run manually creation of secroups

Also you can execute test 𝕙𝕒_𝕕𝕖𝕤𝕥𝕣𝕠𝕪_𝕔𝕠𝕟𝕥𝕣𝕠𝕝𝕝𝕖𝕣𝕤 on localhost end using steps described below see the same issue
1. Revert current deployment (you can use command dos.py revert <env_name> --snapshot-name <snapshot_name> && dos.py resume <env_name> && virsh net-dumpxml <env_name>_admin | grep -P "(\d+\.){3}" -o | awk '{print "Admin node IP: "$0"2"}')
2. Be sure that "date" command on the salves node show the same result. If not synchronize time on slaves manually
3. run ostf ha test until it passes
4. run ostf sanity tests until it passes
5. run ostf smoke tests - that failed on sec group creation
6. ssh to the controller node and run manually creation of secroups

𝑨𝒄𝒕𝒖𝒂𝒍 𝒓𝒆𝒔𝒖𝒍𝒕:
When I try to create sec group manualy
http://paste.openstack.org/show/135222/
in compute.log on compute node
http://paste.openstack.org/show/135221/

at the same time one of retries for manual creation(the same group with the same name) finish with success
http://paste.openstack.org/show/135223/

So there are some randomly fails and it would be great if object creation will be more stable

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
importance: Undecided → Medium
description: updated
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

"Timed out waiting for a reply to message ID" should be related to heartbeats and Oslo.messaging.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → MOS Oslo (mos-oslo)
milestone: 5.1.1 → 5.1.2
status: New → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Medium bugs cannot be assigned to 5.1.1 due to SCF, moved to 5.1.2

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

the same things happens during run of our stahe jobs on attepmts to delete networks
http://paste.openstack.org/show/135894/
http://jenkins-product.srt.mirantis.net:8080/job/6.0.staging.centos.bvt_1/82/

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

the same situation with different operantions (snapshot creation, laucnh vm, create/ delete net etc).

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
description: updated
description: updated
description: updated
Revision history for this message
Andrew Woodward (xarses) wrote :

are we ensuring that crm has not been operating on rabbitmq-server after resuming the environment? The time-sqew after restoring a snapshot may be enough to trigger it into promoting as a false failover

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Andrew, yes, we are. Also I was mentioned we check if there some time sqew on ther nodes, and if so sync it manually. As an additional check for rabbit here( that creates queues publish it and than consumes) and it pass. But when we manually created instances, or secgroups or other objects it unpredictable failed with oslo traces. It can be 2-4 times from 20 attempts

Revision history for this message
Denis M. (dmakogon) wrote :

It appears that most of this faults are caused by Neutron because Neutron doens't use any where RPC call timeouts to wait reply from other remote Neutron service.

Take a look at Neutron L3 agent PRC API https://github.com/openstack/neutron/blob/stable/juno/neutron/agent/l3_agent.py
You will see that Neutron doesn't uses timeout kwarg anywhere.

Investigating oslo.messaging workflow it appears that this problem can be addressed but only partially due to Neutron's RPC API timepouts.

Recomendation:

This should be pasted into neutron.conf at any controller https://gist.github.com/denismakogon/839105ca2487df9b837d

To be short, to each nuetron.conf:
        rabbit_retry_interval=12
        rabbit_max_retries=5
        kombu_reconnect_delay=20

From Neutron team, please consider fixing RPC API to allow deployers to configure timeouts during RPC API calls.

Next comment will contain diagnostic snapshot and rabbitqm cluster status report during smoke tests execution.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

It looks like this bug should be high, as it affects RPC failover for Neutron's requests

Revision history for this message
Denis M. (dmakogon) wrote :

Bogdan, but as you can see, this bug is caused by Neutron by itself, and oslo.messaging is being a single point of failure. This bug should be considered to be moved from MOS Oslo team to Neutron team.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Neutron doesn't use RPC CALL timeouts for certain procedures and this why Broken Pipe appears.

Explanation: RPC Call procedure opens socket from RPC Client to RPC Server with appropriate timeout. So, when client executes call without timeout, client would not wait to receive the server's response and transport socket will be closed immediately by sending SIGPIPE, and when server will try to write its response to that socket it will fail due to ERROR 32 - Broken Pipe.

So, as you can see, all OpenStack services should use explicitly defined timeouts when using RPC CALL method. That's why this bug is not caused by oslo.messaging too. Reason: stabale/juno Neutron RPC L3 agent API doesn't use RPC CALL timeouts at all, see [2].

[2] https://github.com/openstack/neutron/blob/stable/juno/neutron/agent/l3_agent.py

Revision history for this message
Ilya Shakhat (shakhat) wrote :

Neutron doesn't specify the RPC timeout explicitly meaning that the default value is used (60 seconds). See https://github.com/openstack/oslo.messaging/blob/stable/juno/oslo/messaging/rpc/client.py#L137-L144 and https://github.com/openstack/oslo.messaging/blob/stable/juno/oslo/messaging/rpc/client.py#L34-L38. Timeout exceptions are raised when a service doesn't respond during this time.

"Socket closed" and "Broken pipe" errors come from Rabbit driver located deep inside of oslo.messaging. They mean that TCP connection between service and Rabbit server was lost. Neutron as a consumer of high-level API shouldn't bother about handling these connections and setting any parameters for them.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The bug is in OpenStack, hence moving it to MOS project

no longer affects: fuel/5.1.x
no longer affects: fuel/6.1.x
no longer affects: fuel/6.0.x
no longer affects: fuel
Revision history for this message
Dmitry Savenkov (dsavenkov) wrote :

It's very likely to be the case that it won't be fixed by any of 6.0.x releases due to a number of systemic issues that we hope to resolve when there are obtained some results of stress-testing.

Revision history for this message
Oleksii Zamiatin (ozamiatin) wrote :

Can not reproduce on MOS 6.1 (iso 338)

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Not customer-found, no fix available - Won't Fix for 5.1.1-updates and 6.0-updates

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.