Slow TCP connectivity between VM and controller node

Bug #1431396 reported by Ilya Shakhat
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
High
Sergey Vasilenko
6.0.x
Invalid
Undecided
Sergey Vasilenko

Bug Description

Symptoms:
1. VM cannot retrieve metadata:
    * Cirros log contains
          failed to read iid from metadata. tried 20
          no results found for mode=net. up 241.41. searched: nocloud configdrive ec2
          failed to get instance-id of datasource
    * Ubuntu log contains
          2015-03-12 10:05:27,093 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'>: [Errno 101] Network is unreachable)]
          Cloud-init v. 0.7.5 finished at Thu, 12 Mar 2015 10:09:41 +0000. Datasource DataSourceNone. Up 389.45 seconds
          2015-03-12 10:09:41,238 - cc_final_message.py[WARNING]: Used fallback datasource
2. Download to VM extremely slow:
    * time curl -I http://ya.ru
       real 0m 14.00s
3. It takes couple minutes to connect by ssh to the instance

However ping (ICMP) traffic goes well in both directions (from controller L3-agent's namespace to instance):
[root@node-14 neutron]# ip netns exec qrouter-b2a4bff5-5038-49e6-957f-50a278b773e6 ping -s 1400 10.0.0.5
PING 10.0.0.5 (10.0.0.5) 1400(1428) bytes of data.
1408 bytes from 10.0.0.5: icmp_seq=1 ttl=64 time=0.760 ms

Ilya Shakhat (shakhat)
summary: - Slow connectivity between VM and controller node
+ Slow TCP connectivity between VM and controller node
Revision history for this message
Ilya Shakhat (shakhat) wrote :

{u'build_id': u'2015-03-11_02-09-40', u'ostf_sha': u'8df5f2fcdae3bc9ea7d700ffd64db820baf51914', u'build_number': u'182', u'auth_required': True, u'nailgun_sha': u'058d1adef486c116ab8c79379ea6f925db039177', u'production': u'docker', u'api': u'1.0', u'python-fuelclient_sha': u'8a292dbdfc3afc1994fd8a81a28903f9a5cca351', u'astute_sha': u'93de472789d9fc351d915e401892c9f792c14ca2', u'fuelmain_sha': u'0f588ec9125cc1f4dd24a07d3bc6903c97b84d27', u'feature_groups': [u'mirantis'], u'release': u'6.1', u'release_versions': {u'2014.2-6.1': {u'VERSION': {u'build_id': u'2015-03-11_02-09-40', u'ostf_sha': u'8df5f2fcdae3bc9ea7d700ffd64db820baf51914', u'build_number': u'182', u'api': u'1.0', u'nailgun_sha': u'058d1adef486c116ab8c79379ea6f925db039177', u'production': u'docker', u'python-fuelclient_sha': u'8a292dbdfc3afc1994fd8a81a28903f9a5cca351', u'astute_sha': u'93de472789d9fc351d915e401892c9f792c14ca2', u'feature_groups': [u'mirantis'], u'release': u'6.1', u'fuelmain_sha': u'0f588ec9125cc1f4dd24a07d3bc6903c97b84d27', u'fuellib_sha': u'acd7dfb5f93ee0719464d07faf5883ee804a7205'}}}, u'fuellib_sha': u'acd7dfb5f93ee0719464d07faf5883ee804a7205'}

Revision history for this message
Ilya Shakhat (shakhat) wrote :

Connectivity between controller and compute nodes: 9.35Gbits/s

Revision history for this message
Ilya Shakhat (shakhat) wrote :

Traffic capture collected at instance's interface (compute -> tapfa24978c-8d) contains lots of TCP Retransmission packets (see attached)

tags: added: neutron
Changed in mos:
assignee: nobody → MOS Neutron (mos-neutron)
importance: Undecided → Critical
milestone: none → 6.1
status: New → Confirmed
Revision history for this message
Ilya Shakhat (shakhat) wrote :

Connectivity between instances in the same L2 network on the same compute host is normal. Between different compute hosts is awful.

Revision history for this message
Ilya Shakhat (shakhat) wrote :

Deployment: HA + Neutron VLAN

Changed in fuel:
status: New → Confirmed
importance: Undecided → Critical
milestone: none → 6.1
assignee: nobody → Fuel for Openstack (fuel)
Ilya Shakhat (shakhat)
Changed in fuel:
assignee: Fuel for Openstack (fuel) → Sergey Vasilenko (xenolog)
no longer affects: mos
Changed in fuel:
status: Confirmed → New
tags: added: l23network
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

looks like temporary disabled "disable offloading" feature affects some scale labs.

This feature will be re-implemented. Also one depends from
https://review.openstack.org/#/c/163090/
https://review.openstack.org/#/c/164429/

*** should *not* be backported to 6.0.x and early.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/164450

tags: added: scale
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/164450
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=8b103a7394ef3e1f090c893c805b9d99ecc4ce36
Submitter: Jenkins
Branch: master

commit 8b103a7394ef3e1f090c893c805b9d99ecc4ce36
Author: Sergey Vasilenko <email address hidden>
Date: Sat Mar 14 19:38:42 2015 +0300

    FIX: "disable offloading" flag functionality

    Old network configuration
    (6.0 and early) has global flag for disabling offload.

    New implementation has 'disable offloading' flag per interface.
    This commit add default set of ethtool propertios for disabling
    offloading if no ethtool properties given for corresponded interface

    Change-Id: Ic1aab4a6cfd32a3211d9a5c449ba887b7dc05953
    Closes-bug: #1431735
    Partial-bug: #1431396
    Blueprint: refactor-l23-linux-bridges

Revision history for this message
Andrey Maximov (maximov) wrote :

according to Vladimir K. this problem can be reproduced only on CentOS 6.5 (kernel version 2.6.xx). However, if we update kernel to version 3.10 this issue disappears.
We need to retest this configuration against fuel 6.0 to make sure that this problem is purely related to HW compatibility and NOT related to changes we made for L23 networking refactoring.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

According to the latest status, as it is hardware-specific issue which can be worked around by disabling TX checksums on the nodes, I am marking it as High-priority instead of critical

Revision history for this message
Ilya Shakhat (shakhat) wrote :

Shaker performance results after applying the fix, measured instance-to-instance bandwidth in the same L2 domain (instances hosted on different compute nodes):
 * 1 thread - 3.33 Gb/s
 * 2 threads - 2 x 2.9 Gb/s = 5.4 Gb/s
 * 4 threads - 4 x 2.1 Gb/s = 8.4 Gb/s
 * 6 threads - 6 x 1.35 Gb/s = 8.1 Gb/s
(full report attached)

tags: added: release-notes
tags: added: docs
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

If anybody increase MTU in instances (and private interface, exactly), this leads to https://bugs.launchpad.net/fuel/+bug/1453425

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This issue is a driver bug. There is a w/a available - disable TX checksumming for 2.6.32 kernel

Revision history for this message
Sergey Vasilenko (xenolog) wrote :

Please re-test it, using ISO made after 21may.
This issue should be fixed by set of fixes for adjusting MTU, merged on this week.

if not -- I agree, it's a driver bug.

Changed in fuel:
status: Triaged → Won't Fix
no longer affects: fuel/6.1.x
Revision history for this message
Leontii Istomin (listomin) wrote :

reproduced with 6.1-480:
api: '1.0'
astute_sha: 5d570ae5e03909182db8e284fbe6e4468c0a4e3e
auth_required: true
build_id: 2015-05-29_17-57-31
build_number: '480'
feature_groups:
- mirantis
fuel-library_sha: 6461ed55d75d267d6ef9eca835011313c4d70a30
fuel-ostf_sha: 7413186490e8d651b8837b9eee75efa53f5e230b
fuelmain_sha: 6b5712a7197672d588801a1816f56f321cbceebd
nailgun_sha: 3830bdcb28ec050eed399fe782cc3dd5fbf31bde
openstack_version: 2014.2.2-6.1
production: docker
python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b
release: '6.1'

Configuration:
Baremetal,Centos,IBP, Neutron-vlan,Сeph-all,Nova-debug,nova-quotas,6.1_480
Controllers:3 Computes:3

We have faced with an issue when we can't reach a VM via ssh. Centos+neutronVLAN. Does it known issue?
tcpdump on interface of instance on compute node (tcpdump -i qvo07deb7a0-56) when ssh to VM from router namespace:
http://paste.openstack.org/show/252948/
ping works well.
tcpdump on interface of instance on compute node (tcpdump -i qvo07deb7a0-56) when ping VM from router namespace:
http://paste.openstack.org/show/252947/

Offloading was disabled on each interfaces (screenshot is attached)

Revision history for this message
Leontii Istomin (listomin) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/187265

Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Sergey Vasilenko (xenolog)
status: Won't Fix → In Progress
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

I confirm, disabling 'tx-checksumming' solved this issue.

Taking into account, that this bug reproduced only on Centos with old (2.6.32) kernel environments, I see two way for fix it:
* accept and merge the patch above
* merge nothing, but reflect ability of this issue in the release notes and recommend do not use old kernel on Centos.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I suggest that we do not fix this issue but provide users with clear instructions how to work it around

Changed in fuel:
status: In Progress → Won't Fix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Sergey Vasilenko (<email address hidden>) on branch: master
Review: https://review.openstack.org/187265

Revision history for this message
Chris Arnott (chris-arnott) wrote :

[quote]
I suggest that we do not fix this issue but provide users with clear instructions how to work it around
[quote]

What are the work around instructions? I'm experiencing this problem (with a bonded pair of NICs) and have:
 - disabled tx checksumming on each NIC
 - NOT been able to disable tx checksumming on the bonded interface:
[root@node-1 ~]# ethtool -K bond0 tx off
Cannot change tx-checksumming
Could not change any device features

So the issue persists.
Should I be upgrading my kernel? Or running without network bonding?

Revision history for this message
Chris Arnott (chris-arnott) wrote :

An update to my comment #21.

Disabling checksumming on each eth interface of the bond has in fact fixed the problem for me.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.