Fuel for OpenStack

Slow TCP connectivity between VM and controller node

Bug #1431396 reported by Ilya Shakhat on 2015-03-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Won't Fix	High	Sergey Vasilenko	Fuel for OpenStack 6.1
	6.0.x	Invalid	Undecided	Sergey Vasilenko	Fuel for OpenStack 6.0-updates

Bug Description

Symptoms:
1. VM cannot retrieve metadata:
    * Cirros log contains
          failed to read iid from metadata. tried 20
          no results found for mode=net. up 241.41. searched: nocloud configdrive ec2
          failed to get instance-id of datasource
    * Ubuntu log contains
          2015-03-12 10:05:27,093 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'>: [Errno 101] Network is unreachable)]
          Cloud-init v. 0.7.5 finished at Thu, 12 Mar 2015 10:09:41 +0000. Datasource DataSourceNone. Up 389.45 seconds
          2015-03-12 10:09:41,238 - cc_final_message.py[WARNING]: Used fallback datasource
2. Download to VM extremely slow:
    * time curl -I http://ya.ru
       real 0m 14.00s
3. It takes couple minutes to connect by ssh to the instance

However ping (ICMP) traffic goes well in both directions (from controller L3-agent's namespace to instance):
[root@node-14 neutron]# ip netns exec qrouter-b2a4bff5-5038-49e6-957f-50a278b773e6 ping -s 1400 10.0.0.5
PING 10.0.0.5 (10.0.0.5) 1400(1428) bytes of data.
1408 bytes from 10.0.0.5: icmp_seq=1 ttl=64 time=0.760 ms

Tags:

Ilya Shakhat (shakhat) on 2015-03-12

summary:

- Slow connectivity between VM and controller node
+ Slow TCP connectivity between VM and controller node

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-03-12:

{u'build_id': u'2015-03-11_02-09-40', u'ostf_sha': u'8df5f2fcdae3bc9ea7d700ffd64db820baf51914', u'build_number': u'182', u'auth_required': True, u'nailgun_sha': u'058d1adef486c116ab8c79379ea6f925db039177', u'production': u'docker', u'api': u'1.0', u'python-fuelclient_sha': u'8a292dbdfc3afc1994fd8a81a28903f9a5cca351', u'astute_sha': u'93de472789d9fc351d915e401892c9f792c14ca2', u'fuelmain_sha': u'0f588ec9125cc1f4dd24a07d3bc6903c97b84d27', u'feature_groups': [u'mirantis'], u'release': u'6.1', u'release_versions': {u'2014.2-6.1': {u'VERSION': {u'build_id': u'2015-03-11_02-09-40', u'ostf_sha': u'8df5f2fcdae3bc9ea7d700ffd64db820baf51914', u'build_number': u'182', u'api': u'1.0', u'nailgun_sha': u'058d1adef486c116ab8c79379ea6f925db039177', u'production': u'docker', u'python-fuelclient_sha': u'8a292dbdfc3afc1994fd8a81a28903f9a5cca351', u'astute_sha': u'93de472789d9fc351d915e401892c9f792c14ca2', u'feature_groups': [u'mirantis'], u'release': u'6.1', u'fuelmain_sha': u'0f588ec9125cc1f4dd24a07d3bc6903c97b84d27', u'fuellib_sha': u'acd7dfb5f93ee0719464d07faf5883ee804a7205'}}}, u'fuellib_sha': u'acd7dfb5f93ee0719464d07faf5883ee804a7205'}

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-03-12:

Connectivity between controller and compute nodes: 9.35Gbits/s

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-03-12:

slow.cap Edit (7.8 KiB, application/cap)

Traffic capture collected at instance's interface (compute -> tapfa24978c-8d) contains lots of TCP Retransmission packets (see attached)

tags:	added: neutron
Changed in mos:
assignee:	nobody → MOS Neutron (mos-neutron)
importance:	Undecided → Critical
milestone:	none → 6.1
status:	New → Confirmed

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-03-12:

Connectivity between instances in the same L2 network on the same compute host is normal. Between different compute hosts is awful.

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-03-12:

Deployment: HA + Neutron VLAN

Changed in fuel:
status:	New → Confirmed
importance:	Undecided → Critical
milestone:	none → 6.1
assignee:	nobody → Fuel for Openstack (fuel)

Ilya Shakhat (shakhat) on 2015-03-13

Changed in fuel:
assignee:	Fuel for Openstack (fuel) → Sergey Vasilenko (xenolog)

Fuel Devops McRobotson (fuel-devops-robot) on 2015-03-13

no longer affects:	mos
Changed in fuel:
status:	Confirmed → New

Sergey Vasilenko (xenolog) on 2015-03-14

tags:

added: l23network

Revision history for this message

Sergey Vasilenko (xenolog) wrote on 2015-03-14:

looks like temporary disabled "disable offloading" feature affects some scale labs.

This feature will be re-implemented. Also one depends from
https://review.openstack.org/#/c/163090/
https://review.openstack.org/#/c/164429/

*** should *not* be backported to 6.0.x and early.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-14: Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/164450

Leontii Istomin (listomin) on 2015-03-17

tags:

added: scale

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-18: Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/164450
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=8b103a7394ef3e1f090c893c805b9d99ecc4ce36
Submitter: Jenkins
Branch: master

commit 8b103a7394ef3e1f090c893c805b9d99ecc4ce36
Author: Sergey Vasilenko <email address hidden>
Date: Sat Mar 14 19:38:42 2015 +0300

FIX: "disable offloading" flag functionality

Old network configuration
(6.0 and early) has global flag for disabling offload.

    New implementation has 'disable offloading' flag per interface.
    This commit add default set of ethtool propertios for disabling
    offloading if no ethtool properties given for corresponded interface

    Change-Id: Ic1aab4a6cfd32a3211d9a5c449ba887b7dc05953
    Closes-bug: #1431735
    Partial-bug: #1431396
    Blueprint: refactor-l23-linux-bridges

Revision history for this message

Andrey Maximov (maximov) wrote on 2015-04-01:

according to Vladimir K. this problem can be reproduced only on CentOS 6.5 (kernel version 2.6.xx). However, if we update kernel to version 3.10 this issue disappears.
We need to retest this configuration against fuel 6.0 to make sure that this problem is purely related to HW compatibility and NOT related to changes we made for L23 networking refactoring.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-04-01:

#10

According to the latest status, as it is hardware-specific issue which can be worked around by disabling TX checksums on the nodes, I am marking it as High-priority instead of critical

Revision history for this message

Ilya Shakhat (shakhat) wrote on 2015-04-03:

#11

shaker_dense_l2.html Edit (188.6 KiB, text/html)

Shaker performance results after applying the fix, measured instance-to-instance bandwidth in the same L2 domain (instances hosted on different compute nodes):
* 1 thread - 3.33 Gb/s
* 2 threads - 2 x 2.9 Gb/s = 5.4 Gb/s
* 4 threads - 4 x 2.1 Gb/s = 8.4 Gb/s
* 6 threads - 6 x 1.35 Gb/s = 8.1 Gb/s
(full report attached)

Vladimir Kuklin (vkuklin) on 2015-04-09

tags:

added: release-notes

Vladimir Kuklin (vkuklin) on 2015-04-28

tags:

added: docs

Revision history for this message

Sergey Vasilenko (xenolog) wrote on 2015-05-13:

#12

If anybody increase MTU in instances (and private interface, exactly), this leads to https://bugs.launchpad.net/fuel/+bug/1453425

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-05-15:

#13

This issue is a driver bug. There is a w/a available - disable TX checksumming for 2.6.32 kernel

Revision history for this message

Sergey Vasilenko (xenolog) wrote on 2015-05-22:

#14

Please re-test it, using ISO made after 21may.
This issue should be fixed by set of fixes for adjusting MTU, merged on this week.

if not -- I agree, it's a driver bug.

Vladimir Kuklin (vkuklin) on 2015-05-22

Changed in fuel:
status:	Triaged → Won't Fix
no longer affects:	fuel/6.1.x

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-06-01:

#15

offloading_off.png Edit (71.7 KiB, image/png)

reproduced with 6.1-480:
api: '1.0'
astute_sha: 5d570ae5e03909182db8e284fbe6e4468c0a4e3e
auth_required: true
build_id: 2015-05-29_17-57-31
build_number: '480'
feature_groups:
- mirantis
fuel-library_sha: 6461ed55d75d267d6ef9eca835011313c4d70a30
fuel-ostf_sha: 7413186490e8d651b8837b9eee75efa53f5e230b
fuelmain_sha: 6b5712a7197672d588801a1816f56f321cbceebd
nailgun_sha: 3830bdcb28ec050eed399fe782cc3dd5fbf31bde
openstack_version: 2014.2.2-6.1
production: docker
python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b
release: '6.1'

Configuration:
Baremetal,Centos,IBP, Neutron-vlan,Сeph-all,Nova-debug,nova-quotas,6.1_480
Controllers:3 Computes:3

We have faced with an issue when we can't reach a VM via ssh. Centos+neutronVLAN. Does it known issue?
tcpdump on interface of instance on compute node (tcpdump -i qvo07deb7a0-56) when ssh to VM from router namespace:
http://paste.openstack.org/show/252948/
ping works well.
tcpdump on interface of instance on compute node (tcpdump -i qvo07deb7a0-56) when ping VM from router namespace:
http://paste.openstack.org/show/252947/

Offloading was disabled on each interfaces (screenshot is attached)

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-06-01:

#16

here is the snapshot for #15 comment: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-01_14-07-33.tar.xz

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-06-01: Fix proposed to fuel-library (master)

#17

Fix proposed to branch: master
Review: https://review.openstack.org/187265

Changed in fuel:
assignee:	Vladimir Kuklin (vkuklin) → Sergey Vasilenko (xenolog)
status:	Won't Fix → In Progress

Revision history for this message

Sergey Vasilenko (xenolog) wrote on 2015-06-01:

#18

I confirm, disabling 'tx-checksumming' solved this issue.

Taking into account, that this bug reproduced only on Centos with old (2.6.32) kernel environments, I see two way for fix it:
* accept and merge the patch above
* merge nothing, but reflect ability of this issue in the release notes and recommend do not use old kernel on Centos.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-06-01:

#19

I suggest that we do not fix this issue but provide users with clear instructions how to work it around

Changed in fuel:
status:	In Progress → Won't Fix

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-03: Change abandoned on fuel-library (master)

#20

Change abandoned by Sergey Vasilenko (<email address hidden>) on branch: master
Review: https://review.openstack.org/187265

Revision history for this message

Chris Arnott (chris-arnott) wrote on 2015-11-04:

#21

[quote]
I suggest that we do not fix this issue but provide users with clear instructions how to work it around
[quote]

What are the work around instructions? I'm experiencing this problem (with a bonded pair of NICs) and have:
- disabled tx checksumming on each NIC
- NOT been able to disable tx checksumming on the bonded interface:
[root@node-1 ~]# ethtool -K bond0 tx off
Cannot change tx-checksumming
Could not change any device features

So the issue persists.
Should I be upgrading my kernel? Or running without network bonding?

Revision history for this message

Chris Arnott (chris-arnott) wrote on 2015-11-05:

#22

An update to my comment #21.

Disabling checksumming on each eth interface of the bond has in fact fixed the problem for me.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.