[neutron] Network verification failed on Neutron Vlan with Vlan tagging

Bug #1306705 reported by Andrey Sledzinskiy
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Dima Shulyak
5.0.x
Fix Committed
High
Dima Shulyak

Bug Description

Bug is reproduced on {"build_id": "2014-04-10_16-27-34", "mirantis": "yes", "build_number": "94", "nailgun_sha": "d1794c66fbcff02d11e4e42c3d388cfec37d1eb0", "production": "prod", "ostf_sha": "17f2fe6e56452f8e2f01a385be4c4b87bf3698a8", "fuelmain_sha": "fb993e1a8cbfab92101fdffcc2f6be1c04b71d99", "astute_sha": "6ddf5d5a60ca67fec4b25bcf5fcda822e46d87e6", "release": "5.0", "fuellib_sha": "0e4c251a1ab4aed317aeb79d6859dfcfe1314463"}

Steps:
1. Create next env - Centos, HA, KVM, Neutron Vlan with networks tagging, Cinder LVM
2. Add 1 controller, 1 compute, 1 cinder, 1 ceph
3. Deploy cluster
4. Run Network verification

Expected - successful verification
Actual - error message Expected VLAN (not received)
Untitled (ef:9d) 64:21:43:74:ef:9d eth0 1009,1010,1011,1012,1013,1014

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Dima Shulyak (dshulyak) wrote :

I see that only part of vlans range wasnt received, 1009:1014 and only for one node.

To eleminate such issues we need to allow listeners to wait longer for response, up to 10 sec.

Changed in fuel:
status: New → Confirmed
assignee: nobody → Dima Shulyak (dshulyak)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/87768

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/88284

Revision history for this message
Dima Shulyak (dshulyak) wrote : Re: Network verification failed on Neutron Vlan with Vlan tagging

it seems my fixes is not valid. Turns out that it happens only in ha mode, and only in case corosync turned on.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

If it's not valid, then can you abandon a change?

Revision history for this message
Denis Egorenko (degorenko) wrote :

I have the same problem, when i try launch verification on neutron gre with vlan tagging

Revision history for this message
Denis Egorenko (degorenko) wrote :

But after deploy, network verification is ok

Revision history for this message
Dima Shulyak (dshulyak) wrote :

Denis please provide more details, in my understanding if it is before deployment, it is completely different issue.

Revision history for this message
Denis Egorenko (degorenko) wrote :

[root@bootstrap ~]# ethtool -i eth1
driver: igb
version: 5.0.3-k
firmware-version: 1.80, 0x00000000
bus-info: 0000:02:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

[root@bootstrap ~]# ethtool -i eth0
driver: igb
version: 5.0.3-k
firmware-version: 1.80, 0x00000000
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

[root@bootstrap ~]# ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:25:90:94:9F:E4
          inet addr:10.20.0.173 Bcast:10.20.0.255 Mask:255.255.255.0
          inet6 addr: fe80::225:90ff:fe94:9fe4/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:1306 errors:0 dropped:0 overruns:0 frame:0
          TX packets:39863 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:141653 (138.3 KiB) TX bytes:10026888 (9.5 MiB)
          Memory:df920000-df940000

eth1 Link encap:Ethernet HWaddr 00:25:90:94:9F:E5
          inet6 addr: fe80::225:90ff:fe94:9fe5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b) TX bytes:468 (468.0 b)
          Memory:df900000-df920000

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

Revision history for this message
Denis Egorenko (degorenko) wrote :

info from controller:

[root@bootstrap ~]# ethtool -i eth0
driver: e1000e
version: 2.3.2-k
firmware-version: 2.1-2
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

[root@bootstrap ~]# ethtool -i eth1
driver: e1000e
version: 2.3.2-k
firmware-version: 2.1-2
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

Revision history for this message
Dima Shulyak (dshulyak) wrote :

Denis, thanks for info, that was another bug, it was fixed by patching igb driver.

Changed in fuel:
importance: High → Medium
Revision history for this message
Dima Shulyak (dshulyak) wrote :

Unable to reproduce it locally or on baremetal, afaik it is only reproducible on chezh labs, going investigate it later.
Just lowering its priority to medium.

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 5.0 → 5.1
Mike Scherbakov (mihgen)
Changed in fuel:
status: In Progress → Incomplete
Revision history for this message
Dima Shulyak (dshulyak) wrote :

i think correct fix for this is raising udp buffer size, so If such error will be reproduced please try:

sysctl -w net.core.rmem_max=26214400

which is linux default x 2

Revision history for this message
Dima Shulyak (dshulyak) wrote :

setting rmem_max and wmem_max to max value 1410065408 seems to help

Revision history for this message
Mike Scherbakov (mihgen) wrote :

So should we fix it somehow in the code?

Revision history for this message
Dima Shulyak (dshulyak) wrote :

It was required only on slow hardware, but if this fix wont affect deployment - we can use it.
I will ask library folks to review it

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (master)

Change abandoned by Dmitry Shulyak (<email address hidden>) on branch: master
Review: https://review.openstack.org/88284
Reason: will restore if needed

Dima Shulyak (dshulyak)
Changed in fuel:
status: Incomplete → Won't Fix
Dima Shulyak (dshulyak)
Changed in fuel:
status: Won't Fix → Confirmed
Dmitry Ilyin (idv1985)
summary: - Network verification failed on Neutron Vlan with Vlan tagging
+ [neutron] Network verification failed on Neutron Vlan with Vlan tagging
Dima Shulyak (dshulyak)
Changed in fuel:
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/110232

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Dima Shulyak (dshulyak) wrote :

I'm pretty sure now, that the problem is in python libpcap bindings, cause dumping traffic with tcpdump works for me, but
python-pcap misses a lot of packets.

Probably the reason tcpdump works good is that it does some preparation begind the scenes, like increasing queue buffer size.

I will need to rewrite net_probe to use tcpdump instead of python-pcap bindings

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-main (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/110298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/110298
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=24472577c08199bd10dc9b55cd44b29a1073e64a
Submitter: Jenkins
Branch: master

commit 24472577c08199bd10dc9b55cd44b29a1073e64a
Author: Dima Shulyak <email address hidden>
Date: Tue Jul 29 17:08:29 2014 +0300

    Add tcpdump dependency to nailgun-net-check

    Change-Id: I56714c2e4ee98d0ba936f5b8b695d6f2c53580e8
    Related-Bug: 1306705

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/110232
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=49585312d511622746c446dc5676e4bce70d652c
Submitter: Jenkins
Branch: master

commit 49585312d511622746c446dc5676e4bce70d652c
Author: Dima Shulyak <email address hidden>
Date: Tue Jul 29 17:03:40 2014 +0300

    Use tcpdump for traffic dumping in net_probe

    I found out that tcpdump much more reliable than
    python libpcap bindings, and works well under load

    - start tcpdump listeners for each iface
      (will catch both tagged/untagged traffic)
    - deserealize pcap data with help of scapy rdpcap util
    - all pcap file will be stored in /var/run/pcap_dir by default
      and maybe attached to diagnostic snapshot

    Change-Id: Id8320d4a05c84687c7a0a3d0ddbce4d05a7115ea
    Closes-Bug: 1306705

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-web (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/111207

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/111211

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/111207
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=36d2f118c78e55b3cc54aed8afb25024d823d78c
Submitter: Jenkins
Branch: master

commit 36d2f118c78e55b3cc54aed8afb25024d823d78c
Author: Dima Shulyak <email address hidden>
Date: Fri Aug 1 11:51:52 2014 +0300

    Add pcap dumps in diagnostic snapshot

    Change-Id: Ifa47ece58813316d3b043702be93ea592f75c049
    Related-Bug: 1306705

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/111211
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=746b1a654f1a7a85393eaaa154029790de07c307
Submitter: Jenkins
Branch: master

commit 746b1a654f1a7a85393eaaa154029790de07c307
Author: Dima Shulyak <email address hidden>
Date: Tue Jul 29 11:11:42 2014 +0300

    Generate traffic for a given amount of time

    Sending prefefined amount of packets for each interface:vlan pair
    can still result in random verify_networks failures on heavily loaded
    environments

    - sender will generate traffic based on time provided with --duration option
      or from config file, defaults to 20 sec
    - repeat option will be used to configure amount of packets per iface:vlan
      pair sended in each iteration, defaults to 2 packets

    Change-Id: Ie92a3ea175ca2ae9f43e79f66b449366a1b68126
    Partial-Bug: 1306705

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-main (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/111341

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/111341
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=fdf45e1db0d1a509c2a151d84a4c2cb4c133e2a9
Submitter: Jenkins
Branch: master

commit fdf45e1db0d1a509c2a151d84a4c2cb4c133e2a9
Author: Dima Shulyak <email address hidden>
Date: Fri Aug 1 20:09:09 2014 +0300

    Increase timeout for network verification

    Timeout for verify_networks increased from 2 to 5 min

    Change-Id: I24702d2a6058d4bf43151f74382450f7729f1a1d
    Related-Bug: 1306705

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/111462

Mike Scherbakov (mihgen)
Changed in fuel:
status: Fix Committed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/111462
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=ce86172e77661026c91fdf1ff8066d7df1f7d89d
Submitter: Jenkins
Branch: master

commit ce86172e77661026c91fdf1ff8066d7df1f7d89d
Author: Dima Shulyak <email address hidden>
Date: Sat Aug 2 13:07:32 2014 +0300

    Increase mcollective timeout for net_probe tasks

    Adding +2 minutes for each net_probe call

    Change-Id: I38092e65684523023c73ccdd84d751ac49693b2f
    Related-Bug: 1306705

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/111545

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/111545
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=8db571f2bad2b601dc15211a152664d34cac450c
Submitter: Jenkins
Branch: master

commit 8db571f2bad2b601dc15211a152664d34cac450c
Author: Dima Shulyak <email address hidden>
Date: Sun Aug 3 08:49:05 2014 +0300

    Tune net_check generator params

    Huge load of traffic could not be processed under time
    constraints that we have either in system tests/orchestrator

    - change duration of traffic generation to 5 sec
    - send only 1 packets in each iteration

    Change-Id: Ia503c2df72bca82952c51ac794615faa374ebf21
    Closes-Bug: #1306705

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Egor Kotko (ykotko) wrote :

Reproduced on the last system tests:
http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.ubuntu.thread_2/131/testReport/%28root%29/simple_flat_blocked_vlan/simple_flat_blocked_vlan/

{"build_id": "2014-08-06_02-01-17", "ostf_sha": "be71965998364bf8e6415bd38b75c84b63aab867", "build_number": "405", "auth_required": true, "api": "1.0", "nailgun_sha": "f64b06c788e2b92fcb8e678ea6d0c9b86e8d0ab7", "production": "docker", "fuelmain_sha": "124ea87f1ac1c06e27613fe3b31fd5fc6b39e82d", "astute_sha": "99a790ad1b7526cbbd5bf8add0cb2b4e503fccd4", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "513ec5cdcdef74c7419d5bae967b9edc7da8dbd7"}

Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
Dima Shulyak (dshulyak) wrote :

afaik this test blocks vlan on bridge with ebtables, i'm not sure where the issue is, but it is not related to this bug

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Dima Shulyak (dshulyak) wrote :

Looks like it started to fail on jenkins for 5.0 release

Revision history for this message
Dima Shulyak (dshulyak) wrote :

To fix this for 5.0 based environments we need to update nailgun-net-check package to recent version.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-main (stable/5.0)

Related fix proposed to branch: stable/5.0
Review: https://review.openstack.org/115645

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/115648

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (stable/5.0)

Change abandoned by Dmitry Shulyak (<email address hidden>) on branch: stable/5.0
Review: https://review.openstack.org/115648

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/117124

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/117125

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-main (stable/5.0)

Reviewed: https://review.openstack.org/115645
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=dfe7e8478526714c6318ad3db813f74492cc1709
Submitter: Jenkins
Branch: stable/5.0

commit dfe7e8478526714c6318ad3db813f74492cc1709
Author: Dima Shulyak <email address hidden>
Date: Tue Jul 29 17:08:29 2014 +0300

    Add tcpdump dependency to nailgun-net-check

    Change-Id: I56714c2e4ee98d0ba936f5b8b695d6f2c53580e8
    Related-Bug: 1306705

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/5.0)

Reviewed: https://review.openstack.org/117124
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=eea6b54d225f1304a51a9cedb3be5e4f96656f12
Submitter: Jenkins
Branch: stable/5.0

commit eea6b54d225f1304a51a9cedb3be5e4f96656f12
Author: Dima Shulyak <email address hidden>
Date: Tue Jul 29 17:03:40 2014 +0300

    Use tcpdump for traffic dumping in net_probe

    I found out that tcpdump much more reliable than
    python libpcap bindings, and works well under load

    - start tcpdump listeners for each iface
      (will catch both tagged/untagged traffic)
    - deserealize pcap data with help of scapy rdpcap util
    - all pcap file will be stored in /var/run/pcap_dir by default
      and maybe attached to diagnostic snapshot

    Change-Id: Id8320d4a05c84687c7a0a3d0ddbce4d05a7115ea
    Closes-Bug: 1306705

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/117125
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=9a65c23980ca7d119c260787e93e4cf154dd9cf6
Submitter: Jenkins
Branch: stable/5.0

commit 9a65c23980ca7d119c260787e93e4cf154dd9cf6
Author: Dima Shulyak <email address hidden>
Date: Tue Jul 29 11:11:42 2014 +0300

    Generate traffic for a given amount of time

    Sending prefefined amount of packets for each interface:vlan pair
    can still result in random verify_networks failures on heavily loaded
    environments

    - sender will generate traffic based on time provided with --duration option
      or from config file, defaults to 20 sec
    - repeat option will be used to configure amount of packets per iface:vlan
      pair sended in each iteration, defaults to 2 packets

    Change-Id: Ie92a3ea175ca2ae9f43e79f66b449366a1b68126
    Partial-Bug: 1306705

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/115648
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=c18a21381843dffe807b254a4ff96eec259953cb
Submitter: Jenkins
Branch: stable/5.0

commit c18a21381843dffe807b254a4ff96eec259953cb
Author: Dima Shulyak <email address hidden>
Date: Sun Aug 3 08:49:05 2014 +0300

    Tune net_check generator params

    Huge load of traffic could not be processed under time
    constraints that we have either in system tests/orchestrator

    - change duration of traffic generation to 5 sec
    - send only 1 packets in each iteration

    Change-Id: Ia503c2df72bca82952c51ac794615faa374ebf21
    Closes-Bug: #1306705

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.