tempest

Intermittent 3rd party ci failures: 400 Bad Request [Errno 101] ENETUNREACH 101

Bug #1427833 reported by Ramy Asselin on 2015-03-03

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	tempest	Expired	Undecided	Unassigned

Bug Description

Our 3rd party ci tests for Cinder have been failing intermittently.

For example: http://15.126.198.151/59/134759/13/check/3par-fc-driver-master-client-pip-eos10-dsvm/c3e9b0b/

Analysis:
VolumesV1ActionsTest creates nova instances which sets up networking which takes down eth0 and moves ips to br100.
During that time, no new sockets can be created. Httplib2 returns socket error: [Errno 101] ENETUNREACH 101

This in turn causes the external HTTP REST clients to fail with an HTTP 400 Bad Request failure, which propagates to cinder as a failure.

This issue was diagnosed by running wireshark and observing unexpected arp packets at the time of failure.
Running a simple script during the test run to continuously check arp tables revealed the change of eth0 to br100 at the time of failure.
This was further traced back to nova networking via time correlation.

#!/bin/bash

while [ 1 ]; do
    echo `date +"%H:%M:%S,%3N"` `date -u +"%H:%M:%S,%3N"`
    arp -v
    ss -s
    sleep 0.1
done

2651793 Tue Mar 3 04:52:55 UTC 2015 Tue Mar 3 04:52:55 UTC 2015
2651794 Address HWtype HWaddress Flags Mask Iface
2651795 10.52.23.33 ether 9c:b6:54:86:f4:50 C eth0
2651796 10.52.0.1 ether 3c:4a:92:26:6d:00 C eth0
2651797 Entries: 2 Skipped: 0 Found: 2
2651798 Total: 394 (kernel 0)
2651799 TCP: 959 (estab 201, closed 728, orphaned 0, synrecv 0, timewait 727/0), ports 0
2651800
2651801 Transport Total IP IPv6
2651802 * 0 - -
2651803 RAW 0 0 0
2651804 UDP 15 13 2
2651805 TCP 231 169 62
2651806 INET 246 182 64
2651807 FRAG 0 0 0
2651808
2651809 Tue Mar 3 04:53:00 UTC 2015 Tue Mar 3 04:53:00 UTC 2015
2651810 Address HWtype HWaddress Flags Mask Iface
2651811 10.52.23.33 ether 9c:b6:54:86:f4:50 C br100
2651812 Entries: 1 Skipped: 0 Found: 1
2651813 Total: 394 (kernel 0)
2651814 TCP: 959 (estab 201, closed 728, orphaned 0, synrecv 0, timewait 727/0), ports 0
2651815
2651816 Transport Total IP IPv6
2651817 * 0 - -
2651818 RAW 0 0 0
2651819 UDP 15 13 2
2651820 TCP 231 169 62
2651821 INET 246 182 64
2651822 FRAG 0 0 0

Workaround: Run the tests sequentially. This avoids the possibility of one test thread causing nova network changes while the other test tread tries to open a socket.

Revision history for this message

Ramy Asselin (ramy-asselin) wrote on 2015-03-05:

Not sure why br100 only comes up later in the test? When running devstack, it happens right away.

FYI: here's an example of where the br100 is caused:

2015-03-05 00:26:40.653 DEBUG nova.network.manager [req-d77d1000-863c-40a8-a5f2-c8f6dc12526b VolumesV1ActionsTest-1345011238 VolumesV1ActionsTest-1034163737] [instance: e853e451-9944-40b6-bfe1-2dfdb938889a] networks retrieved for instance: |[{'bridge': u'br100', 'should_create_bridge': True, 'bridge_interface': u'eth0', 'tenant_id': None, 'label': u'private', 'id': '781d4bce-66cd-45e0-aa40-8f6006052abb'}]| allocate_for_instance /opt/stack/new/nova/nova/network/manager.py:501

Revision history for this message

Masayuki Igawa (igawa) wrote on 2015-03-05:

Is there any link for "During that time, no new sockets can be created. Httplib2 returns socket error: [Errno 101] ENETUNREACH 101" error message?
I was finding some logs in http://15.126.198.151/59/134759/13/check/3par-fc-driver-master-client-pip-eos10-dsvm/c3e9b0b/logs/ , but I couldn't find it.

Changed in tempest:
status:	New → Incomplete

Revision history for this message

Ramy Asselin (ramy-asselin) wrote on 2015-03-05:

sceen-c-vol showing enetunreach Edit (776.3 KiB, text/plain)

I don't have a link for that specifically, since it was done in debug mode.
The "official" logs just show a 400 Bad Request error with no body.
Here's the fork of httplib2 I was using to debug to get the actual socket error: https://github.com/rasselin/httplib2/tree/addSocketMsg

In particular, this line number: https://github.com/rasselin/httplib2/blob/addSocketMsg/python2/httplib2/__init__.py#L919

I added the debug log file.

Revision history for this message

Ramy Asselin (ramy-asselin) wrote on 2015-03-05:

related nova network log file Edit (17.5 KiB, text/plain)

For my reference to find originals: 29030/6 stack@csimlog1:/srv/static/logs/30/29030/6/check/3par-iscsi-driver-icehouse-client-pip-megatron-dsvm/e3f38b3/logs$

Added related nova-network log file related to c-vol attached in previous comment.

And snippet of my stats.txt showing time-correlation to interface changing from eth0 to br100 when socket connection fails:

Tue Mar 3 04:51:47 UTC 2015 Tue Mar 3 04:51:47 UTC 2015
Address HWtype HWaddress Flags Mask Iface
10.52.22.33 ether 9c:b6:54:86:df:f8 C eth0
10.52.0.1 ether 3c:4a:92:26:6d:00 C eth0
Entries: 2 Skipped: 0 Found: 2
Total: 394 (kernel 0)
TCP: 1036 (estab 206, closed 801, orphaned 0, synrecv 0, timewait 800/0), ports 0

Transport Total IP IPv6
* 0 - -
RAW 0 0 0
UDP 12 10 2
TCP 235 172 63
INET 247 182 65
FRAG 0 0 0

Tue Mar 3 04:51:47 UTC 2015 Tue Mar 3 04:51:47 UTC 2015
Address HWtype HWaddress Flags Mask Iface
10.52.22.33 ether 9c:b6:54:86:df:f8 C eth0
10.52.0.1 ether 3c:4a:92:26:6d:00 C eth0
Entries: 2 Skipped: 0 Found: 2
Total: 393 (kernel 0)
TCP: 952 (estab 197, closed 724, orphaned 0, synrecv 0, timewait 723/0), ports 0

Transport Total IP IPv6
* 0 - -
RAW 0 0 0
UDP 15 13 2
TCP 228 165 63
INET 243 178 65
FRAG 0 0 0

Tue Mar 3 04:51:52 UTC 2015 Tue Mar 3 04:51:52 UTC 2015
Address HWtype HWaddress Flags Mask Iface
10.52.22.33 ether 9c:b6:54:86:df:f8 C br100
Entries: 1 Skipped: 0 Found: 1
Total: 392 (kernel 0)
TCP: 951 (estab 197, closed 724, orphaned 0, synrecv 0, timewait 723/0), ports 0

Transport Total IP IPv6
* 0 - -
RAW 0 0 0
UDP 15 13 2
TCP 227 164 63
INET 242 177 65
FRAG 0 0 0

For my reference to find originals: 29030/6  stack@csimlog1:/srv/static/logs/30/29030/6/check/3par-iscsi-driver-icehouse-client-pip-megatron-dsvm/e3f38b3/logs$

Added related nova-network log file related to c-vol attached in previous comment.

And snippet of my stats.txt showing time-correlation to interface changing from eth0 to br100 when socket connection fails:

Tue Mar 3 04:51:47 UTC 2015 Tue Mar 3 04:51:47 UTC 2015
Address                  HWtype  HWaddress           Flags Mask            Iface
10.52.22.33              ether   9c:b6:54:86:df:f8   C                     eth0
10.52.0.1                ether   3c:4a:92:26:6d:00   C                     eth0
Entries: 2      Skipped: 0      Found: 2
Total: 394 (kernel 0)
TCP:   1036 (estab 206, closed 801, orphaned 0, synrecv 0, timewait 800/0), ports 0

Transport Total     IP        IPv6
*         0         -         -
RAW       0         0         0
UDP       12        10        2
TCP       235       172       63
INET      247       182       65
FRAG      0         0         0

Tue Mar 3 04:51:47 UTC 2015 Tue Mar 3 04:51:47 UTC 2015
Address                  HWtype  HWaddress           Flags Mask            Iface
10.52.22.33              ether   9c:b6:54:86:df:f8   C                     eth0
10.52.0.1                ether   3c:4a:92:26:6d:00   C                     eth0
Entries: 2      Skipped: 0      Found: 2
Total: 393 (kernel 0)
TCP:   952 (estab 197, closed 724, orphaned 0, synrecv 0, timewait 723/0), ports 0

Transport Total     IP        IPv6
*         0         -         -
RAW       0         0         0
UDP       15        13        2
TCP       228       165       63
INET      243       178       65
FRAG      0         0         0

Tue Mar 3 04:51:52 UTC 2015 Tue Mar 3 04:51:52 UTC 2015
Address                  HWtype  HWaddress           Flags Mask            Iface
10.52.22.33              ether   9c:b6:54:86:df:f8   C                     br100
Entries: 1      Skipped: 0      Found: 1
Total: 392 (kernel 0)
TCP:   951 (estab 197, closed 724, orphaned 0, synrecv 0, timewait 723/0), ports 0

Transport Total     IP        IPv6
*         0         -         -
RAW       0         0         0
UDP       15        13        2
TCP       227       164       63
INET      242       177       65
FRAG      0         0         0

Tue Mar 3 04:51:52 UTC 2015 Tue Mar 3 04:51:52 UTC 2015
Address                  HWtype  HWaddress           Flags Mask            Iface
10.52.22.33              ether   9c:b6:54:86:df:f8   C                     br100
Entries: 1      Skipped: 0      Found: 1
Total: 392 (kernel 0)
TCP:   951 (estab 197, closed 724, orphaned 0, synrecv 0, timewait 723/0), ports 0

Revision history for this message

Ramy Asselin (ramy-asselin) wrote on 2015-03-09:

I added some debug during devstack-gate. Perhaps the issue is configuration...??

https://review.openstack.org/#/c/162713/
http://logs.openstack.org/13/162713/1/check/check-tempest-dsvm-full/b291f71/

br100 is created here: http://logs.openstack.org/13/162713/1/check/check-tempest-dsvm-full/b291f71/logs/screen-n-net.txt.gz#_2015-03-09_19_21_12_122

Snippet of : http://logs.openstack.org/13/162713/1/check/check-tempest-dsvm-full/b291f71/logs/stats.txt.gz

19:21:12,974 19:21:12,976
Address HWtype HWaddress Flags Mask Iface
104.239.140.3 ether e4:c7:22:1c:da:c1 C eth0
104.239.140.1 ether 00:00:0c:9f:f0:01 C eth0
Entries: 2 Skipped: 0 Found: 2
19:21:12,981 19:21:12,983
Entries: 0 Skipped: 0 Found: 0

...

< during this time, no new socket connections can be made >

...

19:21:13,338 19:21:13,342
Entries: 0 Skipped: 0 Found: 0
19:21:13,350 19:21:13,353
Address HWtype HWaddress Flags Mask Iface
104.239.140.1 ether 00:00:0c:9f:f0:01 C br100
Entries: 1 Skipped: 0 Found: 1

So it seems br100 is created early in the official devstack-gate. In cinder 3rd party ci, we're only doing the tempest.api.volume.* test cases, so the br100 isn't created till later on.

Revision history for this message

Ramy Asselin (ramy-asselin) wrote on 2015-03-10:

OK seems this is the 'warning' we're running into:

https://github.com/openstack-dev/devstack/blob/master/lib/nova#L134

# **DHCP Warning**: If your flat interface device uses DHCP, there will be a
# hiccup while the network is moved from the flat interface to the flat network
# bridge. This will happen when you launch your first instance. Upon launch
# you will lose all connectivity to the node, and the VM launch will probably
# fail.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2015-05-10:

[Expired for tempest because there has been no activity for 60 days.]

Changed in tempest:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.