Intermittent 3rd party ci failures: 400 Bad Request [Errno 101] ENETUNREACH 101

Bug #1427833 reported by Ramy Asselin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tempest
Expired
Undecided
Unassigned

Bug Description

Our 3rd party ci tests for Cinder have been failing intermittently.

For example: http://15.126.198.151/59/134759/13/check/3par-fc-driver-master-client-pip-eos10-dsvm/c3e9b0b/

Analysis:
VolumesV1ActionsTest creates nova instances which sets up networking which takes down eth0 and moves ips to br100.
During that time, no new sockets can be created. Httplib2 returns socket error: [Errno 101] ENETUNREACH 101

This in turn causes the external HTTP REST clients to fail with an HTTP 400 Bad Request failure, which propagates to cinder as a failure.

This issue was diagnosed by running wireshark and observing unexpected arp packets at the time of failure.
Running a simple script during the test run to continuously check arp tables revealed the change of eth0 to br100 at the time of failure.
This was further traced back to nova networking via time correlation.

#!/bin/bash

while [ 1 ]; do
    echo `date +"%H:%M:%S,%3N"` `date -u +"%H:%M:%S,%3N"`
    arp -v
    ss -s
    sleep 0.1
done

2651793 Tue Mar 3 04:52:55 UTC 2015 Tue Mar 3 04:52:55 UTC 2015
2651794 Address HWtype HWaddress Flags Mask Iface
2651795 10.52.23.33 ether 9c:b6:54:86:f4:50 C eth0
2651796 10.52.0.1 ether 3c:4a:92:26:6d:00 C eth0
2651797 Entries: 2 Skipped: 0 Found: 2
2651798 Total: 394 (kernel 0)
2651799 TCP: 959 (estab 201, closed 728, orphaned 0, synrecv 0, timewait 727/0), ports 0
2651800
2651801 Transport Total IP IPv6
2651802 * 0 - -
2651803 RAW 0 0 0
2651804 UDP 15 13 2
2651805 TCP 231 169 62
2651806 INET 246 182 64
2651807 FRAG 0 0 0
2651808
2651809 Tue Mar 3 04:53:00 UTC 2015 Tue Mar 3 04:53:00 UTC 2015
2651810 Address HWtype HWaddress Flags Mask Iface
2651811 10.52.23.33 ether 9c:b6:54:86:f4:50 C br100
2651812 Entries: 1 Skipped: 0 Found: 1
2651813 Total: 394 (kernel 0)
2651814 TCP: 959 (estab 201, closed 728, orphaned 0, synrecv 0, timewait 727/0), ports 0
2651815
2651816 Transport Total IP IPv6
2651817 * 0 - -
2651818 RAW 0 0 0
2651819 UDP 15 13 2
2651820 TCP 231 169 62
2651821 INET 246 182 64
2651822 FRAG 0 0 0

Workaround: Run the tests sequentially. This avoids the possibility of one test thread causing nova network changes while the other test tread tries to open a socket.

Revision history for this message
Ramy Asselin (ramy-asselin) wrote :

Not sure why br100 only comes up later in the test? When running devstack, it happens right away.

FYI: here's an example of where the br100 is caused:

2015-03-05 00:26:40.653 DEBUG nova.network.manager [req-d77d1000-863c-40a8-a5f2-c8f6dc12526b VolumesV1ActionsTest-1345011238 VolumesV1ActionsTest-1034163737] [instance: e853e451-9944-40b6-bfe1-2dfdb938889a] networks retrieved for instance: |[{'bridge': u'br100', 'should_create_bridge': True, 'bridge_interface': u'eth0', 'tenant_id': None, 'label': u'private', 'id': '781d4bce-66cd-45e0-aa40-8f6006052abb'}]| allocate_for_instance /opt/stack/new/nova/nova/network/manager.py:501

Revision history for this message
Masayuki Igawa (igawa) wrote :

Is there any link for "During that time, no new sockets can be created. Httplib2 returns socket error: [Errno 101] ENETUNREACH 101" error message?
I was finding some logs in http://15.126.198.151/59/134759/13/check/3par-fc-driver-master-client-pip-eos10-dsvm/c3e9b0b/logs/ , but I couldn't find it.

Changed in tempest:
status: New → Incomplete
Revision history for this message
Ramy Asselin (ramy-asselin) wrote :

I don't have a link for that specifically, since it was done in debug mode.
The "official" logs just show a 400 Bad Request error with no body.
Here's the fork of httplib2 I was using to debug to get the actual socket error: https://github.com/rasselin/httplib2/tree/addSocketMsg

In particular, this line number: https://github.com/rasselin/httplib2/blob/addSocketMsg/python2/httplib2/__init__.py#L919

I added the debug log file.

Revision history for this message
Ramy Asselin (ramy-asselin) wrote :

For my reference to find originals: 29030/6 stack@csimlog1:/srv/static/logs/30/29030/6/check/3par-iscsi-driver-icehouse-client-pip-megatron-dsvm/e3f38b3/logs$

Added related nova-network log file related to c-vol attached in previous comment.

And snippet of my stats.txt showing time-correlation to interface changing from eth0 to br100 when socket connection fails:

Tue Mar 3 04:51:47 UTC 2015 Tue Mar 3 04:51:47 UTC 2015
Address HWtype HWaddress Flags Mask Iface
10.52.22.33 ether 9c:b6:54:86:df:f8 C eth0
10.52.0.1 ether 3c:4a:92:26:6d:00 C eth0
Entries: 2 Skipped: 0 Found: 2
Total: 394 (kernel 0)
TCP: 1036 (estab 206, closed 801, orphaned 0, synrecv 0, timewait 800/0), ports 0

Transport Total IP IPv6
* 0 - -
RAW 0 0 0
UDP 12 10 2
TCP 235 172 63
INET 247 182 65
FRAG 0 0 0

Tue Mar 3 04:51:47 UTC 2015 Tue Mar 3 04:51:47 UTC 2015
Address HWtype HWaddress Flags Mask Iface
10.52.22.33 ether 9c:b6:54:86:df:f8 C eth0
10.52.0.1 ether 3c:4a:92:26:6d:00 C eth0
Entries: 2 Skipped: 0 Found: 2
Total: 393 (kernel 0)
TCP: 952 (estab 197, closed 724, orphaned 0, synrecv 0, timewait 723/0), ports 0

Transport Total IP IPv6
* 0 - -
RAW 0 0 0
UDP 15 13 2
TCP 228 165 63
INET 243 178 65
FRAG 0 0 0

Tue Mar 3 04:51:52 UTC 2015 Tue Mar 3 04:51:52 UTC 2015
Address HWtype HWaddress Flags Mask Iface
10.52.22.33 ether 9c:b6:54:86:df:f8 C br100
Entries: 1 Skipped: 0 Found: 1
Total: 392 (kernel 0)
TCP: 951 (estab 197, closed 724, orphaned 0, synrecv 0, timewait 723/0), ports 0

Transport Total IP IPv6
* 0 - -
RAW 0 0 0
UDP 15 13 2
TCP 227 164 63
INET 242 177 65
FRAG 0 0 0

Tue Mar 3 04:51:52 UTC 2015 Tue Mar 3 04:51:52 UTC 2015
Address HWtype HWaddress Flags Mask Iface
10.52.22.33 ether 9c:b6:54:86:df:f8 C br100
Entries: 1 Skipped: 0 Found: 1
Total: 392 (kernel 0)
TCP: 951 (estab 197, closed 724, orphaned 0, synrecv 0, timewait 723/0), ports 0

Transport Total IP IPv6
* 0 - -
RAW 0 0 0
UDP 15 13 2
TCP 227 164 63
INET 242 177 65
FRAG 0 0 0

Revision history for this message
Ramy Asselin (ramy-asselin) wrote :

I added some debug during devstack-gate. Perhaps the issue is configuration...??

https://review.openstack.org/#/c/162713/
http://logs.openstack.org/13/162713/1/check/check-tempest-dsvm-full/b291f71/

br100 is created here: http://logs.openstack.org/13/162713/1/check/check-tempest-dsvm-full/b291f71/logs/screen-n-net.txt.gz#_2015-03-09_19_21_12_122

Snippet of : http://logs.openstack.org/13/162713/1/check/check-tempest-dsvm-full/b291f71/logs/stats.txt.gz

19:21:12,974 19:21:12,976
Address HWtype HWaddress Flags Mask Iface
104.239.140.3 ether e4:c7:22:1c:da:c1 C eth0
104.239.140.1 ether 00:00:0c:9f:f0:01 C eth0
Entries: 2 Skipped: 0 Found: 2
19:21:12,981 19:21:12,983
Entries: 0 Skipped: 0 Found: 0

...

< during this time, no new socket connections can be made >

...

19:21:13,338 19:21:13,342
Entries: 0 Skipped: 0 Found: 0
19:21:13,350 19:21:13,353
Address HWtype HWaddress Flags Mask Iface
104.239.140.1 ether 00:00:0c:9f:f0:01 C br100
Entries: 1 Skipped: 0 Found: 1

So it seems br100 is created early in the official devstack-gate. In cinder 3rd party ci, we're only doing the tempest.api.volume.* test cases, so the br100 isn't created till later on.

Revision history for this message
Ramy Asselin (ramy-asselin) wrote :

OK seems this is the 'warning' we're running into:

https://github.com/openstack-dev/devstack/blob/master/lib/nova#L134

# **DHCP Warning**: If your flat interface device uses DHCP, there will be a
# hiccup while the network is moved from the flat interface to the flat network
# bridge. This will happen when you launch your first instance. Upon launch
# you will lose all connectivity to the node, and the VM launch will probably
# fail.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for tempest because there has been no activity for 60 days.]

Changed in tempest:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.