Intermittent 3rd party ci failures: 400 Bad Request [Errno 101] ENETUNREACH 101
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tempest |
Expired
|
Undecided
|
Unassigned |
Bug Description
Our 3rd party ci tests for Cinder have been failing intermittently.
For example: http://
Analysis:
VolumesV1Action
During that time, no new sockets can be created. Httplib2 returns socket error: [Errno 101] ENETUNREACH 101
This in turn causes the external HTTP REST clients to fail with an HTTP 400 Bad Request failure, which propagates to cinder as a failure.
This issue was diagnosed by running wireshark and observing unexpected arp packets at the time of failure.
Running a simple script during the test run to continuously check arp tables revealed the change of eth0 to br100 at the time of failure.
This was further traced back to nova networking via time correlation.
#!/bin/bash
while [ 1 ]; do
echo `date +"%H:%M:%S,%3N"` `date -u +"%H:%M:%S,%3N"`
arp -v
ss -s
sleep 0.1
done
2651793 Tue Mar 3 04:52:55 UTC 2015 Tue Mar 3 04:52:55 UTC 2015
2651794 Address HWtype HWaddress Flags Mask Iface
2651795 10.52.23.33 ether 9c:b6:54:86:f4:50 C eth0
2651796 10.52.0.1 ether 3c:4a:92:26:6d:00 C eth0
2651797 Entries: 2 Skipped: 0 Found: 2
2651798 Total: 394 (kernel 0)
2651799 TCP: 959 (estab 201, closed 728, orphaned 0, synrecv 0, timewait 727/0), ports 0
2651800
2651801 Transport Total IP IPv6
2651802 * 0 - -
2651803 RAW 0 0 0
2651804 UDP 15 13 2
2651805 TCP 231 169 62
2651806 INET 246 182 64
2651807 FRAG 0 0 0
2651808
2651809 Tue Mar 3 04:53:00 UTC 2015 Tue Mar 3 04:53:00 UTC 2015
2651810 Address HWtype HWaddress Flags Mask Iface
2651811 10.52.23.33 ether 9c:b6:54:86:f4:50 C br100
2651812 Entries: 1 Skipped: 0 Found: 1
2651813 Total: 394 (kernel 0)
2651814 TCP: 959 (estab 201, closed 728, orphaned 0, synrecv 0, timewait 727/0), ports 0
2651815
2651816 Transport Total IP IPv6
2651817 * 0 - -
2651818 RAW 0 0 0
2651819 UDP 15 13 2
2651820 TCP 231 169 62
2651821 INET 246 182 64
2651822 FRAG 0 0 0
Workaround: Run the tests sequentially. This avoids the possibility of one test thread causing nova network changes while the other test tread tries to open a socket.
Not sure why br100 only comes up later in the test? When running devstack, it happens right away.
FYI: here's an example of where the br100 is caused:
2015-03-05 00:26:40.653 DEBUG nova.network. manager [req-d77d1000- 863c-40a8- a5f2-c8f6dc1252 6b VolumesV1Action sTest-134501123 8 VolumesV1Action sTest-103416373 7] [instance: e853e451- 9944-40b6- bfe1-2dfdb93888 9a] networks retrieved for instance: |[{'bridge': u'br100', 'should_ create_ bridge' : True, 'bridge_interface': u'eth0', 'tenant_id': None, 'label': u'private', 'id': '781d4bce- 66cd-45e0- aa40-8f6006052a bb'}]| allocate_ for_instance /opt/stack/ new/nova/ nova/network/ manager. py:501