MiniDNS TCP connections stop being accepted

Bug #1549980 reported by Rahman Syed on 2016-02-25
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Designate
Critical
Rahman Syed
Kilo
Critical
Kiall Mac Innes
Liberty
Critical
Kiall Mac Innes

Bug Description

During normal operations, requests to MiniDNS stop being served over TCP (while the service does continue responding over UDP). This condition can only be recovered with a restart of the service.

Rahman Syed (rsyed) wrote :

This is was later found to be reproducible by a simple curl request against the mdns port.

 Traceback (most recent call last):
   File "/opt/designate/designate/local/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in fire_timers
     timer()
   File "/opt/designate/designate/local/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
     cb(*args, **kw)
   File "/opt/designate/designate/local/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 154, in _timeout
     current.throw(exc)
   File "/opt/designate/designate/local/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
     result = function(*args, **kwargs)
   File "/opt/designate/designate/local/lib/python2.7/site-packages/designate/service.py", line 269, in _dns_handle_tcp
     errname = errno.errorcode[e.args[0]]
 KeyError: 'timed out'

A resolution for the root cause can be found in the commit message for the fix.

Changed in designate:
assignee: nobody → Rahman Syed (rahman-syed-w)
status: New → In Progress
Kiall Mac Innes (kiall) wrote :
Changed in designate:
importance: Undecided → Critical
milestone: none → mitaka-3

Reviewed: https://review.openstack.org/284912
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=d5d0706705c64dba847cea5a30b4a6be39ecd63f
Submitter: Jenkins
Branch: master

commit d5d0706705c64dba847cea5a30b4a6be39ecd63f
Author: Rahman Syed <email address hidden>
Date: Thu Feb 25 14:12:53 2016 -0600

    Improve error handling for TCP connections

    In the abstract DNSService's _dns_handle_tcp method, error handling
    is broken in a way that stops the main loop for handling TCP
    connections.

    Because socket.timeout is a subclass of socket.error, the error
    handling block for socket.timeout is never reached.

    Because of this, error handling of a TCP timeout is sent to the
    socket.error block. Due to the way eventlet hijacks these errors,
    the errorcode is not available and a KeyError is raised. This
    KeyError interferes with the main loop because it is not caught.

    Further improvement may include ensuring that these main loops
    can never die due to unexpected exceptions.

    Many thanks to Erik Andersson for pointing out the issue, which
    was seemingly innocuous but ended up being the cause of our
    problems.

    Closes-bug: 1549980
    Change-Id: I47e1260a0818cc42cbd56e4d296e083f8fcbbae5

Changed in designate:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/286555
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=a42f0bab4b64978f00ed05d1b6700751b51c4607
Submitter: Jenkins
Branch: stable/liberty

commit a42f0bab4b64978f00ed05d1b6700751b51c4607
Author: Rahman Syed <email address hidden>
Date: Thu Feb 25 14:12:53 2016 -0600

    Improve error handling for TCP connections

    In the abstract DNSService's _dns_handle_tcp method, error handling
    is broken in a way that stops the main loop for handling TCP
    connections.

    Because socket.timeout is a subclass of socket.error, the error
    handling block for socket.timeout is never reached.

    Because of this, error handling of a TCP timeout is sent to the
    socket.error block. Due to the way eventlet hijacks these errors,
    the errorcode is not available and a KeyError is raised. This
    KeyError interferes with the main loop because it is not caught.

    Further improvement may include ensuring that these main loops
    can never die due to unexpected exceptions.

    Many thanks to Erik Andersson for pointing out the issue, which
    was seemingly innocuous but ended up being the cause of our
    problems.

    Closes-bug: 1549980
    Change-Id: I47e1260a0818cc42cbd56e4d296e083f8fcbbae5

Reviewed: https://review.openstack.org/286557
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=8de1f180c215be651095cc6ef7dac0c2a13d66eb
Submitter: Jenkins
Branch: stable/kilo

commit 8de1f180c215be651095cc6ef7dac0c2a13d66eb
Author: Rahman Syed <email address hidden>
Date: Thu Feb 25 14:12:53 2016 -0600

    Improve error handling for TCP connections

    In the abstract DNSService's _dns_handle_tcp method, error handling
    is broken in a way that stops the main loop for handling TCP
    connections.

    Because socket.timeout is a subclass of socket.error, the error
    handling block for socket.timeout is never reached.

    Because of this, error handling of a TCP timeout is sent to the
    socket.error block. Due to the way eventlet hijacks these errors,
    the errorcode is not available and a KeyError is raised. This
    KeyError interferes with the main loop because it is not caught.

    Further improvement may include ensuring that these main loops
    can never die due to unexpected exceptions.

    Many thanks to Erik Andersson for pointing out the issue, which
    was seemingly innocuous but ended up being the cause of our
    problems.

    Closes-bug: 1549980
    Change-Id: I47e1260a0818cc42cbd56e4d296e083f8fcbbae5

This issue was fixed in the openstack/designate 1.0.2 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers