MiniDNS TCP connections stop being accepted

Bug #1549980 reported by Rahman Syed
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Designate
Fix Released
Critical
Rahman Syed
Kilo
Fix Committed
Critical
Kiall Mac Innes
Liberty
Fix Committed
Critical
Kiall Mac Innes

Bug Description

During normal operations, requests to MiniDNS stop being served over TCP (while the service does continue responding over UDP). This condition can only be recovered with a restart of the service.

Revision history for this message
Rahman Syed (rsyed) wrote :

This is was later found to be reproducible by a simple curl request against the mdns port.

 Traceback (most recent call last):
   File "/opt/designate/designate/local/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in fire_timers
     timer()
   File "/opt/designate/designate/local/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
     cb(*args, **kw)
   File "/opt/designate/designate/local/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 154, in _timeout
     current.throw(exc)
   File "/opt/designate/designate/local/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
     result = function(*args, **kwargs)
   File "/opt/designate/designate/local/lib/python2.7/site-packages/designate/service.py", line 269, in _dns_handle_tcp
     errname = errno.errorcode[e.args[0]]
 KeyError: 'timed out'

A resolution for the root cause can be found in the commit message for the fix.

Changed in designate:
assignee: nobody → Rahman Syed (rahman-syed-w)
status: New → In Progress
Revision history for this message
Kiall Mac Innes (kiall) wrote :
Changed in designate:
importance: Undecided → Critical
milestone: none → mitaka-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to designate (master)

Reviewed: https://review.openstack.org/284912
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=d5d0706705c64dba847cea5a30b4a6be39ecd63f
Submitter: Jenkins
Branch: master

commit d5d0706705c64dba847cea5a30b4a6be39ecd63f
Author: Rahman Syed <email address hidden>
Date: Thu Feb 25 14:12:53 2016 -0600

    Improve error handling for TCP connections

    In the abstract DNSService's _dns_handle_tcp method, error handling
    is broken in a way that stops the main loop for handling TCP
    connections.

    Because socket.timeout is a subclass of socket.error, the error
    handling block for socket.timeout is never reached.

    Because of this, error handling of a TCP timeout is sent to the
    socket.error block. Due to the way eventlet hijacks these errors,
    the errorcode is not available and a KeyError is raised. This
    KeyError interferes with the main loop because it is not caught.

    Further improvement may include ensuring that these main loops
    can never die due to unexpected exceptions.

    Many thanks to Erik Andersson for pointing out the issue, which
    was seemingly innocuous but ended up being the cause of our
    problems.

    Closes-bug: 1549980
    Change-Id: I47e1260a0818cc42cbd56e4d296e083f8fcbbae5

Changed in designate:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to designate (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/286555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to designate (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/286557

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to designate (stable/liberty)

Reviewed: https://review.openstack.org/286555
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=a42f0bab4b64978f00ed05d1b6700751b51c4607
Submitter: Jenkins
Branch: stable/liberty

commit a42f0bab4b64978f00ed05d1b6700751b51c4607
Author: Rahman Syed <email address hidden>
Date: Thu Feb 25 14:12:53 2016 -0600

    Improve error handling for TCP connections

    In the abstract DNSService's _dns_handle_tcp method, error handling
    is broken in a way that stops the main loop for handling TCP
    connections.

    Because socket.timeout is a subclass of socket.error, the error
    handling block for socket.timeout is never reached.

    Because of this, error handling of a TCP timeout is sent to the
    socket.error block. Due to the way eventlet hijacks these errors,
    the errorcode is not available and a KeyError is raised. This
    KeyError interferes with the main loop because it is not caught.

    Further improvement may include ensuring that these main loops
    can never die due to unexpected exceptions.

    Many thanks to Erik Andersson for pointing out the issue, which
    was seemingly innocuous but ended up being the cause of our
    problems.

    Closes-bug: 1549980
    Change-Id: I47e1260a0818cc42cbd56e4d296e083f8fcbbae5

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to designate (stable/kilo)

Reviewed: https://review.openstack.org/286557
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=8de1f180c215be651095cc6ef7dac0c2a13d66eb
Submitter: Jenkins
Branch: stable/kilo

commit 8de1f180c215be651095cc6ef7dac0c2a13d66eb
Author: Rahman Syed <email address hidden>
Date: Thu Feb 25 14:12:53 2016 -0600

    Improve error handling for TCP connections

    In the abstract DNSService's _dns_handle_tcp method, error handling
    is broken in a way that stops the main loop for handling TCP
    connections.

    Because socket.timeout is a subclass of socket.error, the error
    handling block for socket.timeout is never reached.

    Because of this, error handling of a TCP timeout is sent to the
    socket.error block. Due to the way eventlet hijacks these errors,
    the errorcode is not available and a KeyError is raised. This
    KeyError interferes with the main loop because it is not caught.

    Further improvement may include ensuring that these main loops
    can never die due to unexpected exceptions.

    Many thanks to Erik Andersson for pointing out the issue, which
    was seemingly innocuous but ended up being the cause of our
    problems.

    Closes-bug: 1549980
    Change-Id: I47e1260a0818cc42cbd56e4d296e083f8fcbbae5

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/designate 1.0.2

This issue was fixed in the openstack/designate 1.0.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.