dhcp agent error reading lease file

Bug #1788556 reported by Antonio Ojea on 2018-08-23
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Medium
Dirk Mueller

Bug Description

With a large number of VMs, at some point, the dhcp agent throws this index error trying to read the lease file:

2018-08-15 16:17:40.771 40391 ERROR neutron.agent.dhcp.agent [req-cf1c7a8e-b718-4c46-b7be-999747c7e526 afe623c9e78e47febd76617008b9138e c4f22248feb9430093858a0404b779d5 - - -] Unable to reload_allocations dhcp for ef71f918-dc0d-4a6e-8d37-0f5f0720e295.: IndexError: list index out of range
2018-08-15 16:17:40.771 40391 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2018-08-15 16:17:40.771 40391 ERROR neutron.agent.dhcp.agent File "/opt/stack/venv/neutron-20180718T154642Z/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py", line 142, in call_driver
2018-08-15 16:17:40.771 40391 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs)
2018-08-15 16:17:40.771 40391 ERROR neutron.agent.dhcp.agent File "/opt/stack/venv/neutron-20180718T154642Z/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 512, in reload_allocations
2018-08-15 16:17:40.771 40391 ERROR neutron.agent.dhcp.agent self._release_unused_leases()
2018-08-15 16:17:40.771 40391 ERROR neutron.agent.dhcp.agent File "/opt/stack/venv/neutron-20180718T154642Z/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 825, in _release_unused_leases
2018-08-15 16:17:40.771 40391 ERROR neutron.agent.dhcp.agent v6_leases = self._read_v6_leases_file_leases(leases_filename)
2018-08-15 16:17:40.771 40391 ERROR neutron.agent.dhcp.agent File "/opt/stack/venv/neutron-20180718T154642Z/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 810, in _read_v6_leases_file_leases
2018-08-15 16:17:40.771 40391 ERROR neutron.agent.dhcp.agent (iaid, ip, client_id) = parts[1], parts[2], parts[4]
2018-08-15 16:17:40.771 40391 ERROR neutron.agent.dhcp.agent IndexError: list index out of range

When this happens, the agent calls sync_state to fully resync the agent state, which is a serious problem when dealing with a lot of ports in a scale environment.

Is it possible to avoid a full resync of all ports?

Changed in neutron:
assignee: nobody → Antonio Ojea (itsuugo)
status: New → In Progress
Changed in neutron:
importance: Undecided → Medium
Changed in neutron:
assignee: Antonio Ojea (itsuugo) → Stephen Ma (stephen-ma)
Changed in neutron:
assignee: Stephen Ma (stephen-ma) → Brian Haley (brian-haley)
Hongbin Lu (hongbin.lu) on 2018-08-24
tags: added: l3-ipam-dhcp
tags: added: rocky-backport-potential
tags: added: queens-backport-potential
tags: added: pike-backport-potential
Antonio Ojea (itsuugo) wrote :

@brian-haley Although we are ignoring lines with an incorrect number of fields with the partial fix proposed, we are also likely to miss the entire trailing section of the file or to have truncated entries in the client_id field.

What are the implications of this for the dhcp agent?

Brian Haley (brian-haley) wrote :

Yes, it could be that there are items missed, I don't think it's fatal for the dhcp-agent. The only change I could think of to get past this is to trigger the retry code when we encounter a lease that's invalid. For example:

_read_leases_file_leases() see this < 5 case
  insert a "fake" entry to signify it

_release_unused_leases() checks for this "fake" lease
  if found, trigger a retry loop, which will read the file a second time

It's not that simple looking at the code, but could be done.

Antonio Ojea (itsuugo) wrote :

The problem is that the bigger the size of the file the more chances to hit this bug and I'm afraid the situation can be worse if we start to retry and have more "incomplete entries".
From the top of my head, the only solution is to guarantee the atomicity in the file operations( dnsmasq writing and dhcp agent reading) on the same file is the main problem.

@<email address hidden> it would be useful if you can give us more details

Reviewed: https://review.openstack.org/595235
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8a3ff8a19ec39630d24b71cec86740b6b9f16bbe
Submitter: Zuul
Branch: master

commit 8a3ff8a19ec39630d24b71cec86740b6b9f16bbe
Author: aojeagarcia <email address hidden>
Date: Wed Aug 22 10:41:14 2018 +0200

    Parse dhcp leases file in a more robust way

    It turns out that in environments with a big number of VMs, sometimes
    the neutron dhcp agent fails to read the dhcp lease file because some
    lines with the ipv4/ipv6 entries don't have enough fields and causes the
    dhcp agent to fail.

    When this happens the agent calls sync_state to
    fully resync the agent state, that causes a serious performance problems
    in scale environments.

    We need to be more robust reading the file to handle these exceptions.

    Co-authored-by: stephen-ma
    Partial-Bug: #1788556

    Change-Id: Ia681a5e929df5bf8c97ae9445876c306c34061b5

Reviewed: https://review.openstack.org/604319
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=011d0fbf7c085f9f9f24bac67cf74b05ed91bdda
Submitter: Zuul
Branch: stable/queens

commit 011d0fbf7c085f9f9f24bac67cf74b05ed91bdda
Author: aojeagarcia <email address hidden>
Date: Wed Aug 22 10:41:14 2018 +0200

    Parse dhcp leases file in a more robust way

    It turns out that in environments with a big number of VMs, sometimes
    the neutron dhcp agent fails to read the dhcp lease file because some
    lines with the ipv4/ipv6 entries don't have enough fields and causes the
    dhcp agent to fail.

    When this happens the agent calls sync_state to
    fully resync the agent state, that causes a serious performance problems
    in scale environments.

    We need to be more robust reading the file to handle these exceptions.

    Co-authored-by: stephen-ma
    Partial-Bug: #1788556

    Change-Id: Ia681a5e929df5bf8c97ae9445876c306c34061b5
    (cherry picked from commit 8a3ff8a19ec39630d24b71cec86740b6b9f16bbe)

tags: added: in-stable-queens

Reviewed: https://review.openstack.org/604321
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=04b7f80f999c0fec456f5e905ce9407ab9ec5261
Submitter: Zuul
Branch: stable/pike

commit 04b7f80f999c0fec456f5e905ce9407ab9ec5261
Author: aojeagarcia <email address hidden>
Date: Wed Aug 22 10:41:14 2018 +0200

    Parse dhcp leases file in a more robust way

    It turns out that in environments with a big number of VMs, sometimes
    the neutron dhcp agent fails to read the dhcp lease file because some
    lines with the ipv4/ipv6 entries don't have enough fields and causes the
    dhcp agent to fail.

    When this happens the agent calls sync_state to
    fully resync the agent state, that causes a serious performance problems
    in scale environments.

    We need to be more robust reading the file to handle these exceptions.

    Co-authored-by: stephen-ma
    Partial-Bug: #1788556

    Change-Id: Ia681a5e929df5bf8c97ae9445876c306c34061b5
    (cherry picked from commit 8a3ff8a19ec39630d24b71cec86740b6b9f16bbe)

tags: added: in-stable-pike

Reviewed: https://review.openstack.org/604320
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ffcb22f6338ae62c56b2f6e0d531b61f33349167
Submitter: Zuul
Branch: stable/rocky

commit ffcb22f6338ae62c56b2f6e0d531b61f33349167
Author: aojeagarcia <email address hidden>
Date: Wed Aug 22 10:41:14 2018 +0200

    Parse dhcp leases file in a more robust way

    It turns out that in environments with a big number of VMs, sometimes
    the neutron dhcp agent fails to read the dhcp lease file because some
    lines with the ipv4/ipv6 entries don't have enough fields and causes the
    dhcp agent to fail.

    When this happens the agent calls sync_state to
    fully resync the agent state, that causes a serious performance problems
    in scale environments.

    We need to be more robust reading the file to handle these exceptions.

    Co-authored-by: stephen-ma
    Partial-Bug: #1788556

    Change-Id: Ia681a5e929df5bf8c97ae9445876c306c34061b5
    (cherry picked from commit 8a3ff8a19ec39630d24b71cec86740b6b9f16bbe)

tags: added: in-stable-rocky

Fix proposed to branch: master
Review: https://review.openstack.org/632568

Changed in neutron:
assignee: Brian Haley (brian-haley) → Dirk Mueller (dmllr)

Reviewed: https://review.openstack.org/632563
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bb3dbf371ca127918355524120c080a9c4bafc1f
Submitter: Zuul
Branch: stable/ocata

commit bb3dbf371ca127918355524120c080a9c4bafc1f
Author: aojeagarcia <email address hidden>
Date: Wed Aug 22 10:41:14 2018 +0200

    Parse dhcp leases file in a more robust way

    It turns out that in environments with a big number of VMs, sometimes
    the neutron dhcp agent fails to read the dhcp lease file because some
    lines with the ipv4/ipv6 entries don't have enough fields and causes the
    dhcp agent to fail.

    When this happens the agent calls sync_state to
    fully resync the agent state, that causes a serious performance problems
    in scale environments.

    We need to be more robust reading the file to handle these exceptions.

    Conflicts:
        neutron/agent/linux/dhcp.py: due to
        If9aa76fcf121c0e61a7c08088006c5873faee56e missing
            (translation guideline changed in pike -
            _LW is still there in ocata)
        neutron/tests/unit/agent/linux/test_dhcp.py:
        due to not having Ic1864f7efbc94db1369ac7f3e2879fda86f95a11 in ocata

    Co-authored-by: stephen-ma
    Partial-Bug: #1788556

    Change-Id: Ia681a5e929df5bf8c97ae9445876c306c34061b5
    (cherry picked from commit 8a3ff8a19ec39630d24b71cec86740b6b9f16bbe)

tags: added: in-stable-ocata

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/632568
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers