neutron should forbid configuring agent_down_time that is known to crash due to CPython epoll limitation

Bug #2028724 reported by Lewis Denny
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Lewis Denny
oslo.service
New
Undecided
Unassigned

Bug Description

This bug is created to improve neutron to not allow configuring agent_down_time to values that are known to misbehave because of limitations of CPython C-types interface that doesn't seem to support any values larger than (2^32 / 2 - 1) [in miliseconds] for green thread waiting.

We can either truncate or error on invalid value (the former is probably preferable).

Also, we may want to consider patching oslo.service (?) to apply similar truncation for values passed through loopingcall module. If the library is patched to do the truncation, then neutron enforcement won't be needed.

To reproduce, set agent_down_time to a number larger than (2^32 / 2 - 1)/1000 and check the neutron server log for an error like:
```
05:28:58.327 39 ERROR oslo_service.threadgroup [req-39043291-6236-4d9b-a1e5-45b6cfc7eb2d - - - - -] Error waiting on thread.: OverflowError: timeout is too large
```

This bug is applicable to all current versions of Neutron and can be reproduced on master devstack

Lewis Denny (lewisdenny)
Changed in neutron:
assignee: nobody → Lewis Denny (lewisdenny)
Revision history for this message
Bence Romsics (bence-romsics) wrote :

Hi,

Thanks for the report! I'm not really sure if we need to protect ourselves against an admin maintaining configuration, but at the same time I don't see a problem with the validation either. By the way let me link your patch here:

https://review.opendev.org/c/openstack/neutron/+/889373

Changed in neutron:
status: New → Triaged
status: Triaged → In Progress
importance: Undecided → Low
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

@Bence, admin may not be aware that a particular config option gets passed down to epoll that has this (undocumented) limit. But we are aware, and so oslo.service (and neutron) could help admins by failing early - by enforcing that insane intervals raise exceptions.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

The admins in the sceanario of this bug were setting `agent_down_time` to a really high value because they wanted to effectively disable the periodic checks / updates of the agent status in neutron. While it may be argued that they shouldn't have done it for other reasons (there's a good reason these checks should happen periodically and not be disabled), still that's what they did, and nothing stopped them. This bug is to make sure they are stopped at some limit that we know is broken.

Perhaps neutron may also want to tighten the option down even more to make sure that the periodic is executed more often, but that's a different discussion. This bug is for theoretical limit defined by libc common implementation.

Revision history for this message
Bence Romsics (bence-romsics) wrote :

@Ihar: Thanks for the explanation!

tags: added: low-hanging-fruit
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/889373
Committed: https://opendev.org/openstack/neutron/commit/6fef1e65250dbda057206e1c2ee64f59b21d490f
Submitter: "Zuul (22348)"
Branch: master

commit 6fef1e65250dbda057206e1c2ee64f59b21d490f
Author: Lewis Denny <email address hidden>
Date: Mon Jul 31 16:38:22 2023 +1000

    Add max limit to agent_down_time

    The agent_down_time ends up being passed to an eventlet green-thread;
    under the hood, this uses a CPython C-types interface with a limitation
    of (2^32 / 2 - 1) INT_MAX (as defined in C) where int is usually 32 bits

    I have set the max value to (2^32 / 2 - 1)/1000 as agent_down_time
    configured in seconds, this ends up being 2147483.

    This patch is required as passing a larger number
    causes this error: OverflowError: timeout is too large

    If a user currently has a value larger than (2^32 / 2 - 1)/1000 set,
    Neutron Server will fail to start and will print out a very helpful
    error message.

    Closes-Bug: #2028724
    Change-Id: Ib5b943344cddbd468c00768461ba1ee00a2b4c58

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 23.0.0.0b3

This issue was fixed in the openstack/neutron 23.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/905347

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/neutron/+/905329

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/905330

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/905331

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/905332

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/905347
Committed: https://opendev.org/openstack/neutron/commit/94cf7a4c281413b18ff90cf16c45d2f0df436b44
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 94cf7a4c281413b18ff90cf16c45d2f0df436b44
Author: Lewis Denny <email address hidden>
Date: Mon Jul 31 16:38:22 2023 +1000

    Add max limit to agent_down_time

    The agent_down_time ends up being passed to an eventlet green-thread;
    under the hood, this uses a CPython C-types interface with a limitation
    of (2^32 / 2 - 1) INT_MAX (as defined in C) where int is usually 32 bits

    I have set the max value to (2^32 / 2 - 1)/1000 as agent_down_time
    configured in seconds, this ends up being 2147483.

    This patch is required as passing a larger number
    causes this error: OverflowError: timeout is too large

    If a user currently has a value larger than (2^32 / 2 - 1)/1000 set,
    Neutron Server will fail to start and will print out a very helpful
    error message.

    Conflicts:
          neutron/conf/agent/database/agents_db.py

    Closes-Bug: #2028724
    Change-Id: Ib5b943344cddbd468c00768461ba1ee00a2b4c58
    (cherry picked from commit 6fef1e65250dbda057206e1c2ee64f59b21d490f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/905329
Committed: https://opendev.org/openstack/neutron/commit/2c8076dd28c1e51e7d340d3ee5f9056b40396c0c
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 2c8076dd28c1e51e7d340d3ee5f9056b40396c0c
Author: Lewis Denny <email address hidden>
Date: Mon Jul 31 16:38:22 2023 +1000

    Add max limit to agent_down_time

    The agent_down_time ends up being passed to an eventlet green-thread;
    under the hood, this uses a CPython C-types interface with a limitation
    of (2^32 / 2 - 1) INT_MAX (as defined in C) where int is usually 32 bits

    I have set the max value to (2^32 / 2 - 1)/1000 as agent_down_time
    configured in seconds, this ends up being 2147483.

    This patch is required as passing a larger number
    causes this error: OverflowError: timeout is too large

    If a user currently has a value larger than (2^32 / 2 - 1)/1000 set,
    Neutron Server will fail to start and will print out a very helpful
    error message.

    Conflicts:
          neutron/conf/agent/database/agents_db.py

    Closes-Bug: #2028724
    Change-Id: Ib5b943344cddbd468c00768461ba1ee00a2b4c58
    (cherry picked from commit 6fef1e65250dbda057206e1c2ee64f59b21d490f)

tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/yoga)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/905330
Reason: stable/yoga branch of openstack/neutron is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/yoga if you want to further work on this patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/905332
Committed: https://opendev.org/openstack/neutron/commit/96c207c5f575e37882f78aa6079b2e3e1d0824e0
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 96c207c5f575e37882f78aa6079b2e3e1d0824e0
Author: Lewis Denny <email address hidden>
Date: Mon Jul 31 16:38:22 2023 +1000

    Add max limit to agent_down_time

    The agent_down_time ends up being passed to an eventlet green-thread;
    under the hood, this uses a CPython C-types interface with a limitation
    of (2^32 / 2 - 1) INT_MAX (as defined in C) where int is usually 32 bits

    I have set the max value to (2^32 / 2 - 1)/1000 as agent_down_time
    configured in seconds, this ends up being 2147483.

    This patch is required as passing a larger number
    causes this error: OverflowError: timeout is too large

    If a user currently has a value larger than (2^32 / 2 - 1)/1000 set,
    Neutron Server will fail to start and will print out a very helpful
    error message.

    Conflicts:
          neutron/conf/agent/database/agents_db.py

    Closes-Bug: #2028724
    Change-Id: Ib5b943344cddbd468c00768461ba1ee00a2b4c58
    (cherry picked from commit 6fef1e65250dbda057206e1c2ee64f59b21d490f)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/905331
Committed: https://opendev.org/openstack/neutron/commit/e18837a7e01cea95477beb5a4ab58c80f39cce37
Submitter: "Zuul (22348)"
Branch: stable/xena

commit e18837a7e01cea95477beb5a4ab58c80f39cce37
Author: Lewis Denny <email address hidden>
Date: Mon Jul 31 16:38:22 2023 +1000

    Add max limit to agent_down_time

    The agent_down_time ends up being passed to an eventlet green-thread;
    under the hood, this uses a CPython C-types interface with a limitation
    of (2^32 / 2 - 1) INT_MAX (as defined in C) where int is usually 32 bits

    I have set the max value to (2^32 / 2 - 1)/1000 as agent_down_time
    configured in seconds, this ends up being 2147483.

    This patch is required as passing a larger number
    causes this error: OverflowError: timeout is too large

    If a user currently has a value larger than (2^32 / 2 - 1)/1000 set,
    Neutron Server will fail to start and will print out a very helpful
    error message.

    Conflicts:
          neutron/conf/agent/database/agents_db.py

    Closes-Bug: #2028724
    Change-Id: Ib5b943344cddbd468c00768461ba1ee00a2b4c58
    (cherry picked from commit 6fef1e65250dbda057206e1c2ee64f59b21d490f)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron wallaby-eom

This issue was fixed in the openstack/neutron wallaby-eom release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron xena-eom

This issue was fixed in the openstack/neutron xena-eom release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 21.2.1

This issue was fixed in the openstack/neutron 21.2.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.