TestRPC unit tests intermittently fail with "'>' not supported between instances of 'NoneType' and 'datetime.datetime'" - maybe due to "Fatal Python error: Cannot recover from stack overflow."

Bug #1825435 reported by Matt Riedemann on 2019-04-18
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Stephen Finucane

Bug Description

Seen here:

http://logs.openstack.org/45/649345/7/check/openstack-tox-py36/ba15c17/job-output.txt.gz#_2019-04-18_18_12_52_121182

2019-04-18 18:12:52.121182 | ubuntu-bionic | {3} nova.tests.unit.test_rpc.TestRPC.test_get_transport_url_null [] ... inprogress
2019-04-18 18:12:52.138024 | ubuntu-bionic | '>' not supported between instances of 'NoneType' and 'datetime.datetime'
2019-04-18 18:12:52.353474 | ubuntu-bionic | ERROR: InvocationError for command /home/zuul/src/git.openstack.org/openstack/nova/.tox/py36/bin/stestr run (exited with code 1)

Which seems to kill the test and stestr.

There is also a stack overflow here:

http://logs.openstack.org/45/649345/7/check/openstack-tox-py36/ba15c17/job-output.txt.gz#_2019-04-18_18_10_53_423952

2019-04-18 18:10:53.423952 | ubuntu-bionic | Fatal Python error: Cannot recover from stack overflow.
2019-04-18 18:10:53.423999 | ubuntu-bionic |
2019-04-18 18:10:53.424162 | ubuntu-bionic | Current thread 0x00007f34d2bcc740 (most recent call first):
2019-04-18 18:10:53.424472 | ubuntu-bionic | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/py36/lib/python3.6/site-packages/oslo_config/cfg.py", line 2614 in _get
2019-04-18 18:10:53.424794 | ubuntu-bionic | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/py36/lib/python3.6/site-packages/oslo_config/cfg.py", line 2183 in __getattr__
...
2019-04-18 18:10:53.476982 | ubuntu-bionic | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/py36/lib/python3.6/site-packages/oslo_config/cfg.py", line 2183 in __getattr__
2019-04-18 18:10:53.477236 | ubuntu-bionic | ...
2019-04-18 18:10:53.477293 | ubuntu-bionic | Aborted

The stack overflow seems to be just nova since April 15:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22%7C%20Fatal%20Python%20error%3A%20Cannot%20recover%20from%20stack%20overflow.%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

I don't see anything that looks related for nova changes around April 15, but maybe something that was released to upper-constraints on April 15:

https://github.com/openstack/requirements/commit/a96ee0e258aafb2880149b3e25dd5959f7b37c09

Nothing in there looks obvious though.

Matt Riedemann (mriedem) wrote :

The "'>' not supported between instances of 'NoneType' and 'datetime.datetime'" message doesn't show up in all runs that fail with the TestRPC inprogress thing:

http://logs.openstack.org/34/649534/8/gate/openstack-tox-lower-constraints/1fe6021/job-output.txt.gz#_2019-04-18_17_41_14_279335

The stack overflow does though:

http://logs.openstack.org/34/649534/8/gate/openstack-tox-lower-constraints/1fe6021/job-output.txt.gz#_2019-04-18_17_39_13_745618

Matt Riedemann (mriedem) wrote :

I don't think this is due to oslo.config or oslo.messaging changes because it's hitting in both lower-constraints and py36 jobs where the versions of those packages are different, and it definitely started around 4/15 but it's not obvious to me what changes happened in nova around that time that might be triggering this.

summary: TestRPC unit tests intermittently fail with "'>' not supported between
- instances of 'NoneType' and 'datetime.datetime'"
+ instances of 'NoneType' and 'datetime.datetime'" - maybe due to "Fatal
+ Python error: Cannot recover from stack overflow."
Changed in nova:
importance: Undecided → High
Jay Pipes (jaypipes) wrote :

Note that the error is being hit across many source CI providers and multiple regions in each source provider, so this is not isolated to one particular CI provider.

Jay Pipes (jaypipes) wrote :

Matt, could this patch have anything to do with this you think?

https://github.com/openstack/oslo.config/commit/6c951ed373f443727a8ea2d10d346ad769285aa2

It's changing out the base class for ConfigOpts and GroupAttr. When I looked at the collections.abc implementations in Python2 and Python3.7, I noticed there were some differences in how the base classes that comprise collections.abc.Mapping "calculate" attributes.

In the case of Python2, there is a _hasattr() module-level function that is used to determine if a particular attribute *or* method exists. For example, one of the base metaclasses that collections.Mapping derives from is Container, which uses this _hasattr() function to determine if the subclass overrides the __contains__ magic method:

    @classmethod
    def __subclasshook__(cls, C):
        if cls is Container:
            if _hasattr(C, "__contains__"):
                return True
    return NotImplemented

https://github.com/python/cpython/blob/2.7/Lib/_abcoll.py#L116

Compare this to the Python3.7 Container class's similar implementation:

    @classmethod
    def __subclasshook__(cls, C):
        if cls is Container:
            return _check_methods(C, "__contains__")
    return NotImplemented

https://github.com/python/cpython/blob/3.7/Lib/_collections_abc.py#L385

If we look at Python2's _hasattr() compared to Python3.7's _check_methods() function, we see some differences:

Python2:

def _hasattr(C, attr):
    try:
        return any(attr in B.__dict__ for B in C.__mro__)
    except AttributeError:
        # Old-style class
    return hasattr(C, attr)

Python 3.7:

def _check_methods(C, *methods):
    mro = C.__mro__
    for method in methods:
        for B in mro:
            if method in B.__dict__:
                if B.__dict__[method] is None:
                    return NotImplemented
                break
        else:
            return NotImplemented
    return True

Given the stacktrace that seems to be occurring, I think it might be worthwhile to revert https://github.com/openstack/oslo.config/commit/6c951ed373f443727a8ea2d10d346ad769285aa2 and see if that might indeed be the culprit here.

Just a thought,

-jay

Matt Riedemann (mriedem) wrote :

I don't see how https://github.com/openstack/oslo.config/commit/6c951ed373f443727a8ea2d10d346ad769285aa2 could be involved because it's not released yet and even if it was, we're hitting this in both py36 (normal with upper-constraints) and lower-constraints jobs, so different versions of oslo.config. Maybe it's a latent bug in oslo.config and something recently changed in nova is tickling it, or there is some other untracked-in-constraints library release around 4/15 that is causing some issues (maybe something in python3.6 itself in the bionic repo?).

Matt Riedemann (mriedem) wrote :

http://changelogs.ubuntu.com/changelogs/pool/main/p/python3-defaults/python3-defaults_3.6.7-1~18.04/changelog doesn't suggest there were any recent changes to the python3.6 binary in the bionic-updates repo.

Related fix proposed to branch: master
Review: https://review.opendev.org/655843

Changed in nova:
assignee: nobody → Stephen Finucane (stephenfinucane)
status: Confirmed → In Progress

Reviewed: https://review.opendev.org/655843
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1a4a521fefb71aa0817770f265063a9150653743
Submitter: Zuul
Branch: master

commit 1a4a521fefb71aa0817770f265063a9150653743
Author: Stephen Finucane <email address hidden>
Date: Fri Apr 26 12:10:11 2019 +0100

    test_rpc: Stop f****** with global state

    We're occasionally seeing stacktraces like this in our tests:

      Fatal Python error: Cannot recover from stack overflow.

      Current thread 0x00007fe66549f740 (most recent call first):
        File "...nova/.tox/py36/lib/python3.6/site-packages/oslo_config/cfg.py", line 2614 in _get
        File "...nova/.tox/py36/lib/python3.6/site-packages/oslo_config/cfg.py", line 2183 in __getattr__
        File "...nova/.tox/py36/lib/python3.6/site-packages/oslo_config/cfg.py", line 2614 in _get
        File "...nova/.tox/py36/lib/python3.6/site-packages/oslo_config/cfg.py", line 2183 in __getattr__
        File "...nova/.tox/py36/lib/python3.6/site-packages/oslo_config/cfg.py", line 2614 in _get
        ...

    From a look at the oslo.config source, this seems to be occurring
    because 'ConfigOpts.__cache' is somehow undefined, which results in the
    '_get' method attempting to call '__getattr__' [1], which calls '_get'
    [2], which calls '__getattr__' and so on.

    The exact reason this is happening isn't clear, but what is clear is
    that how we handle global config options in the tests that are failing
    is very strange and potentially subject to race conditions. We have a
    clear pattern for mocking everything and anything - the mock module -
    and we should be using this here. Start doing so, reworking a lot of the
    tests in the process, in order to avoid messing with oslo.config and
    triggering the issue entirely.

    [1] https://github.com/openstack/oslo.config/blob/6.8.1/oslo_config/cfg.py#L2614
    [2] https://github.com/openstack/oslo.config/blob/6.8.1/oslo_config/cfg.py#L2183

    Change-Id: I468cef94185a1b59f379ca527050450e03664c67
    Signed-off-by: Stephen Finucane <email address hidden>
    Closes-Bug: #1825435

Changed in nova:
status: In Progress → Fix Released

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/654468
Reason: https://review.opendev.org/655843

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers