[queens][OSP13] An overcloud reboot will sometimes leave nova_api broken

Bug #1882094 reported by Herve Beraud
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
melanie witt
Train
In Progress
Undecided
melanie witt
Ussuri
Fix Released
Undecided
melanie witt
Victoria
Fix Released
Undecided
melanie witt

Bug Description

Description of problem:

This is a composable HA overcloud with tls-everywhere with 2020-05-28.2 compose.
We reboot the overcloud one node at the time, and from time to time (totally not consistent) we see that the nova_api containers is in an unhealthy state and returns the following in a loop:
```
[Thu Jun 04 07:00:29.332162 2020] [:error] [pid 19] [remote 172.17.1.36:180] mod_wsgi (pid=19): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-api'.
[Thu Jun 04 07:00:29.332185 2020] [:error] [pid 19] [remote 172.17.1.36:180] Traceback (most recent call last):
[Thu Jun 04 07:00:29.332208 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/var/www/cgi-bin/nova/nova-api", line 54, in <module>
[Thu Jun 04 07:00:29.332271 2020] [:error] [pid 19] [remote 172.17.1.36:180] application = init_application()
[Thu Jun 04 07:00:29.332286 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/nova/api/openstack/compute/wsgi.py", line 20, in init_pplication
[Thu Jun 04 07:00:29.332317 2020] [:error] [pid 19] [remote 172.17.1.36:180] return wsgi_app.init_application(NAME)
[Thu Jun 04 07:00:29.332327 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/nova/api/openstack/wsgi_app.py", line 78, in init_applcation
[Thu Jun 04 07:00:29.332351 2020] [:error] [pid 19] [remote 172.17.1.36:180] config.parse_args([], default_config_files=conf_files)
[Thu Jun 04 07:00:29.332367 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/nova/config.py", line 35, in parse_args
[Thu Jun 04 07:00:29.332385 2020] [:error] [pid 19] [remote 172.17.1.36:180] log.register_options(CONF)
[Thu Jun 04 07:00:29.332401 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/oslo_log/log.py", line 250, in register_options
[Thu Jun 04 07:00:29.332433 2020] [:error] [pid 19] [remote 172.17.1.36:180] conf.register_cli_opts(_options.common_cli_opts)
[Thu Jun 04 07:00:29.332461 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2440, in __inner
[Thu Jun 04 07:00:29.332490 2020] [:error] [pid 19] [remote 172.17.1.36:180] result = f(self, *args, **kwargs)
[Thu Jun 04 07:00:29.332503 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2662, in register_cli_opts
[Thu Jun 04 07:00:29.332523 2020] [:error] [pid 19] [remote 172.17.1.36:180] self.register_cli_opt(opt, group, clear_cache=False)
[Thu Jun 04 07:00:29.332532 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2444, in __inner
[Thu Jun 04 07:00:29.332550 2020] [:error] [pid 19] [remote 172.17.1.36:180] return f(self, *args, **kwargs)
[Thu Jun 04 07:00:29.332559 2020] [:error] [pid 19] [remote 172.17.1.36:180] File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2654, in register_cli_opt
[Thu Jun 04 07:00:29.332577 2020] [:error] [pid 19] [remote 172.17.1.36:180] raise ArgsAlreadyParsedError("cannot register CLI option")
[Thu Jun 04 07:00:29.332603 2020] [:error] [pid 19] [remote 172.17.1.36:180] ArgsAlreadyParsedError: arguments already parsed: cannot register CLI option
```

It look like we should always reset CONF when starting the wsgi app [1].

Version-Release number of selected component (if applicable):
openstack-nova-api-17.0.13-7.el7ost.noarch
puppet-nova-12.5.0-8.el7ost.noarch
python-nova-17.0.13-7.el7ost.noarch
openstack-nova-common-17.0.13-7.el7ost.noarch
python2-novaclient-10.1.1-1.el7ost.noarch

[1] https://github.com/openstack/nova/blob/stable/queens/nova/api/openstack/wsgi_app.py#L78

Tags: api
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/733627

Changed in nova:
assignee: nobody → Herve Beraud (herveberaud)
status: New → In Progress
melanie witt (melwitt)
tags: added: api
Revision history for this message
melanie witt (melwitt) wrote :

We have been discussing this on the associated downstream bug [1] and what we have found from the nova perspective is that it wouldn't be the best approach to try to make all of nova's global state "reloadable" to allow it to endure re-running of the WSGI script *without* actually restarting the python interpreter. Nova has A Lot of global state within it (unlike placement) and an approach involving reload-ifying all global data would be an exercise in whack-a-mole and vulnerable to future issues if/when more global state is added.

This mod_wsgi doc about considerations when reloading WSGI scripts in Embedded Mode [2] describes the problem we have here fairly well: "The first issue is that when the script file is imported, if the code makes modifications to sys.path or other global data structures and the changes are additive, checks should first be made to ensure that the change has not already been made, else duplicate data will be added every time the script file is reloaded."

Now, in our poorly behaving scenario, we are *not* running in Embedded Mode but are rather running in Daemon Mode HOWEVER, what we are observing in our downstream apache + mod_wsgi environment is that if the WSGI script fails to load for any reason (in this case it's a DBConnectionError, expected because the DB is rebooting while nova-api is coming up -- we establish connection to the DB as part of the WSGI init script), something (mod_wsgi?) will re-run the script *without* reloading the daemon process at all. And that is where we run into the blow ups with the various global state.

Instead of chasing each piece of global data and making it reloadable, we're thinking there are a couple of other options:

(1) Removing and moving the DB connection establishment out of our WSGI init script so that the init script is dead simple and global data access stays outside in normal python modules

or

(2) Instead of letting exceptions go uncaught in our WSGI init script, catch them and log a message and sys.exit() the python process as we have failed to do the bare minimum of initialization needed for nova-api to work properly

For now I'm favoring option (1) as is also described in [2]: "One should therefore be cautious of what data is kept in a script file. Preferably the script file should only act as a bridge to code and data residing in a normal Python module imported from an entirely different directory."

But I'm not 100% how this will behave, so I need to do some local testing with it. I will try some things out and comment again afterward.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1843798
[2] https://modwsgi.readthedocs.io/en/develop/user-guides/reloading-source-code.html#reloading-in-embedded-mode

Changed in nova:
assignee: Herve Beraud (herveberaud) → melanie witt (melwitt)
Revision history for this message
melanie witt (melwitt) wrote :

I thought about this more and realized that doing (1) and simply trying to move init of global data outside of the init_application method would have the same end result as grouping the global data init into a single method and ensuring that method could only run once per python interpreter instance ... and the latter seemed much simpler after I had another look around the code. I've proposed a new PS with this approach and will add a test soon to see if it would solve the problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.0.0.0rc1

This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.

Revision history for this message
melanie witt (melwitt) wrote :
Revision history for this message
melanie witt (melwitt) wrote :
Revision history for this message
melanie witt (melwitt) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/victoria)

Change abandoned by "melanie witt <email address hidden>" on branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/785059
Reason: Squashed into https://review.opendev.org/c/openstack/nova/+/785060

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "melanie witt <email address hidden>" on branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/785060
Reason: squashed into https://review.opendev.org/c/openstack/nova/+/785059

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/nova/+/785059
Committed: https://opendev.org/openstack/nova/commit/e3085fa6310ddeaafa493c3f718aab0ce64f0994
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit e3085fa6310ddeaafa493c3f718aab0ce64f0994
Author: Hervé Beraud <email address hidden>
Date: Thu Jun 4 09:49:59 2020 +0200

    Initialize global data separately and run_once in WSGI app init

    NOTE(melwitt): This is a combination of two changes to avoid
    intermittent test failure that was introduced by the original bug fix,
    and was fixed by change I2bd360dcc6501feea7baf02d4510b282205fc061.

    We have discovered that if an exception is raised at any point during
    the running of the init_application WSGI script in an apache/mod_wsgi
    Daemon Mode environment, it will prompt apache/mod_wsgi to re-run the
    script without starting a fresh python process. Because we initialize
    global data structures during app init, subsequent runs of the script
    blow up as some global data do *not* support re-initialization. It is
    anyway not safe to assume that init of global data is safe to run
    multiple times.

    This mod_wsgi behavior appears to be a special situation that does not
    behave the same as a normal reload in Daemon Mode as the script file is
    being reloaded upon failure instead of the daemon process being
    shutdown and restarted as described in the documentation [1].

    In order to handle this situation, we can move the initialization of
    global data structures to a helper method that is decorated to run only
    once per python interpreter instance. This way, we will not attempt to
    re-initialize global data that are not safe to init more than once.

    Co-Authored-By: Michele Baldessari <email address hidden>
    Co-Authored-By: melanie witt <email address hidden>

    Conflicts:
        nova/api/openstack/wsgi_app.py

    NOTE(melwitt): The conflict is because change
    If4783adda92da33d512d7c2834f0bb2e2a9b9654 (Support sys.argv in wsgi
    app) is not in Victoria.

    Closes-Bug: #1882094

    [1] https://modwsgi.readthedocs.io/en/develop/user-guides/reloading-source-code.html#reloading-in-daemon-mode

    Reset global wsgi app state in unit test

    Since I2bd360dcc6501feea7baf02d4510b282205fc061 there is a global state
    set during the wsgi_app init making our unit test cases
    non-deterministic based on the order of them. This patch makes sure
    that the global state is reset for each test case.

    Closes-Bug: #1921098
    (cherry picked from commit bc2c19bb2db901af0c48d34fb15a335f4e343361)

    Change-Id: I2bd360dcc6501feea7baf02d4510b282205fc061
    (cherry picked from commit 7c9edc02eda45aafbbb539b759e6b92f7aeb5ea8)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ussuri)

Change abandoned by "melanie witt <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/785061
Reason: Using https://review.opendev.org/c/openstack/nova/+/785061 instead.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 22.2.2

This issue was fixed in the openstack/nova 22.2.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/nova/+/785061
Committed: https://opendev.org/openstack/nova/commit/59249697bf09ca1a560defdd550be7e3c439b5b7
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 59249697bf09ca1a560defdd550be7e3c439b5b7
Author: Hervé Beraud <email address hidden>
Date: Thu Jun 4 09:49:59 2020 +0200

    Initialize global data separately and run_once in WSGI app init

    NOTE(melwitt): This is a combination of two changes to avoid
    intermittent test failure that was introduced by the original bug fix,
    and was fixed by change I2bd360dcc6501feea7baf02d4510b282205fc061.

    We have discovered that if an exception is raised at any point during
    the running of the init_application WSGI script in an apache/mod_wsgi
    Daemon Mode environment, it will prompt apache/mod_wsgi to re-run the
    script without starting a fresh python process. Because we initialize
    global data structures during app init, subsequent runs of the script
    blow up as some global data do *not* support re-initialization. It is
    anyway not safe to assume that init of global data is safe to run
    multiple times.

    This mod_wsgi behavior appears to be a special situation that does not
    behave the same as a normal reload in Daemon Mode as the script file is
    being reloaded upon failure instead of the daemon process being
    shutdown and restarted as described in the documentation [1].

    In order to handle this situation, we can move the initialization of
    global data structures to a helper method that is decorated to run only
    once per python interpreter instance. This way, we will not attempt to
    re-initialize global data that are not safe to init more than once.

    Co-Authored-By: Michele Baldessari <email address hidden>
    Co-Authored-By: melanie witt <email address hidden>

    Conflicts:
      nova/test.py

    NOTE(melwitt): The conflict is because change
    I1fea14d5be10bb4e884f52e0ae8be722519ddd3f (Poison
    netifaces.interfaces() in tests) is not in Ussuri.

    Closes-Bug: #1882094

    [1] https://modwsgi.readthedocs.io/en/develop/user-guides/reloading-source-code.html#reloading-in-daemon-mode

    Reset global wsgi app state in unit test

    Since I2bd360dcc6501feea7baf02d4510b282205fc061 there is a global state
    set during the wsgi_app init making our unit test cases
    non-deterministic based on the order of them. This patch makes sure
    that the global state is reset for each test case.

    Closes-Bug: #1921098
    (cherry picked from commit bc2c19bb2db901af0c48d34fb15a335f4e343361)

    Change-Id: I2bd360dcc6501feea7baf02d4510b282205fc061
    (cherry picked from commit 7c9edc02eda45aafbbb539b759e6b92f7aeb5ea8)
    (cherry picked from commit e3085fa6310ddeaafa493c3f718aab0ce64f0994)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 21.2.3

This issue was fixed in the openstack/nova 21.2.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/train)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/785064
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.