nova_placement service start not coordinated with api db sync on multiple controllers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Low
|
Lee Yarwood | ||
Rocky |
Fix Released
|
Low
|
Martin Schuppert | ||
tripleo |
Fix Released
|
High
|
Martin Schuppert |
Bug Description
On a loaded HA / galera environment using VMs I can fairly consistently reproduce a race condition where the nova_placement service is started on controllers where the database is not yet available. The nova_placement service itself does not seem to be able to tolerate this condition upon startup and it then fails to recover. Mitigation here can either involve synchronizing these conditions or getting nova-placement to be more resilient.
The symptoms of overcloud deploy failure look like two out of three controllers having the nova_placement container in an unhealthy state:
TASK [Debug output for task which failed: Check for unhealthy containers after step 3] ***
Saturday 28 July 2018 10:19:29 +0000 (0:00:00.663) 0:30:26.152 *********
fatal: [stack2-
"failed_
"outputs.
]
}
fatal: [stack2-
"failed_
"outputs.
]
}
ok: [stack2-
"failed_
"outputs.
}
ok: [stack2-
"failed_
"outputs.
}
NO MORE HOSTS LEFT *******
inspecting placement_
[Sat Jul 28 10:17:06.525018 2018] [:error] [pid 14] [remote 10.1.20.15:0] mod_wsgi (pid=14): Target WSGI script '/var/www/
[Sat Jul 28 10:17:06.525067 2018] [:error] [pid 14] [remote 10.1.20.15:0] mod_wsgi (pid=14): Exception occurred processing WSGI script '/var/www/
[Sat Jul 28 10:17:06.525101 2018] [:error] [pid 14] [remote 10.1.20.15:0] Traceback (most recent call last):
[Sat Jul 28 10:17:06.525124 2018] [:error] [pid 14] [remote 10.1.20.15:0] File "/var/www/
[Sat Jul 28 10:17:06.525165 2018] [:error] [pid 14] [remote 10.1.20.15:0] application = init_application()
[Sat Jul 28 10:17:06.525174 2018] [:error] [pid 14] [remote 10.1.20.15:0] File "/usr/lib/
[Sat Jul 28 10:17:06.525198 2018] [:error] [pid 14] [remote 10.1.20.15:0] return deploy.
[Sat Jul 28 10:17:06.525205 2018] [:error] [pid 14] [remote 10.1.20.15:0] File "/usr/lib/
[Sat Jul 28 10:17:06.525300 2018] [:error] [pid 14] [remote 10.1.20.15:0] update_database()
[Sat Jul 28 10:17:06.525310 2018] [:error] [pid 14] [remote 10.1.20.15:0] File "/usr/lib/
[Sat Jul 28 10:17:06.525329 2018] [:error] [pid 14] [remote 10.1.20.15:0] resource_
[Sat Jul 28 10:17:06.525337 2018] [:error] [pid 14] [remote 10.1.20.15:0] File "/usr/lib/
[Sat Jul 28 10:17:06.526277 2018] [:error] [pid 14] [remote 10.1.20.15:0] _trait_sync(ctx)
...
[Sat Jul 28 10:17:06.531950 2018] [:error] [pid 14] [remote 10.1.20.15:0] raise errorclass(errno, errval)
[Sat Jul 28 10:17:06.532049 2018] [:error] [pid 14] [remote 10.1.20.15:0] ProgrammingError: (pymysql.
the log then continues with the following exception repeated every 30 seconds, which is likely the service trying to recover but it's hitting some problem with how it uses oslo.config in process:
[Sat Jul 28 10:18:36.916617 2018] [:error] [pid 14] [remote 10.1.20.15:148] mod_wsgi (pid=14): Target WSGI script '/var/www/
[Sat Jul 28 10:18:36.916646 2018] [:error] [pid 14] [remote 10.1.20.15:148] mod_wsgi (pid=14): Exception occurred processing WSGI script '/var/www/
[Sat Jul 28 10:18:36.916664 2018] [:error] [pid 14] [remote 10.1.20.15:148] Traceback (most recent call last):
[Sat Jul 28 10:18:36.916681 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/var/www/
[Sat Jul 28 10:18:36.916711 2018] [:error] [pid 14] [remote 10.1.20.15:148] application = init_application()
[Sat Jul 28 10:18:36.916719 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/
[Sat Jul 28 10:18:36.916742 2018] [:error] [pid 14] [remote 10.1.20.15:148] _parse_args([], default_
[Sat Jul 28 10:18:36.916748 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/
[Sat Jul 28 10:18:36.916760 2018] [:error] [pid 14] [remote 10.1.20.15:148] logging.
[Sat Jul 28 10:18:36.916766 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/
[Sat Jul 28 10:18:36.916779 2018] [:error] [pid 14] [remote 10.1.20.15:148] conf.register_
[Sat Jul 28 10:18:36.916785 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/
[Sat Jul 28 10:18:36.916807 2018] [:error] [pid 14] [remote 10.1.20.15:148] result = f(self, *args, **kwargs)
[Sat Jul 28 10:18:36.916813 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/
[Sat Jul 28 10:18:36.916823 2018] [:error] [pid 14] [remote 10.1.20.15:148] self.register_
[Sat Jul 28 10:18:36.916828 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/
[Sat Jul 28 10:18:36.916838 2018] [:error] [pid 14] [remote 10.1.20.15:148] return f(self, *args, **kwargs)
[Sat Jul 28 10:18:36.916843 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/
[Sat Jul 28 10:18:36.916852 2018] [:error] [pid 14] [remote 10.1.20.15:148] raise ArgsAlreadyPars
[Sat Jul 28 10:18:36.916894 2018] [:error] [pid 14] [remote 10.1.20.15:148] ArgsAlreadyPars
in nova_placement.yaml we can see that the placement service is started earlier than others, in step 3:
docker_
step_2:
get_attr: [NovaPlacementL
# start this early so it is up before computes start reporting
step_3:
image: {get_param: DockerNovaPlace
net: host
user: root
however in nova-api.yaml we can see that the api DB sync is also in step_3, which although it includes start_order: 0, I'm going to guess that the "start_order" system is not coordinated across controllers:
docker_
step_2:
get_attr: [NovaApiLogging, docker_config, step_2]
step_3:
image: &nova_api_image {get_param: DockerNovaApiImage}
net: host
detach: false
user: root
- {get_attr: [ContainersCommon, volumes]}
- {get_attr: [NovaApiLogging, volumes]}
-
- /var/lib/
- /var/lib/
the placement service always starts correctly on the "bootstrap" controller since start_order is coordinating, but this does not carry over to the other two controllers.
affects: | nova (Ubuntu) → ubuntu |
affects: | ubuntu → nova |
tags: | added: placement |
Changed in tripleo: | |
status: | New → Triaged |
milestone: | none → rocky-rc1 |
importance: | Undecided → High |
Changed in nova: | |
status: | New → Triaged |
importance: | Undecided → Low |
Changed in tripleo: | |
milestone: | rocky-rc1 → stein-1 |
Changed in tripleo: | |
assignee: | nobody → Martin Schuppert (mschuppert) |
Changed in tripleo: | |
milestone: | stein-1 → stein-2 |
We could probably consider handling either of the two errors within placement itself, but I think it would be (much) better to simply let placement die (all the way dead, not whatever mod wsgi is doing that causes the repeating oslo config issue) and manage retrying externally.
However, it also looks like the wsgi-script support in PBR needs support for separating configuration stages and other set up from creating the WSGI "application" symbol.