Apparent lack of locking in conductor logs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Incomplete
|
Undecided
|
Unassigned | ||
oslo.log |
Opinion
|
Undecided
|
Unassigned |
Bug Description
I'm looking at conductor logs generated by a customer running RH OSP 10 (Newton). The logs appear to be corrupt in a manner I'd expect to see if 2 processes were writing to the same log file simultaneously. For example:
===
2017-09-14 15:54:39.689 120626 ERROR nova.servicegro
2017-09-14 15:54:39.689 120626 ERROR nova.s2017-09-14 15:54:39.690 120562 ERROR nova.servicegro
2017-09-14 15:54:39.690 120562 ERROR nova.servicegro
===
Notice how a new log starts part way through the second line above. This also results in log entries in the wrong sort order:
===
2017-09-14 15:54:39.690 120562 ERROR nova.servicegro
2017-09-14 15:54:39.690 120562 ERROR nova.servicegro
2017-09-14 15:54:39.689 120626 ERROR nova.servicegro
2017-09-14 15:54:39.689 120626 ERROR nova.servicegro
===
Note how the first 2 lines are after the last 2 by timestamp, as presumably the last 2 are a continuation of a previous log entry. This confounds merge sorting of log files, which is exceptionally useful.
We also see truncated lines with no timestamp which aren't a continuation of the previous line:
===
2017-09-14 15:54:39.690 120607 ERROR nova.servicegro
2017-09-14 15:54:39.690 120607 ERROR nova.servicegro
elf._execute_
===
I strongly suspect this is because multiple conductors are running in separate processes, and are therefore not benefiting from the thread safety of python's logging.
summary: |
- Apparent lack of locking in logger + Apparent lack of locking in conductor logs |
After some brief discussion in #openstack-nova I've moved this to oslo.log. The issue here appears to be that we spawn multiple separate conductor processes writing to the same nova-conductor.log file. We don't want to stop doing this, as it would break people.
It seems that by default python logging uses thread logs rather than external locks:
https:/ /docs.python. org/2/library/ multiprocessing .html#logging
Suggest the fix might be to explicitly use multiprocessing .get_logger( ), or at least provide an option to do this when we know it's required.