Apparent lack of locking in conductor logs

Bug #1718439 reported by Matthew Booth
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Undecided
Unassigned
oslo.log
Opinion
Undecided
Unassigned

Bug Description

I'm looking at conductor logs generated by a customer running RH OSP 10 (Newton). The logs appear to be corrupt in a manner I'd expect to see if 2 processes were writing to the same log file simultaneously. For example:

===
2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return self.dbapi.connect(*cargs, **cparams)
2017-09-14 15:54:39.689 120626 ERROR nova.s2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db [-] Unexpected error while reporting service status
2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db Traceback (most recent call last):
===

Notice how a new log starts part way through the second line above. This also results in log entries in the wrong sort order:

===
2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db return self.dbapi.connect(*cargs, **cparams)
2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/pymysql/__init__.py", line 88, in Connect
2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return Connection(*args, **kwargs)
2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 657, in __init__
===

Note how the first 2 lines are after the last 2 by timestamp, as presumably the last 2 are a continuation of a previous log entry. This confounds merge sorting of log files, which is exceptionally useful.

We also see truncated lines with no timestamp which aren't a continuation of the previous line:

===
2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1']
2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db
elf._execute_and_instances(context)
===

I strongly suspect this is because multiple conductors are running in separate processes, and are therefore not benefiting from the thread safety of python's logging.

Matthew Booth (mbooth-9)
summary: - Apparent lack of locking in logger
+ Apparent lack of locking in conductor logs
Revision history for this message
Matthew Booth (mbooth-9) wrote :

After some brief discussion in #openstack-nova I've moved this to oslo.log. The issue here appears to be that we spawn multiple separate conductor processes writing to the same nova-conductor.log file. We don't want to stop doing this, as it would break people.

It seems that by default python logging uses thread logs rather than external locks:

  https://docs.python.org/2/library/multiprocessing.html#logging

Suggest the fix might be to explicitly use multiprocessing.get_logger(), or at least provide an option to do this when we know it's required.

affects: nova → oslo.log
Revision history for this message
Doug Hellmann (doug-hellmann) wrote :

It sounds like the change to multiprocessing.get_logger() would be in nova, not the library. Is that right?

Revision history for this message
Ben Nemec (bnemec) wrote :

Yeah, since we stopped calling get_logger from oslo.log I don't think we can fix this in the library. I'm going to add Nova back to the bug so they can re-triage base on this information.

Changed in oslo.log:
status: New → Opinion
Revision history for this message
melanie witt (melwitt) wrote :

Ben, what do you mean when you say, "stopped calling get_logger from oslo.log?" I couldn't find any commits in oslo.log containing get_logger with 'git log -S get_logger'.

In the conductor manager in nova, for example, we call getLogger from oslo.log [1] and then internally oslo.log calls getLogger from python's logging module [2]. Does this not show that we need to provide some option or fix in oslo.log? Apologies if I'm missing something here.

[1] https://github.com/openstack/nova/blob/f5d0d85/nova/conductor/manager.py#L22-L55
[2] https://github.com/openstack/oslo.log/blob/3b43002/oslo_log/log.py#L428-L449

Changed in nova:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.