apache-log-parser should parse one file at a time

Bug #622765 reported by Michael Nelson
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Undecided
Michael Nelson

Bug Description

With 10.07 we fixed bug 588288 which allowed us to set the maximum number of lines of (each) log file that will be parsed.

Initially we'd thought this would help us solve the memory issue when running the parser against the backlog of ppa access logs, but after trialling with logparser_max_parsed_lines set to 100, the PPA log parser still had to be killed as it consumed too much memory.

Going back to the code, there are a lot of other improvements that could be made. One that stands out is that currently *all* log files with new lines to parse are opened during get_files_to_parse(), being returned in a dict. This means that when we run the parser against the backlog of ppa access files, there are over 2600 files being opened at once. It would be great to instead use a generator, and limit the number of lines processed to all files, rather than for each file.

Note: there is also a comment related to the librarian logfile parser in the docstring at:

cronscripts/parse-librarian-apache-access-logs.py:

which, applying it to the PPA log file parser, implies that we could additionally update the script to clear the storm cache
(store._cache.clear()) at some regular interval (such as after each file is processed). This will reduce the benefit of the cache of course, but will limit the amount of ram storm consumes during the process.

Also, with this knowledge, we can QA such a change on dogfood or locally by simply copying the log file that we have many times before running the script.

Related branches

description: updated
Tom Haddon (mthaddon)
tags: added: canonical-losa-lp
Revision history for this message
William Grant (wgrant) wrote :

I'm not sure that the given QA strategy is correct: the log parser infrastructure identifies files by their first line, not their filename.

Revision history for this message
Michael Nelson (michael.nelson) wrote : Re: [Bug 622765] Re: apache-log-parser should parse one file at a time

On Tue, Aug 24, 2010 at 12:13 AM, William Grant <email address hidden> wrote:
> I'm not sure that the given QA strategy is correct: the log parser
> infrastructure identifies files by their first line, not their filename.
>

Yes, we'd need to copy and modify the first line to ensure all the
files were held open. Thanks William.

Benji York (benji)
Changed in launchpad-foundations:
assignee: nobody → Benji York (benji)
status: New → In Progress
Revision history for this message
Launchpad QA Bot (lpqabot) wrote : Bug fixed by a commit
Changed in launchpad-foundations:
milestone: none → 10.09
tags: added: qa-needstesting
Changed in launchpad-foundations:
status: In Progress → Fix Committed
Revision history for this message
Michael Nelson (michael.nelson) wrote :

Hi Benji! Thanks for converting files_to_parse() to consume a generator, that will be a big help.

Note: AFAICS, this won't yet ensure that the logparser_max_parsed_lines config option will be respected? Currently, if logparser_max_parsed_lines is set to 1000, then 1000 lines will be parsed from *each* logfile with lines to parse, rather than in total. I'd meant to make that more obvious in the description.

Revision history for this message
Benji York (benji) wrote : Re: [Bug 622765] Re: apache-log-parser should parse one file at a time

On Mon, Aug 30, 2010 at 4:07 AM, Michael Nelson
<email address hidden> wrote:
> Note: AFAICS, this won't yet ensure that the logparser_max_parsed_lines
> config option will be respected? Currently, if
> logparser_max_parsed_lines is set to 1000, then 1000 lines will be
> parsed from *each* logfile with lines to parse, rather than in total.
> I'd meant to make that more obvious in the description.

Yeah, I totally didn't get that from the bug description. :)

Here's a change that makes logparser_max_parsed_lines respected across
multiple files:
https://code.edge.launchpad.net/~benji/launchpad/bug-622765/+merge/34077
--
Benji York

Revision history for this message
Michael Nelson (michael.nelson) wrote :

Wonderful, thanks Benji!

Revision history for this message
Gary Poster (gary) wrote :

Julian and I agreed last week that Foundations would do the desired changes, and the Soyuz team would do the QA on dogfood. We're ready for that QA. :-)

Changed in launchpad-foundations:
assignee: Benji York (benji) → Michael Nelson (michael.nelson)
Revision history for this message
Launchpad QA Bot (lpqabot) wrote : Bug fixed by a commit
Revision history for this message
Julian Edwards (julian-edwards) wrote :

It seems ok on dogfood, albeit with some odd output

"Finished parsing <closed file '/srv/launchpad.net/ppa-apache-logs/ppa.dogfood.launchpad.net.access-log', mode 'r' at 0xa0fff50>"

for example. I've pasted a few runs of the code below.

launchpad@mawson:/srv/launchpad.net/codelines/trunk$ cronscripts/parse-ppa-apache-access-logs.py -vv
2010-09-07 14:13:49 DEBUG Cronscript control file not found at /srv/launchpad.net/codelines/trunk/cronscripts.ini
2010-09-07 14:13:49 DEBUG Enabled by DEFAULT section
2010-09-07 14:13:49 INFO Creating lockfile: /var/lock/launchpad-parse-ppa-apache-logs.lock
2010-09-07 14:14:28 INFO Parsed 1000000 lines resulting in 20 download stats.
2010-09-07 14:14:29 INFO Finished parsing <closed file '/srv/launchpad.net/ppa-apache-logs/ppa.dogfood.launchpad.net.access-log', mode 'r' at 0xa0fff50>
2010-09-07 14:14:29 INFO Done parsing apache log files
2010-09-07 14:14:29 DEBUG Removing lock file: /var/lock/launchpad-parse-ppa-apache-logs.lock
launchpad@mawson:/srv/launchpad.net/codelines/trunk$ cronscripts/parse-ppa-apache-access-logs.py -vv
2010-09-07 14:14:41 DEBUG Cronscript control file not found at /srv/launchpad.net/codelines/trunk/cronscripts.ini
2010-09-07 14:14:41 DEBUG Enabled by DEFAULT section
2010-09-07 14:14:41 INFO Creating lockfile: /var/lock/launchpad-parse-ppa-apache-logs.lock
2010-09-07 14:15:17 INFO Parsed 1000000 lines resulting in 20 download stats.
2010-09-07 14:15:18 INFO Finished parsing <closed file '/srv/launchpad.net/ppa-apache-logs/ppa.dogfood.launchpad.net.access-log', mode 'r' at 0xa0fff50>
2010-09-07 14:15:18 INFO Done parsing apache log files
2010-09-07 14:15:18 DEBUG Removing lock file: /var/lock/launchpad-parse-ppa-apache-logs.lock

tags: added: qa-ok
removed: qa-needstesting
Revision history for this message
Benji York (benji) wrote : Re: [Bug 622765] Re: apache-log-parser should parse one file at a time

On Tue, Sep 7, 2010 at 10:16 AM, Julian Edwards
<email address hidden> wrote:
> It seems ok on dogfood, albeit with some odd output
>
> "Finished parsing <closed file '/srv/launchpad.net/ppa-apache-
> logs/ppa.dogfood.launchpad.net.access-log', mode 'r' at 0xa0fff50>"

Yeah, I thought that was an odd thing to log too, but didn't want to
remove it in case it was important to someone.

I suspect a bug opened about the odd output would get addressed before
too long.
--
Benji York

Curtis Hovey (sinzui)
Changed in launchpad-foundations:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.