beagle-build-index uses 100% of cpu when started from cron.daily

Bug #89487 reported by Øyvind Grønnesby
10
Affects Status Importance Assigned to Milestone
beagle (Ubuntu)
Invalid
Medium
Unassigned

Bug Description

Binary package hint: beagle

When the cron job to build the indexes start beagle-build-index will take up as much CPU as it can. SIGTERM will not stop the process, it must get SIGKILL to actually quit.
It is not hanged but the indexing process is very resource intensive

Version: 0.2.16-0ubuntu4

ProblemType: Bug
Date: Sat Mar 3 16:54:32 2007
DistroRelease: Ubuntu 7.04
Uname: Linux othello.oyving.org 2.6.20-9-generic #2 SMP Sun Feb 25 22:59:06 UTC 2007 x86_64 GNU/Linux

Revision history for this message
dBera (dbera-web) wrote :

1) Does it happen everytime i.e. can you reproduce it ?
2) Can you manually give the same command from the terminal with an additional "--debug" argument ? When it hangs, kill it and paste the last 15-20 lines from the terminal.

Revision history for this message
Øyvind Grønnesby (oyving) wrote : Re: [Bug 89487] Re: beagle-build-index hangs and must be killed

dBera wrote:
> 1) Does it happen everytime i.e. can you reproduce it ?

Yes, but see below.

> 2) Can you manually give the same command from the terminal with an additional "--debug" argument ? When it hangs, kill it and paste the last 15-20 lines from the terminal.

It seems that it's not spinning out of control, debug output shows that
it is working and busy scanning the file system. But trying to kill the
indexer shows

  Debug: Shutdown Requested

and it either completely ignored the "request" or it takes incredibly
long time before actually shuts down. This, of course, is hard to see
without any logging :-)

--
Øyvind Grønnesby

Revision history for this message
Kevin Kubasik (kkubasik) wrote : Re: beagle-build-index hangs and must be killed

This is a bug, I have the same issue.

Changed in beagle:
status: Unconfirmed → Confirmed
Revision history for this message
Milan Bouchet-Valat (nalimilan) wrote :

I have had the same problem here, but I cannot reproduce it. I stopped beagle-build-index using 'killall beagle-build-index', and the logs confirm that it stopped normally. But anyway, it was eating all CPU, apparently building a new index because I have recently upgraded to Feisty.

I attach the two logs hoping this can help, but if you find a way to reproduce it, I will try to. This is not a terrible but, since the process is likely to finish successfully, but it's annoying and does not correspond to Beagle standard behavior.

Revision history for this message
Milan Bouchet-Valat (nalimilan) wrote :
Revision history for this message
dBera (dbera-web) wrote : Re: [Bug 89487] Re: beagle-build-index hangs and must be killed

> I have had the same problem here, but I cannot reproduce it. I stopped
> beagle-build-index using 'killall beagle-build-index', and the logs
> confirm that it stopped normally. But anyway, it was eating all CPU,
> apparently building a new index because I have recently upgraded to
> Feisty.
>
> I attach the two logs hoping this can help, but if you find a way to

Unfortunately beagle-build-index is different that beagled (you
attached the logs for beagled). beagled works in your home directory
and other directories containing your personal files and data.
beagle-build-index works on system/common files only.

To reproduce this, find who is starting beagle-build-index (its one of
the crop scripts probably). Look at the full command line (ps auxw) to
see the full command line parameters. Then kill b-b-i and give the
same command from a terminal (you might need to sudo with the correct
user). Observe the terminal, you will know which file was causing the
CPU drain.

Revision history for this message
Milan Bouchet-Valat (nalimilan) wrote : Re: beagle-build-index hangs and must be killed

Indeed, the indexing process goes right, the only issue is it uses 100% CPU! (sometimes it only uses half, though) Looks like it doesn't care of it. I think you may experience the same running manually /etc/cron.daily/beagle-crawl-system

We can hope this occurs only once, an that other indexings will be quicker. I see two solutions:
- adding a way to tell the user that the system is indexing data, eg via a notification icon (or a dialog), maybe allowing to pause it
- providing a ready-made index corresponding to standard Feisty root, which would save much work on every computer since they're almost all identical - beagle would only have to adjust (I don't know if it's possible)

In two cases, it would be good to limit CPU usage, because for now it's really annoying.

PS: could you remove the two useless attachments, since many files from my home are listed in ? thanks

Revision history for this message
dBera (dbera-web) wrote :

I dont think I can remove the attachment, I dont have the permission.

Kerry (the KDE client) has the notification system. No mechanism to pause it though.

Adding a ready made index is a good idea. Any feisty dev listening ???

Revision history for this message
Nicolò Chieffo (yelo3) wrote :

the strange thing is that it seems that the index builder removes files and readds them into the index:

Debug: -file:///usr/share/doc/python/python-policy.html/ap-build_dependencies.html
[.....]
Debug: +file:///usr/share/doc/python/python-policy.html/ap-build_dependencies.html

this is why it always use lots of cpu... it does not check if the file had been already indexed!

description: updated
Revision history for this message
Nicolò Chieffo (yelo3) wrote :

if the situation is this, the file have to be moved from cron.dayly to cron.weekly at least, in my opinion...

Revision history for this message
dBera (dbera-web) wrote :

> the strange thing is that it seems that the index builder removes
> files and readds them into the index

Beagle is supposed to check if a file is already indexed, if not it prints the -file:/// and +file:/// and proceeds to add the file to the index. The actual file is indexed between these two lines, the -file:/// last printed will tell you which file is being currently indexed.

So, (1) by observing the output during 100% cpu can you find out which document is taking up all the CPU. Possibly is a bug in some beagle file parser.
(2) Do you think beagle is re-adding some files that it indexed before and the file was not changed in between ? Some bug could have been introduced here lately. Can you output two consequtive runs of beagle-build-index to some log file and attach it here/send it to me ?

Revision history for this message
Nicolò Chieffo (yelo3) wrote : Re: [Bug 89487] Re: beagle-build-index uses 100% of cpu when started from cron.daily

the cpu is always used at 100%, no matter which file is indexed.
since the index operation will always last more than 20 minutes I
think that the index is completely rebuildt every time. I'm doing the
log files now.

There are some errors when trying to index screensaver files (the
extension is .desktop)

I will attach the debug logs sooner or later

Revision history for this message
Nicolò Chieffo (yelo3) wrote :

ok wait, there are 2 indexing processes. the first indexes
applications (and it lasts in few minutes) and the second indexes
documentation (that one is very long)
I'm attaching the application indexing log, as I think that it is
enough to understand the bug

Revision history for this message
Nicolò Chieffo (yelo3) wrote :

the two files are very similar. so I think that the index is deleted and rebuilt without any check

Revision history for this message
Nicolò Chieffo (yelo3) wrote :
Revision history for this message
dBera (dbera-web) wrote :

Actually, I would rather see the documentation one.

The application is working ok. The reason you see files being repeated is, build-index does not have any filter (parser) for screensaver desktop files, so when the indexer gets such a file it says "Error: Could not filter file: No desktop entry" and moves on to the next file. On the next run, beagle retries to index this file (since you might have installed a new filter for screensaver desktop files in the meantime) and again fails. This goes on in every run. The index doesn't change - the crawler anyway has to crawl all files, so nobody loses. And its also fast.

The documentation one might reveal if some document is taking long time to index or being actually re-parsed over and over again.

Revision history for this message
Nicolò Chieffo (yelo3) wrote :

so do you mean that every time beagle tries to index the files that
didn't manage to index the previous time, because of missing filters?

if so this might be the problem with documentation files, maybe every
time he tries to index those files, but it doesn't have the correct
filter...

And this could also be the cause of the 15 minutes that I have to wait
every time I log in, before the warning in the search box disappears
(caused by beagled that indexes my personal files)

I'm builnding now the 2 log files, let's hope to find something!

Revision history for this message
Nicolò Chieffo (yelo3) wrote :

this is the line to start the documentation index (start it from the
user beagleindex)

/usr/sbin/beagle-build-index --target
/var/cache/beagle/indexes/documentation --disable-directories
--recursive --allow-pattern *.xml,*.html,*.docbook /usr/share/doc
/usr/local/share/doc /opt/kde3/share/doc /opt/gnome/share/gnome/help
/usr/share/gnome/help /opt/gnome/share/gtk-doc/html
/usr/share/gtk-doc/html /usr/share/gnome/html > documentation-index.1

this time they lasted 5 minutes each. attaching the log files

Revision history for this message
Nicolò Chieffo (yelo3) wrote :
Revision history for this message
Nicolò Chieffo (yelo3) wrote :
Revision history for this message
Nicolò Chieffo (yelo3) wrote :

in the cron launcher there is also a reference to a binary called "ionice" from the universe package "schedutils"
maybe this binary let the process to be executed with less resource greed... I have not tested it yet

Revision history for this message
dBera (dbera-web) wrote :

Yeah, the ionice is supposed to lower the io priority. Actually, if you check the first two lines of the log, you will see "Debug: Set best effort IO priority to lowest level (7)
Debug: Reniced process to 19" - which means beagle is already trying to play fare.

Could you attach one of the recrawled xml files ? I could possibly get any of them from the web, but just wanted to make sure I have the exact same one that you have. I will try to reproduce this here.

Revision history for this message
Nicolò Chieffo (yelo3) wrote :
Revision history for this message
Nicolò Chieffo (yelo3) wrote :

there are some strange characters (non european)... could it be this?

Revision history for this message
Nicolò Chieffo (yelo3) wrote :

here is another file

Revision history for this message
dBera (dbera-web) wrote :

I figured out the problem. The docbook parser is failing to parse those files, thus those files are getting filtered again on re-indexing. It really cannot be avoided, since users can always install updated filters before re-running build-index.

The bad thing is that the docbook filter decides success or failure only at the end of the parsing the whole file. Someone needs to look at the docbook filter, but other than that everything else is working ok.

Revision history for this message
Nicolò Chieffo (yelo3) wrote :

all right... so we just need to wait...

Changed in beagle:
importance: Undecided → Medium
Revision history for this message
Saul D Beniquez (saullawl) wrote :

Erm.. it's been over a year and nothing has been done about this...

Revision history for this message
Saul D Beniquez (saullawl) wrote :

(Sorry for the double post, clicked "save" by accident)

Have there been any updates?

Revision history for this message
Thomas Hotz (thotz-deactivatedaccount) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. As far as i now beagle development was stopped 2010. I will close this bug report.

Changed in beagle (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.