Indexing consumes too much memory

Bug #1038178 reported by Siegfried Schweizer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Goobi.Presentation
Fix Released
Low
Sebastian Meyer

Bug Description

After upgrading to 1.1.3, I tried to reindex all of my metadata and noticed extremely heavy memory consumption during that process. top tells me:

 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 4992 root 20 0 7164m 3.5g 2284 D 10 93.1 1:12.21 cli_dispatch.ph

Also, the output of the Python script I am using for automated remote indexing via WebDAV shows errors being thrown by PHP:

 PHP Fatal error: Out of memory (allocated 39583744) (tried to allocate 1 bytes) in /srv/www/htdocs/presentation/typo3conf/ext/dlf/common/class.tx_dlf_document.php on line 1689
 PHP Fatal error: Out of memory (allocated 37748736) (tried to allocate 72 bytes) in /srv/www/htdocs/presentation/typo3conf/ext/dlf/common/class.tx_dlf_document.php on line 1699
 PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 523800 bytes) in /srv/www/typo3_src-4.5.2/t3lib/class.t3lib_befunc.php on line 963
 PHP Fatal error: Out of memory (allocated 70778880) (tried to allocate 72 bytes) in /srv/www/htdocs/presentation/typo3conf/ext/dlf/common/class.tx_dlf_document.php on line 1805
 PHP Fatal error: Out of memory (allocated 39059456) (tried to allocate 1 bytes) in /srv/www/htdocs/presentation/typo3conf/ext/dlf/common/class.tx_dlf_document.php on line 1689
 PHP Fatal error: Out of memory (allocated 31457280) (tried to allocate 524288 bytes) in /srv/www/htdocs/presentation/typo3conf/ext/dlf/common/class.tx_dlf_document.php on line 1801

PHP memory limit is 128MB, which should in all cases be more than sufficient. Since nothing has been changed in my Apache/PHP and Tomcat setup before or after upgrading to 1.1.3, this must be deemed a bug in the latter.

description: updated
description: updated
description: updated
Changed in goobi-presentation:
importance: Undecided → Critical
milestone: none → 1.1.4
assignee: nobody → Sebastian Meyer (sebastian-meyer)
Revision history for this message
Siegfried Schweizer (siegfried-schweizer) wrote :

Maybe I should mention that this problem already existed in 1.1.3 in similar form, but I didn't report it then due to the shortly afterwards upcoming of 1.1.4, from which I assumed that this already might be fixed.

Before, in 1.0.4, indexing worked very well, also when using remote metadata sources being accessed via http and remote Solr hosts.

Revision history for this message
Siegfried Schweizer (siegfried-schweizer) wrote :

Sorry, 1.1.2 and 1.1.3, respectively.

Revision history for this message
Sebastian Meyer (sebastian-meyer) wrote :

I will investigate this, but it won't be in the next couple of days. Maybe you could look into this, too, and at least narrow down which methods consume that much memory?

This might as well be a general TYPO3 problem, because it seems to be a issue only in backend processes. A frontend process doing the same (i.e. showing the table of contents, full metadata, navigation and images - which triggers nearly the same methods as indexing) consumes less than 32 MB of memory (according to XHProf PHP profiler).

Revision history for this message
Sebastian Meyer (sebastian-meyer) wrote :

Note: Of course the memory consumption depends mostly on the METS file. The bigger the file, the bigger its memory footprint.

Revision history for this message
Siegfried Schweizer (siegfried-schweizer) wrote :

Again, this problem was inexistent in 1.0.4, using the very same Apache/PHP/Typo3/whatever setup and an identical set of METS files, where indexing ~2500 files could be accomplished in about half an hour.

I'll see if I can narrow down the problem; unfortunately I can't simply go back to 1.0.4.

Revision history for this message
Siegfried Schweizer (siegfried-schweizer) wrote :

Unfortunately, I won't be able to investigate anything prior to September 9th due to holidays.

Changed in goobi-presentation:
status: New → Incomplete
assignee: Sebastian Meyer (sebastian-meyer) → nobody
Changed in goobi-presentation:
assignee: nobody → Sebastian Meyer (sebastian-meyer)
Revision history for this message
Siegfried Schweizer (siegfried-schweizer) wrote :

Now there's something new. I indexed a bunch of 700 METS files into a fresh index, and the problem did not occur anymore. The significant difference was the following: There was an error in our METS files for volumes of multivolume works. In these files under

./mets:structMap[@TYPE="LOGICAL"]/mets:div[@TYPE="multivolume_work"]/mets:mptr[@LOCTYPE="URL"]/xlink:href

there should normally be the URL of the "anchor" METS file which references the respective main title where the volume belongs to. But in our case, before it had been the URL of the volume METS file itself! This error had been fixed in that new bunch of METS files, and indexing then went straightforward and without any errors.

In our old GDZ based presentation this issue did not lead to misbehaviour, but we stumbled upon it while attempting to track down errors in displaying our multivolume works in the DFG viewer.

So, could that have been the cause of the heavy memory consumption?

Revision history for this message
Sebastian Meyer (sebastian-meyer) wrote :

That's quite possible, because the indexer analyzes the logical structure map of a METS file and automatically imports each referenced parent METS file, too. So in your case this obviously led to an infinite loop, because every volume triggered the import of itself again and again.

This problem should be solved in 1.2, because there a check for already indexed documents was introduced in tx_dlf_indexing.

Changed in goobi-presentation:
status: Incomplete → Invalid
Revision history for this message
Siegfried Schweizer (siegfried-schweizer) wrote :

OK, that looks good. And what does "a check for already indexed documents" really mean? I assume that already indexed documents are being updated, in Solr as well as in the Typo3 database?

Revision history for this message
Sebastian Meyer (sebastian-meyer) wrote : AW: [Bug 1038178] Re: Indexing consumes too much memory

You are right. What I meant to say is that there is a check for already processed documents in each indexing run. So if a document gets indexed, each document referenced in the logical structMap gets indexed, too. At this point there is a new check which prevents infinite looping by checking if a document was already indexed within the same indexing run. If that's the case, the document is skipped.

Revision history for this message
Siegfried Schweizer (siegfried-schweizer) wrote :

I see. But is this really fixed in 1.2? In 1.2b1 I still noticed the issue. My posting from 2013-02-14 refers to r157.

Revision history for this message
Siegfried Schweizer (siegfried-schweizer) wrote :

It seems to me that there's at least one thing that is not being updated when reindexing: The value in the database field tx_dlf_documents.location stays the same after the respective METS file has been reindexed via cli_dispatch.phpsh and using a -doc parameter that differs from the one used with the first indexing run.

Do you deem this a bug too?

Revision history for this message
Sebastian Meyer (sebastian-meyer) wrote :

I can confirm the bug that infinite looping can still occur under certain circumstances.

Regarding the second bug: please file this separately.

Changed in goobi-presentation:
milestone: 1.1.4 → 1.2.b2
importance: Critical → Low
status: Invalid → Confirmed
Changed in goobi-presentation:
status: Confirmed → Fix Committed
Changed in goobi-presentation:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.