file:read-text-lines() blocking

Bug #921458 reported by William Candillon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zorba
Fix Released
Critical
Matthias Brantner

Bug Description

I wrote the following query:
import module namespace file ="http://expath.org/ns/file";

for $line at $i in file:read-text-lines("doc.xml")
return
  if($i lt 1104869) then () else concat($line, "
")

Where doc.xml is a large document.
The result of the query seems to never end and its memory footprint is huge.

Related branches

Changed in zorba:
importance: Undecided → High
milestone: none → 2.2
Revision history for this message
William Candillon (wcandillon) wrote :
summary: - file:read-text-lines() not streaming?
+ file:read-text-lines() blocking
description: updated
Changed in zorba:
importance: High → Critical
Revision history for this message
Matthias Brantner (matthias-brantner) wrote :

Paul, Could you please investigate why William's program doesn't stream after applying the patch below?
I understand why it didn't stream in the trunk (i.e. with fn:tokenize) but using the string:split function, there should be nothing that prevents it from streaming. I guess the problem is somewhere in the implementation of the split function (in src/runtime/strings/strings_impl.cpp).

=== modified file 'modules/org/expath/ns/file.xq'
--- modules/org/expath/ns/file.xq 2011-10-19 05:09:31 +0000
+++ modules/org/expath/ns/file.xq 2012-02-01 20:39:39 +0000
@@ -23,6 +23,8 @@
  :)
 module namespace file = "http://expath.org/ns/file";

+import module namespace string = "http://www.zorba-xquery.com/modules/string";
+
 import schema namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";
 declare namespace ann = "http://www.zorba-xquery.com/annotations";
 declare namespace ver = "http://www.zorba-xquery.com/options/versioning";
@@ -424,8 +426,7 @@
   $encoding as xs:string
 ) as xs:string*
 {
- let $content := file:read-text($file, $encoding)
- return fn:tokenize($content, "\n")
+ string:split(file:read-text($file, $encoding), "f")
 };

 (:~

Changed in zorba:
assignee: nobody → Paul J. Lucas (paul-lucas)
Revision history for this message
William Candillon (wcandillon) wrote :

Even without streaming this program doesn't seem to work

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

In the original query, where does the value of $i come from?

Revision history for this message
Matthias Brantner (matthias-brantner) wrote :

for $line at _$i_

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :
Download full text (3.7 KiB)

It doesn't stream because the StreamableString is materializing the string. Sticking an assert() in materialize() shows the stack trace:

#0 0x00007fff8157482a in __kill ()
#1 0x00007fff830e0a9c in abort ()
#2 0x00007fff831135de in __assert_rtn ()
#3 0x000000010aa19df4 in zorba::simplestore::StreamableStringItem::materialize (this=0x7fc67a6865a0) at atomic_items.cpp:1772
#4 0x000000010aa1d596 in zorba::simplestore::StreamableStringItem::getStringValue2 (this=0x7fc67a6865a0, val=@0x7fff68c3b018) at atomic_items.cpp:1693
#5 0x000000010a1e137e in zorba::FnTokenizeIterator::nextImpl (this=0x7fc67a6aede0, result=@0x7fff68c3be80, planState=@0x7fc67a679940) at strings_impl.cpp:1706
#6 0x000000010a1d2e2c in zorba::Batcher<zorba::FnTokenizeIterator>::produceNext (this=0x7fc67a6aede0, result=@0x7fff68c3be80, planState=@0x7fc67a679940) at plan_iterator.h:531
#7 0x000000010a248815 in zorba::PlanIterator::consumeNext (result=@0x7fff68c3be80, iter=0x7fc67a6aede0, planState=@0x7fc67a679940) at plan_iterator.cpp:124
#8 0x0000000109ec835a in zorba::FunctionTraceIterator::nextImpl (this=0x7fc67a682b10, result=@0x7fff68c3be80, aPlanState=@0x7fc67a679940) at other_diagnostics_impl.cpp:43
#9 0x00000001093df3c8 in zorba::Batcher<zorba::FunctionTraceIterator>::produceNext (this=0x7fc67a682b10, result=@0x7fff68c3be80, planState=@0x7fc67a679940) at plan_iterator.h:531
#10 0x000000010a248815 in zorba::PlanIterator::consumeNext (result=@0x7fff68c3be80, iter=0x7fc67a682b10, planState=@0x7fc67a679940) at plan_iterator.cpp:124
#11 0x000000010a346b54 in zorba::UDFunctionCallIterator::nextImpl (this=0x7fc67a6a17c0, result=@0x7fff68c3be80, planState=@0x7fc67a679a20) at fncall_iterator.cpp:490
#12 0x000000010a350804 in zorba::Batcher<zorba::UDFunctionCallIterator>::produceNext (this=0x7fc67a6a17c0, result=@0x7fff68c3be80, planState=@0x7fc67a679a20) at plan_iterator.h:531
#13 0x000000010a248815 in zorba::PlanIterator::consumeNext (result=@0x7fff68c3be80, iter=0x7fc67a6a17c0, planState=@0x7fc67a679a20) at plan_iterator.cpp:124
#14 0x000000010a310ccb in zorba::flwor::FLWORIterator::bindVariable (this=0x7fc67a6e6b50, varNo=0, iterState=0x7fc67a675c50, planState=@0x7fc67a679a20) at flwor_iterator.cpp:1199
#15 0x000000010a31162a in zorba::flwor::FLWORIterator::nextImpl (this=0x7fc67a6e6b50, result=@0x7fff68c3c478, planState=@0x7fc67a679a20) at flwor_iterator.cpp:948
#16 0x000000010a31c3ce in zorba::Batcher<zorba::flwor::FLWORIterator>::produceNext (this=0x7fc67a6e6b50, result=@0x7fff68c3c478, planState=@0x7fc67a679a20) at plan_iterator.h:531
#17 0x000000010a248815 in zorba::PlanIterator::consumeNext (result=@0x7fff68c3c478, iter=0x7fc67a6e6b50, planState=@0x7fc67a679a20) at plan_iterator.cpp:124
#18 0x000000010a246e91 in zorba::PlanWrapper::next (this=0x7fc67b002660, result=@0x7fff68c3c478) at plan_wrapper.cpp:149
#19 0x000000010916e306 in zorba::serializer::serialize (this=0x7fff68c3c808, aObject=@0x7fff68c3c730, aOStream=@0x7fff72b43f70, aHandler=0x0) at serializer.cpp:2257
#20 0x000000010916e5d5 in zorba::serializer::serialize (this=0x7fff68c3c808, aObject=@0x7fff68c3c8d8, aOStream=@0x7fff72b43f70) at serializer.cpp:2215
#21 0x000000010909fe8f in zorba::XQueryImpl::se...

Read more...

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

Oh wait.... you said *with* the patch. It would help if he/you attached the patch as a patch to the bug.

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

As an aside: it's not clear to my why tokenize() would need to materialize the string. It should need to call getStringValue2().

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

*With* William's patch, the query works for me and the string is never materialized (where x.xq is William's query):

$ time bin/zorba -f -q /tmp/x.xq
# ... output elided ...

real 0m58.719s
user 0m57.956s
sys 0m0.483s

It takes longer than it should, but it does run to completion in a not-too-long amount of time. However, the performance is pretty bad considering:

$ time wc -l /tmp/doc.xml
 1798219 /tmp/doc.xml

real 0m0.085s
user 0m0.069s
sys 0m0.016s

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

Oh, since the string isn't materialized, that means it *must* be streaming.... so it's not clear there's anything to fix... assigning back to you.

Changed in zorba:
assignee: Paul J. Lucas (paul-lucas) → Matthias Brantner (matthias-brantner)
Changed in zorba:
assignee: Matthias Brantner (matthias-brantner) → William Candillon (wcandillon)
Revision history for this message
William Candillon (wcandillon) wrote :

I cannot find the patch you're mentioning, where can I find it?

Revision history for this message
William Candillon (wcandillon) wrote :

The problem is that I'm not able to make the query to finish but I seem to be the only one with that problem?

Changed in zorba:
milestone: 2.2 → 2.5
Changed in zorba:
assignee: William Candillon (wcandillon) → Matthias Brantner (matthias-brantner)
Changed in zorba:
status: New → In Progress
Changed in zorba:
status: In Progress → Fix Committed
Changed in zorba:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.