Regression in 3.32.0: Search/replace not working in format conversion

Bug #1796578 reported by Jonas Christian
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
calibre
Invalid
Undecided
Unassigned

Bug Description

I upgraded to 3.32.0 and noticed that the "Search & replace" functionality in the format conversion is not working anymore. When converting a document from PDF to EPUB, it just passes the document as-is from the PDF exporter to the EPUB format.

Using the exact same settings on the exact same book in 3.31.0 works fine and does the search/replace as it always had.

Revision history for this message
Kovid Goyal (kovid) wrote : Re: calibre bug 1796578

Post the conversion log from the problem conversion (you can get it by
clicking the rotating jobs button in the bottom right corner of the
calibre window)

 status incomplete

Changed in calibre:
status: New → Incomplete
Revision history for this message
Jonas Christian (jonasvp) wrote :
Download full text (7.5 KiB)

```
Buch 1 von 1 (The Long Descent: A User's Guide to the End of the Industrial Age) konvertieren
Conversion options changed from defaults:
  cover: u'/tmp/calibre_3.32.0_tmp_2mFCNf/kayB2U.jpeg'
  verbose: 2
  output_profile: 'cybook_opus'
  read_metadata_from_opf: u'/tmp/calibre_3.32.0_tmp_2mFCNf/PFnrNH.opf'
  search_replace: '[["<hr/>\\n<a id=\\"p\\\\d+\\"></a>\\\\d+ <br>\\nThe Long Descent<br>", ""], ["<hr/>\\n<a id=\\"p\\\\d+\\"></a> <br>\\n[^<]+ <br>\\n\\\\d+<br>", ""], ["^(.{60,}?)-?<br>\\\\s+", "\\\\1"]]'
Resolved conversion options
calibre version: 3.32.0
{'asciiize': False,
 'author_sort': None,
 'authors': None,
 'base_font_size': 0.0,
 'book_producer': None,
 'change_justification': u'original',
 'chapter': u"//*[((name()='h1' or name()='h2') and re:test(., '\\s*((chapter|book|section|part)\\s+)|((prolog|prologue|epilogue)(\\s+|$))', 'i')) or @class = 'chapter']",
 'chapter_mark': u'pagebreak',
 'comments': None,
 'cover': u'/tmp/calibre_3.32.0_tmp_2mFCNf/kayB2U.jpeg',
 'debug_pipeline': None,
 'dehyphenate': True,
 'delete_blank_paragraphs': True,
 'disable_font_rescaling': False,
 'dont_split_on_page_breaks': False,
 'duplicate_links_in_toc': False,
 'embed_all_fonts': False,
 'embed_font_family': None,
 'enable_heuristics': False,
 'epub_flatten': False,
 'epub_inline_toc': False,
 'epub_toc_at_end': False,
 'epub_version': u'2',
 'expand_css': False,
 'extra_css': None,
 'extract_to': None,
 'filter_css': u'',
 'fix_indents': True,
 'flow_size': 260,
 'font_size_mapping': None,
 'format_scene_breaks': True,
 'html_unwrap_factor': 0.4,
 'input_encoding': None,
 'input_profile': <calibre.customize.profiles.InputProfile object at 0x7fb49af3a8d0>,
 'insert_blank_line': False,
 'insert_blank_line_size': 0.5,
 'insert_metadata': False,
 'isbn': None,
 'italicize_common_cases': True,
 'keep_ligatures': False,
 'language': None,
 'level1_toc': None,
 'level2_toc': None,
 'level3_toc': None,
 'line_height': 0.0,
 'linearize_tables': False,
 'margin_bottom': 5.0,
 'margin_left': 5.0,
 'margin_right': 5.0,
 'margin_top': 5.0,
 'markup_chapter_headings': True,
 'max_toc_links': 50,
 'minimum_line_height': 120.0,
 'new_pdf_engine': False,
 'no_chapters_in_toc': False,
 'no_default_epub_cover': False,
 'no_images': False,
 'no_inline_navbars': False,
 'no_svg_cover': False,
 'output_profile': <calibre.customize.profiles.CybookOpusOutput object at 0x7fb49af3ac50>,
 'page_breaks_before': u"//*[name()='h1' or name()='h2']",
 'prefer_metadata_cover': False,
 'preserve_cover_aspect_ratio': False,
 'pretty_print': True,
 'pubdate': None,
 'publisher': None,
 'rating': None,
 'read_metadata_from_opf': u'/tmp/calibre_3.32.0_tmp_2mFCNf/PFnrNH.opf',
 'remove_fake_margins': True,
 'remove_first_image': False,
 'remove_paragraph_spacing': False,
 'remove_paragraph_spacing_indent_size': 1.5,
 'renumber_headings': True,
 'replace_scene_breaks': u'',
 'search_replace': '[["<hr/>\\n<a id=\\"p\\\\d+\\"></a>\\\\d+ <br>\\nThe Long Descent<br>", ""], ["<hr/>\\n<a id=\\"p\\\\d+\\"></a> <br>\\n[^<]+ <br>\\n\\\\d+<br>", ""], ["^(.{60,}?)-?<br>\\\\s+", "\\\\1"]]',
 'series': None,
 'series_index': None,
 'smarten_punctuation': False,
 'sr1_replace': None,
 'sr1...

Read more...

Revision history for this message
Kovid Goyal (kovid) wrote :

I tried it with a PDF file in my library, works for me. Attach the PDF file demonstrating/reproducing the problem to this bug report. You can do that by clicking the "Add attachment or patch" link at the bottom of the bug's page. If the file you are attaching is copyrighted, mark the bug as private. You can do this by clicking the tiny yellow icon next to "This report contains Public information" in the top right area of the bug's page.

 status incomplete

Revision history for this message
LEONARDO TREVISAN LOMBARDI (ltlombardi) wrote :

This is happening to me too. Tried 32 bit and 64bit. v3.32. Same situation. PDF to Mobi, conversion with regex search and replace not working. Downgraded to V3.22 and works fine,.

Revision history for this message
Jonas Christian (jonasvp) wrote :

It doesn't seem to depend on the PDF at all. Attached please find one example of a document where it fails. I tried the most current version (3.33.1) and it still doesn't work, I'm staying on 3.31.0 for now...

Revision history for this message
Kovid Goyal (kovid) wrote :

Works for me with that PDF. I tried the search expression:

limiting

and replace

XXX

the word limiting was replaced as expected. Can you also post a search/replace expression that fails with the PDF

Revision history for this message
Jonas Christian (jonasvp) wrote :

Ok, this is interesting. I can reproduce that your replacement works, also using regular expressions such as "l.miting". The replacement I'm testing is replacing "^(.{50,})<br>" with "\1" in order to remove line breaks on long lines. That does not work from 3.32.0 onwards - the line breaks are still there. It works in 3.31.0.

Revision history for this message
Kovid Goyal (kovid) wrote :

Does it work in the wizard (click the magic wand icon next to the search
field).

Revision history for this message
Jonas Christian (jonasvp) wrote :

Yes, it works in the wizard (all versions).

Revision history for this message
Kovid Goyal (kovid) wrote :

I dont see how it could possibly work in the wizard and not in the actual conversion. And I cannot reproduce the failure on my linux system. I'll test on windows as well, when I am on a windows computer.

Revision history for this message
Kovid Goyal (kovid) wrote :

I tried it on my windows machine as well with the above file and search and replace expressions, and the line breaks were successfully removed. I'm afraid without some means to reproduce the issue, there is not much I can do, sorry.

Changed in calibre:
status: Incomplete → Invalid
Revision history for this message
Jonas Christian (jonasvp) wrote :

Sorry to keep bothering you about this but the problem persists up until the current version 3.36.0. For what it's worth, I'm on Ubuntu 18.04.

I tried narrowing it down and I think the culprit is trying to match a HTML tag. For instance, matching "limiting" or "l.m.ting" works fine but trying to match "<b>The </b>" (right after the title) works _only_ in the wizard, not when actually converting.

As I said, it worked up until 3.31.0. Was there a change in escaping the angle brackets or something like that?

I'd be very grateful if you could have another look. Let me know if there's anything I can do to help.

Revision history for this message
Kovid Goyal (kovid) wrote :

As I said, I cannot replicate the issue. Without some way to replicate the issue, I have no way to help.

Revision history for this message
Jonas Christian (jonasvp) wrote :

What exactly have you tried replicating? From your comment above it seems you only tried search/replace on a word, not on a HTML tag.

Also, could you point me to the general area in the code where the conversion happens? I could try having a look myself.

Revision history for this message
Kovid Goyal (kovid) wrote :

I tried replicating it with the <br> search expression and the file that was posted. Search replace happens in preprocess.py

Revision history for this message
Mike Bayer (zzzeek) wrote :

Hi there -

I'm having exactly the same problem, using 3.36 on Fedora Linux. The wizard successfully finds all of the page numbers I'm looking for of the form "\d+ <br>", they highlight in yellow etc., run the conversion to epub and the search and replace does nothing at all. all the markup that the wizard claimed would be matched are unaffected.

Revision history for this message
Mike Bayer (zzzeek) wrote :

When I run with debug, I can see that what is shown in the wizard as:

2 <br>

looks in the file input/index.html as:

2&#160;<br>

I tried using the regex \d+\&#160;<br> instead, which I tested in Python 2 to make sure the interpreter matches it, which it does, but this works neither in the wizard or in the output.

is it possible that Calibre's dependencies, like Python version, or other 3rd party library in use, can affect its behavior in this regard such that you're not able to reproduce ?

Revision history for this message
Kovid Goyal (kovid) wrote :

Use the official calibre binaries, not the distro calibre package and
you will be fine, it comes with all needed dependencies, ad is actually
up-to-date as well.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.