[Enhancement] Full-text search for doc/docx/zip (containing htm/html)

Bug #2100891 reported by sacharja
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Fix Released
Undecided
Unassigned

Bug Description

Hi,

would be awesome if indexing could include the formats doc/docx/zip (containing htm/html). Don't know Python, but found the following examples (seems it is somehow technical possible):

DOCX

from spire.doc import *
from spire.doc.common import *

# Create a Document object
doc = Document()

# Load a Word file
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\input.docx")

# Get text from the entire document
text = doc.GetText()

# Print result
print(text)

DOC (2003 format)

maybe possible via GPL cross-plattform / no longer developed Antiword:
https://en.wikipedia.org/wiki/Antiword

ZIP (containing htm / html)

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract() # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

Greetz

Revision history for this message
Kovid Goyal (kovid) wrote :

full text indexing works for any format for which calibre has an input
format plugin. That is a format calibre can convert *from*.
DOCX is one of them already. DOC is not supported. And zip with HTML is also supported.

Changed in calibre:
status: New → Invalid
Revision history for this message
sacharja (c3521333) wrote :

right, docx is working, but for sure html is not working or indexed.

Example attached:
1. drag & drop into Calibre (html is added via zip)
2. restart, so that it's indexed
3. full text search for "ipsum" --> no results

Revision history for this message
Kovid Goyal (kovid) wrote :

Then convert the ZIP file to EPUB and you will be fine.

Revision history for this message
Kovid Goyal (kovid) wrote :

Fixed in branch master. The fix will be in the next release. calibre is usually released every alternate Friday.

Changed in calibre:
status: Invalid → Fix Released
Revision history for this message
sacharja (c3521333) wrote :

Unfortunately this is not fixed in Calibre 8.0.1, htm/html is still not indexed

Revision history for this message
Kovid Goyal (kovid) wrote :

Works for me. You will ne to *readd* any books you want indexed or re-run the indexing.

Revision history for this message
sacharja (c3521333) wrote :

Strange, did both on Windows:

- installed a fresh Calibre portable
- added via drag & drop: https://bugs.launchpad.net/calibre/+bug/2100891/+attachment/5862625/+files/lorem.htm
- fulltext search with completed indexing for "lorem" or "ipsum"

--> no result

Revision history for this message
sacharja (c3521333) wrote :
Revision history for this message
Kovid Goyal (kovid) wrote :

Fixed in branch master. The fix will be in the next release. calibre is usually released every alternate Friday.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.