[Enhancement] Full-text search for doc/docx/zip (containing htm/html)
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| calibre |
Fix Released
|
Undecided
|
Unassigned | ||
Bug Description
Hi,
would be awesome if indexing could include the formats doc/docx/zip (containing htm/html). Don't know Python, but found the following examples (seems it is somehow technical possible):
DOCX
from spire.doc import *
from spire.doc.common import *
# Create a Document object
doc = Document()
# Load a Word file
doc.LoadFromFil
# Get text from the entire document
text = doc.GetText()
# Print result
print(text)
DOC (2003 format)
maybe possible via GPL cross-plattform / no longer developed Antiword:
https:/
ZIP (containing htm / html)
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://
html = urlopen(url).read()
soup = BeautifulSoup(html, features=
# kill all script and style elements
for script in soup(["script", "style"]):
script.
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
Greetz

full text indexing works for any format for which calibre has an input
format plugin. That is a format calibre can convert *from*.
DOCX is one of them already. DOC is not supported. And zip with HTML is also supported.