Overview

Summary
Search for python code (libraries) for extracting text from file attachments or/and implement a pure python filter
Count
1
Label
Research

Short Description

Currently we use external programs for several mimetypes to create the search index for xapian search (see MoinMoin.filter). These external programs are additional dependencies we create by using them and these programs might not be easily available on each platform moin runs on.

Thus we would like to replace those external binary programs by python code we can bundle with moin, to get rid of those dependencies and to make installation of moin easier.

This applies for example for e.g. application/msword, application/pdf, text/rtf. Just have a look at MoinMoin.filter package - we want to either replace code there using external programs or write new pure python filters.

Document every python code which could do the job.

pyPdf sounds like it could be used to replace pdftotext. If you like you could try to write a new filter or use pypdf instead of pdftotext for implementing the pdf filter.

You have to deliver a list (at least 10 entries, in wiki markup) of Python libraries/modules that qualify to be used for this. For every code you find you have to describe shortly how it could be used.

We estimate that this task takes 10h work time and you must complete this task within 7 days.

Detailed Description

Discussion


Answer from tuxella :

excel

After some research on internet, it appeared that 2 python libraries can open .xls files :

It seems that xlrd is compatible with more versions of excel file (from the documentations) and it is less crash prone than excelerator when it finds some quirks in files. I have written a filter based on this library

application_vnd_ms_excel.py Moreover, xlrd is pure python. This is based on the example furnished with xlrd, so it might work quite well.

word

Here is another issue... Word file format is definitely more complex than excel one. Thus there isn't any pure python library that can read and extract text from a .doc. But the lib gsf, used by AbiWord and KWord is able to parse any Ole2 file format. There seems to have python bindings in the packages (--with-python in the configure script and a python directory) but there isn't anything functional. More, according to what I found while googling, this bindings where done a long time ago to do some tests and aren't maintained since, this could explain why I didn't managed to make anything working ...

pdf

As advised in the issue description, I began using pypdf,and anyway there isn't any other working library. I have written a filter for this file format too. But when I tried it on various files, it appears there are still some problems with the textExtract method (what is said in the documentation, by the way). Indeed, I tried with a file in which text is embedded in a frame, and in this case the test isn't extracted. More, pypdf is pure python So, here is the filter (definitly simpler than the xls one):

application_pdf.py


Some questions:

rtf

There isn't any python library that can read rtf files. However, the rtf file format isn't very complicated, and then with just some regexp (slightly more complicated that the one used in the OOo filter), we could remove formatting from rtf documents and then extract the text. It would need to get the list of formatting tokens on the Microsoft web page and then remove key words that are between brackets and maybe followed by a parameter.

others

hachoir-metadata

How to use it :

pros :

cons :

Supported file formats : Total: 33 file formats.

Archive

Audio

Container

Image

Misc

Program

Video

kaa-metadata

Syntax : info = kaa.metadata.parse(fileName)

pros :

cons :

Supported File formats: Total : 27

extractor

This is the python bindings for the libextractor Gnu project. It's pretty straight forward to use too.

Pros :

Cons :

Supported file formats : Total : 28

HTML, PDF, PS, OLE2 (DOC, XLS, PPT), Open Office (sxw), Star Office (sdw), DVI, MAN, MP3 (ID3v1 and ID3v2), NSF (NES Sound Format), SID, OGG, WAV, EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, REAL, RIFF (AVI), MPEG, QT and ASF.

Text files

http://chardet.feedparser.org/ is a library that tries to guess the encoding of a file. It returns the encoding guessed and the confidence you can give to its prediction. It's a pure python port of the algorithm used in the Mozilla browser.

Here is an example of how you can use it to guess the encoding of a file:

   1 from chardet.universaldetector import UniversalDetector
   2 
   3 def guessEncoding(filename):
   4     detector = UniversalDetector()
   5     for line in file(filename, 'rb'):
   6         detector.feed(line)
   7         if detector.done: break
   8     detector.close()
   9     return(detector.result)

   1 {'confidence': 1.0, 'encoding': 'ascii'}
   2 {'confidence': 0.99, 'encoding': 'utf-8'}
   3 {'confidence': 0.9355, 'encoding': 'windows-1251'}
   4 {'confidence': 1.0, 'encoding': 'UTF-16BE'}

So it's then easy to decide to use or not the encoding guessed by this library.

Some stuff copied here from the google ghop tracker

I had already looked to this eventualities but none of those deserve to be in my report :

MoinMoin: Python_code_usable_for_filters (last edited 2010-11-05 21:22:22 by ThomasWaldmann)