PDFMiner has evolved into a terrific tool. It allows direct control of pdf files at the lowest level, allowng for direct control of the creation of documents and extraction of data. Combined with document writer, recognition, and image manipulation tools as well as a little math magic and the power of commercial tools can be had for all but the most complex tasks. I plan on writing on the use of OCR, Harris corner detection, and contour analysis in OpenCV, homebrew code, and tesseract later.
However, there is little in the way of documentation beyond basic extraction and no python package listing. Basically the methods are discoverable but not listed in full. In fact, existing documentation consists mainly of examples despite the mainy different modules and classes designed to complete a multitude of tasks. The aim of this article, one in a hopefully two part series is to help with extraction of information. The next step is the creation of a pdf document using a tool such as pisa or reportlab since PdfMiner performs extraction.
There are several imports that will nearly alwasy be used for document extraction. All are under the main pdfminer import. The imports can get quite large.
Some imports are meant to perform extraction and others are meant to check for and support the extraction of different types.
Visul Representation of Outputs
The following image is taken from pdfminer’s limited documentation.
Source: Carleton University
Imports for Extraction
The following table goes over the imports that perform actions on the document.
|PDFParser||The parser class normally passed to the PDFDocument that helps obtain elements from the PDFDocument.|
|PDFResourceManager||Helps with aggregation. Performs some tasks between the interpreter and device.|
|PDFPageInterpreter||Obtains pdf objects for extraction. With the ResourceManager, it changes page objects to instructions for the device.|
|PDFPageAggregator||Takes in LAParams and PDFResourceManager for getting text from individual pages.|
|PDFDocument||Holds the parser and allows for direct actions to be taken.|
|PDFDevice||Writes instructions to the document.|
|LAParams||Helps with document extraction|
Imports that act as Types
Other imports act as a means of checking against types and utilizing the other classes properties. The PDFDocument contains a variety of pdf objects that which hold their own information. That information includes the type, the coordinates, and the text displayed in the document. Images are also handleable.
The objects include:
These types are useful for pulling information from tables as eplained by Julian Todd.
Creating the Document
Creating the document requires instantiating each of the parts present in the diagram above. The order for setting up the document is to create a parser and then the document with the parser. The resourcemanager and LAParams accepted as arguments by the PageAggregator device used for this task. The PageInterpretor accepts the aggregator and the resource manager. This code is typical of all parsers and as part of the pdf writers.
StringIO will make extraction run more quickly. The resulting object’s code is written in C.
The get_result() method adds to the StringIO. The results are passed ot the ParsePage definition. Another method can be used for pure extraction (.get_value()).
The PDF Text Based Objects
The layout received from get_result() parses the strings into separate objects. These objects have several key components. They are the type, the coordinates (startingx, startingy, endingx, endingy), and the content.
Accessing the type can be found using type(object) and compared to the direct type (e.g. type(object)LTRect). In this instance, a comparison to LTRect returns True.
Getting the Output
Output is obtained and parsed through a series of method calls. The following example shows how to extract content.
This code takes the object stack as a list, which contains the method pop since python, although having a collections (import collections) package with data structurs such as a set, is highly flexible.
This example is a modification of Julian Todd’s code since I could not find solid documentation for pdfminer. It takes the objects from the layout, reverses them since they are placed in the layout as if it were a stack, and then iterates down the stack, finding anything with text and expanding it or taking text lines and adding them to the list that stores them.
The resulting list (tcols), looks much like other pure extractions that can be performed in a variety of tools including Javas pdfbox, pypdf, and even pdfminer. However, the objects are placed into the bbox (bounding box coordinate list) and the text object accessible from .get_text().
Images are handled using the LTImage type which has a few additional attributes in addition to coordinates and data. The image contains bits, colorspace, height,imagemask,name,srcsize,stream, and width.
Extracting an image works as follows:
PDFMiner only seems to extract jpeg objects. However, xpdf extracts all image.
A more automated and open source solution would be to use subprocess.Popen() to call a java program that extracts images to a specific or provided folder using code such as this (read the full article).
Handling the Output
Handling the code is fairly simple and forms the crux of this articles benefits besides combining a variety of resources in a single place.
Just iterate down the stack and pull out the objects as needed. It is possible to form the entire structure using the coordinates. The bounding box method allows for objects to be input in a new data structure as they appear in any pdf document. With some analysis, generic algorithms are posslbe. It may be a good idea to write some code first with the lack of documentation.
The following extracts specific columns of an existing pdf. The bounding box list/array is set up as follows. bbox is the starting x coordinate, bbox is the starting y coordinate, bbox is the ending x coordinate, and bbox is the ending y coordinate.
This code uses python's list comprehension. The reason for the inequalities is that slight differentiations exist in the placement of object. The newline escape character represents an underline in this case.
Pure Text Extraction
In order to see how to perform pure text extraction and move to a better understanding of the code, analyze the following code.
PdfMiner is a useful tool that can write and read pdfs and their actual formating. The tool is flexible and can easily control strings. Extracting data is made much easier compared to some full text analysis which can produced garbled and misplaced lines. Not all pdfs are made equal.
Python PDF parser and analyzer
PDFMiner is a tool for extracting information from PDF documents. Unlikeother PDF-related tools, it focuses entirely on getting and analyzingtext data. PDFMiner allows one to obtain the exact location of text in apage, as well as other information such as fonts or lines. It includes aPDF converter that can transform PDF files into other text formats (suchas HTML). It has an extensible PDF parser that can be used for otherpurposes than text analysis.
- Written entirely in Python. (for version 2.6 or newer)
- Parse, analyze, and convert PDF documents.
- PDF-1.7 specification support. (well, almost)
- CJK languages and vertical writing scripts support.
- Various font types (Type1, TrueType, Type3, and CID) support.
- Basic encryption (RC4) support.
- PDF to HTML conversion (with a sample converter web app).
- Outline (TOC) extraction.
- Tagged contents extraction.
- Reconstruct the original layout by grouping text chunks.
PDFMiner is about 20 times slower than other C/C++-based counterpartssuch as XPdf.
Where to Ask¶
How to Install¶
Install Python 2.6 or newer.(Python 3 is not supported.)
Download the PDFMiner source.
For CJK languages¶
In order to process CJK languages, you need an additional step to takeduring installation:
On Windows machines which don’t have
make command, paste thefollowing commands on a command line prompt:
Command Line Tools¶
PDFMiner comes with two handy tools:
pdf2txt.py extracts text contents from a PDF file. It extracts allthe text that are to be rendered programmatically, i.e. text representedas ASCII or Unicode strings. It cannot recognize text drawn as imagesthat would require optical character recognition. It also extracts thecorresponding locations, font names, font sizes, writing direction(horizontal or vertical) for each text portion. You need to provide apassword for protected PDF documents when its access is restricted. Youcannot extract any text from a PDF document which does not haveextraction permission.
Note: Not all characters in a PDF can be safely converted toUnicode.
Specifies the output format. The following formats are currentlysupported.
text: TEXT format. (Default)
html: HTML format. Not recommended for extraction purposesbecause the markup is messy.
xml: XML format. Provides the most information.
tag: “Tagged PDF” format. A tagged PDF has its own contentsannotated with HTML-like tags. pdf2txt tries to extract its contentstreams rather than inferring its text locations. Tags used here aredefined in the PDF specification (See §10.7 “Tagged PDF”).
These are the parameters used for layout analysis. In an actual PDFfile, text portions might be split into several chunks in the middle ofits running, depending on the authoring software. Therefore, textextraction needs to splice text chunks. In the figure below, two textchunks whose distance is closer than the char_margin (shown as M)is considered continuous and get grouped into one. Also, two lines whosedistance is closer than the line_margin (L) is grouped as a textbox, which is a rectangular area that contains a “cluster” of textportions. Furthermore, it may be required to insert blank characters(spaces) as necessary if the distance between two words is greater thanthe word_margin (W), as a blank between words might not berepresented as a space, but indicated by the positioning of each word.
Each value is specified not as an actual length, but as a proportion ofthe length to the size of each character in question. The default valuesare M = 2.0, L = 0.5, and W = 0.1, respectively.
Specifies how the page layout should be preserved. (Currently onlyapplies to HTML format.)
exact: preserve the exact location of each individual character(a large and messy HTML).
normal: preserve the location and line breaks in each textblock. (Default)
loose: preserve the overall location of each text block.
dumppdf.py dumps the internal contents of a PDF file in pseudo-XMLformat. This program is primarily for debugging purposes, but it’s alsopossible to extract some meaningful contents (such as images).
-ioptions are accepted.
-poptions are accepted. Note that page numbers start atone, not zero.
Specifies the output format of stream contents. Because the contents ofstream objects can be very large, they are omitted when none of theoptions above is specified.
-r option, the “raw” stream contents are dumped withoutdecompression. With
-b option, the decompressed contents are dumpedas a binary blob. With
-t option, the decompressed contents aredumped in a text format, similar to
repr() manner. When
-b option is given, no stream header is displayed for the ease ofsaving it to a file.
- 2014/03/28: Further bugfixes.
- 2014/03/24: Bugfixes and improvements for fauly PDFs.API changes:
PDFDocument.initialize()method is removed and no longerneeded. A password is given as an argument of a PDFDocumentconstructor.
- 2013/11/13: Bugfixes and minor improvements.As of November 2013, there were a few changes made to the PDFMinerAPI prior to October 2013. This is the result of code restructuring.Here is a list of the changes:
PDFDocumentclass is moved to
PDFDocumentclass now takes a
PDFParserobject as anargument.
PDFPageclass is moved to
process_pdffunction is implemented as
- 2013/10/22: Sudden resurge of interests. API changes. Incorporated alot of patches and robust handling of broken PDFs.
- 2011/05/15: Speed improvements for layout analysis.
- 2011/05/15: API changes.
- 2011/04/20: API changes. LTPolygon class was renamed as LTCurve.
- 2011/04/20: LTLine now represents horizontal/vertical lines only.Thanks to Koji Nakagawa.
- 2011/03/07: Documentation improvements by Jakub Wilk. Memory usagepatch by Jonathan Hunt.
- 2011/02/27: Bugfixes and layout analysis improvements. Thanks tofujimoto.report.
- 2010/12/26: A couple of bugfixes and minor improvements. Thanks toKevin Brubeck Unhammer and Daniel Gerber.
- 2010/10/17: A couple of bugfixes and minor improvements. Thanks tostandardabweichung and Alastair Irving.
- 2010/09/07: A minor bugfix. Thanks to Alexander Garden.
- 2010/08/29: A couple of bugfixes. Thanks to Sahan Malagi, pk, andHumberto Pereira.
- 2010/07/06: Minor bugfixes. Thanks to Federico Brega.
- 2010/06/13: Bugfixes and improvements on CMap data compression.Thanks to Jakub Wilk.
- 2010/04/24: Bugfixes and improvements on TOC extraction. Thanks toJose Maria.
- 2010/03/26: Bugfixes. Thanks to Brian Berry and Lubos Pintes.
- 2010/03/22: Improved layout analysis. Added regression tests.
- 2010/03/12: A couple of bugfixes. Thanks to Sean Manefield.
- 2010/02/27: Changed the way of internal layout handling. (LTTextItem-> LTChar)
- 2010/02/15: Several bugfixes. Thanks to Sean.
- 2010/02/13: Bugfix and enhancement. Thanks to André Auzi.
- 2010/02/07: Several bugfixes. Thanks to Hiroshi Manabe.
- 2010/01/31: JPEG image extraction supported. Page rotation bug fixed.
- 2010/01/04: Python 2.6 warning removal. More doctest conversion.
- 2010/01/01: CMap bug fix. Thanks to Winfried Plappert.
- 2009/12/24: RunLengthDecode filter added. Thanks to Troy Bollinger.
- 2009/12/20: Experimental polygon shape extraction added. Thanks toYusuf Dewaswala for reporting.
- 2009/12/19: CMap resources are now the part of the package. Thanks toAdobe for open-sourcing them.
- 2009/11/29: Password encryption bug fixed. Thanks to Yannick Gingras.
- 2009/10/31: SGML output format is changed and renamed as XML.
- 2009/10/24: Charspace bug fixed. Adjusted for 4-space indentation.
- 2009/10/04: Another matrix operation bug fixed. Thanks to VitalySedelnik.
- 2009/09/12: Fixed rectangle handling. Able to extract imageboundaries.
- 2009/08/30: Fixed page rotation handling.
- 2009/08/26: Fixed zlib decoding bug. Thanks to Shon Urbas.
- 2009/08/24: Fixed a bug in character placing. Thanks to Pawan Jain.
- 2009/07/21: Improvement in layout analysis.
- 2009/07/11: Improvement in layout analysis. Thanks to Lubos Pintes.
- 2009/05/17: Bugfixes, massive code restructuring, and simple graphicelement support added. setup.py is supported.
- 2009/03/30: Text output mode added.
- 2009/03/25: Encoding problems fixed. Word splitting option added.
- 2009/02/28: Robust handling of corrupted PDFs. Thanks to TroyBollinger.
- 2009/02/01: Various bugfixes. Thanks to Hiroshi Manabe.
- 2009/01/17: Handling a trailer correctly that contains both /XrefStmand /Prev entries.
- 2009/01/10: Handling Type3 font metrics correctly.
- 2008/12/28: Better handling of word spacing. Thanks to ChristianNentwich.
- 2008/09/06: A sample pdf2html webapp added.
- 2008/08/30: ASCII85 encoding filter support.
- 2008/07/27: Tagged contents extraction support.
- 2008/07/10: Outline (TOC) extraction support.
- 2008/06/29: HTML output added. Reorganized the directory structure.
- 2008/04/29: Bugfix for Win32. Thanks to Chris Clark.
- 2008/04/27: Basic encryption and LZW decoding support added.
- 2008/01/07: Several bugfixes. Thanks to Nick Fabry for his vastcontribution.
- 2007/12/31: Initial release.
- 2004/12/24: Start writing the code out of boredom...
- PEP-8 andPEP-257 conformance.
- Better documentation.
- Better text extraction / layout analysis. (writing mode detection,Type1 font file analysis, etc.)
- Crypt stream filter support. (More sample documents are needed!)
Terms and Conditions¶
(This is so-called MIT/X License)
Copyright (c) 2004-2013 Yusuke Shinyama <yusuke at cs dot nyu dot edu>
Permission is hereby granted, free of charge, to any person obtaining acopy of this software and associated documentation files (the“Software”), to deal in the Software without restriction, includingwithout limitation the rights to use, copy, modify, merge, publish,distribute, sublicense, and/or sell copies of the Software, and topermit persons to whom the Software is furnished to do so, subject tothe following conditions:
The above copyright notice and this permission notice shall be includedin all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESSOR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OFMERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANYCLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THESOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Yusuke Shinyama (yusuke at cs dot nyu dot edu)