xtim
Thursday, March 26, 2009
 
All a bit complicated
Lots of work going on at the moment with the text extraction from a troublesome file. I need to keep notes to keep my head straight...hello blog!

The goal

Get our text extraction to work with this file just as it does with other PDFs.

The problem

Copy and paste of the text from within Acrobat works fine.

Copy and paste of the text from within Preview generates high-ascii gibberish.

The wordmap generation tool generates intelligible text - but the character order is backwards. tsom ddo.

Investigation

The file's using a Type 3 font. This confuses OS X's PDFKit, so we will have to use a hand-rolled text processor no matter what.

The glyph names in the font are /40, /41, /42 etc. I *think* this is outside the PDF spec, which expects the glyph names to be from the Adobe Standard Encoding or the Symbol font. We'll let that pass because Acrobat interprets them so we should too.

The file sets the TextMatrix to reverse characters before they're drawn. Don't know why, but that's what confusing the word map generator. I can only assume that the glyph shapes within the font are themselves reversed and rotated so that it all cancels out - the bounding boxes seem to bear this out as they're mainly below the line.

I have no idea why the file's structured like that.

The fix

We will have to process the text using our own tools, built from the PDF stream upwards. This includes extracting the linear text and generating the word / link maps too. It'll be good to bring these all together into a single codebase, but as always we could do with more time...

Progress

The next text processor extracts the correct unicode text from the file (by assuming for example that /47 = character 47 in Adobe's standard encoding).

Because we now need to generate the wordmaps from our own tool too we will need to use the font metrics specified in the PDF file (currently the new text processor just uses a system font, but the positions won't line up).

I've implemented a Font object which pulls the glyph widths out of the font's definition in the PDF and also generates the appropriate toUnicodeMap.

It doesn't currently calculate the bounding box for each glyph, just assumes it's a square with sides the length of the glyph's advance.

Type 0 fonts are not yet supported, just "simple" fonts where each glyph is keyed by a single byte.

We fall back to the PDFKit approach when the new processor encounters something it doesn't understand, except for Form XObjects which we will need to handle but don't yet.

Next

Retrieve the actual bounding box of each glyph as we're accumulating the text. The rotation/reflection of this font means that most of the glyph is below the line and so our naƮvely-calculated bounding boxes are some way out.

Then run the debug version of the extractor on a few test pages, making sure that the drawn text boxes are correctly aligned with the text.

Once happy with that, get the TextBoxes to track individual words so we can kick out a wordmap file.

T

Labels: , ,


Comments: Post a Comment

<< Home

Powered by Blogger