xtim
Tuesday, March 03, 2009
 
Catching up elsewhere
Got as far as item 3 on the test plan - the text scanner falls back to the old PDFKit approach whenever the new system can't handle the supplied PDF. It turned out that support for WinAnsi and MacRoman encodings was required for pretty much every file, so those are in there now.

Other wrinkles: some titles embed all their page text within a Form XObject. This isn't parsed as part of the PDF's page stream, so we don't get to see that text. Can work around this by creating a new scanner for each Form XObject, but we're not doing that yet. Instead we bail and fall back on PDFKit when we detect one of these.

Page Rotation: when a PDFPage specifies an output rotation (eg rotate the page 90 degrees before display) it confuses our text blocks, as their notion of top-to-bottom and left-to-right isn't rotated. Needs fixing, as this affects our existing text extraction.

The font issue (we're only using one when we calculate text positions on the page) is turning out to be more important than I thought. The longer lines in multi-column layouts reduce the width of the vertical gaps in the layout. In turn this means our block slicer is more likely to choose horizontal cuts and get the reading order wrong. This is difficult to catch so we can't just fall back on PDFKit automatically.

Checking everything in, but not yet enabling the new extractor. There are a few other projects which require attention.

T

Labels: , ,


Comments: Post a Comment

<< Home

Powered by Blogger