xtim: 08/01/2009 - 09/01/2009

xtim

Tuesday, August 11, 2009

Type 0 fonts
are finally handled natively within our text extractor. Previously we'd fall back to the PDFKit text extraction whenever we came across one of these beasts, but one of the titles we're working on at the moment breaks when we do that.

We rely on the position of the characters on the page when deciding where to break words/lines/blocks in the extracted text, and PDFKit doesn't always report them reliably. In this case it was reporting that characters butted up against each other even when they were in consecutive words.

Ironically, this wasn't even in a Type 0 font - but there was one on the page, which meant the whole thing reverted to PDFKit extraction. We then had lineswithnospacesbetweenthewords.

So, we now handle basic Type 0 fonts; those which use an encoding other than Identity-H will still kick us over to PDFKit, but on the samples I've tried this afternoon it all gets handled natively. Word breaks seem to be more accurate, which is encouraging.

T

Labels: pdf, text

- posted by Tim Bruce @ 5:22 PM 0 comments

OS X 10.5.8
Just updated. The process went smoothly, but the machine seems pretty slow. Maybe it's just filling up caches after the reboot.

PDF rendering bugs still not fixed...

T

Labels: mac

- posted by Tim Bruce @ 12:14 PM 0 comments