Tuesday, August 11, 2009
Type 0 fonts
are finally handled natively within our text extractor. Previously we'd fall back to the PDFKit text extraction whenever we came across one of these beasts, but one of the titles we're working on at the moment breaks when we do that.
We rely on the position of the characters on the page when deciding where to break words/lines/blocks in the extracted text, and PDFKit doesn't always report them reliably. In this case it was reporting that characters butted up against each other even when they were in consecutive words.
Ironically, this wasn't even in a Type 0 font - but there was one on the page, which meant the whole thing reverted to PDFKit extraction. We then had lineswithnospacesbetweenthewords.
So, we now handle basic Type 0 fonts; those which use an encoding other than Identity-H will still kick us over to PDFKit, but on the samples I've tried this afternoon it all gets handled natively. Word breaks seem to be more accurate, which is encouraging.
T