Wednesday, February 25, 2009
Text, actually
Fonts and text handling in PDF seem to be basically fractal - the closer you get to them, the more detail emerges.
The latest kink to emerge is the handling of ActualText passages. Some displayed text within the PDF may be enclosed within a BDC/EMC section which supplies the "ActualText" as it should be extracted by a text processor.
According to the PDF spec, you might use this for example to supply the unhyphenated original of a word where page layout has imposed a hyphen mid-word.
Some PDF authoring packages seem to be using this feature in weirdly convoluted ways, where the to-Unicode mapping of a font is broken for certain characters, but this is worked around by supplying an ActualText replacement for the affected characters when they're used on the page. Why not just make the Unicode map accurate and skip the ActualText complication? There's something there I don't understand.
All of the above matters because we're obtaining our page text using Apple's PDFKit, and as far as I can tell this ignores ActualText markup. This means the wrong Unicode chars go into our text processor and Bad Things Happen.
If you step outside PDFKit's high-level functions, you can dig deeply into the PDF using CGPDFScanner. This allows you to set up callbacks for the individual PDF operations you want to handle and then parse the page's complete stream.
I've written a class which uses this to extract the ActualText annotations from a page, along with a record of which bits of displayed text they should replace (expressed as a range of characters - for example, replace characters 15-20 with the word "olive").
I had hoped to be able to reconcile this list of substitutions against the string of page text obtained from PDFKit. Then it would be a simple matter of making the substitutions as we build our text-block view of the page. Alas the string we get from PDFKit isn't the same as the one we build while handling the text stream. There's some extra processing going on in PDFKit which means our recorded offsets aren't going to apply.
So: halfway there. I think having a foot on each pontoon would be a bit precarious anyway (reconciling offsets from the CG functions against PDFKit) - even if we try to reproduce the processing to correct the offsets it would only take an OS update for them to drift apart again. The solution seems to be to discard PDFKit for this and generate the text blocks directly from our CGPDFScanner's processing of the page so that everything stays in sync.
Today will be mainly coffee, spotify and matrices.
T