xtim
Friday, February 27, 2009
 
Preparing the new text handler for production
The latest, highly experimental build of RubberStamp now generates its text view of the page directly from the PDF page stream - no PDFKit shortcuts and therefore more control when things go wrong. Also, alas, more scope for things to go wrong...

We're building a ToUnicode map as required so that the character codes in the show-text instructions get mapped to the correct unicode characters. We're also looking out for ActualText substitutions and replacing text as required. This all feeds through our text block system as before to get the text layout into a sensible linear order. W00t!, or indeed FTW!

There are limitations:


  1. We're not using the fonts specified in the PDF yet, just a single font for everything. This means that long lines of text will diverge slightly from the correct position due to variations in character widths between our font and the requested font. We can't use the new code to generate wordmaps until this is implemented, but for the extraction of text without explicit position information it's good enough.

  2. The ToUnicode mapper is basic: it only supports single-byte input character codes and doesn't support the import of existing CMaps. I'm hoping that this won't prove too limiting for now as we're mainly tackling European text.

  3. If a font dictionary doesn't have a ToUnicode entry then we're sunk - there's no support for plain Encoding entries at the moment. We'll see how big a limitation that is we process more content.



So, the plan:


  1. Make the new system aware of its shortcomings and get it to throw an exception when it encounters something it can't handle.

  2. Get RubberStamp to catch these exceptions and fall back on our existing PDFKit approach when required.

  3. Regenerate the text of a few currently-broken titles to check output.

  4. Make the enhanced text handling available to our "Features page" generator so everything's consistent.

  5. Deploy to the production servers



T

Labels: , ,


Wednesday, February 25, 2009
 
Rendering
Some progress - the easiest way to work out what the text processor was seeing seemed to be to render it to the screen. So now we've got the basics of our own PDF renderer. NB: really basic and incomplete at this stage - but the possibilities are encouraging. Some text and strokes are appearing in the right places.

It was actually a project for later in the year to look into writing our own PDF renderer so we can have complete control over any display options and can fix any bugs as they arise. It's encouraging that we can use Apple's PDF scanner and CGContext methods to achieve a lot of this quickly.

The priority for the moment is sorting out the text extraction so will focus on completing the text rendering for now. We need to make sure all chars are in the right places, then feed that information through to our text block sorter along with the ActualText substitutions which kicked this whole project off. The rest of the rendering work can wait for a wet weekend or two later in the year.

Spotify: also good for things you used to have on cassette.

T

Labels: , ,


 
Text, actually
Fonts and text handling in PDF seem to be basically fractal - the closer you get to them, the more detail emerges.

The latest kink to emerge is the handling of ActualText passages. Some displayed text within the PDF may be enclosed within a BDC/EMC section which supplies the "ActualText" as it should be extracted by a text processor.

According to the PDF spec, you might use this for example to supply the unhyphenated original of a word where page layout has imposed a hyphen mid-word.

Some PDF authoring packages seem to be using this feature in weirdly convoluted ways, where the to-Unicode mapping of a font is broken for certain characters, but this is worked around by supplying an ActualText replacement for the affected characters when they're used on the page. Why not just make the Unicode map accurate and skip the ActualText complication? There's something there I don't understand.

All of the above matters because we're obtaining our page text using Apple's PDFKit, and as far as I can tell this ignores ActualText markup. This means the wrong Unicode chars go into our text processor and Bad Things Happen.

If you step outside PDFKit's high-level functions, you can dig deeply into the PDF using CGPDFScanner. This allows you to set up callbacks for the individual PDF operations you want to handle and then parse the page's complete stream.

I've written a class which uses this to extract the ActualText annotations from a page, along with a record of which bits of displayed text they should replace (expressed as a range of characters - for example, replace characters 15-20 with the word "olive").

I had hoped to be able to reconcile this list of substitutions against the string of page text obtained from PDFKit. Then it would be a simple matter of making the substitutions as we build our text-block view of the page. Alas the string we get from PDFKit isn't the same as the one we build while handling the text stream. There's some extra processing going on in PDFKit which means our recorded offsets aren't going to apply.

So: halfway there. I think having a foot on each pontoon would be a bit precarious anyway (reconciling offsets from the CG functions against PDFKit) - even if we try to reproduce the processing to correct the offsets it would only take an OS update for them to drift apart again. The solution seems to be to discard PDFKit for this and generate the text blocks directly from our CGPDFScanner's processing of the page so that everything stays in sync.

Today will be mainly coffee, spotify and matrices.

T

Labels: , ,



Powered by Blogger