xtim
Friday, February 27, 2009
 
Preparing the new text handler for production
The latest, highly experimental build of RubberStamp now generates its text view of the page directly from the PDF page stream - no PDFKit shortcuts and therefore more control when things go wrong. Also, alas, more scope for things to go wrong...

We're building a ToUnicode map as required so that the character codes in the show-text instructions get mapped to the correct unicode characters. We're also looking out for ActualText substitutions and replacing text as required. This all feeds through our text block system as before to get the text layout into a sensible linear order. W00t!, or indeed FTW!

There are limitations:


  1. We're not using the fonts specified in the PDF yet, just a single font for everything. This means that long lines of text will diverge slightly from the correct position due to variations in character widths between our font and the requested font. We can't use the new code to generate wordmaps until this is implemented, but for the extraction of text without explicit position information it's good enough.

  2. The ToUnicode mapper is basic: it only supports single-byte input character codes and doesn't support the import of existing CMaps. I'm hoping that this won't prove too limiting for now as we're mainly tackling European text.

  3. If a font dictionary doesn't have a ToUnicode entry then we're sunk - there's no support for plain Encoding entries at the moment. We'll see how big a limitation that is we process more content.



So, the plan:


  1. Make the new system aware of its shortcomings and get it to throw an exception when it encounters something it can't handle.

  2. Get RubberStamp to catch these exceptions and fall back on our existing PDFKit approach when required.

  3. Regenerate the text of a few currently-broken titles to check output.

  4. Make the enhanced text handling available to our "Features page" generator so everything's consistent.

  5. Deploy to the production servers



T

Labels: , ,


Comments: Post a Comment

<< Home

Powered by Blogger