xtim
Monday, October 27, 2008
 
Text progress
The text view is now available (* selected titles only...).

RubberStamp is now extracting the page text directly. The tricky thing is turning a multi-column magazine layout into sensible linear text - particularly when you can have multiple articles on one page, with photo captions and quotations to integrate into the text flow.

It's difficult to do this perfectly, and often there are a number of possible "correct" interpretations. Sometimes even humans get confused about where the text flows next...

Our approach is to:

1. Iterate through the page text as supplied by the OS X API.

2. Break this text into blocks based on each character's position on the page.

3. Combine any blocks which should really be a single block (in particular, drop capitals and their associated paragraphs).

4. Recursively split the blocks into two lists - "before" and "after". At each stage, draw an imaginary horizontal or vertical line through the blocks such that the line crosses no block and there is at least one block on either side. Left or above is "before", right or below is "after".

5. Combine the text from the blocks in the ordering established in the previous step.

Item 4 is the tricky one - there may well be overlapping blocks so you can't draw a clean line, or there may be many possible lines to draw. The order in which the lines are drawn will effect the order of the reconstructed text.

We've got a working solution in place to the problem of overlapping blocks, but at the moment it's pot luck which line gets chosen first when a few are available. I'm going to make that slightly smarter, and also see if we can improve the handling of rotated text (vertical picture attributions, for example). Then to roll this out to a few more titles.

The presentation side is already in place - there's an expandable div that presents the text for the page / spread you're reading, and a link to the printable rendition which invokes javascript's window.print method.

T

Labels:


Comments: Post a Comment

<< Home

Powered by Blogger