xtim
Wednesday, September 02, 2009
Reading Order Editor
Latest release of RubberStamp is now live in production. This adds a reading-order editor so we can hand tune the text extraction where required - for example in some of our broadsheet publications where stories are mingled across multiple columns.
Now back to the iphone for the next generation of apps.
T
Tuesday, August 11, 2009
Type 0 fonts
are finally handled natively within our text extractor. Previously we'd fall back to the PDFKit text extraction whenever we came across one of these beasts, but one of the titles we're working on at the moment breaks when we do that.
We rely on the position of the characters on the page when deciding where to break words/lines/blocks in the extracted text, and PDFKit doesn't always report them reliably. In this case it was reporting that characters butted up against each other even when they were in consecutive words.
Ironically, this wasn't even in a Type 0 font - but there was one on the page, which meant the whole thing reverted to PDFKit extraction. We then had lineswithnospacesbetweenthewords.
So, we now handle basic Type 0 fonts; those which use an encoding other than Identity-H will still kick us over to PDFKit, but on the samples I've tried this afternoon it all gets handled natively. Word breaks seem to be more accurate, which is encouraging.
T
OS X 10.5.8
Just updated. The process went smoothly, but the machine seems pretty slow. Maybe it's just filling up caches after the reboot.
PDF rendering bugs still not fixed...
T
Labels: mac
Monday, April 13, 2009
Monday, March 30, 2009
RubberStamp 2.0
Profiled, tidied up some of the layout handling related to rotated text and created the installer.
Rubberstamp 2.0 is live!
Next things to fix:
- Handle Type 0 fonts
- Handle Form XObjects
A quick diversion first to handle territorial restrictions (not all books will be available everywhere).
T
Labels: pdf, release, text, Tools
Links and other bits
Link annotations now supported - web, go-to and EE javascript links.
Reinstated support for ActualText replacements.
Drop-caps getting merged into the body text.
Last bits before first release: profile and (ideally) create an installer. I expect we'll release several revisions this week and an installer's going to help.
T
Friday, March 27, 2009
Words, lines and blocks
The text system now tracks individual words and lines so that we can
a) generate the wordmap directly from within Rubberstamp for consistency with the searchable text, and
b) handle hyphenation intelligently at the ends of lines.
Generated wordmaps are generally fine - a couple of pixels off here and there but definitely useable, and now there's just one place to fix bugs.
That problematic PDF is now just about tamed. We can extract the text and place highlights over search results.
The last things to do before release are
1. extract the link annotations into the map files.
2. profile and optimise the code a bit. It's not yet refined for speed.
Big switch will get thrown on Monday.
In the meantime, the weekend.
T
Improved spacing
Part of the problem I was seeing with the spacing was that we guess the spaces based on glyph bounding boxes, and I was passing the font's declared bounding box for each glyph. Swapped to using the explicitly-declared width for each glyph (while retaining the bounding-box height) and that improved things a great deal.
Surprised to see that the bounding box of some the the glyphs is actually wider than the font's declared bounding box. Not sure why that would be. Maybe a buggy font?
Also using the text-position advance arguments of the TJ operator to generate implicit spaces when the position is advanced by more than a quarter of a bounding box.
Added a basic check to look for paragraph breaks (based on line indentation) and use that to split text into separate blocks.
Next to track individual word boundaries and generate word maps.
T
Thursday, March 26, 2009
Bounds line up
Woooooooooooohoooooooooo!
The Font object is now passing the font's bounding box back to the text processor. The text processor applies the current text transform and lines up the glyph's origin with the current text position before passing that transformed box into our TextBlock system. The boxes line up!
I had to fix another bug about the interpretation of offsets in the TJ operator - the offsets weren't getting transformed by the current text matrix so the spacing wandered out of sync.
I'm now fairly confident that the text processor knows
- Which characters are on the page, and
- Where each one is.
Funny things are happening though when the TextBlock system tries to guess where to insert spaces based on character position. That's the next thing to fix.
T
All a bit complicated
Lots of work going on at the moment with the text extraction from a troublesome file. I need to keep notes to keep my head straight...hello blog!
The goal
Get our text extraction to work with this file just as it does with other PDFs.
The problem
Copy and paste of the text from within Acrobat works fine.
Copy and paste of the text from within Preview generates high-ascii gibberish.
The wordmap generation tool generates intelligible text - but the character order is backwards. tsom ddo.
Investigation
The file's using a Type 3 font. This confuses OS X's PDFKit, so we will have to use a hand-rolled text processor no matter what.
The glyph names in the font are /40, /41, /42 etc. I *think* this is outside the PDF spec, which expects the glyph names to be from the Adobe Standard Encoding or the Symbol font. We'll let that pass because Acrobat interprets them so we should too.
The file sets the TextMatrix to reverse characters before they're drawn. Don't know why, but that's what confusing the word map generator. I can only assume that the glyph shapes within the font are themselves reversed and rotated so that it all cancels out - the bounding boxes seem to bear this out as they're mainly below the line.
I have no idea why the file's structured like that.
The fix
We will have to process the text using our own tools, built from the PDF stream upwards. This includes extracting the linear text and generating the word / link maps too. It'll be good to bring these all together into a single codebase, but as always we could do with more time...
Progress
The next text processor extracts the correct unicode text from the file (by assuming for example that /47 = character 47 in Adobe's standard encoding).
Because we now need to generate the wordmaps from our own tool too we will need to use the font metrics specified in the PDF file (currently the new text processor just uses a system font, but the positions won't line up).
I've implemented a Font object which pulls the glyph widths out of the font's definition in the PDF and also generates the appropriate toUnicodeMap.
It doesn't currently calculate the bounding box for each glyph, just assumes it's a square with sides the length of the glyph's advance.
Type 0 fonts are not yet supported, just "simple" fonts where each glyph is keyed by a single byte.
We fall back to the PDFKit approach when the new processor encounters something it doesn't understand, except for Form XObjects which we will need to handle but don't yet.
Next
Retrieve the actual bounding box of each glyph as we're accumulating the text. The rotation/reflection of this font means that most of the glyph is below the line and so our naƮvely-calculated bounding boxes are some way out.
Then run the debug version of the extractor on a few test pages, making sure that the drawn text boxes are correctly aligned with the text.
Once happy with that, get the TextBoxes to track individual words so we can kick out a wordmap file.
T
Thursday, March 05, 2009
Bugfixes
Two releases in one day...
Release 7.3.1 has just gone up. This neatens up the interaction of global preview with
- Rolling archives (where everything up to N months ago is free to all anyway), and
- Gateway forms, where you can complete a form to obtain unrestricted access to a particular issue.
We also fixed
- Google ads when displayed against fit-to-width pages. They now stay in place rather than getting squashed below the page images.
and added
- Support for access to titles through /isbn/XXXXXXXX . That's not a real ISBN.
T
Global Preview
Release 7.3 has just gone live. This moves everyone over to the new access control model (parameter-based rather than cookies).
Now that that's in place we've also graduated to "Global Preview" - people browsing in our shop can search the complete archives of all of our titles. Results are presented in our standard list of cropped page images. The click-through takes you to the thumbnail view of the pages, with an invitation to buy a subscription for full access.
All front covers and contents pages are available at full size so you can see what you're missing. The text of these also feeds into Google to help readers find the content.
T