xtim
Monday, March 30, 2009
 
RubberStamp 2.0
Profiled, tidied up some of the layout handling related to rotated text and created the installer.

Rubberstamp 2.0 is live!

Next things to fix:


  1. Handle Type 0 fonts

  2. Handle Form XObjects



A quick diversion first to handle territorial restrictions (not all books will be available everywhere).

T

Labels: , , ,


 
Links and other bits
Link annotations now supported - web, go-to and EE javascript links.

Reinstated support for ActualText replacements.

Drop-caps getting merged into the body text.

Last bits before first release: profile and (ideally) create an installer. I expect we'll release several revisions this week and an installer's going to help.

T

Labels: , ,


Friday, March 27, 2009
 
Words, lines and blocks
The text system now tracks individual words and lines so that we can

a) generate the wordmap directly from within Rubberstamp for consistency with the searchable text, and

b) handle hyphenation intelligently at the ends of lines.

Generated wordmaps are generally fine - a couple of pixels off here and there but definitely useable, and now there's just one place to fix bugs.

That problematic PDF is now just about tamed. We can extract the text and place highlights over search results.

The last things to do before release are

1. extract the link annotations into the map files.

2. profile and optimise the code a bit. It's not yet refined for speed.

Big switch will get thrown on Monday.

In the meantime, the weekend.

T

Labels: , ,


 
Improved spacing
Part of the problem I was seeing with the spacing was that we guess the spaces based on glyph bounding boxes, and I was passing the font's declared bounding box for each glyph. Swapped to using the explicitly-declared width for each glyph (while retaining the bounding-box height) and that improved things a great deal.

Surprised to see that the bounding box of some the the glyphs is actually wider than the font's declared bounding box. Not sure why that would be. Maybe a buggy font?

Also using the text-position advance arguments of the TJ operator to generate implicit spaces when the position is advanced by more than a quarter of a bounding box.

Added a basic check to look for paragraph breaks (based on line indentation) and use that to split text into separate blocks.

Next to track individual word boundaries and generate word maps.

T

Labels: ,


Thursday, March 26, 2009
 
Bounds line up
Woooooooooooohoooooooooo!

The Font object is now passing the font's bounding box back to the text processor. The text processor applies the current text transform and lines up the glyph's origin with the current text position before passing that transformed box into our TextBlock system. The boxes line up!

I had to fix another bug about the interpretation of offsets in the TJ operator - the offsets weren't getting transformed by the current text matrix so the spacing wandered out of sync.

I'm now fairly confident that the text processor knows


  1. Which characters are on the page, and

  2. Where each one is.



Funny things are happening though when the TextBlock system tries to guess where to insert spaces based on character position. That's the next thing to fix.

T

Labels: ,


 
All a bit complicated
Lots of work going on at the moment with the text extraction from a troublesome file. I need to keep notes to keep my head straight...hello blog!

The goal

Get our text extraction to work with this file just as it does with other PDFs.

The problem

Copy and paste of the text from within Acrobat works fine.

Copy and paste of the text from within Preview generates high-ascii gibberish.

The wordmap generation tool generates intelligible text - but the character order is backwards. tsom ddo.

Investigation

The file's using a Type 3 font. This confuses OS X's PDFKit, so we will have to use a hand-rolled text processor no matter what.

The glyph names in the font are /40, /41, /42 etc. I *think* this is outside the PDF spec, which expects the glyph names to be from the Adobe Standard Encoding or the Symbol font. We'll let that pass because Acrobat interprets them so we should too.

The file sets the TextMatrix to reverse characters before they're drawn. Don't know why, but that's what confusing the word map generator. I can only assume that the glyph shapes within the font are themselves reversed and rotated so that it all cancels out - the bounding boxes seem to bear this out as they're mainly below the line.

I have no idea why the file's structured like that.

The fix

We will have to process the text using our own tools, built from the PDF stream upwards. This includes extracting the linear text and generating the word / link maps too. It'll be good to bring these all together into a single codebase, but as always we could do with more time...

Progress

The next text processor extracts the correct unicode text from the file (by assuming for example that /47 = character 47 in Adobe's standard encoding).

Because we now need to generate the wordmaps from our own tool too we will need to use the font metrics specified in the PDF file (currently the new text processor just uses a system font, but the positions won't line up).

I've implemented a Font object which pulls the glyph widths out of the font's definition in the PDF and also generates the appropriate toUnicodeMap.

It doesn't currently calculate the bounding box for each glyph, just assumes it's a square with sides the length of the glyph's advance.

Type 0 fonts are not yet supported, just "simple" fonts where each glyph is keyed by a single byte.

We fall back to the PDFKit approach when the new processor encounters something it doesn't understand, except for Form XObjects which we will need to handle but don't yet.

Next

Retrieve the actual bounding box of each glyph as we're accumulating the text. The rotation/reflection of this font means that most of the glyph is below the line and so our naƮvely-calculated bounding boxes are some way out.

Then run the debug version of the extractor on a few test pages, making sure that the drawn text boxes are correctly aligned with the text.

Once happy with that, get the TextBoxes to track individual words so we can kick out a wordmap file.

T

Labels: , ,


Thursday, March 05, 2009
 
Bugfixes
Two releases in one day...

Release 7.3.1 has just gone up. This neatens up the interaction of global preview with



We also fixed



and added



T

Labels: ,


 
Global Preview
Release 7.3 has just gone live. This moves everyone over to the new access control model (parameter-based rather than cookies).

Now that that's in place we've also graduated to "Global Preview" - people browsing in our shop can search the complete archives of all of our titles. Results are presented in our standard list of cropped page images. The click-through takes you to the thumbnail view of the pages, with an invitation to buy a subscription for full access.

All front covers and contents pages are available at full size so you can see what you're missing. The text of these also feeds into Google to help readers find the content.

T

Labels: ,


Wednesday, March 04, 2009
 
Parameter-based access control
is now up and running on the site - but applies only to admins at this point. Will roll it out to everyone once I've tested everything.

Had to update the clipper so that it knows how to retrieve the page images under the new approach.

It's quite cool to be able to grab a page image URL and access it directly in another browser - for about a minute. After that the timestamp parameter automatically expires the link and we deny access. There might be opportunities there for use in promotions? Could generate links to page images with one-month expiry etc. These could then be served without any login preamble.

T

Labels:


Tuesday, March 03, 2009
 
Expiry and signature
Generation on the Java side seems to be working fine. Next to get perl to interpret the parameters and grant/deny access.

T

Labels:


 
Large cookies
Best kind.

There are currently two codebases which work together to serve our web pages and page images.

By far the biggest is the Java application which runs within Tomcat and knows about all our abstractions - members, subscriptions, books and magazines.

The smaller piece of code is an Apache perl module which grants or denies access to the page images depending on what you're allowed to see.

These two pieces of code communicate through cookies sent via your browser - the Java code knows what you're allowed to see and it sets a cookie to reflect that. The browser supplies this cookie when requesting a page image and the perl module decodes it before deciding whether to send you the file you asked for.

There are complications like digital signatures in there to maintain the cookie's integrity through this conversation.

The problem is that some accounts (for example our UK shop) now have many subscriptions. It becomes inefficient to squash all this information into a cookie, and we're occasionally encountering hard limits on cookie length.

Time for a new solution.

The new approach follows Amazon's S3 model, where each image request will include an expiry date and a cryptographic signature. The perl module can just check the expiry date against the current system time, verify the signature and serve the file if all is ok. No cookies required.

On the Java side we have to generate the expiry timestamps and signatures for each request.

Java first.

T

Labels: , ,


 
Catching up elsewhere
Got as far as item 3 on the test plan - the text scanner falls back to the old PDFKit approach whenever the new system can't handle the supplied PDF. It turned out that support for WinAnsi and MacRoman encodings was required for pretty much every file, so those are in there now.

Other wrinkles: some titles embed all their page text within a Form XObject. This isn't parsed as part of the PDF's page stream, so we don't get to see that text. Can work around this by creating a new scanner for each Form XObject, but we're not doing that yet. Instead we bail and fall back on PDFKit when we detect one of these.

Page Rotation: when a PDFPage specifies an output rotation (eg rotate the page 90 degrees before display) it confuses our text blocks, as their notion of top-to-bottom and left-to-right isn't rotated. Needs fixing, as this affects our existing text extraction.

The font issue (we're only using one when we calculate text positions on the page) is turning out to be more important than I thought. The longer lines in multi-column layouts reduce the width of the vertical gaps in the layout. In turn this means our block slicer is more likely to choose horizontal cuts and get the reading order wrong. This is difficult to catch so we can't just fall back on PDFKit automatically.

Checking everything in, but not yet enabling the new extractor. There are a few other projects which require attention.

T

Labels: , ,



Powered by Blogger