xtim
Friday, October 31, 2008
 
Extracting page text on Tiger
Looks as if there might be a bug in PDFKit on Tiger?

Calling


[page attributedString]


is throwing a range exception for some pages. On Leopard the console records a warning message to the same effect but the exception itself is not thrown.

This was rare, but it killed the import process when it happened. We're now explicitly catching the exception and retrying with


[page string]


instead to retrieve the plain text if the first call failed. This means we can't use the string attributes to help deduce the text's layout, but we can at least retrieve the content.

T

Labels: ,


 
Release 6.9.7
Adds support for downloadable PDFs, basket tweaks.

T

Labels: ,


Thursday, October 30, 2008
 
Promotions
Tweaked the basket handler so that you can only apply a single promotion code to each purchase - we've got some "overlapping" promotions on the way and you shouldn't be able to use them all at once.

T

Labels:


 
Downloading
Added new link to download complete PDF, for titles that offer this option. The front cover of the downloaded PDF is watermarked with the username of the downloader - low tech, but a slight disincentive for casual distribution of the file.

T

Labels:


Wednesday, October 29, 2008
 
Downsampling
We're going to make PDFs available for some titles - obviously we don't want to release the full-sized production PDFs as their size (often in the hundreds of megabytes) makes them unwieldy.

RubberStamp will now kick out a recompressed PDF as part of the import process. This uses the astoundingly handy Quartz Filter system of OS X. We've created a filter which downsamples all images to "screen resolution" and applies fairly hefty jpg compression, bringing the typical file size down to a few MB.

You can apply filters directly in OS X 10.5 as part of the PDFDocument writeToFile options; a couple of our import machines are still running 10.4 so we have to fire up an external task to invoke the filter. This is the quartzfilter utility in /System/Library/Printers/Libraries .

Further complications: quartzfilter seems to be picky on 10.4 about where it will pick up filters. I tried embedding the filter into RubberStamp as an application resource, and this works fine on 10.5 but is ignored by quartzfilter on 10.4. Putting the filter in ~/Library/Filters works fine on both, so that's what we're doing.

Filtering the pdf also seems to strip out the pages' CropBox and TrimBox settings - RubberStamp's now collecting the TrimBox settings for each page, applying the filter and then resetting the CropBoxes to match the original TrimBoxes. Most PDF readers will use the CropBox to clip the page, so we want to set it to match the intended display area.

Finally we strip out our production-specific annotations to give a clean, small PDF ready for the public.

Now to deliver it...

T

Labels: ,


Monday, October 27, 2008
 
Final tweaks?
The block-slicing algorithm has had its final refinement for the time being. Rather than choosing an available cut at random, we assign a score to all available cuts based on the width of the channels between text blocks. The cut with the highest score (ie widest gap) is used.

Also added debug output so we can see images of the pages with the text block boundaries drawn on top. This revealed that all our blocks were padded to the right, where OS X supplies us with a Space, Newline sequence to mark the end of each line. We're now ignoring those when sizing the text blocks, which gives the cut-scoring algorithm better information to work with.

Text re-calculated and re-uploaded.

T

Labels:


 
Vertical Text fixed
Found the cause of the problem with vertical text.

Our text was getting split into many separate blocks, when it should really have been a single column. The algorithm responsible for block-splitting takes into account the dimensions of each character's rectangle on the page, and this was getting confused. For some characters (notably 'i's, 'l's and 'f's), the PDF handling routines were reporting a rectangle of tiny (and negative) width - millionths of a point wide.

Now using the larger of reported height and width for the calculation instead. This seems much more reliable, and I'm regenerating the text for our first text-enabled title with the new algorithm.

T

Labels:


 
Text progress
The text view is now available (* selected titles only...).

RubberStamp is now extracting the page text directly. The tricky thing is turning a multi-column magazine layout into sensible linear text - particularly when you can have multiple articles on one page, with photo captions and quotations to integrate into the text flow.

It's difficult to do this perfectly, and often there are a number of possible "correct" interpretations. Sometimes even humans get confused about where the text flows next...

Our approach is to:

1. Iterate through the page text as supplied by the OS X API.

2. Break this text into blocks based on each character's position on the page.

3. Combine any blocks which should really be a single block (in particular, drop capitals and their associated paragraphs).

4. Recursively split the blocks into two lists - "before" and "after". At each stage, draw an imaginary horizontal or vertical line through the blocks such that the line crosses no block and there is at least one block on either side. Left or above is "before", right or below is "after".

5. Combine the text from the blocks in the ordering established in the previous step.

Item 4 is the tricky one - there may well be overlapping blocks so you can't draw a clean line, or there may be many possible lines to draw. The order in which the lines are drawn will effect the order of the reconstructed text.

We've got a working solution in place to the problem of overlapping blocks, but at the moment it's pot luck which line gets chosen first when a few are available. I'm going to make that slightly smarter, and also see if we can improve the handling of rotated text (vertical picture attributions, for example). Then to roll this out to a few more titles.

The presentation side is already in place - there's an expandable div that presents the text for the page / spread you're reading, and a link to the printable rendition which invokes javascript's window.print method.

T

Labels:


Thursday, October 16, 2008
 
Text plan
Current plan:


  1. Move all text extraction into RubberStamp so we don't need to maintain a separate app/scripts.

  2. Add per-title "show text" flag.

  3. Display text in an expandable area on the web page.



T

Labels:


 
Text next
Live servers are now patched with to make curl work properly with Shibboleth.

Release 6.9.5 has just gone live; this fixes a cookie problem for those with lots of subscriptions (say a couple of hundred).

Next project is to make the text of certain magazine/book pages available directly on our web pages; good for accessibility, Google and flexible printing.

Side note of things to be added to the next web server AMI:


  1. curl patch

  2. tweaked sendmail config



T

Labels: , ,


 
Shibboleth, curl and NSS
One thing which needs tweaking: Shibboleth-sp uses curl to request attributes from the identity provider if they've not been pushed in the original communication.

The new servers are using a later version of Fedora, in which curl has been built to use the NSS libraries for SSL support. This breaks the attribute request mechanism, and in the shibd logs you can find errors like this:


Shibboleth.AttributeResolver.Query [5]: exception during SAML query to
https://typekey.sdss.ac.uk:8443/typekey/AA: CURLSOAPTransport failed
while contacting SOAP responder: Unknown cipher in
list: ALL:!aNULL:!LOW:!EXPORT:!SSLv2


A solution is to rebuild curl from the source RPM with different configuration options, as described in this very helpful post.

T

Labels: ,


 
301: Moved permanently
It is a bad plan that admits of no modification.

Publilius Syrus (~100 BC), Maxims


We're now hosted completely on AWS, though we're not yet serving our page images directly from S3. Instead, our old hosting config is replicated almost byte-for-byte on EC2 and EBS (the elastic block store which enables storage to persist between virtual machine restarts).

In the long term there's definitely an argument to be made for moving all the page images onto S3, particularly when Amazon launches its Content Delivery Network on top of S3 later in the year. In the short term it's better to move one step at a time - doing it this way means we don't have to revise our import, logging, mail and hosting systems all at once. The mechanism are in place to migrate content over once we're ready.

So far (crossing all possible fingers) the new hosting is working fine. It's substantially cheaper than our previous arrangement and much more flexible. I've already upgraded the database machine, which involved


  1. Starting a new machine instance

  2. Shutting down the db

  3. Disconnecting the drive from the previous instance

  4. Connecting the drive to the new instance

  5. Restarting the db

  6. Telling the web servers to talk to the new machine

  7. Terminating the original machine



All in all, 3 minutes' work and all done from the command line - no raising tickets or paying for unused hardware (running instances are billed by the hour). Awesome.

Also very cool are elastic IPs, which enable you to re-map your public IPs between running instances. This enables you to repeat the trick above with a web server. Will be trying that later in the month.

We're now running scheduled "snapshots" of our EBS drives. Every two hours the whole thing gets backed up to S3 - it only stores the deltas so the storage cost is low, and you can create a new EBS volume from any snapshot. Of course, all the snapshots are still within AWS so we still need an external backup.

NB: title is a geek joke. Our URLs haven't changed at all; they're just answered by different machines.

T

Labels: ,


Monday, October 06, 2008
 
Still here
more soon.

T

Powered by Blogger