xtim
Friday, January 30, 2004
HighlightEngine is no longer static, it keeps lists in fields to stop creating them all the time.
More optimizations on the list handling in the highlighter. Stuck on a dumb bug to do with concurrent modifications, but that's fixed now and I'm going home!
Class name Instance count Difference
-------------------------------------------------------------------- -------------- ----------
java.lang.String 108330 + 18496
char[] 74652 + 15151
java.util.LinkedList$Entry 14496 + 14326
java.util.LinkedList$ListItr 13286 + 13286
java.lang.StringBuffer 5734 + 5706
com.xrefer.lucene.entitytokenizer.Token 4651 + 4651
Object[] 6458 + 4069
org.apache.lucene.analysis.Token 3323 + 3322
int[] 3763 + 3095
byte[] 2435 + 1804
java.nio.HeapCharBuffer 1138 + 1138
short[] 1838 + 1070
java.io.IOException 1049 + 1049
T
Optimized the highlighter so it doesn't call String.replaceAll unless it knows it needs to:
Class name Instance count Difference
-------------------------------------------------------------------- -------------- ----------
java.util.LinkedList$Entry 19721 + 19576
java.lang.String 108322 + 18488
char[] 74642 + 15141
java.util.LinkedList$ListItr 13286 + 13286
java.util.LinkedList 5938 + 5935
java.lang.StringBuffer 5732 + 5704
com.xrefer.lucene.entitytokenizer.Token 4651 + 4651
Object[] 6462 + 4069
org.apache.lucene.analysis.Token 3322 + 3322
int[] 3763 + 3095
byte[] 2433 + 1800
java.nio.HeapCharBuffer 1138 + 1138
short[] 1842 + 1070
java.io.IOException 1049 + 1049
T
Tweaked HighlightEngine a little to do lazy instantiation of some Lists which might not always be needed.
Also got it to recycle lists where possible.
New counts:
Class name Instance count Difference
-------------------------------------------------------------------- -------------- ----------
char[] 80302 + 20801
java.lang.String 109588 + 19754
java.util.LinkedList$Entry 19721 + 19576
java.util.LinkedList$ListItr 13286 + 13286
int[] 8827 + 8159
java.lang.StringBuffer 6998 + 6970
java.util.LinkedList 5938 + 5935
Object[] 7728 + 5335
com.xrefer.lucene.entitytokenizer.Token 4651 + 4651
org.apache.lucene.analysis.Token 3322 + 3322
byte[] 2431 + 1800
java.util.regex.Pattern$Slice 1338 + 1338
java.util.regex.Pattern 1337 + 1337
java.util.regex.Matcher 1337 + 1337
java.util.regex.Pattern$BnM 1335 + 1335
java.nio.HeapCharBuffer 1138 + 1138
short[] 1838 + 1070
java.io.IOException 1049 + 1049
T
Refactored the highlight process so that callers can create a single token matcher and then apply it repeatedly. This allows the Highlighter to process all table cells/column names/etc. without re-parsing the query for each one.
The new object counts are:
Class name Instance count Difference
-------------------------------------------------------------------- -------------- ----------
java.util.LinkedList$Entry 26386 + 26241
char[] 80306 + 20806
java.lang.String 109591 + 19758
java.util.LinkedList$ListItr 13286 + 13286
java.util.LinkedList 12603 + 12600
int[] 8827 + 8159
java.lang.StringBuffer 6999 + 6971
Object[] 7724 + 5335
com.xrefer.lucene.entitytokenizer.Token 4651 + 4651
org.apache.lucene.analysis.Token 3322 + 3322
byte[] 2433 + 1802
java.util.regex.Pattern$Slice 1338 + 1338
java.util.regex.Pattern 1337 + 1337
java.util.regex.Matcher 1337 + 1337
java.util.regex.Pattern$BnM 1335 + 1335
java.nio.HeapCharBuffer 1138 + 1138
short[] 1838 + 1070
java.io.IOException 1049 + 1049
which looks better, but could still do with some improvement.
T
Profiling doesn't reveal anything too nasty apart from the hit from highlighting dtable contents after a search. Reckon we can improve that, and with the new release coming up on Tuesday I'd like to get the changes in today.
So, perform a search for "china", select the "exports - partners" link on the result list then count object creation for a page refresh.
Currently:
Class name Instance count Difference
-------------------------------------------------------------------- -------------- ----------
java.util.LinkedList$Entry 35710 + 35565
char[] 94294 + 34793
java.lang.String 114254 + 24420
java.util.LinkedList$ListItr 20612 + 20612
Object[] 22384 + 19987
java.util.LinkedList 17265 + 17262
int[] 16816 + 16151
java.lang.StringBuffer 13659 + 13631
org.apache.lucene.analysis.standard.Token 6012 + 6012
org.apache.lucene.analysis.Token 5320 + 5320
short[] 5838 + 5066
java.io.IOException 5045 + 5045
java.util.Vector 4857 + 4722
com.xrefer.lucene.entitytokenizer.Token 4651 + 4651
java.io.StringReader 2673 + 2673
org.apache.lucene.analysis.standard.StandardFilter 2672 + 2672
org.apache.lucene.analysis.LowerCaseFilter 2672 + 2672
com.xrefer.lucene.LoggingFilter 2672 + 2672
org.apache.lucene.search.BooleanClause 2705 + 2672
com.xrefer.parser.util.EntityLegacyReplacer 2005 + 2005
com.xrefer.parser.util.EntityLegacyReader 2005 + 2005
com.xrefer.parser.util.EntityUnicodeHashMapReplacer 2004 + 2004
org.apache.lucene.search.TermQuery 2032 + 2004
org.apache.lucene.analysis.standard.FastCharStream 2004 + 2004
org.apache.lucene.analysis.standard.StandardTokenizerTokenManager 2004 + 2004
org.apache.lucene.analysis.standard.StandardTokenizer 2004 + 2004
com.xrefer.parser.util.EntityIndexingReader 2004 + 2004
org.apache.lucene.index.Term 3563 + 2004
byte[] 2436 + 1803
java.util.regex.Pattern$Slice 1338 + 1338
java.util.regex.Pattern 1337 + 1337
java.util.regex.Matcher 1337 + 1337
org.apache.lucene.search.BooleanQuery 1342 + 1336
java.util.regex.Pattern$BnM 1335 + 1335
java.nio.HeapCharBuffer 1139 + 1139
T
Cache fix is in. Added logging output to highlighter and ContentDAO.
Now profiling before check-in.
We will have to update the site jar to incorporate these changes, so I'll bring February's jar release forward a week to next Tuesday (3rd February).
T
Thursday, January 29, 2004
Highlighting oddness is down to the cache I put in for supposedly-immutable properties.
Will now cache a non-highlighted version and process that if required for a given retrieval.
T
Fixed dtable indexing - col headers are now indexed alongside bodies so that all hits can be found in the extract generation process.
Need to update import jar and reindex the book for this to take effect.
Next problem - the highlight code sometimes affects the chart legends.
T
There's a collection of bad search results if you search for "cia gdp" - no result trimming is taking place so all the results are very long.
T
Wednesday, January 28, 2004
Fixed the value axis labelling for GDP etc. Added new units to handle trillions, which are displayed in scientific notation so we don't run out of zeros.
cia factbook is now imported and we're fine-tuning the display.
Widened chart border on the right-hand side to prevent clipping of the axis text.
GDP chart creates bad axis labelling - hundreds of overlapping labels. The values go up into the trillions, so perhaps we're exceeding a range somewhere. Looking into it.
T
Tuesday, January 27, 2004
Table export now prompts for a destination filename and saves the data. Content type is now text/tab-separated-values and the file is named table{entry_id}.tsv
T
Monday, January 26, 2004
Chart deletion is now working.
At the end of the day, we are left with these still to do:
3. Submit labelling fix to jfreechart in case they want to adopt it.
6. Test on live servers.
7. move dtable handling from entry_dt.jsp to entry.jsp
8. Get export to prompt with file/save dialog rather than display contents.
So, arrange import tomorrow ready to test (6) on wednesday. 7 and 8 can be done ready for tomorrow's site update. 3 can follow.
T
Added that reverse-lookup which has speeded things up a little but not much - we're down to 6 seconds.
Swapped to 2D rather than 3D rendering because I think it looks clearer.
Added to the list: 9. Trim the large empty areas above and below the data section.
T
Tracked down a few rogue processes on my machine which were eating all the memory. Killed them and profiling is now working.
16% of the time to serve the chart goes in looking up index values of keys in arrays. At the moment that's implelented using a linear search, so if we have to look up n keys we're looking at n-squared efficiency. Will replace that with a reverse-lookup hashtable to speed it up.
Another 16% goes in initializing the two fonts as I noted earlier.
15% goes on retrieving the dtable.
10% goes to the initialisation of java word-breaking classes so the chart can build text blocks.
At the moment, it takes about 7 seconds to retrieve a chart.
T
Aha - did manage to get a little information from the process profiler, and it seems that writing the PNG is taking a lot of time. Will switch back to jpeg and see if that's comparable.
Also, 8. Get export to prompt with file/save dialog rather than display contents.
T
Tried profiling the chart routines, but there's not much I can do to speed them up. There are loads of allocations of char[] objects, most of which come from Font allocations in the axis constructors. We could cache the table data in the jsp engine, but it's not that much of a cost to retrieve compared to the charting.
The processor activity profiler can't record a session without crashing, so that's not much help.
T
Today is the day for final tweaks to dtables, their exports and their charts. Would like to release our first dtables this week in the CIA and move on to something else (probably the release of mapper v2).
Things I'd know need doing:
1. Provide easy access to chart cosmetics so that James can modify colours, etc.
2. Profile and speed chart generation if possible.
3. Submit labelling fix to jfreechart in case they want to adopt it.
4. Move chart generation from testim.jsp into an appropriate file and embed the call to it within another page.
5. Arrange deletion of chart temp files.
6. Test on live servers.
6 will have to wait until we've performed an import of the cia data.
T
Friday, January 23, 2004
Fixed the chart routines - category axis labels are now drawn correctly.
Sending chart as png rather than jpg for improved quality.
Tidied up the image generation jsp.
T
Thursday, January 22, 2004
OK, the demo horizontal bar chart now reproduces the problem. I've raised the number of series from 4 to 50, and the labels now look very odd. Expanding the window doesn't fix the problem, even when there's room to display them properly. So, on to the fix!
T
Working on the chart labels.
Have downloaded and compiled clean copies of both jars.
Demos working fine.
Plan is to alter one of the demo charts to reproduce the problem, then fix it within that framework before testing on our data.
T
Tracing link weirdness with the new philips enc. Amit's launched a relink and I'm watching it to ensure it completes successfully - the last one may have been interrupted. No problems so far.
Tested the dtable data in Excel to see what it does with the charts. Not too impressive - it skips labels from the category axis so that none overlap, but you have to do a lot of tweaking to make them all visible. Got a plan of how to modify the jfreechart code to do what we want.
Double-checked the XML site yesterday for secid behaviour and it already behaves as I described below, so there's nothing to change. Just need to update the documentation.
T
Wednesday, January 21, 2004
Added export function to dtables, so that we can generate some excel-based charts for comparison. We export two copies of numeric columns - the first with the cell text, the second with the actual numeric values used for sorting / charting.
T
Stats are fixed. I had to tweak the xsl a little further:
a/b/c[last()]
doesn't get you the last c in the document which matches a/b/c. It gets you the last c under the first a and first b. I actually wanted
a[last()]/b/c
which does the job niceley (there are many a's but only one b and c for each one).
Published stats.
T
Tuesday, January 20, 2004
The stats proplem occurred only in the annual summaries. It's because the XSLT used to retrieve the client's name grabs the first one listed for the year. I've modified it to use the last listed name for the year and restarted stats generation. Should be finished by tomorrow when I'll put them up.
T
Tracked down the problem with the chart labelling. The text breaking is actually working OK, it's just being given much too small an area into which to fit the text. The calling routine doesn't take account of whether the category labels are perpendicular or parallel to the axis - it always assume's they're parallel, and so gives you axislength / num categories pixels space, which is tiny. Our labels are perpendicular, so that quantity actually corresponds to the height of the labels.
It's going to be a little involved to fix this, but I've tested it with a hardcoded value and the results look worth it. I'll concentrate on getting it to work with horizontally-labelled vertical axes and generalise if it turns out that we need it. I've got both jars building now and debug code in them so we can change whatever we need. If the fix looks generally useful I'll submit it to the maintainers.
Also got to fix an anomaly reported by Claire, which is that the client names on our stats pages don't match their system names if the latter have been updated.
Will additionally modifiy section id handling on the XML site so that users never have to think about it and can just us hand us the secid unmnodified from the result page to the entry page. At the moment they may need to add an extra hyphen to reproduce our site behaviour. Will document the system once I've changed this. I don't want to change it today because I want to ensure we've not overlooked any new weirdness introduced by the site software update.
T
Back to the chart work. Text processing (which is giving us the inappropriate line breaks in labels) happens in an external jar. Got that building, now to modify and test.
T
Release wasn't entirely smooth - this is an argument for sticking to your original plans. For a while, entry pages weren't coming back from the site.
The new release depends on changes to the database structure for dtables.
I applied these changes to rsw1 last week in anticipation of the release - the planned cluster-update method would ensure the software went across hand-in-hand with the data changes.
Today, with the disk space issues on rsw1, I went ahead manually - oh dear. Database changes hadn't propagated by that point.
Applied db changes manually and it's all back up.
Problems will have varied in length from server to server, but it was about 15 mins for w2 and 3 for w5. I'm now banging my head against the wall as required.
T
Release day - disk space is low on rsw1 and I don't want to perform a cluster update until we know all sql/reindexing is complete on that machine. Amit was in the middle of updating the machine when it sponked out, and he's away today. I'll update the machines by hand.
T
Monday, January 19, 2004
Deployed new jar on the import system, ready to tackle the world factbook.
The main site release is coming up tomorrow. Until then I'm going to dive into the jfreechart code to see if we can pretty up the labels a bit.
T
Added a new "test" option to the client type field. This is for all the accounts like xdev which aren't really system accounts but are very useful. You know, for testing.
T
Put the new xml pages live. These add support for auto-login based on IP address. Tested the new feature from the office, where it works fine, and also verified that the existing syndication clients work as normal. Will go back to re-check these throughout the week to ensure that token-expiry is handled by our pages as it should be.
Can now give meta-searchers the go-ahead to search based on client-ip.
T
Friday, January 16, 2004
ip-authentication to go live on the xml site on Monday morning, for the benefit of metasearchers.
More tweaks on dtable link presentation.
T
The codebase passed the junit and jsitetest checks, and it's up and running on didcot for user testing.
Tagged the release "temple", as part of our continuing journey around the circle line.
T
Preparing for the release next Tuesday. Brought local codebase up-to-date, now performing a full rebuild and junit test.
T
Thursday, January 15, 2004
Column-link sort now complete. We do this on the page rather than at the server end so that clients have still got access to the table's original column order if they want it.
Links are now in a four-column table. Hmm. Not sure if I like it.
T
Modified the default-column handling in the dtable indexer as described below. Search results and default views both look useful now.
Setting up IP-based login for T meta-search.
Next, sort available-column links alphabetically on page and put them into a table.
T
A rethink on dtable default columns. We axed the default marker on the population column, because it was annoying to have it pop up in unrelated search results. However, this makes the default view of the table very boring.
So, we're going to mark a few cols as default to liven up the basic table display. For indexing, we'll ignore default cols and index yourcol + firstcol. This assumes that the first column will contain useful row-identification text, but that seems fair enough as the erm, default.
T
The perling to fix millions/billions isn't working out - there's too much flexibility (eg the presence of non-table tags) in the human-readble text to do it reliably, and it would be a mammoth job to verify that all has worked properly. Will have to ask the data agency to fix it.
T
An idea for an idle moment: I quite fancy using junit to run a set of live-site diagnostics. Just a sanity check which ensures that all the machines are up, all have a running database, httpd, etc. Maybe collect some table stats from the dbs as well so that we can check consistency.
T
Applying the dtable database changes to rsw1 in preparation for next week's release.
Next: use a bit of perl to fix/verify the cia numeric fields (millions/billions etc).
T
Spent the rest of yesterday hooking up charting for dtables. I'm using the jfreechart toolket, which seems excellent. There's an oddity with the way it draws axis labels, but it's open-source (yay!) so I can get in there and tweak it to xrefer's own ends. You can now populate and sort your dynamic table, then hit a link at the bottom of the page to chart it. The first column is used for x-axis labels and all numeric columns are charted. The server prepares a jpg and this is sent from a jsp.
Because we're using jserv rather than tomcat, our servlet classes are several versions out of date. This means we can't rely on the jfreechart auto-deletion of temporary files - something to handle manually. Could be worth re-testing tomcat now that it's more mature. When we last looked at it (in about 2002) it just couldn't handle the load jserv manages.
T
Wednesday, January 14, 2004
Applied new topic alterations to xplus. Script is ready to go against the live site when approved.
T
Tuesday, January 13, 2004
Tables looking good. Going to add a link to the "default view" and remove the default flag from the population column.
val attributes need looking at to handle millions and billions.
T
Amended column short-names for cia rml source. Removed all chars apart from lowercase letters and digits - makes secids easier to pass in URLs.
Copied the field dtable into my test file and ran an import. Tidied up some sort code. Looking better, going to ask James and Carl to check it out.
T
Non-default sorting has no impact on the stats. Reckon we're done on optimizations for this baby.
Next:
1. check the rml for the cia factbook and ensure dtable markup looks ok.
2. update subject titles and move books around for dewey-based classification.
3. update import process to record vol edition field.
T
Running the same test but with a highlight phrase in place ups the numbers somewhat:
char[] 35k
java.util.LinkedList$Entry 27k
Object[] 27k
etc. The vast majority of these are allocated within Lucene's search functions so I'm going to leave this as it is for now and re-examine it when we update to Lucene 1.3. It's only the first time the user views the table from a result list that they'll see these performance hits - thereafter we don't highlight search terms.
T
Not much to go on with the cpu timings. Only 50% of the overall retrieval time is within retrieveDTable, and almost all of that is in preparing the sql/calling getString on the resultSet. We've gone as far as we can with this test case. The rough timings are down to 2.5 seconds for the page generation and it's not going to get much better.
Will now check for non-default sort and highlighting.
T
Back to dtable optimization. I'm now only retrieving the numerical value of cells for columns with sort order NUM. The new object counts are:
char[] 3.5k
String 3k
byte[] 2k
StringBuffer 1k
I don't think these will get much lower - the percentages of each allocated within retrieveDTable itself are:
char[] 17%
String 20%
byte[] 72%
StringBuffer 1%
so there's only byte[] which really stands out against the background radiation. All of that 72% allocation occur within the psql driver, so unless we trim the sql call down even further there's nothing to do. Diminishing returns on that. I'll look at the cpu time used instead before moving on.
T
Fixed a problem with Novemeber's stats for one client. Their list of top search terms featured a high-ascii character as an attempted wildcards, and that threw off the xml parser - replaced it with a "." and reprocessed.
T
Monday, January 12, 2004
Restructured the data retrieval so that we sort and obtain all the cell data with a single SQL call.
New object totals are:
char[] 4k
byte[] 3k
String 3k
HeapCharBuffer 1k
and it feels faster.
The char and byte arrays are from the single remaining db retrieval and I don't think we can reduce those without further caching. More profiling tomorrow to see if we can get the other totals down. The nokia took this morning so I'm going to delay release until next Tuesday.
The nokia's set up with putty and working fine. Hurrah for open source.
T
Have added a cache to the ContentDAO which records the invariant DTable properties. Things like num_rows and title are cached.
This reduces the totals for a retrieval to
byte[] 5k
char[] 4k
String 3k
HeapCharBuffer 1k
assuming that there's a cache hit (and it's anticipated that most dtable retrievals will re-use information). Would still like to reduce these further.
T
Top object allocations during table retrieval are
byte[] 7k
char[] 5k
String 3k
HeapCharBuffer 2k
and they all occur during the database retrieval calls. Might be possible to bundle more information into each call and reduce the overall number of retrievals.
T
The re-installed ssh client on the Nokia 9210 crashes once it's actually logged you in to your destination server. Can't find out why, and SSH have stopped selling/supporting it since October last year. Blugh.
There is now an open source alternative at: http://s2putty.sourceforge.net/ , but sourceforge isn't allowing downloads at the moment. Will try again later.
Back to dtables in the meantime.
T
Rearranged desks, again.
Can't think of a good reason to keep the results extract entities in plaintext - will look at this again before upgrading them to full rendering.
Psion 9210 has reset itself - now to reinstall ssh for site support.
T
Friday, January 09, 2004
Fixed the handling of default sort within dtables. The section path you retrieve from the returned dtable includes the appropriate sort parameter which you would need to regenerate the table you're seeing.
T
December's stats and the annual summaries are done, cleaned up and published.
Next, review project list before getting back to dtables.
T
Fixing stats stuff ready for publication. Need to trim out search modifiers from the listed top search terms (eg. ":us ").
T
To fix: if you're looking at a table with the default sort and hit the "sort" link, the order doesn't reverse at it should.
T
Thursday, January 08, 2004
Result truncation now fixed: the reassembly of the dtable entry includes some newlines which weren't there before. This doesn't matter to the xml parsers, but the result extract process uses regexes to select text. Now fixed using the word/nonword technique described earlier.
T
Wednesday, January 07, 2004
Got dtable to include the table cells when recreating the table source.
Was getting transaction timeouts during import, but I think that's just down to the vast amount of log information it was publishing. I've lowered the log levels and the import works fine.
More regexp weirdness, this time in DirectSearcher's extract method. The method wasn't handling newlines well, I've fixed it as before with the word/non-word match. Results look OKish but occasionally over-truncated. Will track this down tomorrow.
T
Tuesday, January 06, 2004
Basic reassembly is now working - dtable, pre, post and header rows taken care of. Will tackle the table body tomorrow.
Parser gets the DTable bean to reassemble the markup - this is much more efficient but will need to be moved back towards the parser if we ever do generalise the dtable feature across different markups (see below).
T
Needed a small readjustment to get the ordering correct in EntryBean - was storing dtables before establishing the correct entry id. Now fixed.
T
Preview sorted out - for the moment. The paragraph styles weren't working because the p elements were marked runon="on", which effectively removes them. The section display didn't need a hyphen - it's never expected one. Instead the lower level of sections should be marked b="s" so that they open whenever displayed. Now fixed.
Back to assembleSource before lunch.
T
Quick diversion to sort out some display oddities on the preview site: Paragraph indent attributes aren't getting passed through, and the link to open a section isn't appending a hyphen to the new section path. These are both for James testing the CIA factbook before import.
T
Just can't seem to get the java replaceAll function to match "any number of any character, including newlines" using my standard perl form, [.\n]* so I've fallen back to [\w\W]* (any word or non-word character), and that's working OK. Nng.
So, the disassembleSource function now works, On to the assembleSource function.
T
Parser split working apart from frustrating problem with regexp. Can't see why it's failing to match. Now testing bit-by-bit.
T
Monday, January 05, 2004
Amended setText appropriately and removed the other per-field mutators.
Next, the parser interaction which will split/reassemble the source.
T
Test import looks ok! Here's a sample:
FINEST: Updated entry info: EntryBean:
m_nId = 0
m_nVolId = 327
m_nLocalEntryId = 4
m_svText =
This is a small footstool. Not a table.
m_nTextChunk = 0
m_svHeading = again, not a table (pachinko)
m_svHeadingAux = uvavu
m_svAlphabeticalHeading = againnotatablepachinkouvavu
m_nContextEntry = 0
m_nLanguageId = 1
m_et = EntryType [m_svIdentifier=standard, m_nOrdinal=0]
m_bIsModified = true
Now, lunch.
T
Added a new method to EntryBean, updateEntryInfo, which updates all the fields which we can deduce from the entry source. This is publicly available (eg for when we want to update headings using a new entity map) and is called by EntryBean itself during a create. Creation is now much simpler for the client, as they don't have to calculate any of these fields themselves. Will run a test import, then add the call into setText as well and delete the existing setX methods for the fields effected. 3rd parties shouldn't be able to set fields which differ from that specified in the source!
Once that's done, add a similar method to split/recreate the source for dtables and others.
T
Been thinking about the design over the weekend and I reckon it's OK. I'm going to make a few changes so that the entry takes a more active role in managing its data. At the moment, it's the import parser which runs the text through an entry parser to obtain the heading, entry type etc. I'd like to make that the responsibility of the entry itself. That way we avoid nasty race conditions about setting entry text before entry type when the entry text will be split off into different objects.
So, to summarise the current plan:
1. Entries will take responsibility for finding their own parser and applying it to the entry text. This means their create methods can be simplified.
2. Import parser will be simplified, as it can let the entry decide its own heading and type.
3. When the entry's text is changed, it will run it through an appropriate parser to create any seconday objects (eg dtables) and only store the potentially modified source.
4. When getText is called on the entry, it will reassemble the original text using the parser.
This way, all markup-specific work is performed by the parser layer. All entries get/set text deals in pure raw data, as originally intended. We get an efficient storage mechanism for entries which contain vast amounts of data which is better represented elsewhere (eg dtables).
It's a plan. It's a Monday. I've moved my desk. Time to get going!
T
Friday, January 02, 2004
So anyway, the new system (as it stands) does put the responsibility on the parser of deciding what type an entry is, including whether or not it's a dtable. That seems correct. During import, we create the entry and then check with the parser as to whether it's a dtable. If so, we invoke the dtable parser on it. That's what takes care of creating a matching DTableBean.
This is where the model breaks slightly: there's only one dtable parser. There should really be one for every type of markup which can support dtables. As it happens, no other markups can support them so that's OK, but there's nothing in the way dtable parser is written to remind you that it only knows about RML31. Perhaps I'd feel happier if we did group it under the parsers rather than in its own package. That might help a lot - the rml specific stuff would then be back where it belongs.
Let's say we do that; how do we stop the EntryBean from storing and retrieving the dtable data? What about the import parser getting the now-rml-specific dtable parser to handle the data as usual, then using it to strip the dtable data from the entry text? EntryBean gets updated with stipped text, job's a good 'un. On retrieval, entries can check what type they are and get the appropriate parser to reconstitute themselves.
I think I'm happier with that. Will think it through again over the weekend.
T
This feels really awkward, as it's cutting through the nice neat object model which has worked well so far. Time for a review.
The EntryBean looks after data storage. It associates a piece of data with each entry id, but knows nothing about how to interpret the data.
Each entry has an associated markup id. This identifies which parser to use with an entry. The parser knows how to go from the entry's data to our domain-specific objects: headings, bodies, links, etc.
Nothing else knows this. Everything else in the system uses the domain objects. This means that we can revise RML or introduce completely different markup (eg non-xml) without rewriting the system. Just add a new parser which understands the markup, associate it with the new entries and you're away. Linking, indexing, all the rest will just work. That's useful.
Then we come to the problem of dtables. In theory, these could work on exactly the same lines as the existing structure. The EntryBean would retrieve the marked-up data, the parser would decode it and all would be well. There are two problems:
1. we want to offer flexibility which doesn't sit well with that approach: users can sort and hide columns or use the data in graphs.
2. the full entry data is huge.
So it's not practical to parse the full table every time we want to display a view of it.
The solution we've developed is to pre-parse the entry on import and store the table fields separately in the database. This gives us random access to the cell data, column list, etc. We can create custom views of the table quite efficiently. This happens through a dedicated API call, separate from the standard entry retrieval call. Unfortunately, the user doesn't know to call that function until they've called the standard method and discovered that the entry is, in fact, a table.
It's too expensive to process the whole entry source at that point (particularly as we're not going to use it, we're going to use the dedicated table data instead). Ideally we'd strip the dtable information out of the entry before it's written to the db.
Problem: if we make that the responsibility of the EntryBean, it's starting to interpret the data and that wasn't part of the deal. If we make it the responsibility of the parser then everything which ever calls getText on the EntryBean will have to handle the possibility that it's not getting the full text.
Hmm.
Will continue in next post.
Oh dear. The test lucene index under /tmp has disappeared. In trying to re-create it, I've come up against a problem: with the dtable edited out of the entry, there's not much to index.
This will be a drag if we follow this approach in the live system. Once an entry has been imported, indexed, linked and dtabled, we can theoretically remove the dtable element. If we do that, however, we've got to reinsert it whenever we reindex or relink the vol.
Alternatives:
1. Do remove it, but get the indexing system to work directly from the dtable information rather than the source rml.
2. Don't remove it, and hope that the de-chunked entry retrieval is sufficiently fast.
I don't hold out much hope for option 2. I envisage quite large dtables - the cia example is already huge. Having to wheel all that data around every time we want to display the entry doesn't appeal.
I think we should go for dtable-aware tools and invoke them when required. Ultimately it gives us more flexibility, at the cost of some complexity. They should all live within the dtable package. It may be that we develop sibling packages in future to deal with other specialised entry types (atlas, video, etc).
So, what are we going to need?
1. dtable-aware indexer.
2. dtable-aware linkers.
3. ability to apply the right tool to the right entry.
But hang on! Surely the point of all the original work with the parser layer was to isolate changes in the data representation from the downstream tools. Can't we hide the changes at that level and leave the tools as they are? How about if the Entry takes on the job of spotting that it's a dtable and splitting the RML appropriately? The EJBs are only used during import and maintenance, so it can hide the dtable handling within its own encapsulation for that (where speed isn't so important). For display, where we use the DAO directly, we can end up with a nice complact de-dtabled entry text.
Sounds good. Now to re-appraise the entry import process and see if it's feasible to hide all dtable splitting within EntryBean.
T
Last year's notes seem to make sense. Removing the highlight code again and getting the tests back up and running. Will check for consistency.
These are the notes from the optimisation work I was doing before Christmas:
Profiling dtable retrieval
mark object counts after search (india export)
and retrieval of entry.jsp
count objects created by delivery of entry_dt.jsp
287,000 instances of LinkedList$Entry
99.8% from HighlightEngine.highlightTerms
247,520 instances of LinkedList$ListItr
97% from highlightTerms.
180,018 instances of String
92% from highlightTerms.
So, highlighting could be a big hit.
1. Time 10 retrievals with existing code.
2. Remove highlight call and time again.
10 refreshes took 1 minute 20 seconds
Now with highlighting disabled:
50 seconds (saved 37.5%).
So, disable it for now and reintroduce later on. 5 seconds per requests is still too slow.
Test in profiler again for new object counts.
Much better! Highest is 16k for char[].
48% of these allocated in TextChunker.read - is there any way we can remove this?
not really, unless we don't store the table content in the entry.
OK, if we cut the dtable element from the stored RML then:
1. it might be much faster
2. we can regenerate it if required from the representation in the dtable SQL tables.
Let's try by hand-editing the stored entry text and see how much faster it is.
Other big hits before we rewrite:
15k byte[]
12k char[]
10k String
7k HeapCharBuffer
7k HeapByteBuffer
7k GetPropertyAction
4k Object[]
Now to edit the db and try again.
Frankly, no better. Why's that?
Changes weren't saved!
Will try again.
Amended transaction attribute in ejb-jar.xml and redeployed.
OK, it was actually a bug in EntryBean, which must have been there since the start. If you shorten an entry below the chunk threshold, the chunk field is never cleared. Now fixed.
Back to the optim.
new figures are
8k char[]
7k byte[]
5k HeapCharBuffer
2k StringBuffer
better but still high!
10 retrievals now take 24 seconds. Much better.
Putting the highlight code back in for pre-Christmas-break check-in.
Right then - first workday of 2004 is a Friday. Time for a bit of planning before the office fills up next week.
My current to-do list looks like this:
unchunk entries tim bruce unassigned 2004-01-02 edit - delete
January release tim bruce unassigned 2004-01-02 edit - delete
fix long extract tim bruce unassigned 2004-01-02 edit - delete
Fix the link to your search results tim bruce unassigned 2003-12-19 edit - delete
ZDOTB tim bruce commenced 2003-12-09 edit - delete
optimise dtables for release tim bruce commenced 2003-12-03 edit - delete
indent on p tag tim bruce waiting_signoff 2003-12-01 edit - delete
checkout visualisation magazine tim bruce commenced 2003-11-25 edit - delete
handle russian search tim bruce unassigned 2003-11-24 edit - delete
checkout shibboleth tim bruce commenced 2003-11-13 edit - delete
clear superceded vols tim bruce commenced 2003-11-13 edit - delete
summarise outstanding v2 work on hawkins blog tim bruce unassigned 2003-11-05 edit - delete
next mapper iteration tim bruce commenced 2003-10-13 edit - delete
person find by descriptor to use full forenames tim bruce commenced 2003-09-30 edit - delete
fix contributor links for mapper tim bruce unassigned 2003-09-26 edit - delete
speed up indexing tim bruce unassigned 2003-09-24 edit - delete
adam/paper tim bruce unassigned 2003-09-18 edit - delete
Upgrade to Athens 3.6 tim bruce unassigned 2003-09-15 edit - delete
evaluate java 1.4.2 tim bruce commenced 2003-07-03 edit - delete
annual stats for clients tim bruce waiting_signoff 2003-06-16 edit - delete
derive alphabetical heading from simple heading tim bruce unassigned 2003-05-30 edit - delete
Upgrade to lucene 1.3 when released tim bruce unassigned 2003-04-28 edit - delete
switch to UIL for admin jms tim bruce commenced 2003-04-08 edit - delete
add num xreferences to meta fields for search tim bruce unassigned 2003-04-03 edit - delete
speed up list clients page on admin tim bruce unassigned 2003-03-10 edit - delete
add account expiry field to admin system tim bruce unassigned 2003-02-18 edit - delete
Remove withdrawn titles tim bruce waiting_signoff 2003-02-10 edit - delete
web service interface tim bruce unassigned 2003-01-15 edit - delete
integrate media update into import tim bruce unassigned 2002-12-05 edit - delete
integrate generation of search thumbnails into import tim bruce unassigned 2002-12-05 edit - delete
The plan is to clear the decks with the January release, which will be going up on Tuesday the 13th. I'd like to put dynamic tables live at that point and include any server-side changes we anticipate for the next version of the mapper. I'll include all other fixes which I can fit in. Once the January release is up, I'll finish work on version 2 of the mapper, put that live and move on to look at the atlas implementation.
Unchunking entries will speed up entry display and simplify administration (in particular the removal of withdrawn books). The chunker was put in to work around limitations on field size with postgres 6.5 - now that we're up to 7.3 it's no longer needed. Will write a small app to de-chunk the data after this month's release.
T
Hi!
This is Tim's workbook, set up in a flurry of efficiency at the start of 2004. It's for all the notes which I would otherwise lose somewhere on my desk.
We're not confidential here, but neither are we fascinating. You probably want the next blog along.
T