xtim
Thursday, September 30, 2004
 
Tracked down the stats issue - jacket pages don't count as entry retrievals. Resolution postponed until new stats site design.

Now, back to the search mechanism.

T
 
Not the most interesting morning: regenerated stats for a college which had none recorded for August - don't know why.

Also tracked down an athens login error to a typo in the account setup.

This afternoon, one more stats thing to look at before getting back to struts.

T
Wednesday, September 29, 2004
 
Added basic area-searching to the new struts pages.

Modified pawn2 to put it live on the office servers (not a big change, just the *.do forwarding).

Looks useful. Next step is to move the search handling itself up into the Action and beef up the ActionForm.

T
 
search and results actions in place.

A quirk: struts.jar is at the webapp level. xrefer.jar is at the common level. If our struts actions sit in our regular jar, they get loaded by the root classloader and so can't see the struts classes. Split our jar into xrefer.jar and xrefer_struts.jar - one global, one local. Modified build.xml appropriately.

We could do with a spring-clean of the build process sometime soon.

T
 
Struts 1.2.4 is out! Installed it ready for the work on the new search mechanism.

Unfortunately, I couldn't get it to do anything. The config files were getting read, as I could introduce errors and get them logged. No requests generated debug output. Finally twigged that it was apache's config I needed to tweak; it's currently set to forward *.jsp to tomcat - now it has to forward *.do as well.

*.do done.

Next to set up the new framework - we want to handle search.do and results.do. I'll build the basic pages under those for existing search before factoring in our new choices.

T


Tuesday, September 28, 2004
 
Now hosting the central copies of the site in my local tomcat (as we used to do with xlocal and jserv).

Beefy, strutty search tomorrow.

T
 
FC2 installed and updated. Looks good.

Modified pick-and-mix system so that we list books in a consistent order between web forms and emails.

T
Monday, September 27, 2004
 
My machine's sufficiently out-of-date that I can't run the master copy of the site locally.

Fedora Core 2, here I come.

T
 
Live site memory allocations - accident free for 4 days!

T
Wednesday, September 22, 2004
 
No OOMs today!

Marked all unused volumes as withdrawn locally and on w1, ready for the next cluster update.

Created BOBToken system to handle back-of-book subscription deals. You can create tokens and check validity. Junit tests are in place.

T
Tuesday, September 21, 2004
 
Tweaked pool config on the live servers - now max 10 resources per pool, 3 min spare.

This means that even at maximumum usage we shouldn´t run OOM. Tested lots today, refreshing multiple search tabs simultaneously and they get serialised as expected. This is a good thing. We also run at a lower overhead overall.

Updated stats pages to fix image link.

Updated admin pages similarly.

Integrated lucene 1.4 mergefactor changes into current codebase, redeployed on import server. Index merge now takes 1 hour, which is still not great, but better.

Extended search opportunities looking promising. Next project is to move the whole search operation to struts. We´re looking for:

1. Compatibility with published URLs.
2. A no-hits page which remembers what you´re after.
3. A pre-search filter which simplifies your search if it´s just unreasonable.
4. Support for extended search options.

Matt did his thing; updated US server won´t boot? Fax over a bash script and talk the salesman through it. That boy is a diamond.

In other news, today marks two years since Johanna and I got married. Gone by in a flash.

T
Monday, September 20, 2004
 
Right, we're drawing a line under this.

The configuration update last Tuesday seems to have solved the long-term OOM errors. The servers have been running since then without a recurrence of the problem.

We have, however, seen a couple of OOMs - two of them this morning on w3. In each case we've traced the cause to "aberrant behaviour" from the clients - weird searches (or many simultaneous near-identical searches) which operate as a DOS attack on the server concerned.

So then, the things to do:

1. tweak pool config on each machine so mem can hold theoretical max size of pool.
2. refactor search handling via struts and include a sanity checker for search requests - too many clauses or too odd and you get a simplified search (or none at all).

T
 
OK, I think we've identified the problem. I'm trying to recreate it on xplus now.

If you submit a number of complex search queries quickly so that the pool resources all get used in the handling them the pool manager tries to extend the pool to cope. It can't grow fast enough or some internal resource fight occurs and the size of the pool goes up with 0 free. Each resource opens a full set of lucene indexes, which will eat memory very quickly.

Reproduction of this problem is very dull - I'm watching xplus do nothing, hoping eventually for a maintenance page. If it does reproduce the problem I'll set about solving it. First thing would be to find out why searchBySubject is taking so long.

Reindexing of rml3 vols continuing.

Modified index-merge code to back-port the speedup from 1.4. Will deploy onto the import server this afternoon, merge and then test the new search possibilities.

Also on the cards for today: number generator for back-of-book account vouchers.

T
 
In a word, no. We can't walk the heap and dump stats beyond a basic total mem/free mem, which we're now including on the maint page mails.

Quiet weekend overall, but this morning w3 spannered at about 7:30. I restarted services but it went again at 9am. This is actually good newx - that seems way too fast for it to be a hard-to-trace tiny-but-incrememntal leak. Instead, something in that short period ate all the memory. Today I'm going through the logs to find out what.

T
Friday, September 17, 2004
 
w5 again last night, again a couple of OOMs from which it recovered.

Lifetime analysis of a single DQT yesterday didn't reveal any obvious leaks.

Is it possible to walk the heap when we intercept an OOM and dump to a file? That, ladies, is today's question.

T
Thursday, September 16, 2004
 
w5 served two maint pages last night due to memory problems. It recovered and continued, but I'm disappointed it's still happening.

Spent today analysing page retrievals in optimizeit.

orac itself keeps running out of memory while testing, which makes it tricky.

Now running against a reduced index to make things more manageable.

Resources are expiring and recreating fine, there are no lingering references to DQTs. That was the big concern.

How about throughout a DQT's lifetime?

T
Wednesday, September 15, 2004
 
Spent most of yesterday working on a fix for our OutOfMemoryException problems. Came up with a plan which we rolled out yesterday at 3pm - it's been quiet since then and I've got everything crossed. Essentially we've shifted our stuff from the per-webapp deployment areas into a common area. This gets the singletons (in particular our pool manager) to work as per-server singletons rather than per-application singletons. It at least halves our memory requirement and brings it closer to the tried-and-tested jserv configuration.

Added support for "scit" citation elements, linking them to the specified full citation.

Added new SectionTypeCitation which handles display quirks for our new section type.

Performed a test import and all seems good.

Added new meta-information indexing to broaden our search scope.

T


Monday, September 13, 2004
 
Assembled a more detailed breakdown of the memory errors, mailed to Carl.

Replied to concerns from DL of G.

Tackle remaining email requests, then focus on finding fixes for memory exceptions.

T
 
OK, next thing to tackle is the mass of maint page notifications.

I've had a quick scan through and they break down as follows:

1. vast majority are "out of memory" exceptions on searches.
2. one from a topic page redirection. I suspect this is a phantom (observed before; page is served correctly despite maintenance notification). I put in a controller last week so there should be few direct accesses to topic which require redirection - perhaps someone has bookmarked the converter?
3. Unattributed maints from the free site. This would indicate there's no exception available to report. Can't think why that would be (given that even out-of-memory exceptions are getting reported elsewhere) - perhaps direct requests? Though I can't think why. Added extra handling to maint page which explicitly reports whether an exception's available.

T
 
repeated pm submissions not getting handled because the admin system's down (running, not serving pages and who knows what else is wrong in there). Restarted.

pm_check now coming up clean.

T
 
Also, can't ssh to orac though I can ping it. Using emperor for now.

T
 
Back from holiday. Italy was excellent and I'm looking forward to getting back to things electronic.

First job: review inbox. There are quite a few notifications of maint pages, some unmatched pm requests and a few items of non-automated mail.

pm requests first.

T
Thursday, September 02, 2004
 
Added support for vol type (ready/specialist) to the client api.

T
 
Spent the rest of yesterday testing and refining tomcat sites on w3.

w3 went live in the afternoon and seems in good health. Matt's rolling the config/pages out to the other machines.

Fixed admin system so we no longer generated reams and reams of SQL when syncing a central account.

T
Wednesday, September 01, 2004
 
Altered banner jsp so that the link to the G site is hidden if the user entered via partner login.

T

Powered by Blogger