faster now?

Jonas Bosson jonas at illuminet.se
Mon Feb 20 16:11:12 CET 2006


On Mon, 20 Feb 2006, Carsten Svaneborg wrote:

> On Sunday 19 February 2006 18:37, Carsten Svaneborg wrote:
> > If I can find the time, i will chop the adddelupdate script to pieces,
> 
> I have now more or less done this. It should be about as functional
> as the old script. But with more flexibility for future load optimization.

Great news. It also seems the new index speeds the wiki request very very 
much. The old primary for the patents table was a combination of state and 
idx, adding a index for just idx seem to have made things more easy on 
postgress.
 
> There are now the following tables/bots:
> 
> bot_scheduleupdate.pl analyse the DB, and finds patents that
> are added to the updated queue.  (run daily)
> 
> Add/update  queues, handled by bot_addupdate.pl. They contain
> idx, userID based on the wiki. It basically shits patents into the
> Download queue.
> 
> More detail: Patents to be updated are moved to the Download queue
> (idx,filename). Patents in the add queue are discarded if they are
> already known, otherwise a filename based on userid is generated,
> and they are added to the download queue.
> 
> Download queue bot_download.pl  downloads patent html files from
> espacenet, and schedules import, which takes quite a bit of time, but
> has no DB overhead.
> 
> Import queue bot_import.pl  imports patents into the DB, which is 
> expensive in terms of DB overhead.
> 
> Del queue for patents to be deleted, handled by bot_delete.pl.
> This is also expensive for the DB.
> 
> Now the DB intensive scripts can be restricted to running at night,
> and they can be modified to terminate automatically if the CPU load
> becomes to high.
> 
> It should also be possible to run 2 or 3 bot_download scripts in
> parallel, which just does HTTP connections to espacenet, but sometimes
> Espacenet is not available due to maintenance of their DB. And in that
> case the bot_download script just terminates.
> 
> Now, I just have to fix the crontab such that they are run on a daily basis,
> and automatically terminate.
> 


Cool. Erik and I discussed how to create a soft ranking last week, where 
we dont actually do deletes, but rather down or up-rank patents from different 
statistical perspectives or user views on them. Hence the suggestions I 
made for patentsclass tables some time ago without further explenations.

This also means that we should decouple the patent-file handling from the 
add/delete lists and use a generic patent-cache when the contents of the 
patents are required. That means that less requested patents will be 
dropped on a garbage collection basis rather than on a black/white manner.

I think that one way to do this from the current state is to find a single 
minimal point of access and then expand to use the same method system wide. 

In this case view.py could load from a gz-xml-patent-file and if not 
available create one from the Patents-table (and if not there put in 
download queue).

I have started to collect/prototype py-code to do this. 

Suggestion/critique are appreciated. 

/jonas




More information about the Gauss-parl mailing list