Error from automatic update
zqex at mpipks-dresden.mpg.de
Mon Feb 13 15:52:30 CET 2006
On Monday 13 February 2006 14:56, you wrote:
> I bet they would be faster if we broke out the text from the Patents
> table and changed the boolean filtering.
Hmm. There are many things that could be done, and I have absolutely
no idea what gives most bang for the buck. I wish we had a lot of metrics
about what fails and what makes it fail, because I worry about optimising
in the dark.
I agree that the DB does indeed need to be redesigned.
The adddelupdate process also needs to be redesigned.
Currently the adddelupdate process downloads a patent and then imports it.
This is started every weekend, and then goes on for weeks. The download
process can require a handfull of HTTP connects to get bibliography, claims,
abstract, epoline data. This makes it take time.
It should be redesigned to download the patents in one step, and then
during night the newly downloaded patents could be imported into the DB.
E.g. rather than scheduling updates every weekend, and then running
the script to completion, there should be an update bot downloading patents
on a daily basis (possibly on a different server), and an import bot that does
the import running at night. That way we can precisely controll the number of
simultaneous download and import processes and when they are running.
That would lead to 0 maintenance overload during daytime.
Secondly, rather than using the format of a HTML table, patents should
be stored in XML. This is a bit problematic sincer there is HTML tags
imbedded in the text, so perhaps some more care needs to be taken when
extracting data when parsing the espacenet/epoline data.
> separate machines, such as one logging changes on EPO and one fetching
> patents when needed to update the gauss-patent-cache. etc.
OK. I first see this now. But I guess we agree on that. ;*)
More information about the Gauss-parl