[Gauss-parl] Gauss: implementation issues
Carsten Svaneborg
zqex at mpipks-dresden.mpg.de
Tue Mar 15 14:27:27 CET 2005
On Tuesday 15 March 2005 09:01, Roland Orre wrote:
> the script pleech.pl works well, but I'll probably change it somewhat.
This is fine, but lets make sure that we still have one script at
the end of the day, which also means that changes has to be
propagated to the HTML fileformat, and the import.pl script.
> * First, what does the blacklist mean?
> I tried e.g to download my own EP0520400, but then it says it
> is blacklisted, it is listed as
> /scratch/pat/people/wagner/EP520400.html
> in the blacklist.
> That patent is a pure mathematical/software solution.
Look at the structure in ~zqex/PatData/pat
If I do a search, typically 50-90% of the patents are already
downloaded somewhere else in the tree in prior searches. And I
don't want to waste bandwidth downloading them repeatedly,
just to have to bother to find and delete them.
This is solved by the blocklist. The semantic is "patent has
already been downloaded".
It is not water tight, as I usually run 5-8 download processes in
parallel, and only update the blocklist every once in a while, but
it is still a factor 2-10 reduction in the number of HTTP connections.
Use -nbl to disable it.
> * When I download the patent I see that the claims are not included,
That is a bug. The problem is that I have the DL scripts three places.
And the gauss.ffii.org is the least updated, they are now uptodate..
However, diff shows the only change was:
- if (/href="(textclaim[^"]*)"/i) { $urlCLAIM=$1; }
+ if (/href="(textclam[^"]*)"/i) { $urlCLAIM=$1; }
(correcting spelling mistakes without a though is a bad thing)
> I'll start with abstracts anyway, but intend to use descriptions
> and claims as well. It could be interesting to see what
> classifications on abstract, description, and claims may give.
For swpat probability maybe, but for classifying I think
using only the claims might produce the best result.
Ideally it should be possible to parse the claims and map them
unto some logical structure of relations.
Claimed is object X characterised by property Y. When X uses a
Z and X has a W, where Z may be Q and E, Q or E and so on.
In that case one could display the patent claims graphically as a
kind of flow diagram of relations between objects their properties,
and groups of these.
This would require some highpowered natural language parsing.
But it would significantly simplify the process of reading claims.
(maybe we should apply for a patent for that idea?)
> * The tabular structure you have made is nice and clear, but do
> you mind if I convert it to an SGML/XML structure? That feels
> more future safe, as it will be easy to add information later
> without affecting the structure. I also want more structure
> e.g for the claims later, to make them easier to handle
> for future similarity checks between different claims.
I like to do e.g. cat patentlist | xargs -n1 w3m --dump | less
which provides an easily readable ascii version of the patent,
which can be searched by less
But XML is fine as long as it is as easy to parse.
> * Further on I will clean out e.g. the self reference
> "Description of EP0520400" as it is unneccesary to
> create keywords which will be used once.
No problem.
> * I hardly understand pearl at all, I have avoided it because
> the syntax is too obscure for me
This is true. Perl is pretty relaxed wrt. syntax. A true perl
programmer would probably complain that my perl programs
smell of C. ;*)
> but as I have understood it is rather "easy" to convert from
> pearl to python, at least it is claimed so here :-)
I'm a newbie to python.
> but I'll start to modify the pearl I think, that is most efficient.
Agreed.
> Do you have any hint about a http downloader in python similar
> to what is done in pleech.pl?
No. I implemented my own HTTP socket IO to be able to spoof a
browser exactly, and also control cookies, which is used in
psearch.pl. (the real reason was probably that I was too lazy to
read the manual for the perl HTTP lib)
> You mean those parts written in "write only languages" :?)
> Is there a complete list of EPO patents somewhere?
> I didn't find any at esp at cenet.
What do you mean? Granted EPO patents?
--
Mvh. Carsten
More information about the Gauss-parl
mailing list