script to copy all the files of an "Online Public File
ehj at ffii.org
Mon Jul 31 20:31:00 CEST 2006
I used to find nice and clean documents containing EN, DE and FR
versions of granted EP patents, but I can't find them anymore.
I think it is only necessary to retrieve and save claims of granted
patents in Gauss (if there is a space problem).
Rufus Pollock wrote:
> I'm not sure you need js support. I've successfully used python
> mechanize module (port of the perl original) to manipulate js type links
> and forms.
> See <http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html>
> "Embedded script is messing up my web-scraping. What do I do?
> It is possible to embed script in HTML pages (sandwiched between
> ECMAScript, VBScript, or even Python. These scripts can do all sorts of
> things, including causing cookies to be set in a browser, submitting or
> filling in parts of forms in response to user actions, changing link
> colours as the mouse moves over a link, etc.
> If you come across this in a page you want to automate, you have four
> options. Here they are, roughly in order of simplicity.
> * Simply figure out what the embedded script is doing and emulate it
> in your Python code: for example, by manually adding cookies to your
> CookieJar instance, calling methods on HTMLForms, calling urlopen, etc.
> * Dump mechanize and ClientForm and automate a browser instead. For
> example use MS Internet Explorer via its COM automation interfaces,
> using the Python for Windows extensions, aka pywin32, aka win32all (eg.
> simple function, pamie; pywin32 chapter from the O'Reilly book) or
> ctypes ( example: may be out of date, since ctypes' COM support is still
> evolving). This kind of thing may also come in useful on Windows for
> cases where the automation API is lacking. pyphany is a binding to the
> epiphany web browser, allowing both plugins and automation code to be
> written in Python. XXX Mozilla automation & XPCOM / PyXPCOM, Konqueror &
> DCOP / KParts / PyKDE).
> * Get ambitious and automatically delegate the work to an
> support is "very alpha", though!)."
> Benjamin Henrion wrote:
>> Hi Gauss,
>> I am looking for a way to copy all the PDFs docs linked from this page:
>> Unfortunately, the EPO website makes links like this:
>> <a href="#"
>> false;" onMouseOver="window.status='ODOCP'; return true"
>> onMouseOut="window.status=''; return true">Patent document cited
>> during the opposition procedure</a>
>> which is a disaster for Google to have the file indexed.
>> Do you know a way to suck those files by scripting a browser that has JS
>> support? I don't want to write a specific script for that purpose...
>> Benjamin Henrion <bhenrion at ffii.org>
>> FFII Brussels - +32-484-566109 - +32-2-4148403
>> Gauss-parl maillist
>> subscribe via http://aktiv.ffii.org/
>> fine-tune via http://lists.ffii.org/mailman/listinfo/gauss-parl
> Gauss-parl maillist
> subscribe via http://aktiv.ffii.org/
> fine-tune via http://lists.ffii.org/mailman/listinfo/gauss-parl
More information about the Gauss-parl