script to copy all the files of an "Online Public File Inspection" page?

Erik Josefsson ehj at ffii.org
Mon Jul 31 20:31:00 CEST 2006


I used to find nice and clean documents containing EN, DE and FR 
versions of granted EP patents, but I can't find them anymore.

I think it is only necessary to retrieve and save claims of granted 
patents in Gauss (if there is a space problem).

//Erik

Rufus Pollock wrote:
> I'm not sure you need js support. I've successfully used python 
> mechanize module (port of the perl original) to manipulate js type links 
> and forms.
> 
> See <http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html>
> 
> "Embedded script is messing up my web-scraping. What do I do?
> 
> It is possible to embed script in HTML pages (sandwiched between 
> <SCRIPT>here</SCRIPT> tags, and in javascript: URLs) - JavaScript / 
> ECMAScript, VBScript, or even Python. These scripts can do all sorts of 
> things, including causing cookies to be set in a browser, submitting or 
> filling in parts of forms in response to user actions, changing link 
> colours as the mouse moves over a link, etc.
> 
> If you come across this in a page you want to automate, you have four 
> options. Here they are, roughly in order of simplicity.
> 
>     * Simply figure out what the embedded script is doing and emulate it 
> in your Python code: for example, by manually adding cookies to your 
> CookieJar instance, calling methods on HTMLForms, calling urlopen, etc.
>     * Dump mechanize and ClientForm and automate a browser instead. For 
> example use MS Internet Explorer via its COM automation interfaces, 
> using the Python for Windows extensions, aka pywin32, aka win32all (eg. 
> simple function, pamie; pywin32 chapter from the O'Reilly book) or 
> ctypes ( example: may be out of date, since ctypes' COM support is still 
> evolving). This kind of thing may also come in useful on Windows for 
> cases where the automation API is lacking. pyphany is a binding to the 
> epiphany web browser, allowing both plugins and automation code to be 
> written in Python. XXX Mozilla automation & XPCOM / PyXPCOM, Konqueror & 
> DCOP / KParts / PyKDE).
>     * Use Java's httpunit from Jython, since it knows some JavaScript.
>     * Get ambitious and automatically delegate the work to an 
> appropriate interpreter (Mozilla's JavaScript interpreter, for 
> instance). This approach is the one taken by DOMForm (the JavaScript 
> support is "very alpha", though!)."
> 
> ~rufus
> 
> Benjamin Henrion wrote:
>> Hi Gauss,
>>
>> I am looking for a way to copy all the PDFs docs linked from this page:
>>
>> http://ofi.epoline.org/view/GetDossier?dosnum=&pubnum=EP1146701&lang=EN#
>>
>> Unfortunately, the EPO website makes links like this:
>>
>> <a href="#" 
>> onclick="javascript:NewWindow('/view/BrowsePdfServlet?NPL=Y&objectId=EKSMD5Z21148J10&lang=EN',3);return 
>> false;" onMouseOver="window.status='ODOCP'; return true" 
>> onMouseOut="window.status=''; return true">Patent document cited 
>> during the opposition procedure</a>
>>
>> which is a disaster for Google to have the file indexed.
>>
>> Do you know a way to suck those files by scripting a browser that has JS
>> support? I don't want to write a specific script for that purpose...
>>
>> -- 
>> Benjamin Henrion <bhenrion at ffii.org>
>> FFII Brussels - +32-484-566109 - +32-2-4148403
>>
>>
>> _______________________________________________
>> Gauss-parl maillist
>> subscribe via http://aktiv.ffii.org/
>> fine-tune via http://lists.ffii.org/mailman/listinfo/gauss-parl
> 
> 
> _______________________________________________
> Gauss-parl maillist
> subscribe via http://aktiv.ffii.org/
> fine-tune via http://lists.ffii.org/mailman/listinfo/gauss-parl




More information about the Gauss-parl mailing list