[Gauss-parl] ocr etc

Carsten Svaneborg zqex at mpipks-dresden.mpg.de
Sun Mar 27 17:53:43 CEST 2005


On Saturday 26 March 2005 01:41, Erik Josefsson wrote:
> A lot of /Comments shows that we have a number of patents marked as
> non-swpat. What is the plan for them? Just deletion or feeding
> softassassin with false positives?

Deletion means removing from the DB, and moving the HTML/XML file into
a special subdirectory, where it can be used as spam for Baysian filtering.
(when the delete function is implemented, and when Baysian filtering is
 implemented). 

> so I suggested (volonters) upload OCR stuff to gauss wikipages:
>     http://gauss.ffii.org/PatentView/EP1228451/OCR
> but 1) it seems you can't make subpages to patent pages and 2) it might
> be a very stupid idea (OCR not in db).

What is the problem? Each patent page already has a link to the original
PDF file, which presubably is the source for the OCR. I don't see that we
gain anything by having the OCR text.

Currently only _Comments subpages are left to the wiki, everything else
is handled by the virtual page script, if you want  more subpages let me
know their names (and why).

> Third, google does not find the virtual page
>     http://gauss.ffii.org/PatentView/EP1228451

This, presumably, is a question of time.

Google returns 68,100 page hits on gauss. Which is rougly half the number
of patent pages.

-- 
  Mvh. Carsten




More information about the Gauss-parl mailing list