[Gauss-parl] ocr etc
zqex at mpipks-dresden.mpg.de
Sun Mar 27 17:53:43 CEST 2005
On Saturday 26 March 2005 01:41, Erik Josefsson wrote:
> A lot of /Comments shows that we have a number of patents marked as
> non-swpat. What is the plan for them? Just deletion or feeding
> softassassin with false positives?
Deletion means removing from the DB, and moving the HTML/XML file into
a special subdirectory, where it can be used as spam for Baysian filtering.
(when the delete function is implemented, and when Baysian filtering is
> so I suggested (volonters) upload OCR stuff to gauss wikipages:
> but 1) it seems you can't make subpages to patent pages and 2) it might
> be a very stupid idea (OCR not in db).
What is the problem? Each patent page already has a link to the original
PDF file, which presubably is the source for the OCR. I don't see that we
gain anything by having the OCR text.
Currently only _Comments subpages are left to the wiki, everything else
is handled by the virtual page script, if you want more subpages let me
know their names (and why).
> Third, google does not find the virtual page
This, presumably, is a question of time.
Google returns 68,100 page hits on gauss. Which is rougly half the number
of patent pages.
More information about the Gauss-parl