[Gauss-parl] Pres from epip 2005 (+ comments)

Carsten Svaneborg zqex at mpipks-dresden.mpg.de
Mon Mar 14 02:44:28 CET 2005


On Sunday 13 March 2005 05:01, Roland Orre wrote:
> You may also be surprised about the front page describing CRM114 :)
> but the CRM114 concept is actually not so bad, as it is using higher
> order features, that is, scanning windows of different widths which
> can cath features like: "software", "software process", "software
> process and" "software process and method" etc.

One of the project proposals are a way to identify keywords for
patents, such that these can be used to define related patents.
I was wondering precisely how to handle the problem of e.g.
"wireless networking" vs. "wireless".

Maybe the keywords could be tied into some hirachical structure
(networking, networking/browsing, networking/wireless, networking/lan etc.)
which would provide an entirely new way of browsing and searching
patents, which would be closely matching the way a software developer
would think.


> As I understand the actual implementation is not yet using a
> Bayesian classifier, but as I have very hard to understand
> how this can be done without a Bayesian classifer

The majority of patents has been found by making searches
for software related keywords. So some refining is in all
likelyhood required.

> (which I'm also a kind of expert in, have worked with
> Bayesian classification methods the last 10 years,
> my thesis is about classification as well) I presented
> the only solution that I could really understand.

Great! I would be interesting in reading it. I have read various
reviews on Baysian statistics and ET Jaynes book but mostly
focussing on hypothesis testing and data analysis and not classification.


> I have several suggestions about extensions of the project,
> where I can get involved with the methods I'm currently
> working on, e.g.
> * similarity measures between different patents
> * clustering of patents

This is precisely the goal of the project proposal.

> (both relating to recent project I've been working on, can
>  provide papers, three of my latest papers are highly relevant
>  for this:
I would be interested in reading them.
(the ones most relevant, I already have stacks of unread articles)

> I also consider it natural to extend the outcome of the
> classification with multiple classes, and we should actually
> copy the whole EPO data base to be able to run this on all
> patents, and later also to the USPTO I think.

I'm not sure about this, since I expect the non-swpat found
by matching keywords to be clustered in "patent space". So
the spam for the Baysian classifier would be "close" to the ham.

Secondly there is a practical constraint the DB is currently 7GB.

> (my speciality is fast data mining of huge data sets, so this
> would be a perfect challenge for me).

Great!!

> One of my future ideas is to transform patent descriptions
> to lambda calculus, which of course is quite a hard problem,
> but then we would be able to compare the actual functionality
> of a patent description

How would you map patent claims into functions?

 (relating to the Dilbert strip on
> my "ThankYou" page) and to a comment relating to skidbladnir
> http://www.gnu.org/brave-gnu-world//issue-49.en.html
> that is, when they had investigated 40000 patents they found
> 40 different solutions to technical problems. I think we
> could be able to do something similar for software patents.

Sounds interesting.

-- 
  Mvh. Carsten





More information about the Gauss-parl mailing list