[Debtags-devel] Re: Recent progress

Peter Rockai (mornfall) mornfall@kalyxo.org
Thu, 24 Mar 2005 11:58:22 +0000


On Tuesday 08 March 2005 12:15, Herv=E9 Eychenne wrote:
> Here is what comes to my mind without thinking too much about it.
>
> The user types keywords. First, these keywords are associated to tags.
> If a keyword _is_ a tag name, give a maximum score to this tag.
> If it's a tag synonym, give it a slightly inferior score.
> Now look at the tag descriptions, and deduce possible tags from the
> keyword, giving the inferred tags an inferior score.
> Compute the list of packages with these tags, and give a score to
> the packages, with a score per package that reflects the score
> of its tags. You now have a list of packages, sorted by score
> (possible adequacy to the keyword list, nothing will ever be perfect).
>
> You also can combine this with a full text search in package descriptions
> with the same scoring strategy, and combine the results.
I think i suggested this a while back, so yes, i must agree that search lik=
e=20
this would make sense. This in fact turns tags into hyperlinks and the whol=
e=20
data set into a semantic web... And gives us a way to do relevance-based=20
keyword search on the debian archive. I also suggested using fuzzy tags to=
=20
further facilitate this (iow, add scores to the tag/package relations via=20
bayesian tagger or something like that and then use that score when scoring=
 a=20
match... this _may_ be a bit off, but it may be still worth trying... I don=
't=20
believe fully manual tagging of the archive is feasible or even sustainable=
).=20
And the bayes can use much more input data than it does today to get some=20
better precision out of it... It is not going to be perfect either way. But=
=20
google is freaking far from perfect, but it is probably the single most=20
useful web search tool.
>
>  Herv=E9

=2D-=20
Peter Rockai | mornfall()kalyxo!org | prockai()redhat!com | +421907533216
  http://blog.mornfall.net

"In My Egotistical Opinion, most people's C programs should be
 indented six feet downward and covered with dirt."
     -- Blair P. Houghton on the subject of C program indentation