Questions regarding "Smart Search" and Tagging via the Webinterface

Thu Apr 9 21:24:44 UTC 2009

On Thu, Apr 09, 2009 at 02:21:25PM +0200, Benjamin Mesing wrote:

> > In 'debtags smartsearch', every time you type keywords you're just
> > generating a new set of tags to choose.  The search results depend on
> > what tags you have chosen.
> That's what my question question was about: So depending on the tags
> you've choosen until then, the full text search searches only the
> packages in the result set for those tags and then determines the most
> significant tags for the search result. Right?

No, it's sillier than that.  The packages that you see are the packages
that have the tags that you have selected.  The keywords that you enter
do not influence the list of packages that you will obtain, but only the
list of tags that you can choose.

"debtags smartsearch" was an experiment on using keywords only to find
tags to be prompted to the user.  The tags are in turn used to select
pakages, and are the only thing that actually selects packages.  This is
not the best way to search for packages, and smartsearch is in fact more
of an experiment in ways to turn a keyword search into a list of tags.

It actually shows two lists of tags: the "relevant" and the
"discriminant" tags.  The "relevant" tags are as described above: the
most frequent tags of the packages selected by the keyword search.

The "discriminant" tags are those tags that select about 50% of the
packages (their score is proportional to abs(50-[number of packages selected]) ).
The idea is that given a list of the top "discriminant" tags, you can
quickly shorted the result set just by a sequence of choices like "want
this"/"don't want this".  Those tags end up being particularly
significant, and the result is remarkably clever given such a simple
scoring system.

> > The same mechanism is used in the tag editor, when you pick the
> > Available tags / Search function. 
> I see, so even there the search for available tags relies on a full-text
> packagesearch. I wasn't aware of that.

It's not obvious, isn't it?  I've always considered that as a hint that
that algorithm works well: it gives good results and one can't figure
out how it got there; that's probably good enough to define it 'smart'.

> > >  2. For the tag-editor (web), how are the suggested tags computed?
> > >     By AI-methods?
> > Same as the smart search.  Specifically, it uses Xapian: first it does a
> > full text search on the packages, 
> I don't understand this. What is the search term for the full text
> search? We have the package name and the package description of the
> package being tagged, both would often lead to an empty result set.

Sorry, I misread your question.  The suggested tags are indeed computed
like in the smart search, but the keywords used for the full text search
are *all the keywords in the description of the current package, ORed
together".

Xapian is smart enough to give best matches first, so even if we OR
together a lot of terms we still get good results.  Also, it's fast
enough that it manages to compute the result on the fly without any
issue.

> Btw. are there any i18n efforts for the vocabulary under way?

People have wanted to do something about it, but I'm not aware of any
effort that has actually started.

It wouldn't take much to turn the vocabulary into a potfile; how to use
the potfile is a different issue, and I don't know enough of gettext to
work out all the detauls by myself, so we really need someone to take
care of pushing for this if we want to see it happening.

Ciao,

Enrico

-- 
GPG key: 1024D/797EBFAB 2000-12-05 Enrico Zini <enrico at debian.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: Digital signature
URL: <http://lists.alioth.debian.org/pipermail/debtags-devel/attachments/20090409/94d68af4/attachment.pgp>