Using the new editor

Olly Betts olly at survex.com
Sat Nov 25 00:06:17 CET 2006


On 2006-11-18, Enrico Zini <enrico at enricozini.org> wrote:
> On Sat, Nov 18, 2006 at 02:07:19AM +0100, Erich Schubert wrote:
>> What algorithm are you using for calculating the suggestions?
>
> I'm asking xapian to give me a list of packages similar to the one
> being edited, then I take their tags, sorted by how many times they
> appear in the list.

You might do better to use Xapian's built in relevance feedback feature.
This does a similar thing to what you're doing by hand, but also factors
in the overall frequency of tags, so it will tend to prefer tags which
were common in the similar packages but not so common overall, which I
think will tend to give better suggestions.  The formulae used are
derived from Bayes Theorem (of conditional probabilities).

After you've got your MSet, and assuming the terms for tags have prefix
XTAG, you want something like this (if you're using C++):

    // ExpandDecider subclass which only picks Debtags terms.
    class TagEDecider : public Xapian::ExpandDecider {
	public:
	    int operator()(const std::string &t) {
		return t.size() > 4 && t.substr(0, 4) == "XTAG";
	    }
    };

    // And then after you've got your MSet...

	// Add (up to) 5 best matching documents to the rset (the set
	// of relevant documents).
	Xapian::RSet rset;
	for (int i = 0; i < mset.size() && i < 5; ++i) {
	    rset.add_document(*mset[i]);
	}

	// Use relevance feedback to suggest up to 10 tags.
	TagEDecider tagedecider;
	Xapian::ESet eset = enquire.get_eset(10, rset, &tagedecider);
	for (Xapian::ESetIterator i = eset.begin(); i != eset.end(); ++i) {
	    std::cout << "Suggest tag " << (*i).substr(4) << std::endl;
	}

The ESet is ordered with the best suggestions first.  By default, any
terms which were already in the query won't be included.

Obviously you can add more/less than 5 documents and ask for more/less
than 10 suggested tags...

Cheers,
    Olly




More information about the Debtags-devel mailing list