[Debtags-devel] "tagcoll findspecials"

Enrico Zini enrico@enricozini.org
Fri, 20 May 2005 15:23:00 +0200


--LQksG6bCIzRHxTLp
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hello,

I'm doing some research on finding ways of automatically creating a list
of "toplevel facets", defined as the minimum set of facets that one
could display as starting points for a search and still be able to find
all packages from there.

In this quest, I implemented a 'facetcoll' function in 'debtags', that
can be used to generate a tagged collection in which every package is
tagged with just the facets of its tags.  For example, debtags
(implemented-in::c++, interface::commandline, suite::debian,
use::searching) would become tagged with just "implemented-in,
interface, suite, use".

Once I have this collection, I can run tagcoll on it and get something
fun:

  # Count the number of facets
  $ debtags facetcoll | tagcoll reverse | wc -l
  31

  # Get the list of toplevel facets if we created a smart hierarchy with
  # the facets only
  debtags facetcoll | tagcoll hierarchy | cut -f2 -d/ | cut -f1 -d: | sort =
| uniq | wc -l
  26

And that's a first narrowing step: from 31, we got down to 26.  Having
a look around, I think that 26 could become much better.  The toplevel
facets include stuff like 'web', 'x11' or 'uitoolkit' which I feel could
get out of there somehow.  In fact, 'uitoolkit' should all be inside
'interface' somehow.  What are the packages that have
uitoolkit::something and not interface::something?

  debtags facetcoll | grep uitoolkit | grep -v interface | cut -d: -f1 |
  sort | uniq | wc -l
  2140

2140 packages that probably need some love.  Some examples: xvncviewer
(should be interface::x11), wesnoth (should be interface::sdl) and so
on.  (I just sent a tag patch for xvncviewer).

This is a nice way of seeing where work is needed.  But we can have
more.

I then implemented a new 'findspecials' feature in 'tagcoll': it creates
a smart hierarchy, and then for each toplevel node it shows what are
those packages that made it a toplevel node rather than putting it
inside some other node.

  debtags facetcoll | tagcoll findspecials
  (see results in http://www.enricozini.org/store/specials.txt)

Look at 'dbtech' there: 5 items only!  It sounds like not worth being
toplevel for 5 items only, isn't it?

  wget -qO- http://www.enricozini.org/store/specials.txt | grep -v '^ '=20
  special: 4425 items, 0 special items:
  devel: 3783 items, 3736 special items:
  role: 2676 items, 2072 special items:
  uitoolkit: 2257 items, 1231 special items:
  use: 2009 items, 923 special items:
  langdevel: 1951 items, 646 special items:
  suite: 1585 items, 124 special items:
  media: 1459 items, 323 special items:
  interface: 1020 items, 109 special items:
  protocol: 578 items, 89 special items:
  game: 577 items, 13 special items:
  format: 415 items, 32 special items:
  implemented-in: 395 items, 17 special items:
  hardware: 390 items, 90 special items:
  debian-edu: 375 items, 29 special items:
  culture: 340 items, 101 special items:
  field: 306 items, 59 special items:
  data: 263 items, 27 special items:
  x11: 259 items, 45 special items:
  admin: 241 items, 82 special items:
  web: 240 items, 11 special items:
  security: 221 items, 33 special items:
  sound: 158 items, 20 special items:
  dbtech: 153 items, 5 special items:
  accessibility: 55 items, 21 special items:

That's another place that could use some love!  Look at game, format,
implemented-in, data, web, dbtech, sound, accessibility...

To wrap it up, it looks like a good way to go, which it not working now
not because the algorithm is bad, but because the data could be better.
Plus, we now have a way of spotting what needs more work.

/me is considering auto-generating some HTML pages with TODO-lists of
packages pointing at Erich's packagebrowser.  Let me hack a bit into
it...


Ciao,

Enrico

--
GPG key: 1024D/797EBFAB 2000-12-05 Enrico Zini <enrico@debian.org>

--LQksG6bCIzRHxTLp
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFCjeS09LSwzHl+v6sRAvphAJ4y5246FSc6bk5gY61qjjKX9X3FxACfXFSf
d6XYHJ35Fliaj02JjlJpVPk=
=dr2D
-----END PGP SIGNATURE-----

--LQksG6bCIzRHxTLp--