Faceted tags

Enrico Zini zinie@cs.unibo.it
Tue, 6 Apr 2004 23:52:52 +0200


On Tue, Apr 06, 2004 at 11:03:05PM +0200, Erich Schubert wrote:

> >  - they are semantically invariant (for example, the property "colour"
> >    can assume many different values, but it's an invariant concept: an
> >    object will always have a colour)
> IMHO this is a bad example. Objects might have multiple colours, and
> which colour would you assign to glass?

"Transparent".  However, the idea is not to refer to "the specific
colour", but to the idea of "colour" itself.  So, "color" is a facet,
while "red" is the value of that facet.


> > I'd thus define "facet" as the "dimension", or "axis" of categorization
> > and "tag" as the value along the dimension/axis.
> If you are talking about a dimension or axis you expect things to have
> exactly one value upon the axis; maybe an interval. You expect an axis
> to be ordered, too.

Sure: it was to try and explain the concept and purpose of having tags
grouped in facets.

And that thing I wrote (before I mentioned "dimensions" and "axis") was
some definition of "facets" I found in literature, but I'd like to use
something more flexible for us, like a package NOT having a given facet
(for example, there are packages which may not have anything to do with
the "Devel" facet).


> > So, in our debtags domain, possible facets and relative tags can be:
> >  - Supported file formats
> >    (MP3, OGG, PDF...)
> ... and here you certainly will have applications which "have to do
> with" mp3 as well as ogg.

Sure.  But you can tag that application along the "File format" facet.
Then you can tag it along the "Use" facet as "play", "record",
"compress", "convert", "store", "organize"... to say more about how it
"has to do".


> I think we should consider each tag to be an axis itself. Most will be
> binary, either applys or doesn't apply. There a few cases where this is
> different, for example "maturity", and maybe "freedom-of-use".

I understand your reasoning, and I think it's formally correct; however,
I don't think it's what we need.  More later.


> >  - User interface toolkit
> >    (GTK, QT, GNUStep...)
> Be careful with that one; someone was very unhappy with us putting
> GNUStep in the UI/DE section; its an application framework. ;-)

Yes, I understand that.  But as an application framework, it also is a
user interface toolkit.  They'll then be happy to see GNUStep also in
the Suite facet, and maybe also listed as a "devel::framework".

The trick is, the categorization possibilities will not be the various
tags we define, but the result of the interference between the values of
the various facets.


> I think the term "namespaces" is better for the outer terms; facets for
> the items is okay.

No: that's not the meaning of facet.  Think of "facet" like "point of
view": "from the point of view of Purpose, this package serves to Chat";
"from the point of view of Technology, this package uses IRC".


> > For example, now we know that a first set of facets and tags can be
> > defined now, and that facets can be added and expanded later.  So, there
> > is no need to define a special proper set of tags.  It'd be interesting,
> > instead, to make a good work to define a good initial set of facets to
> > start working with.
> 
> Actually that is one of the things I learned from my experients: adding
> tags later on does work, but you would need to re-tag lots of
> applications. Adding tags should be avoided; if possible it should be
> done in a batch job so you can actually re-tag everything.
> Changing tags is even more of a hassle, so IMHO we really should spend a
> lot of time on writing a proper tag set that contains like 99% of the
> tags we are going to have at the end.
> Removing tags is _way_ easier.

And here's a very subtle trick: defining the facet, the point of view,
reduces the waterfall effect in case of tag changes.

So, if a package uses HTTP technology, it'll always use the "HTTP
technology" tag no matter what.  I can decide to add a new facet, but on
the specific "Technology" facet, that package will always have at least
the HTTP tag.

Facets define a well-specified context for a property: the property is
that, and that alone.  Tags in facets become as atomic as possible:
refactoring an "html-reader" tag is hard, but you don't have to refactor
a "tech::html", because it's got a well-defined, atomic meaning.  If you
want to model what a package can read, you add a "reads" facet, and
tag the package with "reads::html" tag, but "tech::html" is still valid.

You add new dimensions (please allow me the word), but the existing
dimensions don't have to change (unless, of course, you defined as a
facet something that is not).


> > And here we know that if GNUStep is something inbetween of a widget
> > toolkit and a desktop environment, then it should be categorized along
> > at least two facets/dimensions/axis: "Widget Toolkit" and "Desktop
> > Environment".
> GNUStep is not a package. We categorize packages.
> It means or "GNUStep" tag has to be split and renamed.

Let me rephrase (sometimes I forget I'm a messy Italian talking to a
very precise German matematician :) :

 And here we know that if GNUStep is something inbetween of a widget
 toolkit and a desktop environment, then GNUStep applications should be
 categorized along at least two facets: "Widget Toolkit" and "Desktop
 Environment".

Facets don't allow you to have a "GNUStep" tag without a context.  And
the context makes the GNUStep tag mean a specific, well defined property
of the object, be it the "GNUStep toolkit" or the "GNUStep desktop" or
the "GNUStep" libraries.

Yes, we did that already with namespaces.  Basically, transitioning to
facets boils down to just mandating namespaces.


> > I find that grouping tags in facets/dimensions/axes makes the
> > catalogation work much easier, because the meaning of each tag is
> > much, much more clear, having the context attached.
> Yeah, i started to do this with my namespaces; instead of putting them
> into "dimensions" or "axes" i had put them into a tree (actually a
> network, a DAG) hierarchy using implications. This is more flexible.

Definitely.  But we're talking of the same thing: finding literature
about facets, I think I just found an interesting frame of reference to
see our work in.

There is a difference between using facets and implications, though: if
you have "http" implying "net", then you have a "net" tag which means
too much, or almost nothing.  Facets don't allow you to aver over-broad
tags (which, by the way, are the big refactoring headaches).

Package identities are captured not by creating specific tags, but by
interference.  If a new package comes out which is difficult to
categorize, then it may be an hint for a new facet/point of view from
which to look at packages, holding the existing ones unchanged. 


> > I plan to write some special support for facets in libtagcoll and the
> > various related applications, as for example one may want to query for
> > administration::* or for *::html, or to list all tags in a given facet.
> > But yes, all existing tools already work great today!
> You don't need this kind of special handling if you use implications.
> administration::something should imply "top-level" administration,
> whereas "file-format::html" should imply "html".
> Using implications also allows to group for example html, txt, tex as
> "text file formats" and mp3, ogg, wav as "audio file formats".

It's this last grouping that creates problems, IMO: "ogg" is an audio
file format, but also a video file format.  It really is a multimedia
container which is commonly used to carry "vorbis" audio data, and can
carry data encoded with other codecs.

Instead of defining "audio file formats", you define "file format" and
you define "media" and "technology".  Then you have file-format::ogg,
media::audio and codec::vorbis.  If the OGG people will decide to put
images or ELF objects inside OGG files, then all the existing facets
will still be valid, and we'll start having applications tagged with
"file-format::ogg", "media::raster-image" and "codec::jpg", and maybe
"file-format::ogg", "devel::linker".

By categorizing with interference, we support a huge amount of cases we
haven't thought of from the beginning.  I see this as extremely
important, as I definitely want to assume that we can't be able to think
of everything from the beginning.


Ciao,

Enrico