Debtags for science

Sat Sep 6 17:37:22 UTC 2008

[damn!  I postponed this since I wrote it on a bus, but then I forgot to
send it]

On Fri, Aug 01, 2008 at 06:01:40PM +0100, Chris Walker wrote:

> I started a thread on package classification for science packages on
> the debian-science mailing list [1]. There seems consensus that
> debtags ought to be the answer (or at least an answer to some problems).
> Input from debtags gurus would be very welcome.
> [1] http://lists.debian.org/debian-science/2008/06/msg00060.html and
> continuing the next month at
> http://lists.debian.org/debian-science/2008/07/msg00000.html 

Thanks Chris, I've now dived into that thread and I'm soaked with all
the ideas: we are in deep waters because the subject is very tricky, but
I'm confident we'll make it into dry land.  Note that since I'm neither
an academic nor a library scientist I risk slipping from the boat in the
following discussion, using the wrong terminology and making myself
looking ridiculous, but I hope it won't happen often.  If in doubt
about the meaning of a phrase, please take the one that you think makes
more sense.

We have many dangers to be aware of:

 * A mismatch of intention

Most categorisation efforts we are aware of (and science has a lot) have
the goal of being universal, or precise and complete.  Debtags on the
other hand has only the goal of being useful for searching and
navigating Debian packages.

This makes the job easier for Debtags, but many habits that come from
the use of serious classification systems will get in the way and lead
us to mistakes like overengineering.

For example, for a debtags tag to be useful there should be at least 6
or 7 packages that use it.  If we categorised packages like scientific
books, we would end up with lots of tags attached to very few packages,
even more tags attached to no packages at all, and the result, once
we removed those categories with no packages, would probably look like a
crappy taxonomy effort with lots of holes.

Also, we need to be careful in sticking to the debtags intention of
satisfying specific use cases, because that allows us to take advantage
of debtags peculiarites.  For example, debtags allows to have a facet
somehow corresponding to common tasks of a group of users (like the
game::* or the admin::* facets for example), and that gives us the
luxury of using the terminology and in general the way of thinking of
that group of users, safely ignoring the worry that it might not be the
same way of thinking and terminology of a different group.

Also, don't worry about making a category that has *exactly* 7 packages:
a tag is too broad for debtags only if it applies to the *whole archive*
except maybe 7 packages.

7 is a rough lower bound, but not a high bound at all.  Look at the
"role" facet, and most of its tags can apply to thousands of packages,
and that's good: the role facet is in fact particularly good as *any*
tag you pick will weed out a substantial percentage of the packages in
Debian from your list of potential packages to look at.  Of course,
searching is an interative step, that doesn't stop when you pick a tag,
but then you pick another one, from a different facet, and you keep
going until you have few enough packages in front of you that it becomes
easy and quick to read through their short descriptions one by one.

 * Taxonomies vs faceted classification

In the scientific world we often see taxonomies (think the
classification of living beings) where things are grouped in large
groups and things inside every large group is divided in smaller groups,
and so on.  In Debtags instead we identify an interesting point of
view/aspect/facet of packages and we try to categorise them from that
point of view *only*, ignoring any other aspect.

This way of working can sometimes seem quite counterintuitive, as it's
difficult to look at only one aspect without feeling like we're
"throwing away" lots of information, and we risk to end up with a facet
that also tries to describe everything else that we noticed during the
process.

 * Levels of abstraction.

The same aspect of a package can sometimes be seen at different levels
of abstraction; for example, if we look at packages from a "use it for
science" point of view, we can easily switch from very high levels like
"modeling/sampling/publishing" to very low levels like "it uses this
algorithm".  All such levels can be important, but when designing a
facet it's important to stick to one level; if also another level is
important, there can be a new facet for it.

 * Specialisation/compartment*ation

Scientists are themselves a taxonomy: in the debian-science discussion
I've seen people identifying themselves in the most diverse ways:
there's the quantum people, the cristallography people, and I work with
meteorologists: these three examples show very different areas of
physics, and even different levels of abstractions.  And the
meteorologists I work with, they are likely to consider it reductive to
be just defined as meteorologists: some are specialised in sea waves,
some in radar processing, some in satellite imagery, and they probably
could go on categorising themselves until they end up having more
categories than people, and I'm sure it's the same in pretty much every
branch of science.  Therefore, don't pick a category that describes what
YOU do, because it will never be specific enough (except, of course, you
name your category, for example "Chris Walker" or "Enrico Zini", but
then they wouldn't be a very useful classification system, or at least
they wouldn't solve the problem that you're trying to solve here).

Rather than describe what you do, describe what everyone in your faculty
does: that probably gives you a good level of abstraction from which to
look at things (note: this isn't exactly my idea, and came out of a
conversation with Chris).

********

Now that I put the big red boring verbose warning sign about the
possible pitfalls, I can make a practical proposal to start a discussion
as well as some work.  Talking with Chris we came out almost by accident
with this possible new facet, which I rather like:

  Facet: science
  Description: Science

  Tag: science::modelling
  Description: Modelling

  Tag: science::data-acquisition
  Description: Data acquisition

  Tag: science::plotting
  Description: Plotting

  Tag: science::bibliogaphy
  Description: Bibliography

  Tag: science::publishing
  Description: Publishing

This more or less models the point of view of "how that package can be
useful for research work" from the level of abstraction of "don't say
what you do, say what every scholar in your faculty does", and it
consequently gives a facet that is potentially useful to everyone in
your faculty, and therefore a big win.

I propose we give this facet a look, se if there's anything wrong
(missing things can always be added later) and give it a go.  Then,
since changing a tool also changes the nature of the work done with the
tool, after we are satisfied that these tags are properly put into use,
we can restart the discussion and see where it leads.

Ciao,

Enrico

-- 
GPG key: 1024D/797EBFAB 2000-12-05 Enrico Zini <enrico at debian.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.alioth.debian.org/pipermail/debtags-devel/attachments/20080906/f99b418e/attachment.pgp