yada yada, new_deb822.py

Mon Aug 21 01:34:47 UTC 2006

* John Wright [Sun, 20 Aug 2006 18:55:49 -0600]:

> Wow; I hadn't bothered to see how long it took deb822 to parse
> unstable's Packages...  I've used ParseTagFile before, but I was always
> frustrated with the lack of documentation.  Still, wrapping deb822
> around it isn't a bad idea.  I'd be willing to work on integration.
> Would you prefer that I wait for a new_deb822 branch, or can I start
> with new_deb822.py above?

Ah, my plan was for the latter, since I didn't see myself with the
energy to apply all the needed surgery to merge new_deb822.py back, but
alas, I ended up doing it, and forgot to publish the patches. It is
different to the original new_deb822.py in that apt_pkg is used only for
iter_paragraphs(), not for e.g. __init__(). Bundles attached, branch
available here:

  http://people.debian.org/~adeodato/code/branches/deb822/new_deb822

You'll quickly note that this is again a DictMixin. There's no
particular rationale for this, other than it gave me a very
straightforward way to get a dict-like object, without having to care
about ensuring, when subclassing dict, that every possible access method
works okay (eg. remember your comment in CaseInsensitiveDict.get :-P).

Because of this, if you'd really like for it to be a dict subclass, feel
free to fight with the details and move it back to a real dict subclass. ;-)

I added a bit more lengthy README file, and a TODO file.

Cheers,

-- 
Adeodato Simó                                     dato at net.com.org.es
Debian Developer                                  adeodato at debian.org
 
                                 Listening to: Mirafiori - Cinco minutos
-------------- next part --------------
# Bazaar revision bundle v0.8
#
# message:
#   Support using fast apt_pkg from python-apt in iter_paragraphs().
#   
#   A short discussion of the implementation: __init__ now accepts a private
#   parameter _parsed which, if set, is assumed to be a read-only dict-like
#   object with already parsed data. If not present, the original parser
#   routine in python is used (refactored to Deb822._internal_parser).
#   
#   The __keys member always holds the canonical list of present keys, in
#   the appropriate order. Retrieving values is always attempted against
#   __dict first, and if that fails, against __parsed (this is because
#   modifications go to __dict, since __parsed is R/O).
#   
# committer: Adeodato Simó <dato at net.com.org.es>
# date: Mon 2006-08-21 03:19:15.778000116 +0200

=== modified file deb822.py

--- deb822.py
+++ deb822.py
@@ -1,10 +1,11 @@
+# vim: fileencoding=utf-8
 #
 # A python interface for various rfc822-like formatted files used by Debian
 # (.changes, .dsc, Packages, Sources, etc)
 #
-# Written by dann frazier <dannf at dannf.org>
 # Copyright (C) 2005-2006  dann frazier <dannf at dannf.org>
 # Copyright (C) 2006       John Wright <john at movingsucks.org>
+# Copyright (C) 2006       Adeodato Simó <dato at net.com.org.es>
 #
 # This program is free software; you can redistribute it and/or
 # modify it under the terms of the GNU General Public License
@@ -19,52 +20,120 @@
 # You should have received a copy of the GNU General Public License
 # along with this program; if not, write to the Free Software
 # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
-#
+
+
+try:
+    import apt_pkg
+    _have_apt_pkg = True
+except ImportError:
+    _have_apt_pkg = False
 
 import re
 import StringIO
-
-class Deb822(dict):
-    def iter_paragraphs(cls, sequence, fields=None):
-        """Generator that yields an object for each paragraph in sequence."""
-
-        iterable = iter(sequence)
-        done_one = False
-
+import UserDict
+
+
+class Deb822(object, UserDict.DictMixin):
+
+    def __init__(self, sequence=None, fields=None, _parsed=None):
+        """Create a new Deb822 instance.
+
+        :param sequence: a string, or any any object that returns a line of
+            input each time, normally a file().
+
+        :param fields: if given, it is interpreted as a list of fields that
+            should be parsed (the rest will be discarded).
+
+        :param _parsed: internal parameter.
+        """
+        self.__dict = {}
+        self.__keys = []
+        self.__parsed = None
+
+        if sequence is not None:
+            try:
+                self._internal_parser(sequence, fields)
+            except EOFError:
+                pass
+        elif _parsed is not None:
+            self.__parsed = _parsed
+            if fields is None:
+                self.__keys.extend(self.__parsed.keys())
+            else:
+                self.__keys.extend([ f for f in fields if self.__parsed.has_key(f) ])
+
+    def iter_paragraphs(cls, sequence, fields=None, shared_storage=True):
+        """Generator that yields a Deb822 object for each paragraph in sequence.
+
+        :param sequence: same as in __init__.
+
+        :param fields: likewise.
+
+        :param shared_storage: if sequence is a file(), apt_pkg will be used 
+            if available to parse the file, since it's much much faster. On the
+            other hand, yielded objects will share storage, so they can't be
+            kept across iterations. (Also, PGP signatures won't be stripped
+            with apt_pkg.) Set this parameter to False to disable using apt_pkg. 
+        """
+
+        # TODO Think about still using apt_pkg evein if shared_storage is False,
+        # by somehow instructing the constructor to make copy of the data. (If
+        # this is still faster.)
+
+        if _have_apt_pkg and shared_storage and isinstance(sequence, file):
+            parser = apt_pkg.ParseTagFile(sequence)
+            while parser.Step() == 1:
+                yield cls(fields=fields, _parsed=parser.Section)
+        else:
+            iterable = iter(sequence)
+            x = cls(iterable, fields)
+            while len(x) != 0:
+                yield x
+                x = cls(iterable, fields)
+
+    iter_paragraphs = classmethod(iter_paragraphs)
+
+    ###
+
+    # Methods for DictMixin
+
+    def __setitem__(self, key, value):
+        if not key in self.__keys:
+            self.__keys.append(key)
+        self.__dict[key] = value
+
+    def __getitem__(self, key):
         try:
-            while True:
-                yield cls(iterable, fields)
-                done_one = True
-        except EOFError:
-            if not done_one:
+            return self.__dict[key]
+        except KeyError:
+            if self.__parsed is not None:
+                return self.__parsed[key]
+            else:
                 raise
 
-    iter_paragraphs = classmethod(iter_paragraphs)
-
-    def __init__(self, sequence, fields=None):
-        """Create a new Deb822 instance from a sequence's contents
-
-        sequence may be any object that returns a line of input each time, like
-        a file() or an array of strings.
-
-        If not None, the fields parameter must be a list of the fields that
-        should be parsed (the rest will be discarded).
-        
-        You can initialize a Deb822 object from another with:
-        
-            Deb822(one_deb822.dump().splitlines())
-        """
-        
+    def __delitem__(self, key):
+        if self.__dict.has_key(key):
+            del self.__dict[key]
+        try:
+            self.__keys.remove(key)
+        except ValueError:
+            raise KeyError(key)
+
+    def keys(self):
+        return list(self.__keys)
+
+    ###
+
+    def _internal_parser(self, sequence, fields=None):
         single = re.compile("^(?P<key>\S+)\s*:\s*(?P<data>\S.*?)\s*$")
         multi = re.compile("^(?P<key>\S+)\s*:\s*$")
         multidata = re.compile("^\s(?P<data>.+?)\s*$")
 
         wanted_field = lambda f: fields is None or f in fields
 
-        # Storing keys here is redundant, but it allows us to keep track of the
-        # original order.
-        self._keys = []
-        
+        if isinstance(sequence, basestring):
+            sequence = sequence.splitlines()
+
         curkey = None
         content = ""
         for line in self.gpg_stripped_paragraph(sequence):
@@ -79,7 +148,6 @@
 
                 curkey = m.group('key')
                 self[curkey] = m.group('data')
-                self._keys.append(curkey)
                 content = ""
                 continue
 
@@ -94,7 +162,6 @@
 
                 curkey = m.group('key')
                 self[curkey] = ""
-                self._keys.append(curkey)
                 content = ""
                 continue
 
@@ -106,26 +173,32 @@
         if curkey:
             self[curkey] += content
 
-    def __delitem__(self, key):
-        dict.__delitem__(self, key)
-        self._keys.remove(key)
+    ###
+
+    def __str__(self):
+        return self.dump()
+
+    def __repr__(self):
+        return '{%s}' % ', '.join(['%r: %r' % (k, v) for k, v in self.items()])
 
     def dump(self, fd=None):
         """Dump the the contents in the original format
 
         If fd is None, return a string.
         """
-        
+
         if fd is None:
             fd = StringIO.StringIO()
             return_string = True
         else:
             return_string = False
-        for key in self.keys():
-            fd.write(key + ": " + self[key] + "\n")
+        for key, value in self.iteritems():
+            fd.write('%s: %s\n' % (key, value))
         if return_string:
             return fd.getvalue()
 
+    ###
+
     def isSingleLine(self, s):
         if s.count("\n"):
             return False
@@ -140,7 +213,7 @@
             return s1
         if not s1:
             return s2
-        
+
         if self.isSingleLine(s1) and self.isSingleLine(s2):
             ## some fields are delimited by a single space, others
             ## a comma followed by a space.  this heuristic assumes
@@ -154,7 +227,7 @@
             L.sort()
 
             prev = merged = L[0]
-            
+
             for item in L[1:]:
                 ## skip duplicate entries
                 if item == prev:
@@ -162,7 +235,7 @@
                 merged = merged + delim + item
                 prev = item
             return merged
-            
+
         if self.isMultiLine(s1) and self.isMultiLine(s2):
             for item in s2.splitlines(True):
                 if item not in s1.splitlines(True):
@@ -170,7 +243,7 @@
             return s1
 
         raise ValueError
-    
+
     def mergeFields(self, key, d1, d2 = None):
         ## this method can work in two ways - abstract that away
         if d2 == None:
@@ -201,6 +274,7 @@
             return None
 
         return merged
+    ###
 
     def gpg_stripped_paragraph(sequence):
         lines = []
@@ -241,20 +315,19 @@
 
     gpg_stripped_paragraph = staticmethod(gpg_stripped_paragraph)
 
-    def keys(self):
-        # Override keys so that we can give the correct order
-        other_keys = dict.keys(self)
-        for key in self._keys:
-            other_keys.remove(key)
-        return self._keys + other_keys
+###
 
 class _multivalued(Deb822):
     """A class with (R/W) support for multivalued fields."""
-    def __init__(self, fp):
-        Deb822.__init__(self, fp)
+
+    def __init__(self, *args, **kwargs):
+        Deb822.__init__(self, *args, **kwargs)
 
         for field, fields in self._multivalued_fields.items():
-            contents = self.get(field, '')
+            try:
+                contents = self[field]
+            except KeyError:
+                continue
 
             if self.isMultiLine(contents):
                 self[field] = []
@@ -296,21 +369,29 @@
         if return_string:
             return fd.getvalue()
 
+
+###
+
 class Dsc(_multivalued):
     _multivalued_fields = {
         "Files": [ "md5sum", "size", "name" ],
     }
-# Sources files have the same multivalued format as dsc files
-Sources = Dsc
+
 
 class Changes(_multivalued):
     _multivalued_fields = {
         "Files": [ "md5sum", "size", "section", "priority", "name" ],
     }
 
+
 class PdiffIndex(_multivalued):
     _multivalued_fields = {
         "SHA1-Current": [ "SHA1", "size" ],
         "SHA1-History": [ "SHA1", "size", "date" ],
         "SHA1-Patches": [ "SHA1", "size", "date" ],
     }
+
+###
+
+Sources = Dsc
+Packages = Deb822

=== modified file test_deb822.py
--- test_deb822.py
+++ test_deb822.py
@@ -161,7 +161,7 @@
         for k, v in dict_.items():
             self.assertEqual(v, deb822_[k])
 
-        self.assertEqual(0, dict.__cmp__(dict_, deb822_))
+        self.assertEqual(0, deb822_.__cmp__(dict_))
 
     def deb822_from_format_string(self, string, dict_=PARSED_PACKAGE, cls=deb822.Deb822):
         """Construct a Deb822 object by formatting string with % dict.
@@ -213,11 +213,11 @@
                 self.assertWellParsed(d, PARSED_PACKAGE)
 
     def test_parser_empty_input(self):
-        self.assertRaises(EOFError, deb822.Deb822, [])
+        self.assertEqual({}, deb822.Deb822([]))
 
     def test_iter_paragraphs_empty_input(self):
         generator = deb822.Deb822.iter_paragraphs([])
-        self.assertRaises(EOFError, generator.next)
+        self.assertRaises(StopIteration, generator.next)
 
     def test_parser_limit_fields(self):
         wanted_fields = [ 'Package', 'MD5sum', 'Filename', 'Description' ]

# revision id: dato at net.com.org.es-20060821011915-22cd5784f3aea2b1
# sha1: b08fbd21cc60037effde90c2fc2d44ad52105168
# inventory sha1: 3d7e2b9008bf337e9e761366f7993e0bc188be58
# parent ids:
#   john at movingsucks.org-20060821005735-205500fba5663389
# base id: john at movingsucks.org-20060821005735-205500fba5663389
# properties:
#   branch-nick: new_deb822

-------------- next part --------------
# Bazaar revision bundle v0.8
#
# message:
#   Add top-level README and TODO files.
#   
# committer: Adeodato Simó <dato at net.com.org.es>
# date: Mon 2006-08-21 03:19:46.796000004 +0200

=== added file README // file-id:readme-20060819223922-lizz4h5sh03vqgnk-1
--- /dev/null
+++ README
@@ -0,0 +1,98 @@
+deb822.py README
+================
+
+The Python deb822 aims to provide a dict-like interface to various rfc822-like
+Debian data formats, like Packages/Sources, .changes/.dsc, pdiff Index files,
+etc. The benefit is that deb822 knows about special fields that contain
+whitespace separated sub-fields, and provides named access to them. For
+example, the "Files" filed in Source packages, which has three subfields, or
+"Files" in .changes files, which has five. These are known as "multifields".
+
+deb822 has no external dependencies, but can use python-apt if available to
+parse the data, which gives a very significant performance boost when iterating
+over big Packages files.
+
+Key lookup in Deb822 objects and their multifields is case-insensitive, but the
+original case is always preserved, e.g. when printing the object. [XXX TODO]
+
+
+Classes
+=======
+
+Here is a list of the types deb822 knows about:
+
+  * Deb822 (aliases: Packages) - base class with no multifields.
+
+  * Dsc (aliases: Sources) - class to represent .dsc files / Sources paragraphs.
+
+    - Multivalued fields:
+
+      · Files: md5sum, size, name
+
+  * Changes - class to represent a .changes file
+
+    - Multivalued fields:
+
+      · Files: md5sum, size, section, priority, name
+
+  * PdiffIndex - class to represent a pdiff Index file
+
+    - Multivalued fields:
+
+      · SHA1-Current: SHA1, size
+      · SHA1-History: SHA1, size, date
+      · SHA1-Patches: SHA1, size, date
+
+
+Input
+=====
+
+Deb822 objects are normally initialized from a file() object, from which
+at most one paragraph is read, or a string.
+
+Alternatively, any sequence that returns one line of input at a time may
+be used, e.g. an array of strings.
+
+PGP signatures, if present, will be stripped.
+
+
+Iteration
+=========
+
+All classes provide an "iter_paragraphs" class method to easily go over
+each stanza in a file with multiple entries, like a Packages.gz file.
+For example:
+
+    f = file('/mirror/debian/dists/sid/main/binary-i386/Sources') 
+
+    for src in Sources.iter_paragraphs(f):
+	print src['Package'], src['Version']
+
+This method uses python-apt if available to parse the file, since it
+significantly boosts performance. The downside, though, is that yielded
+objects share storage, so they should never be kept accross iterations.
+To prevent this behavior, pass a "shared_storage=False" keyword-argument
+to the iter_paragraphs() function.
+
+
+Sample usage (TODO: Improve)
+============
+
+   import deb822 
+
+   d = deb822.dsc(file('foo_1.1.dsc'))
+   source = d['Source']
+   version = d['Version']
+
+   for f in d['Files']:
+       print 'Name:', f['name']
+       print 'Size:', f['size']
+       print 'MD5sum:', f['md5sum']
+
+    # If we like, we can change fields
+    d['Standards-Version'] = '3.7.2'
+
+    # And then dump the new contents
+    new_dsc = open('foo_1.1.dsc2', 'w')
+    d.dump(new_dsc)
+    new_dsc.close()

=== added file TODO // file-id:todo-20060819225804-v557fg5zokp18rvv-1
--- /dev/null
+++ TODO
@@ -0,0 +1,2 @@
+* Revamp the test suite.
+* Case insensitive key lookups.

# revision id: dato at net.com.org.es-20060821011946-835bbe10b744ff37
# sha1: c317326683ae657f2825a45e6893b6fb68664352
# inventory sha1: 53f39c2317c902752f8701528c06a89a693af91b
# parent ids:
#   dato at net.com.org.es-20060821011915-22cd5784f3aea2b1
# base id: dato at net.com.org.es-20060821011915-22cd5784f3aea2b1
# properties:
#   branch-nick: new_deb822