yada yada, new_deb822.py
Adeodato Simó
dato at net.com.org.es
Mon Aug 21 01:34:47 UTC 2006
* John Wright [Sun, 20 Aug 2006 18:55:49 -0600]:
> Wow; I hadn't bothered to see how long it took deb822 to parse
> unstable's Packages... I've used ParseTagFile before, but I was always
> frustrated with the lack of documentation. Still, wrapping deb822
> around it isn't a bad idea. I'd be willing to work on integration.
> Would you prefer that I wait for a new_deb822 branch, or can I start
> with new_deb822.py above?
Ah, my plan was for the latter, since I didn't see myself with the
energy to apply all the needed surgery to merge new_deb822.py back, but
alas, I ended up doing it, and forgot to publish the patches. It is
different to the original new_deb822.py in that apt_pkg is used only for
iter_paragraphs(), not for e.g. __init__(). Bundles attached, branch
available here:
http://people.debian.org/~adeodato/code/branches/deb822/new_deb822
You'll quickly note that this is again a DictMixin. There's no
particular rationale for this, other than it gave me a very
straightforward way to get a dict-like object, without having to care
about ensuring, when subclassing dict, that every possible access method
works okay (eg. remember your comment in CaseInsensitiveDict.get :-P).
Because of this, if you'd really like for it to be a dict subclass, feel
free to fight with the details and move it back to a real dict subclass. ;-)
I added a bit more lengthy README file, and a TODO file.
Cheers,
--
Adeodato Simó dato at net.com.org.es
Debian Developer adeodato at debian.org
Listening to: Mirafiori - Cinco minutos
-------------- next part --------------
# Bazaar revision bundle v0.8
#
# message:
# Support using fast apt_pkg from python-apt in iter_paragraphs().
#
# A short discussion of the implementation: __init__ now accepts a private
# parameter _parsed which, if set, is assumed to be a read-only dict-like
# object with already parsed data. If not present, the original parser
# routine in python is used (refactored to Deb822._internal_parser).
#
# The __keys member always holds the canonical list of present keys, in
# the appropriate order. Retrieving values is always attempted against
# __dict first, and if that fails, against __parsed (this is because
# modifications go to __dict, since __parsed is R/O).
#
# committer: Adeodato Simó <dato at net.com.org.es>
# date: Mon 2006-08-21 03:19:15.778000116 +0200
=== modified file deb822.py
--- deb822.py
+++ deb822.py
@@ -1,10 +1,11 @@
+# vim: fileencoding=utf-8
#
# A python interface for various rfc822-like formatted files used by Debian
# (.changes, .dsc, Packages, Sources, etc)
#
-# Written by dann frazier <dannf at dannf.org>
# Copyright (C) 2005-2006 dann frazier <dannf at dannf.org>
# Copyright (C) 2006 John Wright <john at movingsucks.org>
+# Copyright (C) 2006 Adeodato Simó <dato at net.com.org.es>
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
@@ -19,52 +20,120 @@
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
-#
+
+
+try:
+ import apt_pkg
+ _have_apt_pkg = True
+except ImportError:
+ _have_apt_pkg = False
import re
import StringIO
-
-class Deb822(dict):
- def iter_paragraphs(cls, sequence, fields=None):
- """Generator that yields an object for each paragraph in sequence."""
-
- iterable = iter(sequence)
- done_one = False
-
+import UserDict
+
+
+class Deb822(object, UserDict.DictMixin):
+
+ def __init__(self, sequence=None, fields=None, _parsed=None):
+ """Create a new Deb822 instance.
+
+ :param sequence: a string, or any any object that returns a line of
+ input each time, normally a file().
+
+ :param fields: if given, it is interpreted as a list of fields that
+ should be parsed (the rest will be discarded).
+
+ :param _parsed: internal parameter.
+ """
+ self.__dict = {}
+ self.__keys = []
+ self.__parsed = None
+
+ if sequence is not None:
+ try:
+ self._internal_parser(sequence, fields)
+ except EOFError:
+ pass
+ elif _parsed is not None:
+ self.__parsed = _parsed
+ if fields is None:
+ self.__keys.extend(self.__parsed.keys())
+ else:
+ self.__keys.extend([ f for f in fields if self.__parsed.has_key(f) ])
+
+ def iter_paragraphs(cls, sequence, fields=None, shared_storage=True):
+ """Generator that yields a Deb822 object for each paragraph in sequence.
+
+ :param sequence: same as in __init__.
+
+ :param fields: likewise.
+
+ :param shared_storage: if sequence is a file(), apt_pkg will be used
+ if available to parse the file, since it's much much faster. On the
+ other hand, yielded objects will share storage, so they can't be
+ kept across iterations. (Also, PGP signatures won't be stripped
+ with apt_pkg.) Set this parameter to False to disable using apt_pkg.
+ """
+
+ # TODO Think about still using apt_pkg evein if shared_storage is False,
+ # by somehow instructing the constructor to make copy of the data. (If
+ # this is still faster.)
+
+ if _have_apt_pkg and shared_storage and isinstance(sequence, file):
+ parser = apt_pkg.ParseTagFile(sequence)
+ while parser.Step() == 1:
+ yield cls(fields=fields, _parsed=parser.Section)
+ else:
+ iterable = iter(sequence)
+ x = cls(iterable, fields)
+ while len(x) != 0:
+ yield x
+ x = cls(iterable, fields)
+
+ iter_paragraphs = classmethod(iter_paragraphs)
+
+ ###
+
+ # Methods for DictMixin
+
+ def __setitem__(self, key, value):
+ if not key in self.__keys:
+ self.__keys.append(key)
+ self.__dict[key] = value
+
+ def __getitem__(self, key):
try:
- while True:
- yield cls(iterable, fields)
- done_one = True
- except EOFError:
- if not done_one:
+ return self.__dict[key]
+ except KeyError:
+ if self.__parsed is not None:
+ return self.__parsed[key]
+ else:
raise
- iter_paragraphs = classmethod(iter_paragraphs)
-
- def __init__(self, sequence, fields=None):
- """Create a new Deb822 instance from a sequence's contents
-
- sequence may be any object that returns a line of input each time, like
- a file() or an array of strings.
-
- If not None, the fields parameter must be a list of the fields that
- should be parsed (the rest will be discarded).
-
- You can initialize a Deb822 object from another with:
-
- Deb822(one_deb822.dump().splitlines())
- """
-
+ def __delitem__(self, key):
+ if self.__dict.has_key(key):
+ del self.__dict[key]
+ try:
+ self.__keys.remove(key)
+ except ValueError:
+ raise KeyError(key)
+
+ def keys(self):
+ return list(self.__keys)
+
+ ###
+
+ def _internal_parser(self, sequence, fields=None):
single = re.compile("^(?P<key>\S+)\s*:\s*(?P<data>\S.*?)\s*$")
multi = re.compile("^(?P<key>\S+)\s*:\s*$")
multidata = re.compile("^\s(?P<data>.+?)\s*$")
wanted_field = lambda f: fields is None or f in fields
- # Storing keys here is redundant, but it allows us to keep track of the
- # original order.
- self._keys = []
-
+ if isinstance(sequence, basestring):
+ sequence = sequence.splitlines()
+
curkey = None
content = ""
for line in self.gpg_stripped_paragraph(sequence):
@@ -79,7 +148,6 @@
curkey = m.group('key')
self[curkey] = m.group('data')
- self._keys.append(curkey)
content = ""
continue
@@ -94,7 +162,6 @@
curkey = m.group('key')
self[curkey] = ""
- self._keys.append(curkey)
content = ""
continue
@@ -106,26 +173,32 @@
if curkey:
self[curkey] += content
- def __delitem__(self, key):
- dict.__delitem__(self, key)
- self._keys.remove(key)
+ ###
+
+ def __str__(self):
+ return self.dump()
+
+ def __repr__(self):
+ return '{%s}' % ', '.join(['%r: %r' % (k, v) for k, v in self.items()])
def dump(self, fd=None):
"""Dump the the contents in the original format
If fd is None, return a string.
"""
-
+
if fd is None:
fd = StringIO.StringIO()
return_string = True
else:
return_string = False
- for key in self.keys():
- fd.write(key + ": " + self[key] + "\n")
+ for key, value in self.iteritems():
+ fd.write('%s: %s\n' % (key, value))
if return_string:
return fd.getvalue()
+ ###
+
def isSingleLine(self, s):
if s.count("\n"):
return False
@@ -140,7 +213,7 @@
return s1
if not s1:
return s2
-
+
if self.isSingleLine(s1) and self.isSingleLine(s2):
## some fields are delimited by a single space, others
## a comma followed by a space. this heuristic assumes
@@ -154,7 +227,7 @@
L.sort()
prev = merged = L[0]
-
+
for item in L[1:]:
## skip duplicate entries
if item == prev:
@@ -162,7 +235,7 @@
merged = merged + delim + item
prev = item
return merged
-
+
if self.isMultiLine(s1) and self.isMultiLine(s2):
for item in s2.splitlines(True):
if item not in s1.splitlines(True):
@@ -170,7 +243,7 @@
return s1
raise ValueError
-
+
def mergeFields(self, key, d1, d2 = None):
## this method can work in two ways - abstract that away
if d2 == None:
@@ -201,6 +274,7 @@
return None
return merged
+ ###
def gpg_stripped_paragraph(sequence):
lines = []
@@ -241,20 +315,19 @@
gpg_stripped_paragraph = staticmethod(gpg_stripped_paragraph)
- def keys(self):
- # Override keys so that we can give the correct order
- other_keys = dict.keys(self)
- for key in self._keys:
- other_keys.remove(key)
- return self._keys + other_keys
+###
class _multivalued(Deb822):
"""A class with (R/W) support for multivalued fields."""
- def __init__(self, fp):
- Deb822.__init__(self, fp)
+
+ def __init__(self, *args, **kwargs):
+ Deb822.__init__(self, *args, **kwargs)
for field, fields in self._multivalued_fields.items():
- contents = self.get(field, '')
+ try:
+ contents = self[field]
+ except KeyError:
+ continue
if self.isMultiLine(contents):
self[field] = []
@@ -296,21 +369,29 @@
if return_string:
return fd.getvalue()
+
+###
+
class Dsc(_multivalued):
_multivalued_fields = {
"Files": [ "md5sum", "size", "name" ],
}
-# Sources files have the same multivalued format as dsc files
-Sources = Dsc
+
class Changes(_multivalued):
_multivalued_fields = {
"Files": [ "md5sum", "size", "section", "priority", "name" ],
}
+
class PdiffIndex(_multivalued):
_multivalued_fields = {
"SHA1-Current": [ "SHA1", "size" ],
"SHA1-History": [ "SHA1", "size", "date" ],
"SHA1-Patches": [ "SHA1", "size", "date" ],
}
+
+###
+
+Sources = Dsc
+Packages = Deb822
=== modified file test_deb822.py
--- test_deb822.py
+++ test_deb822.py
@@ -161,7 +161,7 @@
for k, v in dict_.items():
self.assertEqual(v, deb822_[k])
- self.assertEqual(0, dict.__cmp__(dict_, deb822_))
+ self.assertEqual(0, deb822_.__cmp__(dict_))
def deb822_from_format_string(self, string, dict_=PARSED_PACKAGE, cls=deb822.Deb822):
"""Construct a Deb822 object by formatting string with % dict.
@@ -213,11 +213,11 @@
self.assertWellParsed(d, PARSED_PACKAGE)
def test_parser_empty_input(self):
- self.assertRaises(EOFError, deb822.Deb822, [])
+ self.assertEqual({}, deb822.Deb822([]))
def test_iter_paragraphs_empty_input(self):
generator = deb822.Deb822.iter_paragraphs([])
- self.assertRaises(EOFError, generator.next)
+ self.assertRaises(StopIteration, generator.next)
def test_parser_limit_fields(self):
wanted_fields = [ 'Package', 'MD5sum', 'Filename', 'Description' ]
# revision id: dato at net.com.org.es-20060821011915-22cd5784f3aea2b1
# sha1: b08fbd21cc60037effde90c2fc2d44ad52105168
# inventory sha1: 3d7e2b9008bf337e9e761366f7993e0bc188be58
# parent ids:
# john at movingsucks.org-20060821005735-205500fba5663389
# base id: john at movingsucks.org-20060821005735-205500fba5663389
# properties:
# branch-nick: new_deb822
-------------- next part --------------
# Bazaar revision bundle v0.8
#
# message:
# Add top-level README and TODO files.
#
# committer: Adeodato Simó <dato at net.com.org.es>
# date: Mon 2006-08-21 03:19:46.796000004 +0200
=== added file README // file-id:readme-20060819223922-lizz4h5sh03vqgnk-1
--- /dev/null
+++ README
@@ -0,0 +1,98 @@
+deb822.py README
+================
+
+The Python deb822 aims to provide a dict-like interface to various rfc822-like
+Debian data formats, like Packages/Sources, .changes/.dsc, pdiff Index files,
+etc. The benefit is that deb822 knows about special fields that contain
+whitespace separated sub-fields, and provides named access to them. For
+example, the "Files" filed in Source packages, which has three subfields, or
+"Files" in .changes files, which has five. These are known as "multifields".
+
+deb822 has no external dependencies, but can use python-apt if available to
+parse the data, which gives a very significant performance boost when iterating
+over big Packages files.
+
+Key lookup in Deb822 objects and their multifields is case-insensitive, but the
+original case is always preserved, e.g. when printing the object. [XXX TODO]
+
+
+Classes
+=======
+
+Here is a list of the types deb822 knows about:
+
+ * Deb822 (aliases: Packages) - base class with no multifields.
+
+ * Dsc (aliases: Sources) - class to represent .dsc files / Sources paragraphs.
+
+ - Multivalued fields:
+
+ · Files: md5sum, size, name
+
+ * Changes - class to represent a .changes file
+
+ - Multivalued fields:
+
+ · Files: md5sum, size, section, priority, name
+
+ * PdiffIndex - class to represent a pdiff Index file
+
+ - Multivalued fields:
+
+ · SHA1-Current: SHA1, size
+ · SHA1-History: SHA1, size, date
+ · SHA1-Patches: SHA1, size, date
+
+
+Input
+=====
+
+Deb822 objects are normally initialized from a file() object, from which
+at most one paragraph is read, or a string.
+
+Alternatively, any sequence that returns one line of input at a time may
+be used, e.g. an array of strings.
+
+PGP signatures, if present, will be stripped.
+
+
+Iteration
+=========
+
+All classes provide an "iter_paragraphs" class method to easily go over
+each stanza in a file with multiple entries, like a Packages.gz file.
+For example:
+
+ f = file('/mirror/debian/dists/sid/main/binary-i386/Sources')
+
+ for src in Sources.iter_paragraphs(f):
+ print src['Package'], src['Version']
+
+This method uses python-apt if available to parse the file, since it
+significantly boosts performance. The downside, though, is that yielded
+objects share storage, so they should never be kept accross iterations.
+To prevent this behavior, pass a "shared_storage=False" keyword-argument
+to the iter_paragraphs() function.
+
+
+Sample usage (TODO: Improve)
+============
+
+ import deb822
+
+ d = deb822.dsc(file('foo_1.1.dsc'))
+ source = d['Source']
+ version = d['Version']
+
+ for f in d['Files']:
+ print 'Name:', f['name']
+ print 'Size:', f['size']
+ print 'MD5sum:', f['md5sum']
+
+ # If we like, we can change fields
+ d['Standards-Version'] = '3.7.2'
+
+ # And then dump the new contents
+ new_dsc = open('foo_1.1.dsc2', 'w')
+ d.dump(new_dsc)
+ new_dsc.close()
=== added file TODO // file-id:todo-20060819225804-v557fg5zokp18rvv-1
--- /dev/null
+++ TODO
@@ -0,0 +1,2 @@
+* Revamp the test suite.
+* Case insensitive key lookups.
# revision id: dato at net.com.org.es-20060821011946-835bbe10b744ff37
# sha1: c317326683ae657f2825a45e6893b6fb68664352
# inventory sha1: 53f39c2317c902752f8701528c06a89a693af91b
# parent ids:
# dato at net.com.org.es-20060821011915-22cd5784f3aea2b1
# base id: dato at net.com.org.es-20060821011915-22cd5784f3aea2b1
# properties:
# branch-nick: new_deb822
More information about the pkg-python-debian-discuss
mailing list