[Pkg-bazaar-commits] ./bzr/unstable r391: - split out notes on storing annotations in revfiles

Fri Apr 10 07:52:09 UTC 2009

------------------------------------------------------------
revno: 391
committer: Martin Pool <mbp at sourcefrog.net>
timestamp: Fri 2005-05-06 13:20:15 +1000
message:
  - split out notes on storing annotations in revfiles
added:
  doc/revfile-annotation.txt
modified:
  doc/index.txt
  doc/revfile.txt
-------------- next part --------------
=== modified file 'doc/index.txt'

--- a/doc/index.txt	2005-05-05 05:46:54 +0000
+++ b/doc/index.txt	2005-05-06 03:20:15 +0000
@@ -104,6 +104,8 @@
 
 * `Revfiles <revfile.html>`__ store the text history of files.
 
+* `Revfiles storing annotations <revfile-annotation.txt>`__
+
 * `Revision syntax <revision-syntax.html>`__ -- ``hello.c at 12``, etc.
 
 * `Roll-up commits <rollup.html>`__ -- a single revision incorporates

=== added file 'doc/revfile-annotation.txt'
--- a/doc/revfile-annotation.txt	1970-01-01 00:00:00 +0000
+++ b/doc/revfile-annotation.txt	2005-05-06 03:20:15 +0000
@@ -0,0 +1,155 @@
+==============================
+Extension to store annotations
+==============================
+
+We might extend the revfile format in a future version to also store
+annotations.  *This is not implemented yet.*
+
+In previous versions, the  index file identified texts by their
+SHA-1 digest.  This was unsatisfying for two reasons.  Firstly it
+assumes that SHA-1 will not collide, which is not an assumption we
+wish to make in long-lived files.  Secondly for annotations we need
+to be able to map from file versions back to a revision.
+
+Texts are identified by the name of the revfile and a UUID
+corresponding to the first revision in which they were first
+introduced.  This means that given a text we can identify which
+revision it belongs to, and annotations can use the index within the
+revfile to identify where a region was first introduced.
+
+  We cannot identify texts by the integer revision number, because
+  that would limit us to only referring to a file in a particular
+  branch.
+
+  I'd like to just use the revision-id, but those are variable-length
+  strings, and I'd like the revfile index to be fixed-length and
+  relatively short.  UUIDs can be encoded in binary as only 16 bytes.
+  Perhaps we should just use UUIDs for revisions and be done?
+
+
+
+Annotations
+-----------
+
+Annotations indicate which revision of a file first inserted a line
+(or region of bytes).
+
+Given a string, we can write annotations on it like so: a sequence of
+*(index, length)* pairs, giving the *index* of the revision which
+introduced the next run of *length* bytes.  The sum of the lengths
+must equal the length of the string.  For text files the regions will
+typically fall on line breaks.  This can be transformed in memory to
+other structures, such as a list of *(index, content)* pairs.
+
+When a line was inserted from a merge revision then the annotation for
+that line should still be the source in the merged branch, rather than
+just being the revision in which the merge took place.
+
+They can cheaply be calculated when inserting a new text, but are
+expensive to calculate after the fact because that requires searching
+back through all previous text and all texts which were merged in.  It
+therefore seems sensible to calculate them once and store them.
+
+To do this we need two operators which update an existing annotated
+file:
+
+A. Given an annotated file and a working text, update the annotation to
+   mark regions inserted in the working file as new in this revision.
+
+B. Given two annotated files, merge them to produce an annotated
+   result.    When there are conflicts, both texts should be included
+   and annotated.
+
+These may be repeated: after a merge there may be another merge, or
+there may be manual fixups or conflict resolutions.
+
+So what we require is given a diff or a diff3 between two files, map
+the regions of bytes changed into corresponding updates to the origin
+annotations.
+
+Annotations can also be delta-compressed; we only need to add new
+annotation data when there is a text insertion.
+
+    (It is possible in a merge to have a change of annotation when
+    there is no text change, though this seems unlikely.  This can
+    still be represented as a "pointless" delta, plus an update to the
+    annotations.)
+
+Index file
+----------
+
+In a proposed (not implemented) storage with annotations, the index
+file is a series of fixed-length records::
+
+  byte[16]     UUID of revision
+  byte[20]     SHA-1 of expanded text (as binary, not hex)
+  uint32       flags: 1=zlib compressed
+  uint32       sequence number this is based on, or -1 for full text
+  uint32       offset in text file of start
+  uint32       length of compressed delta in text file
+  uint32[3]    reserved
+
+Total 64 bytes.
+
+The header is also 64 bytes, for tidyness and easy calculation.  For
+this format the header must be ``bzr revfile v2\n`` padded with
+``\xff`` to 64 bytes.
+
+The first record after the header is index 0.  A record's base index
+must be less than its own index.
+
+The SHA-1 is redundant with the inventory but stored just as a check
+on the compression methods and so that the file can be validated
+without reference to any other information.
+
+Each byte in the text file should be included by at most one delta.
+
+
+Deltas
+------
+
+In a proposed (not implemented) storage with annotations, deltas to
+the text are stored as a series of variable-length records::
+
+  uint32        idx
+  uint32        m
+  uint32        n
+  uint32        l
+  byte[l]       new
+
+This describes a change originally introduced in the revision
+described by *idx* in the index.
+
+This indicates that the region [m:n] of the input file should be
+replaced by the text *new*.  If m==n this is a pure insertion of l
+bytes.  If l==0 this is a pure deletion of (n-m) bytes.
+
+
+
+
+
+Open issues
+-----------
+
+
+* Storing the annotations with the text is reasonably simple and
+  compact, but means that we always need to process the annotation
+  structure even when we only want the text.  In particular it means
+  that full-texts cannot just simply be copied out but rather composed
+  from chunks.  That seems inefficient since it is probably common to
+  only want the text.
+
+* Should annotations also indicate where text was deleted?
+
+* This design calls for only one annotation per line, which seems
+  standard.  However, this is lacking in at least two cases:
+
+  - Lines which originate in the same way in more than one revision,
+    through being independently introduced.  In this case we would
+    apparently have to make an arbitrary choice; I suppose branches
+    could prefer to assume lines originated in their own history.
+
+  - It might be useful to directly indicate which mergers included
+    which lines.  We do have that information in the revision history
+    though, so there seems no need to store it for every line.
+

=== modified file 'doc/revfile.txt'
--- a/doc/revfile.txt	2005-05-06 03:13:26 +0000
+++ b/doc/revfile.txt	2005-05-06 03:20:15 +0000
@@ -128,134 +128,6 @@
 
 
 
-Extension to store annotations
-==============================
-
-We might extend the revfile format in a future version to also store
-annotations.  *This is not implemented yet.*
-
-In previous versions, the  index file identified texts by their
-SHA-1 digest.  This was unsatisfying for two reasons.  Firstly it
-assumes that SHA-1 will not collide, which is not an assumption we
-wish to make in long-lived files.  Secondly for annotations we need
-to be able to map from file versions back to a revision.
-
-Texts are identified by the name of the revfile and a UUID
-corresponding to the first revision in which they were first
-introduced.  This means that given a text we can identify which
-revision it belongs to, and annotations can use the index within the
-revfile to identify where a region was first introduced.
-
-  We cannot identify texts by the integer revision number, because
-  that would limit us to only referring to a file in a particular
-  branch.
-
-  I'd like to just use the revision-id, but those are variable-length
-  strings, and I'd like the revfile index to be fixed-length and
-  relatively short.  UUIDs can be encoded in binary as only 16 bytes.
-  Perhaps we should just use UUIDs for revisions and be done?
-
-
-
-Annotations
------------
-
-Annotations indicate which revision of a file first inserted a line
-(or region of bytes).
-
-Given a string, we can write annotations on it like so: a sequence of
-*(index, length)* pairs, giving the *index* of the revision which
-introduced the next run of *length* bytes.  The sum of the lengths
-must equal the length of the string.  For text files the regions will
-typically fall on line breaks.  This can be transformed in memory to
-other structures, such as a list of *(index, content)* pairs.
-
-When a line was inserted from a merge revision then the annotation for
-that line should still be the source in the merged branch, rather than
-just being the revision in which the merge took place.
-
-They can cheaply be calculated when inserting a new text, but are
-expensive to calculate after the fact because that requires searching
-back through all previous text and all texts which were merged in.  It
-therefore seems sensible to calculate them once and store them.
-
-To do this we need two operators which update an existing annotated
-file:
-
-A. Given an annotated file and a working text, update the annotation to
-   mark regions inserted in the working file as new in this revision.
-
-B. Given two annotated files, merge them to produce an annotated
-   result.    When there are conflicts, both texts should be included
-   and annotated.
-
-These may be repeated: after a merge there may be another merge, or
-there may be manual fixups or conflict resolutions.
-
-So what we require is given a diff or a diff3 between two files, map
-the regions of bytes changed into corresponding updates to the origin
-annotations.
-
-Annotations can also be delta-compressed; we only need to add new
-annotation data when there is a text insertion.
-
-    (It is possible in a merge to have a change of annotation when
-    there is no text change, though this seems unlikely.  This can
-    still be represented as a "pointless" delta, plus an update to the
-    annotations.)
-
-Index file
-----------
-
-In a proposed (not implemented) storage with annotations, the index
-file is a series of fixed-length records::
-
-  byte[16]     UUID of revision
-  byte[20]     SHA-1 of expanded text (as binary, not hex)
-  uint32       flags: 1=zlib compressed
-  uint32       sequence number this is based on, or -1 for full text
-  uint32       offset in text file of start
-  uint32       length of compressed delta in text file
-  uint32[3]    reserved
-
-Total 64 bytes.
-
-The header is also 64 bytes, for tidyness and easy calculation.  For
-this format the header must be ``bzr revfile v2\n`` padded with
-``\xff`` to 64 bytes.
-
-The first record after the header is index 0.  A record's base index
-must be less than its own index.
-
-The SHA-1 is redundant with the inventory but stored just as a check
-on the compression methods and so that the file can be validated
-without reference to any other information.
-
-Each byte in the text file should be included by at most one delta.
-
-
-Deltas
-------
-
-In a proposed (not implemented) storage with annotations, deltas to
-the text are stored as a series of variable-length records::
-
-  uint32        idx
-  uint32        m
-  uint32        n
-  uint32        l
-  byte[l]       new
-
-This describes a change originally introduced in the revision
-described by *idx* in the index.
-
-This indicates that the region [m:n] of the input file should be
-replaced by the text *new*.  If m==n this is a pure insertion of l
-bytes.  If l==0 this is a pure deletion of (n-m) bytes.
-
-
-
-
 
 Open issues
 ===========
@@ -273,26 +145,4 @@
   be fixed by creating the fixed repository as a separate branch, into
   which only the preserved revisions are exported.
 
-* Should annotations also indicate where text was deleted?
-
-* This design calls for only one annotation per line, which seems
-  standard.  However, this is lacking in at least two cases:
-
-  - Lines which originate in the same way in more than one revision,
-    through being independently introduced.  In this case we would
-    apparently have to make an arbitrary choice; I suppose branches
-    could prefer to assume lines originated in their own history.
-
-  - It might be useful to directly indicate which mergers included
-    which lines.  We do have that information in the revision history
-    though, so there seems no need to store it for every line.
-
 * Should we also store full-texts as a transitional step?
-
-* Storing the annotations with the text is reasonably simple and
-  compact, but means that we always need to process the annotation
-  structure even when we only want the text.  In particular it means
-  that full-texts cannot just simply be copied out but rather composed
-  from chunks.  That seems inefficient since it is probably common to
-  only want the text.
-