[Pkg-bazaar-commits] ./bzr/unstable r391: - split out notes on storing annotations in revfiles
Martin Pool
mbp at sourcefrog.net
Fri Apr 10 07:52:09 UTC 2009
------------------------------------------------------------
revno: 391
committer: Martin Pool <mbp at sourcefrog.net>
timestamp: Fri 2005-05-06 13:20:15 +1000
message:
- split out notes on storing annotations in revfiles
added:
doc/revfile-annotation.txt
modified:
doc/index.txt
doc/revfile.txt
-------------- next part --------------
=== modified file 'doc/index.txt'
--- a/doc/index.txt 2005-05-05 05:46:54 +0000
+++ b/doc/index.txt 2005-05-06 03:20:15 +0000
@@ -104,6 +104,8 @@
* `Revfiles <revfile.html>`__ store the text history of files.
+* `Revfiles storing annotations <revfile-annotation.txt>`__
+
* `Revision syntax <revision-syntax.html>`__ -- ``hello.c at 12``, etc.
* `Roll-up commits <rollup.html>`__ -- a single revision incorporates
=== added file 'doc/revfile-annotation.txt'
--- a/doc/revfile-annotation.txt 1970-01-01 00:00:00 +0000
+++ b/doc/revfile-annotation.txt 2005-05-06 03:20:15 +0000
@@ -0,0 +1,155 @@
+==============================
+Extension to store annotations
+==============================
+
+We might extend the revfile format in a future version to also store
+annotations. *This is not implemented yet.*
+
+In previous versions, the index file identified texts by their
+SHA-1 digest. This was unsatisfying for two reasons. Firstly it
+assumes that SHA-1 will not collide, which is not an assumption we
+wish to make in long-lived files. Secondly for annotations we need
+to be able to map from file versions back to a revision.
+
+Texts are identified by the name of the revfile and a UUID
+corresponding to the first revision in which they were first
+introduced. This means that given a text we can identify which
+revision it belongs to, and annotations can use the index within the
+revfile to identify where a region was first introduced.
+
+ We cannot identify texts by the integer revision number, because
+ that would limit us to only referring to a file in a particular
+ branch.
+
+ I'd like to just use the revision-id, but those are variable-length
+ strings, and I'd like the revfile index to be fixed-length and
+ relatively short. UUIDs can be encoded in binary as only 16 bytes.
+ Perhaps we should just use UUIDs for revisions and be done?
+
+
+
+Annotations
+-----------
+
+Annotations indicate which revision of a file first inserted a line
+(or region of bytes).
+
+Given a string, we can write annotations on it like so: a sequence of
+*(index, length)* pairs, giving the *index* of the revision which
+introduced the next run of *length* bytes. The sum of the lengths
+must equal the length of the string. For text files the regions will
+typically fall on line breaks. This can be transformed in memory to
+other structures, such as a list of *(index, content)* pairs.
+
+When a line was inserted from a merge revision then the annotation for
+that line should still be the source in the merged branch, rather than
+just being the revision in which the merge took place.
+
+They can cheaply be calculated when inserting a new text, but are
+expensive to calculate after the fact because that requires searching
+back through all previous text and all texts which were merged in. It
+therefore seems sensible to calculate them once and store them.
+
+To do this we need two operators which update an existing annotated
+file:
+
+A. Given an annotated file and a working text, update the annotation to
+ mark regions inserted in the working file as new in this revision.
+
+B. Given two annotated files, merge them to produce an annotated
+ result. When there are conflicts, both texts should be included
+ and annotated.
+
+These may be repeated: after a merge there may be another merge, or
+there may be manual fixups or conflict resolutions.
+
+So what we require is given a diff or a diff3 between two files, map
+the regions of bytes changed into corresponding updates to the origin
+annotations.
+
+Annotations can also be delta-compressed; we only need to add new
+annotation data when there is a text insertion.
+
+ (It is possible in a merge to have a change of annotation when
+ there is no text change, though this seems unlikely. This can
+ still be represented as a "pointless" delta, plus an update to the
+ annotations.)
+
+Index file
+----------
+
+In a proposed (not implemented) storage with annotations, the index
+file is a series of fixed-length records::
+
+ byte[16] UUID of revision
+ byte[20] SHA-1 of expanded text (as binary, not hex)
+ uint32 flags: 1=zlib compressed
+ uint32 sequence number this is based on, or -1 for full text
+ uint32 offset in text file of start
+ uint32 length of compressed delta in text file
+ uint32[3] reserved
+
+Total 64 bytes.
+
+The header is also 64 bytes, for tidyness and easy calculation. For
+this format the header must be ``bzr revfile v2\n`` padded with
+``\xff`` to 64 bytes.
+
+The first record after the header is index 0. A record's base index
+must be less than its own index.
+
+The SHA-1 is redundant with the inventory but stored just as a check
+on the compression methods and so that the file can be validated
+without reference to any other information.
+
+Each byte in the text file should be included by at most one delta.
+
+
+Deltas
+------
+
+In a proposed (not implemented) storage with annotations, deltas to
+the text are stored as a series of variable-length records::
+
+ uint32 idx
+ uint32 m
+ uint32 n
+ uint32 l
+ byte[l] new
+
+This describes a change originally introduced in the revision
+described by *idx* in the index.
+
+This indicates that the region [m:n] of the input file should be
+replaced by the text *new*. If m==n this is a pure insertion of l
+bytes. If l==0 this is a pure deletion of (n-m) bytes.
+
+
+
+
+
+Open issues
+-----------
+
+
+* Storing the annotations with the text is reasonably simple and
+ compact, but means that we always need to process the annotation
+ structure even when we only want the text. In particular it means
+ that full-texts cannot just simply be copied out but rather composed
+ from chunks. That seems inefficient since it is probably common to
+ only want the text.
+
+* Should annotations also indicate where text was deleted?
+
+* This design calls for only one annotation per line, which seems
+ standard. However, this is lacking in at least two cases:
+
+ - Lines which originate in the same way in more than one revision,
+ through being independently introduced. In this case we would
+ apparently have to make an arbitrary choice; I suppose branches
+ could prefer to assume lines originated in their own history.
+
+ - It might be useful to directly indicate which mergers included
+ which lines. We do have that information in the revision history
+ though, so there seems no need to store it for every line.
+
=== modified file 'doc/revfile.txt'
--- a/doc/revfile.txt 2005-05-06 03:13:26 +0000
+++ b/doc/revfile.txt 2005-05-06 03:20:15 +0000
@@ -128,134 +128,6 @@
-Extension to store annotations
-==============================
-
-We might extend the revfile format in a future version to also store
-annotations. *This is not implemented yet.*
-
-In previous versions, the index file identified texts by their
-SHA-1 digest. This was unsatisfying for two reasons. Firstly it
-assumes that SHA-1 will not collide, which is not an assumption we
-wish to make in long-lived files. Secondly for annotations we need
-to be able to map from file versions back to a revision.
-
-Texts are identified by the name of the revfile and a UUID
-corresponding to the first revision in which they were first
-introduced. This means that given a text we can identify which
-revision it belongs to, and annotations can use the index within the
-revfile to identify where a region was first introduced.
-
- We cannot identify texts by the integer revision number, because
- that would limit us to only referring to a file in a particular
- branch.
-
- I'd like to just use the revision-id, but those are variable-length
- strings, and I'd like the revfile index to be fixed-length and
- relatively short. UUIDs can be encoded in binary as only 16 bytes.
- Perhaps we should just use UUIDs for revisions and be done?
-
-
-
-Annotations
------------
-
-Annotations indicate which revision of a file first inserted a line
-(or region of bytes).
-
-Given a string, we can write annotations on it like so: a sequence of
-*(index, length)* pairs, giving the *index* of the revision which
-introduced the next run of *length* bytes. The sum of the lengths
-must equal the length of the string. For text files the regions will
-typically fall on line breaks. This can be transformed in memory to
-other structures, such as a list of *(index, content)* pairs.
-
-When a line was inserted from a merge revision then the annotation for
-that line should still be the source in the merged branch, rather than
-just being the revision in which the merge took place.
-
-They can cheaply be calculated when inserting a new text, but are
-expensive to calculate after the fact because that requires searching
-back through all previous text and all texts which were merged in. It
-therefore seems sensible to calculate them once and store them.
-
-To do this we need two operators which update an existing annotated
-file:
-
-A. Given an annotated file and a working text, update the annotation to
- mark regions inserted in the working file as new in this revision.
-
-B. Given two annotated files, merge them to produce an annotated
- result. When there are conflicts, both texts should be included
- and annotated.
-
-These may be repeated: after a merge there may be another merge, or
-there may be manual fixups or conflict resolutions.
-
-So what we require is given a diff or a diff3 between two files, map
-the regions of bytes changed into corresponding updates to the origin
-annotations.
-
-Annotations can also be delta-compressed; we only need to add new
-annotation data when there is a text insertion.
-
- (It is possible in a merge to have a change of annotation when
- there is no text change, though this seems unlikely. This can
- still be represented as a "pointless" delta, plus an update to the
- annotations.)
-
-Index file
-----------
-
-In a proposed (not implemented) storage with annotations, the index
-file is a series of fixed-length records::
-
- byte[16] UUID of revision
- byte[20] SHA-1 of expanded text (as binary, not hex)
- uint32 flags: 1=zlib compressed
- uint32 sequence number this is based on, or -1 for full text
- uint32 offset in text file of start
- uint32 length of compressed delta in text file
- uint32[3] reserved
-
-Total 64 bytes.
-
-The header is also 64 bytes, for tidyness and easy calculation. For
-this format the header must be ``bzr revfile v2\n`` padded with
-``\xff`` to 64 bytes.
-
-The first record after the header is index 0. A record's base index
-must be less than its own index.
-
-The SHA-1 is redundant with the inventory but stored just as a check
-on the compression methods and so that the file can be validated
-without reference to any other information.
-
-Each byte in the text file should be included by at most one delta.
-
-
-Deltas
-------
-
-In a proposed (not implemented) storage with annotations, deltas to
-the text are stored as a series of variable-length records::
-
- uint32 idx
- uint32 m
- uint32 n
- uint32 l
- byte[l] new
-
-This describes a change originally introduced in the revision
-described by *idx* in the index.
-
-This indicates that the region [m:n] of the input file should be
-replaced by the text *new*. If m==n this is a pure insertion of l
-bytes. If l==0 this is a pure deletion of (n-m) bytes.
-
-
-
-
Open issues
===========
@@ -273,26 +145,4 @@
be fixed by creating the fixed repository as a separate branch, into
which only the preserved revisions are exported.
-* Should annotations also indicate where text was deleted?
-
-* This design calls for only one annotation per line, which seems
- standard. However, this is lacking in at least two cases:
-
- - Lines which originate in the same way in more than one revision,
- through being independently introduced. In this case we would
- apparently have to make an arbitrary choice; I suppose branches
- could prefer to assume lines originated in their own history.
-
- - It might be useful to directly indicate which mergers included
- which lines. We do have that information in the revision history
- though, so there seems no need to store it for every line.
-
* Should we also store full-texts as a transitional step?
-
-* Storing the annotations with the text is reasonably simple and
- compact, but means that we always need to process the annotation
- structure even when we only want the text. In particular it means
- that full-texts cannot just simply be copied out but rather composed
- from chunks. That seems inefficient since it is probably common to
- only want the text.
-
More information about the Pkg-bazaar-commits
mailing list