[Pkg-bazaar-commits] ./bzr/unstable r328: - more documentation of revfile+annotation

Fri Apr 10 07:51:52 UTC 2009

------------------------------------------------------------
revno: 328
committer: Martin Pool <mbp at sourcefrog.net>
timestamp: Tue 2005-05-03 12:39:45 +1000
message:
  - more documentation of revfile+annotation
modified:
  bzrlib/revfile.py
  doc/revfile.txt
-------------- next part --------------
=== modified file 'bzrlib/revfile.py'

--- a/bzrlib/revfile.py	2005-04-15 01:31:21 +0000
+++ b/bzrlib/revfile.py	2005-05-03 02:39:45 +0000
@@ -52,22 +52,6 @@
 is that sequence numbers are stable references.  But not every
 repository in the world will assign the same sequence numbers,
 therefore the SHA-1 is the only universally unique reference.
-
-This is meant to scale to hold 100,000 revisions of a single file, by
-which time the index file will be ~4.8MB and a bit big to read
-sequentially.
-
-Some of the reserved fields could be used to implement a (semi?)
-balanced tree indexed by SHA1 so we can much more efficiently find the
-index associated with a particular hash.  For 100,000 revs we would be
-able to find it in about 17 random reads, which is not too bad.
-
-This performs pretty well except when trying to calculate deltas of
-really large files.  For that the main thing would be to plug in
-something faster than difflib, which is after all pure Python.
-Another approach is to just store the gzipped full text of big files,
-though perhaps that's too perverse?
-
 The iter method here will generally read through the whole index file
 in one go.  With readahead in the kernel and python/libc (typically
 128kB) this means that there should be no seeks and often only one

=== modified file 'doc/revfile.txt'
--- a/doc/revfile.txt	2005-05-03 01:40:58 +0000
+++ b/doc/revfile.txt	2005-05-03 02:39:45 +0000
@@ -160,6 +160,75 @@
 the regions of bytes changed into corresponding updates to the origin
 annotations.
 
+Annotations can also be delta-compressed; we only need to add new
+annotation data when there is a text insertion.
+
+    (It is possible in a merge to have a change of annotation when
+    there is no text change, though this seems unlikely.  This can
+    still be represented as a "pointless" delta, plus an update to the
+    annotations.)
+
+
+
+Tools
+-----
+
+The revfile module can be invoked as a program to give low-level
+access for data recovery, debugging, etc.
+
+
+
+Format
+======
+
+Index file
+----------
+
+The index file is a series of fixed-length records::
+
+  byte[16]     UUID of revision
+  byte[20]     SHA-1 of expanded text (as binary, not hex)
+  uint32       flags: 1=zlib compressed
+  uint32       sequence number this is based on, or -1 for full text
+  uint32       offset in text file of start
+  uint32       length of compressed delta in text file
+  uint32[3]    reserved
+
+Total 64 bytes.
+
+The header is also 64 bytes, for tidyness and easy calculation.  For
+this format the header must be ``bzr revfile v2\n`` padded with
+``\xff`` to 64 bytes.
+
+The first record after the header is index 0.  A record's base index
+must be less than its own index.
+
+The SHA-1 is redundant with the inventory but stored just as a check
+on the compression methods and so that the file can be validated
+without reference to any other information.
+
+Each byte in the text file should be included by at most one delta.
+
+
+Deltas
+------
+
+Deltas to the text are stored as a series of variable-length records::
+
+  uint32        idx
+  uint32        m
+  uint32        n
+  uint32        l
+  byte[l]       new
+
+This describes a change originally introduced in the revision
+described by *idx* in the index.
+
+This indicates that the region [m:n] of the input file should be
+replaced by the text *new*.  If m==n this is a pure insertion of l
+bytes.  If l==0 this is a pure deletion of (n-m) bytes.
+
+
 
 Open issues
 ===========
@@ -190,3 +259,12 @@
   - It might be useful to directly indicate which mergers included
     which lines.  We do have that information in the revision history
     though, so there seems no need to store it for every line.
+
+* Should we also store full-texts as a transitional step?
+
+* Storing the annotations with the text is reasonably simple and
+  compact, but means that we always need to process the annotation
+  structure even when we only want the text.  In particular it means
+  that full-texts cannot just simply be copied out but rather composed
+  from chunks.  That seems inefficient since it is probably common to
+  only want the text.