[Pkg-bazaar-commits] ./bzr/unstable r390: - Update revfile docs; most of what's in there is speculative

Fri Apr 10 07:52:09 UTC 2009

------------------------------------------------------------
revno: 390
committer: Martin Pool <mbp at sourcefrog.net>
timestamp: Fri 2005-05-06 13:13:26 +1000
message:
  - Update revfile docs; most of what's in there is speculative
    about storage with annotations.
modified:
  doc/revfile.txt
-------------- next part --------------
=== modified file 'doc/revfile.txt'

--- a/doc/revfile.txt	2005-05-03 07:48:16 +0000
+++ b/doc/revfile.txt	2005-05-06 03:13:26 +0000
@@ -5,6 +5,9 @@
 The unit for compressed storage in bzr is a *revfile*, whose design
 was suggested by Matt Mackall.
 
+This document describes version 1 of the file, and has some notes on
+what might be done in version 2.
+
 
 Requirements
 ============
@@ -22,6 +25,9 @@
 
 * Storage of files of at least a few hundred MB.
 
+* Lossless in useful ways: we can extract a series of texts and write
+  them back out without losing any information.
+
 
 Design
 ======
@@ -74,26 +80,13 @@
 the data file is much longer and only the relevant bits of it,
 identified by the index file, need to be read.
 
-  In previous versions, the  index file identified texts by their
-  SHA-1 digest.  This was unsatisfying for two reasons.  Firstly it
-  assumes that SHA-1 will not collide, which is not an assumption we
-  wish to make in long-lived files.  Secondly for annotations we need
-  to be able to map from file versions back to a revision.
-
-Texts are identified by the name of the revfile and a UUID
-corresponding to the first revision in which they were first
-introduced.  This means that given a text we can identify which
-revision it belongs to, and annotations can use the index within the
-revfile to identify where a region was first introduced.
-
-  We cannot identify texts by the integer revision number, because
-  that would limit us to only referring to a file in a particular
-  branch.
-
-  I'd like to just use the revision-id, but those are variable-length
-  strings, and I'd like the revfile index to be fixed-length and
-  relatively short.  UUIDs can be encoded in binary as only 16 bytes.
-  Perhaps we should just use UUIDs for revisions and be done?
+  This design is similar to that of Netscape `mail summary files`_, in
+  that there is a small index which can always be read into memory and
+  that quickly identifies where to look in the main file.  They differ
+  in many other ways though, most particularly that the index is not
+  just a cache but holds precious data in its own right.
+
+.. _`mail summary files`: http://www.jwz.org/doc/mailsum.html
 
 This is meant to scale to hold 100,000 revisions of a single file, by
 which time the index file will be ~4.8MB and a bit big to read
@@ -102,7 +95,9 @@
 Some of the reserved fields could be used to implement a (semi?)
 balanced tree indexed by SHA1 so we can much more efficiently find the
 index associated with a particular hash.  For 100,000 revs we would be
-able to find it in about 17 random reads, which is not too bad.
+able to find it in about 17 random reads, which is not too bad.  On
+the other hand that would compromise the append-only indexing, and
+100,000 revs is a fairly extreme case.
 
 This performs pretty well except when trying to calculate deltas of
 really large files.  For that the main thing would be to plug in
@@ -111,6 +106,10 @@
 though perhaps that's too perverse?
 
 
+Identifying texts
+-----------------
+
+In the current version, texts are identified by their SHA-1.  
 
 
 Skip-deltas
@@ -121,6 +120,43 @@
 too many deltas to reproduce a particular file.  
 
 
+Tools
+-----
+
+The revfile module can be invoked as a program to give low-level
+access for data recovery, debugging, etc.
+
+
+
+Extension to store annotations
+==============================
+
+We might extend the revfile format in a future version to also store
+annotations.  *This is not implemented yet.*
+
+In previous versions, the  index file identified texts by their
+SHA-1 digest.  This was unsatisfying for two reasons.  Firstly it
+assumes that SHA-1 will not collide, which is not an assumption we
+wish to make in long-lived files.  Secondly for annotations we need
+to be able to map from file versions back to a revision.
+
+Texts are identified by the name of the revfile and a UUID
+corresponding to the first revision in which they were first
+introduced.  This means that given a text we can identify which
+revision it belongs to, and annotations can use the index within the
+revfile to identify where a region was first introduced.
+
+  We cannot identify texts by the integer revision number, because
+  that would limit us to only referring to a file in a particular
+  branch.
+
+  I'd like to just use the revision-id, but those are variable-length
+  strings, and I'd like the revfile index to be fixed-length and
+  relatively short.  UUIDs can be encoded in binary as only 16 bytes.
+  Perhaps we should just use UUIDs for revisions and be done?
+
+
+
 Annotations
 -----------
 
@@ -168,23 +204,11 @@
     still be represented as a "pointless" delta, plus an update to the
     annotations.)
 
-
-
-Tools
------
-
-The revfile module can be invoked as a program to give low-level
-access for data recovery, debugging, etc.
-
-
-
-Format
-======
-
 Index file
 ----------
 
-The index file is a series of fixed-length records::
+In a proposed (not implemented) storage with annotations, the index
+file is a series of fixed-length records::
 
   byte[16]     UUID of revision
   byte[20]     SHA-1 of expanded text (as binary, not hex)
@@ -213,7 +237,8 @@
 Deltas
 ------
 
-Deltas to the text are stored as a series of variable-length records::
+In a proposed (not implemented) storage with annotations, deltas to
+the text are stored as a series of variable-length records::
 
   uint32        idx
   uint32        m
@@ -230,6 +255,8 @@
 
 
 
+
+
 Open issues
 ===========