[build-path-prefix-map-spec] 33/50: Update spec wording and move some stuff to the wiki

Fri Mar 10 15:17:21 UTC 2017

This is an automated email from the git hooks/post-receive script.

infinity0 pushed a commit to branch master
in repository build-path-prefix-map-spec.

commit 0c806d8e64bb36eaec83c06eab072c17edb3fb9a
Author: Ximin Luo <infinity0 at debian.org>
Date:   Wed Feb 22 21:11:41 2017 +0100

    Update spec wording and move some stuff to the wiki
---
 spec-draft.rst | 237 +++++++++++++++++++++++++--------------------------------
 1 file changed, 102 insertions(+), 135 deletions(-)

diff --git a/spec-draft.rst b/spec-draft.rst
index 821d491..9fa383d 100644
--- a/spec-draft.rst
+++ b/spec-draft.rst
@@ -1,31 +1,45 @@
-TODO: define producers, consumer
-
-Encoding specification
-======================
-
-This describes an environment variable that encodes a list-of-pairs where each
-pair holds two strings. We'll use the terms "left"- and "right end of the
-list", to respectively refer to the parts of the structure (or related), that
-was originally parsed from the left (start) and right (end) ends of the value.
-
-Generally the data types of environment variables are platform-dependent, but
-we'll assume here that for each platform these are the same for environment
-variable values as it is for file paths. On POSIX systems, they are strings of
-octets (bytes), and on Windows they are strings of 16-bit wide words
-(``wchar_t``) which may be valid or invalid UTF-16.
-
-Since our encoding only deals with ASCII-compatible characters, and UTF-16 uses
-surrogate pairs to encode code points not in the BMP, it should be possible to
-implement our encoding by "naively" operating on string units, regardless of
-whether a unit is an 8-bit octet (e.g. POSIX C), 16-bit wchar_t (e.g. Windows
-C++), or an actual decoded Unicode code point (e.g. Python 3). However in
-practise, this is only possible when your language provides APIs that do not
-attempt to automatically decode environment variables or filesystem paths, or
-does this in a reversible (non-standard) way. See the "language-specific
-implementation notes" section below for details and examples.
-
-For more details on the Windows situation, see the "Windows system strings"
-section further below.
+Specification
+=============
+
+This specification describes an environment variable ``BUILD_PATH_PREFIX_MAP``
+which may be used by build tools to generate reproducible output that does not
+include any paths that are dependent on the build-time filesystem layout.
+
+A *producer* is a program that knows how to determine appropriate values for
+this environment variable, such as a top-level distribution package builder,
+and which can then pass these values to child processes that consume them.
+
+A *consumer* is a program that relies on appropriate values for this variable
+to be set by a higher level build tool, and which then can generate output that
+is reproducible, independent of the filesystem layout of the build machine.
+
+The actual value of this environment variable MUST NOT be saved into any output
+meant to form part of a reproducible binary artefact.
+
+
+Encoding and decoding the variable
+----------------------------------
+
+This section describes a data structure encoding, from a list-of-pairs where
+each pair holds two strings, into a single string.
+
+We use the phrases "left"- and "right end of the list", to respectively refer
+to the parts of the list that correspond to the left (start) and right (end)
+ends of the string that it was parsed from, and vice versa.
+
+On POSIX systems these strings are a sequence of 8-bit bytes. On Windows
+systems these strings are a sequence of 16-bit ``wchar_t`` words. On both
+platforms, these string types are the types of both filesystem paths and
+environment variables on that platform.
+
+When implementing this data structure encoding, either (a) you MUST directly
+operate on the string types described above *without* also decoding or encoding
+them using a character encoding (e.g. UTF-8 or UTF-16); or (b) if you must use
+a character encoding e.g. because your language's standard libraries force you
+to, then you MUST ensure that the overall encode+decode and decode+encode
+operations always exactly preserves the original structure or value, even if it
+contains data that was invalid for the character encoding that was used. See
+[TODO link] for further details and guidance on how to do this.
 
 The encoding is as follows:
 
@@ -97,11 +111,11 @@ The encoding is as follows:
      we don't anticipate this to be a major use-case
 
 
-Setting the variable
-====================
+Setting the encoded value
+-------------------------
 
 Producers SHOULD NOT overwrite existing values; instead they should append
-their mappings onto the right of any existing value.
+their new mappings onto the right of any existing value.
 
 Producers who build *general software* that uses this envvar, MUST NOT expect
 any special contracts on the output emitted by *general consumers* based on
@@ -111,27 +125,25 @@ changes and the value of this envvar is changed to match the new paths.
 On the other hand, if you know you will only support a limited set of
 consumers, you may expect that they apply these mappings in specific ways.
 
-(See also the definition in the next part.)
+See also the requirements for consumers in the next part for guidance.
 
 
-Applying the variable
-=====================
+Applying the decoded structure
+------------------------------
 
-Consumers MUST ensure that, at minimum: for all ("source", "target") prefix
+Consumers MUST ensure that, at minimum: for all (*source*, *target*) prefix
 pairs in the parsed list, with rightmost pairs taking priority: strings in the
 final build output, that represent build-time paths derived from "source",
 instead appear to represent potential run-time paths derived from "target".
 
-(As a corollary, consumers MUST NOT require producers append a directory
-separator to a source prefix, to define mappings related to that directory.)
-
-Implementation notes:
+As a consequence, consumers MUST apply mappings as above, regardless of whether
+the *source* prefix ends with a directory separator or not.
 
-This definition specifically does not define "derived from", since this may be
-different for different consumers (languages, buildsystems, etc), and a more
-specific definition might conflict with their idea of what that means.
+We do not define "derived from" more specifically, since this may be different
+for different consumers (languages, buildsystems, etc), and a more specific
+definition might conflict with their idea of what that means.
 
-In practice, we recommend one of the following algorithms:
+Consumers SHOULD implement one of the following algorithms:
 
 1. For each (source, target) prefix pair in the list-of-pairs, going from right
    to left: if the subject path starts with the source prefix, then replace
@@ -141,121 +153,76 @@ In practice, we recommend one of the following algorithms:
 2. As in (1) but with "starts with" replaced by "starts with, restricted to
    whole-path components". So for example,
 
-   ``/path/to/a/b/c`` "starts with" ``/path/to/a``
-   ``/path/to/aa/b/c`` does not "start with" ``/path/to/a``
+   - ``/path/to/a/b/c`` "starts with" ``/path/to/a``
+   - ``/path/to/aa/b/c`` does not "start with" ``/path/to/a``
 
-   (This has more robust semantics but is slightly more complex to implement.)
+   This has more robust semantics but is slightly more complex to implement.
 
 
-Language-specific implementation notes
-======================================
+Test vectors
+============
 
-Some high-level languages do not provide easy direct access to the underlying
-environment variable value, in the string-type of the platform.
+TODO
 
-For example, on Python 3, ``os.getenv`` and the path functions normally return
-a unicode string (where each unit is a decoded Unicode code point), unless you
-specifically use ``os.getenvb`` instead or give "bytes"-type path arguments.
 
-Luckily on Python 3.3+ one can implement our encoding without duplicating code,
-in a cross-platform way. Yes, paths and environment variables are presented as
-(unencoded) Unicode strings. However on POSIX where the underlying OS values
-are bytes, values which cannot be UTF-8 decoded to valid Unicode are instead
-decoded (by default) into a lone "low surrogate" character (Python calls this
-the "surrogateescope" encoding) which is not present in "normal" Unicode. The
-resulting string, when UTF-8 encoded back into bytes, preserves the original
-byte value - which is invalid UTF-8 but that doesn't matter to a POSIX OS.
-Therefore, it is correct to implement a "naive" algorithm that operates on
-Python unicode strings even when the OS type is bytes, and the benefit is that
-the same code will also work on Windows.
+External links
+==============
 
-This type of "accidentally-correct" situation may not be true for all languages
-however, so you should understand these issues carefully and check it.
+Detailed implementation notes and advice are available at
+`<https://wiki.debian.org/ReproducibleBuilds/BuildPathProposal>`_.
 
-For example, in Rust the ``OsString`` type is platform-dependent and opaque;
-one must write platform-specific code to either convert this to an array of
-[u8] (for POSIX) or an array of [u16] (for Windows). In the latter case, u16
-units that are invalid UTF-16  are represented internally as WTF-8 [3]_, but
-this is only an implementation detail not exposed to Rust stdlib API users.
+Example source code is available on the above page, as well as in runnable form
+on `<https://github.com/infinity0/rb-prefix-map>`_. FIXME use alioth link
 
-.. [3] https://simonsapin.github.io/wtf-8/
 
-For example, in NodeJS (as of v4.6.1), non-UTF-8 bytes in environment variables
-are *not supported* - they will get replaced by U+FFFD instead. Best to file a
-bug against them, if you need to map non-UTF-8 paths.
+References
+==========
 
-Our testcases/ includes a non-UTF-8 case, so you can test how to make this work
-(or not) in your favourite language. (Unfortunately, we do not yet have invalid
-UTF-16 test cases for windows.)
+POSIX system strings
+--------------------
 
+References:
 
-Windows system strings
-======================
+- `Definitions (no HTTPS)
+  <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html>`_
+  - see "Pathname", "String" and "Byte".
 
-Windows environment variables and filesystem paths are "supposed" to be UTF-16
-and they are commonly advertised as "UTF-16", however the kernel does not check
-the validity of the 16-bit strings passed to it and only *some* system APIs
-enforce it. In other words, it is UCS-2, but this term is deprecated. [4]_
+- `Environment Variables (no HTTPS)
+  <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html>`_
+  for the type of ``environ``.
 
-So in practice, user code should not assume that the strings are valid UTF-16,
-and should be able to deal with invalid UTF-16 strings. The easiest way to do
-this, is to treat these things as opaque 16-bit sequences with no encoding.
+- `limits.h - implementation-defined constants (no HTTPS)
+  <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/limits.h.html>`_
+  for the definition of ``CHAR_BIT``.
 
-See [test code TODO link] for some examples.
+Windows system strings
+----------------------
 
-.. [4] http://unicode.org/faq/utf_bom.html#utf16-1 no HTTPS unfortunately
+Windows strings are commonly advertised as "UTF-16", however for environment
+variable values and filesystem paths the system APIs do not enforce validity of
+the 16-bit strings passed to it. In other words, it is UCS-2, but this term `is
+deprecated (no HTTPS) <http://unicode.org/faq/utf_bom.html#utf16-1>`_.
+
+So in practice, user code should not assume that these system strings are valid
+UTF-16, and should be able to deal with invalid UTF-16 strings. The easiest way
+to do this, is to treat them as opaque 16-bit sequences with no encoding.
 
 References:
 
+- `File Management > About File Management > Creating, Deleting, and Maintaining Files
+  <https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx>`_
+
 - `Visual C++ / Documentation / C Runtime Library / [..] / CRT Alphabetical
   Function Reference / getenv_s, _wgetenv_s
   <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/getenv-s-wgetenv-s>`_
 
+- `... > C Language Reference > ... > C Identifiers > Multibyte and Wide Characters
+  <https://msdn.microsoft.com/en-us/library/z207t55f.aspx>`_ Note that what
+  Microsoft calls "Wide Characters" and "Unicode" is actually valid-or-invalid
+  UTF-16 as described above, *not* decoded Unicode code points.
+
 - `Unicode and Character Sets > About Unicode and Character Sets > Character Sets
   <https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx>`_
-
-- `File Management > About File Management > Creating, Deleting, and Maintaining Files
-  <https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx>`_
-
-- `path: Windows paths may contain non-utf8-representable sequences #12056
-  <https://github.com/rust-lang/rust/issues/12056>`_
-
-
-Transmitting these values
-=========================
-
-Our encoding only transforms sequences of printable ASCII characters. If you
-have reason to believe that you need to escape or encode your file paths (e.g.
-because they contain non-printable or non-ASCII characters) before transmitting
-it across your chosen medium, it should suffice to simply apply the same escape
-or encoding mechanism to this environment variable as well. This is an entirely
-separate concern from anything else mentioned in this document, and the code to
-do this should be clearly separated from code that implements this document.
-
-
-Rejected options
-================
-
-- Simple-split using semi-common characters like ':', because it loses the
-  ability to map paths containing those characters.
-
-- Simple-split using never-in-path characters like '\t' or '0x1E RECORD
-  SEPARATOR', because they make the output unconditionally non-printable.
-
-- Any variant of backslash-escape, because it's not clean to implement in
-  high-level languages. (Need to use regex or an explicit loop.)
-
-- Any variant of hex-encoding, because different languages decode hex codes
-  >127 in different ways, when inserting it back into a string.
-
-- Any variant of url-encoding: as for hex-encoding, and additionally because
-  the original perceived gain (being able to use existing decoders) did not
-  work out in the end:
-
-  - Extra characters like "+" ";" need to be encoded.
-
-  - Decoders in many languages only decode to a { key → value list }; there is
-    no way to turn this into a list-of-pairs preserving the original ordering.
-
-- Mapping % into %% (or \ into \\, etc), because this causes differences when
-  decoding sequences like "%%+" via different strategies.
+  This often-cited page is not actually relevant to filesystem paths or
+  environment variable values, and rather instead refers to how Windows
+  applications deal with userland, not system, character data.

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/reproducible/build-path-prefix-map-spec.git