Data Description
================

PMM is primarily for describing metadata associated with datasets.
However, it is also useful for it to be able to describe how data
is stored. The data section allows this.

Optional key: ``flatfile``
--------------------------

This describes data stored in a flat file (often referred to generically
as "CSV" files, originally standing for *comma-separated file*, but now
sometimes used even for files with other separators).

name
++++

The name describes where to find the data on disk.
It can be any string, and can be a full path
(e.g. ``"/usr/local/data/hillstrom.txt"``) or a relative path
(such as (``"hillstrom.txt"``)).

format
++++++

This describes information about how the data is stored in the flat file.

There are three mandatory keys:

    * ``encoding`` (string): ``"UTF-8"`` is recommended.
    * ``separator`` (string): Usually a single character, often ``","``.
    * ``headerrowcount`` (integer): number of rows in the flat file before the data (most commonly containing field names).

There are also optional keys:

    * ``quote`` (string): Usually empty (``""``),
       a single quote (``"'"``) or a double quote (``"\""``).
       There is no provision for multiple quote characters at this stage.
       It is recommended that flat-file readers based on PMM interpret
       always accept unquoted strings even if a quote character is present.

    * ``escape`` (string): Usually backslash (``"\\"``). The escape character
      is used primarily to all the quote character and itself to be
      included in strings by preceding them with the escape character.
      If set to the empty string, there is no escaping.

      Regardless of the setting of escape, flat file readers may choose to
      accept escape sequences for special characters, especially ``\n``
      for newline.

    * ``nullmarker`` (string): This is used to describe how nulls (missing
      values) are listed in the file. If not specifed at all, null values
      are not allowed in the file. Often set to the empty string (``""``)
      or a marker such as ``NULL``.

    * ``dateformat`` (string): This describes the datestamp format used
      in the file. When writing data, it is normally preferable to
      use the same format for all datestamp fields. However, individual
      fields can also specify a date format, which will override this
      overall setting. This is necessary if a file already contains dates
      in mixed formats.

Future extensions
+++++++++++++++++

  * End-of-line nulls (cf. excel)

  * Bad encoding handling?

  * Byte-order markers

  * Strictness specifiers

  * Literal newlines in files

