~bitfehler/m2dir

This document describes the m2dir format for storing a collection of emails on disk. For more information about m2dir, see the project page.

#Status of this document

This specification is considered a draft. Changes, even breaking ones, are possible if feedback from actual implementations indicate they are necessary. This status will be updated accordingly once it stabilizes.

#Goals

M2dir provides a standardized way to store a collection of email messages as files. It is similar to Maildir/Maildir++, but aims to be simpler and more thoroughly specified.

Its goal is to support both synchronization with other hierarchical remote mail stores (e.g. an IMAP account or another m2dir on a remote host) and delivery of new messages (e.g. SMTP delivery or system notifications).

M2dir only specifies the storage mechanism. Any indexing of messages (for their mapping to remote messages, full-text search, etc.) is left to applications.

#Terminology

The name of this specification is m2dir. It mainly defines two things:

  • The m2dir format for a single directory that contains a collection of emails, without any further context
  • The m2store directory layout, which specifies how a collection of m2dirs is organized to make up a hierarchical mail store

#Overview

The m2dir format has the following defining features:

  • Each message is a file, with a static name
  • A simple, standardized directory hierarchy for synchronization from or to hierarchical remote mail storage
  • A human-centric filename part to facilitate usage of standard tools for searching or managing emails
  • Supports arbitrary message flags

#Directory structure

An m2dir-compatible directory structure consists of a root directory (called the m2store root) and any number of folders (simply called m2dirs).

Such a directory structure in its entirety is called an m2store.

When synchronizing an m2store with remote mail storage, the folders must accurately reflect the remote's hierarchy, nested according to the remote's hierarchy delimiter. Specifically, this implies that an m2store mirroring an IMAP account must not contain any emails in its m2store root. Instead, the root will contain an m2dir INBOX.

The specification does not preclude an m2store root from also being an m2dir. However, at the current version of the specification, applications are strongly recommended to avoid such a setup.

The only restriction is that a folder name must not start with a period (.) and any directory starting with a period must be ignored by m2dir-compliant applications.

An m2store root must contain an empty marker file .m2store to enable discovery by other m2dir-compatible applications.

The .m2store marker file must be empty. However, applications should merely check for the file's presence. Future versions of the spec may use the marker file's content, e.g. to indicate support for a revised version of the spec.

#Folder names

The only constraints imposed on folder names by the m2dir specification is that they must be representable as a valid UTF-8 string, must not be empty, and must not start with a dot (.).

However, further contstraints may be imposed by the underlying filesystem and/or operating system. In such circumstances, an application creating a folder may chose to perform percent-encoding of certain characters, as described in RFC 3986, section 2. In the name of legibility of directory names on the filesystem, applications should be conservative in their choice of characters to encode.

Due to the above rule, a percent sign (%) in a folder name must always be percent-encoded (%25).

When creating a folder, an application may choose to throw an error instead, if the underlying filesystem does not accept a folder name. However, if an application chooses to do any kind of encoding, it must be percent-encoding. All applications performing synchronization to any kind of remote mail store must support percent-encoded folder names.

As stated in the RFC,

For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings.

#Default delivery target

An m2store root may contain an entry .delivery to indicate the user's desired default folder for incoming mail. If present, the entry must meet one of the following criteria:

  • It must be a symbolic link to an existing m2dir underneath the m2store root, or
  • It must be a regular file containing only the normalized, relative path to an existing m2dir underneath the m2store root (e.g. INBOX, not ./INBOX or ~/Mail/INBOX)

Applications must support the link variant. The regular file variant is intended as a backup solution for platforms or filesystems that do not support links. Applications are strongly recommended to support both.

The treatment of the default delivery target is covered in the Mail Delivery section. If configured this way, the m2store root is a valid delivery target, even if it is not itself an m2dir. Otherwise, applications must be configured to deliver to a valid m2dir.

The purpose of this to allow the following hypothetical setup: a system administrator configures a m2dir-compatible mail delivery agent to deliver mails to ~/mail for all users. With the described mechanism, each user can direct incoming mails to the folder of their choice.

#Example directory layout

The on-disk representation of an m2store that gets synchronized with a typical IMAP account (but also allows for local delivery of new mail) might therefore look like this:

\_ mail/
  \_ .m2store
  \_ .delivery -> INBOX/
  \_ INBOX/
    \_ .m2dir
    \_ .meta/
  \_ Sent/
    \_ .m2dir
    \_ .meta/
  \_ Work/
    \_ .m2dir
    \_ .meta/
  \_ Lists/
    \_ srht-dev/
      \_ .m2dir
      \_ .meta/
    \_ srht-discuss/
      \_ .m2dir
      \_ .meta/

Note: the name mail is just an example, the name of the m2store root is user-defined.

#Rootless m2dirs

For more advanced use cases, an m2dir can exist outside the context of an m2store. An example could be backing up one specific mailbox of an IMAP account into a user-specified directory. An m2dir-compliant application can still work with the emails in that directory, but must not make any assumptions about the folder name. Synchronization of changes from or to that directory would for example require that the user explicitly specify the remote mailbox.

#M2dir content

A directory that stores emails in m2dir format according to this specification must contain a marker file .m2dir.

The .m2dir marker file must be empty. However, applications must merely check for the file's presence. Future versions of the spec may use the marker file's content, e.g. to indicate support for a revised version of the spec.

Every file in the m2dir represents an email. Files starting with a period (.) must be ignored, unless they are specified in this document.

Email metadata (such as flags) is stored in a subdirectory .meta (see Metadata below). This directory may not exist, even in the presence of emails in the m2dir, if no metadata about these emails has been recorded yet.

All directories in an m2dir should be ignored, unless the m2dir is embedded in a m2dir-compliant m2store directory structure with a known m2store root. Directories whose name starts with a period (.) must be ignored, unless they are specified in this document.

New files must be created according to the Mail delivery section below.

#Filenames

A message's filename is structured as follows:

<HUMAN_CENTRIC_PART>,<UNIQUE_ID>

The unique ID part is structured as follows:

<CHECKSUM>[.<COUNTER>]

The checksum must be a RFC 4648 base64url-encoded string representing 12 bytes of data (see Unique ID below). This implies that it must not contain any padding characters and must contain only the non-padding characters from the RFC's "URL and Filename safe" Base64 alphabet ([A-Za-z0-9_-]).

To handle checksum collisions, an integer greater than zero can be appended to the checksum, separated by a dot (.).

The unique ID of a message must be generated according to the rules described in the Unique ID section.

#Parsing filenames

An m2dir-compliant application must parse the ID by searching backwards from the end of the filename for the first comma (,). This is because the human-centric part may contain commas itself.

Note that applications must not attempt to parse the human-centric part or derive any properties from it.

#Example filenames

Example filename of an email in an m2dir, using the specification's example naming scheme for the human-centric part of the filename:

2023-09-04_13:47_builds@sr.ht,GTfrlwJfN5vyR28R

Storing the same message twice leads to a hash collision. Therefore, the next copy would have the filename:

2023-09-04_13:47_builds@sr.ht,GTfrlwJfN5vyR28R.1
#Metadata

Metadata about emails is stored in separate files in the .meta subdirectory of an m2dir. Each type of metadata is stored in its own file, following the naming convention:

.meta/<UNIQUE_ID>.<EXTENSION>

Currently, the following types of metadata are defined:

  • Flags: .meta/<UNIQUE_ID>.flags
#Unique ID

The unique ID must be generated according to the following specification.

The value S is defined as the little-endian representation of the 32 bit integer size of the message in bytes.

The entire message is hashed, using the FNV64a hash function, salted with S.

The final checksum is the base64url-encoded representation of the four bytes of S concatenated with the eight bytes of hash output (which is also assumed to be little-endian).

As the input for the base64url-encoding is exactly 12 bytes, the resulting string will be 16 characters long and not contain any padding.

If, and only if, a checksum collision is detected (which likely means a duplicate message), the ID is made unique by appending a dot (.) followed by the first integer starting with 1 that will prevent a collision.

Example:

  • The first message with a checksum X gets the ID X
  • The second message with a checksum X gets the ID X.1
  • The third message with a checksum X gets the ID X.2

With this scheme, changes to a message (which should not occur) can be detected by re-computing the checksum and comparing it to the value extracted from the filename.

#Flags

M2dir allows associating a set of arbitrary flags with a message. These flags are considered metadata and stored in a separate file as defined in the Metadata section. This section defines the format of this file.

The flags file must contain a set of flags, one flag per line, lines separated by a single newline character (ASCII character LF, 0x0A). The empty set of flags may be represented either by an empty file or the absence of a flags file. Each flag must be a valid, non-empty UTF-8 string. Flags must not contain any control characters.

The m2dir specification is only concerned with storage. Therefore, it imposes no further restrictions on the permitted flag names, but it is strongly recommended that applications limit the flag names to a conservative subset (such as alphanumeric ASCII characters only, or the allowed characters for IMAP keywords), for interoperability.

Similarly, with m2dir being concerned with storage only, it treats flags as case-sensitive. It is up to an application to normalize flags or compare them in a case-insensitive manner if the use-case calls for it (IMAP for example considers flags case-insensitive).

While the m2dir specification does allow arbitrary flags, it also specifies a set of standard flags for very common use-cases. It is strongly recommended that an application synchronizing a remote mail store with an m2store map whatever flags the remote storage may be using for these common use-cases to the ones defined here (and vice versa). This will help to preserve semantics, even if mail were to be replicated to yet another remote store that potentially uses different flags.

These flags are used for special purposes and are usually not be presented verbatim to the user (though they may trigger certain visual cues in the presentation, such as the highlighting of unread messages). As such, they start with a dollar sign ($) to avoid conflicts with user-defined flags. Note that it is technically possible to have user-defined flags starting with a dollar sign, but it is strongly recommended that applications do not allow this.

The standard flags are all IANA-defined IMAP keywords, verbatim, minus the reserved $recent, plus the IMAP flag \Deleted, but with the leading \ replaced with a $.

At the time of this writing, these are:

  • $seen - Message has been read.
  • $answered - Message has been answered.
  • $Forwarded - Message has been forwarded.
  • $flagged - Message is "flagged" (by the user) for urgent/special attention.
  • $Deleted - Message is marked "deleted", for later removal.
  • $draft - Message has not completed composition (marked as a draft).
  • $Important - Message is marked as "important".
  • $MDNSent - A Message Disposition Notification has been sent.
  • $Junk - Message definitely contains junk.
  • $NotJunk - Message does definitely not contain junk.
  • $Phishing - Message is likely a phishing attempt.

New flags may be defined later. Any new keywords added to the IANA registry automatically become a standard m2dir flag.

#Mail delivery

#Default delivery folder

An application delivering a new message which originates from a remote without a well-defined folder hierarchy (for example SMTP-delivery) must perform the following steps to determine the final storage location for the message. It is assumed that the application has a configured target directory for mail for a certain user (e.g. ~/Mail):

  1. Check if target directory is a valid m2dir (contains .m2dir marker file)
    • If yes, deliver message to this directory; done
    • If no, proceed with next step
  2. Check if target directory is a valid m2store (contains .m2store marker file)
    • If no, abort delivery with error
    • If yes, continue with next step
  3. Check if target directory contains an entry .delivery that is valid according to the rules described in the [Default Target][#default-target] section.
    • If yes, deliver message to target specified by .delivery entry; done
    • If no, abort delivery with error
#File creation

When delivering a new message into an m2dir, it is first written to a temporary file in the target directory. This temporary file's name must start with a period (.) in order to be ignored by compliant applications. In addition, it is strongly recommended that applications employ established mechanisms for secure temporary file creation (such as mkstemp(3)). Once the file is complete, it is renamed ("moved") to its final destination according to the specification. As the final destination is in the same directory, this operation can reasonably be assumed to be atomic.

#Human-centric part of filename

The purpose of the human-centric part is solely to provide some context to a human operator to differentiate emails in a meaningful way. Applications must not attempt to parse the human-centric part or derive any properties from it. It is purely for human consumption.

The actual contents of the human-centric part of the filename are intentionally unspecified. Applications are free to come up with their own naming schemes, or even offer users a choice between different ones.

The only requirement is that the human-centric part of the filename must not change, unless explicitly requested by the user. This is to prevent unexpected breakage of any index the user may have on top of the message store.

#Example

The following example shall illustrate the purpose of the human-centric part of the filename. It is purely informational.

An application might choose the following naming scheme for the human-centric part:

<DATE>_<FROM>

Where

  • <DATE> is the date from the email's Date header in the following format:
    YYYY-MM-DD_hh:mm
    
    or, in other words, equivalent to the output of date '+%Y-%m-%d_%H:%M'
  • <FROM> is the address part of the email's From header

Example:

2023-09-04_13:47_builds@sr.ht

Using the date in the specified notation as first part will naturally sort the messages by date if alphabetic sorting is applied (as is common e.g. in the output of ls). Pretty much any email client presents messages sorted by date, so the (easy to establish) alphabetical order of files would nicely match the chronological order which users are used to.

Due to this common presentation, the date is also something that many people "mentally index" their mail by, consciously or not (think e.g. "I got this mail yesterday", or "Rob sent this last week"). Therefore, making the date easily readable would be another human-centric feature.

The "From:" address is also considered (by the author of this example) to be an important distinguishing feature of an email. The idea is, given a moderate amount of message (say less than 50), to enable a user to find the right one just by looking at the filenames.

#License

This work is marked with CC0 1.0 🅭 🄍

About this wiki

commit 72f8841a7c39f3ea51418476bd5eae4ccb3cbe5b
Author: Conrad Hoffmann <ch@bitfehler.net>
Date:   2024-04-18T16:20:55+02:00

Specify encoding for folder names if required
Clone this wiki
https://git.sr.ht/~bitfehler/m2dir (read-only)
git@git.sr.ht:~bitfehler/m2dir (read/write)