Canonicalization


Canonicalization

Canonicalization is the process of presenting the document unambiguosuly in a non-homogenous environment, and plays an important role in signing and verifying digital signatures. When a document is passed through the internet, some forwarding agents modify the document to the way their local systems interpret character sets and line delimiters. When the document is modified, this same document may be forwarded to the next transfer agent who may likewise make other modifications before forwarding to the next transfer agent. The final recipient is no longer guaranteed that the document received is the original document from the sender even though a "harmless" transfer occurred along the way. Under this same condition, if the document is digitally signed by the sender, the recipient would not be able to verify the same document. Therefore, the document has to be canonicalized so that there is only one way in which the document can be interpreted across transfer agents and local systems. When the canonicalized document is signed on the sender side, it can be verified on the receiving side.

Some of the different ways in which transfer agents and local systems interpret the document may be:

Text-based presentations on local systems handle line delimeters using the single carriage-return character (CR), or single line feed character (LF), or carriage-return line-feed sequence (CRLF).
For text-based presentations, trailing spaces at the end of the line may or may not be stripped.
Character sets in a message body containing binary data may or may not be recognized by local systems.

Required Steps for Canonicalization

When a document is canonicalized, it is modified as follows:

Encode the document using 7-bit character set. So that binary data is represented unambiguously, it must be represented in a 7 bit character set, and that the characters in the character sets are recognized in all systems. This is noramlly done by encoding the header and body of the document with a Base64 or Quoted-Printable 7-bit encoding mechanism. The Content-Transfer-Encoding header is then added with the applied encoding mechanism name assigned as its value
For a message containing multiple body parts, each of the individual body parts are encoded as though they were separate messages. That is, since each body part both contain a header and body, the body is encoded to the 7 bit encoding mechanism and, in the same context, the Content-Transfer-Encoding header is added with the information of the encoding mechanism. NOTE: The encoding mechanisms in each of the body part may differ from the other. For example, one body part may be encoded in Base64 and another body part in the same message could be encoded in Quoted Printable.
If the document is already defined to be text (7-bit) then any occurrences of single carriage returns (CR) and single line feeds (LF) are converted into the canonical line delimiter CRLF sequence. The CR and LF characters have octet value 13 (0xD) and 10 (0xA) respectively. This conversion is required only for computing the digital signature on the document, and the the original document without the conversion must be sent to the trading partner. When the trading partner receives the document, the CRLF conversion must be done before computing the digital signature.

Framework EDI Canonicalization

Because of the many different possible representations of data that are processed and transfered, the question of whether data is canonicalized or not has to be determined. A MIME document is composed of two sections: a header and a body. Framework EDI only reads MIME documents whose header is already canonicalized, i.e. all characters in the header are 7-bit and are all terminated using the line delimiter CRLF canonical form. In the same token, Framework EDI prepares MIME documents with properly canonicalized headers. Therefore, when preparing and processing, only the body is canonicalized. As always, Framework EDI plays on the side of caution from automatically modifying data, and it takes the followings steps to determine if the body should be canonicalized before undertaking the process.

The body is scanned to check if it is already in 7-bit. In a multipart message, all body parts are scanned including all nested body parts to check if it is already in 7-bit. If the body is already in 7-bit then no canonicalization is performed. This check is done whether the body contain text or binary data. If binary data contain only 7-bit bytes then no canonicalization is done.
If the body is not in 7-bit, the body is encoded using the encoding mechanism that is applicable to the content described by the Content-Type header. Currently, Framework EDI encodes the content types as follows:
- Application/EDI-X12 is encoded to Quoted-Printable.
- Application/Edifact is encoded to Quoted-Printable.
- Application/EDI-Content is encoded to Quoted-Printable.
- All other content types are encoded to Base64.
- If the Content-Type header and the Content-Transfer-Encoiding are missing the data is encoded to Base64.
  If the body is described as having Content-Type Text/* or some 7-bit type, or the Content-Transfer-Encoding describes a 7-bit encoding mechanism then no transfer encoding is done, and a warning is generated stating that the body contain binary data.
  
  If the Content-Transfer-Encoding header exists and currently has the name of the specified encoding mechanism then no encoding is done, and a warning is generated stating that the body contain binary data.
  
  If the Content-Transfer-Encoding header exists and the body is encoded in the specified encoding mechanism then the header is assigned the name of the encoding mechanism applied. For example, if the Content-Transfer-Encoding originally has the name "binary", and the body is encoded in Base64 then the new value of the header is assigned the name "base64".
  
  If the Content-Transfer-Encoding header does not exist and the body is encoded in the specified encoding mechanism then the headers is added and assigned the name of the encoding mechanism applied.
When the message is in 7-bit, the entire message is scanned and all single CR and single LF characters are replaced by the CRLF sequence. The message may have been transfer encoded to arrive to its 7-bit character presentation, or the message may already be defined as text. In either case, the CRLF conversion is still required. This conversion is required only for calculating the digital signature. The resulting text after the CRLF conversion must not be sent, and only the text prior to the cnversion is sent. When the message is received on the other end, the recipient must do the same conversion of replacing all single CR and LF with the CRLF sequence before computing the digital signature.