[whatwg] [mimesniff] An alternative approach to section 9 of Mime Sniffing
poccil14 at gmail.com
Thu May 23 12:01:47 PDT 2013
The pattern mask DF is currently only used in the algorithm for identifying
an unknown MIME type, and even here for identifying only one MIME type,
namely text/html. This can be succintly covered with the following ABNF:
WHITESPACE = *( %x09 / %x0A / %x0C / %x0D / %x20 )
; any number of whitespace bytes
TAGTERM = %x20 / %x3E ; a tag-terminating byte: space or ">"
html = WHITESPACE (
"<!DOCTYPE HTML" / "<HTML" /
"<HEAD" / "<SCRIPT" / "<IFRAME" /
"<H1" / "<DIV" / "<FONT" / "<TABLE" /
"<A" / "<STYLE" / "<TITLE" / "<B" /
"<BODY" / "<BR" / "<P" / "<!--"
; Leading whitespace, followed by "<" followed
; by a tag, followed by a tag-terminating byte.
; All strings are case-insensitive.
Note also that the notes in the example (in my previous message) are
retained as comments in the ABNF, since they clarify what the byte pattern
matches and help eliminate some of the confusion.
What problem am I trying to solve?
For one thing, look at section 5, parsing a MIME type. It's currently an
incomplete and unwieldy list of steps that don't clearly state what a MIME
type should consist of. Showing an ABNF next to the rules will help in this
From: Gordon P. Hemsley
Sent: Thursday, May 23, 2013 11:14 AM
To: Peter Occil
Subject: Re: [whatwg] An alternative approach to section 9 of Mime Sniffing
The pattern matching algorithm is used because certain patterns
require other-than-exact matching. That is why the "pattern mask"
exists. This is particularly important for the "rules for identifying
an unknown MIME type" (defined in 10.1), which matches ASCII
characters case-insensitively; it is also important for a number of
patterns that contain unimportant bytes that should be ignored (like
WebP, in your example).
The algorithm lays out the information in tabular form because that
makes clearer the separation between the important bytes and the
unimportant (or case-insensitive) bytes. Keep in mind that
implementations may read one byte at a time; using ABNF would give
them no benefit, and would likely make things more confusing.
I wonder: What problem are you trying to solve with this proposal?
(In the future, please add "[mimesniff]" to the beginning of your
subject line for MIME Sniffing discussions; this will ensure that I
see them and pay attention to them more quickly.)
On Thu, May 23, 2013 at 2:10 AM, Peter Occil <poccil14 at gmail.com> wrote:
> I propose rewriting section 9 and parts of section 10 in a different way,
> to use the ABNF format in RFC 5234. (Note that ABNFs are already used in
> the current Fetch specification.) With this approach, the definitions for
> "byte pattern", "pattern mask", and the "pattern matching algorithm" can
> be eliminated (all of which are found before section 9.1).
> An example for the image pattern matching algorithm is given below.
> 9.1 Matching an image type pattern
> The image pattern matching algorithm takes a byte sequence as input. The
> algorithm goes through the following image types in the order given. For
> each image MIME type given below, if the start of the byte sequence
> matches its ABNF, return the concatenation of "image/" and the name of the
> ABNF (in lowercase), and terminate the image pattern matching algorithm.
> vnd.microsoft.icon = %x00.00.01.00
> ; A Windows Icon signature.
> bmp = %x42.4D
> ; The string "BM", a BMP signature.
> gif = %x22.214.171.124 (%x37 / %x39) %x61
> ; The string "GIF87a" or "GIF89a", a GIF signature.
> webp = %x126.96.36.199 4OCTET %188.8.131.52.56.50
> ; The string "RIFF" followed by four bytes followed by the string
> png = %x89.50.4E.47.0D.0A.1A.0A
> ; The byte 0x89 followed by the string "PNG"
> ; followed by CR LF SUB LF, the PNG signature.
> jpeg = %xFF.D8.FF
> ; The JPEG Start of Image marker followed by the indicator
> ; byte of another marker.
> If the start of the byte sequence doesn't match any ABNF given above,
> return undefined.
> I would appreciate comments.
Gordon P. Hemsley
me at gphemsley.org
http://gphemsley.org/ • http://gphemsley.org/blog/
More information about the whatwg