[whatwg] Parsing, syntax, and content model feedback

Ian Hickson ian at hixie.ch
Thu Dec 25 02:41:47 PST 2008


This is a bulk reply to a variety of e-mails on the topic of the 
HTML5 syntax, its parsing rules, and  sent to the WHATWG list.

On Sun, 27 Jul 2008, Henri Sivonen wrote:
> > >
> > > 2.3.1.
> > > Since blockquote is so abused that it is useless for AI, allowing 
> > > attribution within the blockquote would be practical.
> > 
> > Attribution isn't part of a quote. How would you distinguish quoting 
> > an attribution from quoting text with an attribution from quoting text 
> > that happens to have its attribution?
> 
> Quotation marks:
> <BLOCKQUOTE><p>“There’s just no nice way to say this: Anyone
> who can’t make a syndication feed that’s well-formed XML
> is an incompetent fool.——Maybe this is unkind and elitist
> of me, but I think that anyone who either can’t or won’t
> implement these measures is, as noted above, a bozo.” –
> <A HREF="http://www.tbray.org/ongoing/When/200x/2004/01/11/PostelPilgrim">Tim
> Bray</A>, co-editor of the XML 1.0 specification</p></BLOCKQUOTE>

I think if we were to allow this we would have to introduce an explicit 
<credit> element. Or maybe <legend>, but then why not just use <figure> to 
link the <blockquote> and its legend together?


> On the topic of foreignObject: Shouldn't HTML5 actually *disallow* 
> <html> as a child of <foreignObject> and make the content model of 
> foreignObject equivalent to the content model of <body>? The commented 
> out SVG-in-text/html functionality doesn't support <html> as a child of 
> <foreignObject>.

Ok, done. (I'm surprised that SVG itself doesn't give any guidance as to 
the contents of <foreignObject>. I had no hooks to use to define this.)


> > > 2.3.4.
> > > "When an element has an ID set through multiple methods (for 
> > > example, if it has both id and xml:id attributes simultaneously 
> > > [XMLID]), then the element has multiple identifiers. User agents 
> > > must use all of an HTML element's identifiers (including those that 
> > > are in error according to their relevant specification) for the 
> > > purposes of ID matching."
> > > 
> > > What does this mean in terms of document conformance?
> > 
> > n/a (is the current text ok?)
> 
> It's OK for the multiple ID issue. However, this sentence is known to be 
> confusing: "The value must be unique in the subtree within which the 
> element finds itself and must contain at least one character." It should 
> say that the ID must be unique within all nodes that are inserted into a 
> document, a document fragment or an interconnected set of nodes that 
> live outside a document or document fragment.

I've tried to make it mean that.


> > > 2.9.10.
> > > I suggest the definition of i be changed to "The i element 
> > > represents anything that is italicized in conventional typography." 
> > > That's pretty much the only real world-compatible definition.
> > > 
> > > Also, I suggest b be included in the spec and defined as "The b 
> > > element represents anything (except headings) that is set in bold 
> > > face in conventional typography."
> > 
> > Is the current text ok?
> 
> Yes, except the advice "The i element should be used as a last resort 
> when no other element is more appropriate. In particular, citations 
> should use the cite element, defining instances of terms should use the 
> dfn element, stress emphasis should use the em element" may not be 
> respectful of the authors' time in the absence of concrete benefits for 
> justifying the advice (other than future potential for unconventional 
> styling).

Changed.


On Thu, 4 Dec 2008, Tommy Thorsen wrote:
>
> Consider the following simple markup:
> 
> <!doctype html></br>
> 
> If I run it through my parser, which is implemented after the html5 algorithm,
> the resulting dom is as follows:
> 
> <html>
>    <head>
>    <body>
> 
> The br end tag is a bit special, and should be handled as if it was a br start
> tag. What happens here is as follows: The "before head" insertion mode will,
> upon receiving a br end tag, create a head node and switch to the "in head"
> insertion mode. "in head" will close the head node and move on to the "after
> head" insertion mode. I was expecting "after head" to see the </br> and do
> like it does on a start tag, which is to create a body node and move to the
> "in body" state, but the </br> is just ignored.
> 
> I've changed my implementation of "after head" to handle </br> just like the
> "in head" insertion mode, which is:
> 
>    An end tag whose tag name is "br"
>        Act as described in the "anything else" entry below.
> 
> This results in the following dom, for the example above:
> 
> <html>
>    <head>
>    <body>
>       <br>
> 
> This matches Internet Explorer and Opera, but not Firefox and Safari. Then
> again, it looks like Firefox and Safari ignore all </br> tags.

Oops, this was an oversight. Fixed.


On Thu, 4 Dec 2008, timeless wrote:
>
> if we're both able to get away with ignoring all </br> tags, wouldn't 
> the ideal forward path be to make always ignored?

We can't remove this from quirks, and as others have pointed out, the 
fewer differences the better.


On Thu, 4 Dec 2008, Henri Sivonen wrote:
>
> One option would be making the tokenizer check if an end tag has the 
> name 'br' and turn it into a start tag in the tokenizer. This assumes 
> that SVG and MathML won't be able to introduce an element whose local 
> name is 'br' anyway.

I'd rather not make the tokeniser be aware of specific tag names.


On Thu, 4 Dec 2008, Calogero Alex Baldacchino wrote:
> 
> Section 4.5.3 says,
> 
> "|br| elements must be empty. Any content inside |br| elements must not 
> be considered part of the surrounding text."
> 
> The first part is clearly an authoring rule. But the second part cannot 
> be such as well clearly, because an author might feel that as a 
> reference to a parsing rule discarding anything like <br>Something</br> 
> (but it isn't). Yet, that can't be a parsing rule, since in contrast 
> with the "in body" insertion mode (but not only that), which would turn 
> it into <br>Something<br>, thus presenting the content to the end user 
> (and obviously that's unlikely anyone visiting a web page would check 
> the html code looking for content to ignore :-P). For the purpose of 
> validation, the first part should be enough (that is, when a </br> end 
> tag is found, an error may be prompted to the author). Perhaps, should 
> the second sentence be modified with references to scripts (e.g. to tell 
> it is wrong to use a br .innerHTML or .appendChild() to modify the 
> document) and to styles (e.g. to tell it's wrong to expect any font 
> property will affect the sorrounding text), to make it more clearly an 
> authoring rule? Or perhaps changed into an exemple of bad markup? Or 
> removed, if source of confusion with parsing rules?

The second sentence can't be a requirement on authors, since that would 
contradict the first requirement. Thus it's a UA requirement.


> Otherwise, I don't follow its meaning (perhaps I'm the only confused 
> one). I mean, as far as I know, xml derived languages require a closing 
> tag for every elements, while html has never had such requirements per 
> se, but that's a matter of syntax, not semantics. And, semantically 
> speaking, whatever (but a closing tag) follows an element which can't 
> have children, in the markup, obviously consists of one or more siblings 
> of such element, while its closing tag (again, that's syntax), if 
> misplaced, or not provided for by syntax rules at all, causes a parse 
> error (which may, or may not, be handled gracefully by the u.a., that's 
> a matter of parsing rules). That is, declaring an element as "empty" 
> should imply per se that the element cannot have any descendant, so its 
> content is not... its content, but a syntax error. Perhaps, defining the 
> empty content model such way might avoid misunderstandings. Or am I 
> making some mistakes?

The requirements have nothing to do with the syntax here; <br> elements 
can end up with contents in a variety of ways, e.g. through the DOM or 
using XML.


On Thu, 18 Dec 2008, Giovanni Campagna wrote:
> 2008/12/17 Ian Hickson <ian at hixie.ch>
> >
> > This doesn't cost any time in HTML either, since the tokeniser doesn't 
> > need to worry about what tags have end tags, the tree construction 
> > side just drops unexpected end tags on the floor.
>
> I don't think authors expect tags to disappear.

That's possible, but my point was just related to the performance aspect.

We're very constrained by the legacy for text/html's syntax; sadly, 
usability concerns aren't really able to make us change the language.


> > > don't check for insertion modes
> >
> > Having an insertion mode isn't particularly a performance cost. (It 
> > affects code footprint, but that's about it.)
>
> 1) it needs more code (one x insertion mode): more code is always less 
> performance, even if it is just to load a bigger executable

Sure, but in this case the code footprints are the same, in practice, 
according to reports from browser vendors.


> 2) it needs code to select the insertion mode for the next element (when 
> the spec says to reset the insertion mode): in the worst case it has to 
> compare nodeName 18 times

There are implementation strategies that avoid these problems, e.g. using 
jump tables on interned names, and using function pointers instead of 
insertion mode flags.

> > > just parses the input completely any semantic or particular 
> > > behaviour associated with any tag. Then, when the DOMElement or 
> > > DOMAttr or DOM-whatever are built, they get the appropriate 
> > > interface (eg. HTMLElement) depending on the namespace.
> >
> > That's the same as HTML.
>
> No it is not. HTML defines special beaviour for the following elements: 
> address, area, article, aside, base, basefont, bgsound, blockquote, 
> body, br, center, col, colgroup, command, datagrid, dd, details, dialog, 
> dir, div, dl, dt, embed, eventsource fieldset, figure, footer, form, 
> frame, frameset, h1, h2, h3, h4, h5, h6, head, header, hr, iframe, img, 
> input, isindex, li, link, listing, menu, meta, nav, noembed, noframes, 
> noscript, ol, p, param, plaintext, pre, script, section, select, spacer, 
> style, tbody, textarea, tfoot, thead, title, tr, ul, and wbr. I think 
> they're quite too many to say that it is like XML

I don't wish to pretend that XML is like HTML; my point was that what you 
said above -- that elements get interfaces depending on the tag -- is the 
same in HTML and XML.


> > There are a number of HTML5 parser implementations, and data suggests 
> > that there is no particular performance gain.
>
> There are no actual HTML5 parser implementation, only HTML4 compatible 
> with new syntax. (are you sure that closed source browsers really do 
> what is written in the specification?)

As others have noted, there are in fact a variety of actual HTML5 parser 
implementations. Even browsers are working on implementations.


> > There's no guessing in HTML either; all input streams have very 
> > specific and required results.
>
> Actually, there's nothing that really says that <div><p>some 
> text</p><p>some more text</p></div> is more correct than <div><p>some 
> text<p>some more text</p></p></div>

Yes there is; the HTML5 syntax section defines that the latter is wrong 
(as does the HTML4 DTD, for that matter), and the HTML5 parser spec 
defines how conformance checkers are to catch that as a parse error.


> Just when writing the specification you guess that the first possibility 
> is what auctor thought. You are guessing, not the browser.

The goal is not to guess what the author meant when the authors makes a 
mistake; the goal is to have interoperable, predictable, defined behavior 
for all input.


> > Validating code is certainly an important QA point, but once you've 
> > shipped code, the presence of an error is not helpful to the end user. 
> > Often errors in XML files weren't present when the file was created, 
> > but appear later when new text is merged in automatically.
> 
> As Nils pointed, it is an error itself to have any content to be 
> automatically merged inside a stream.

I'm not attempting to make any value judgements here, I'm just stating 
where XML errors are most frequently found.


> It is like opening a random file, executing it and expecting no errors. 
> Every input, even from the most trustworthy source, must be parsed for 
> errors and then checked after publishing.

Agreed, but that doesn't mean people do it, or that they catch all errors. 
For examples, blogs that use XML often fail to catch invalid UTF-8 bytes 
that, when used, cause the page to be malformed XML.


> And if an end user finds an error, he probably will report it to the 
> owner of the web site, who in turn will report it (quite angrily) to web 
> designer.

That only works if the site is actively maintained. It's also not clear 
that most users really would report the error. I rarely report errors when 
I come across a site that doesn't work right for whatever reason; I just 
find another site. It thus becomes a competitive advantage to use a 
technology that doesn't expose errors to the user.


> > Well, they've ignored it for the past 7 years, so why would they change?
>
> Nobody said to user that he was browsing a deprecate web site. If something
> like IE7 information bar (ie. a non modal bar, disactivable and not annoying
> the user, but immediately visible) could appear in a  web site sent with
> text/html,  I think companies won't like their site tagged as "deprecate"
> and port them to application/xhtml+xml in no time (do you imagine what
> "deprecate" can mean on news web site?)
> And don't forget that the most common browser was IE6, not very standard
> oriented...

What advantage would there be to a Web browser to tell users that the 
entire Web was deprecated?

Web browser vendors have clearly said they're not interested in doing 
this.


> > Anyway, it isn't clear that we would _want_ to deprecate HTML, even if 
> > we had any real choice in the matter.
> 
> I'm not sure if I understood your sentence (sorry, English is not my 
> mother language). Anyway, you just have to put an "authoring 
> requirement" for text/html
> 
> 1) user agent will just ignore it and implement the HTML algorithm (we 
> don't want to "break the web", using Microsoft terms)
>
> 2) standard-oriented authors will convert their sites to 
> application/xhtml+xml (if they didn't before)
>
> 3) other authors will keep their tag soup (and get their sites 
> yellow-barred)
>
> 4) company owners will make their decision between 2 and 3

Why would we want to get rid of text/html?


> Gradually, n° 3 will disappear, because there's no actual needing for HTML.

There's no actual need for XML. :-)


On Sun, 21 Dec 2008, Philipp Kempgen wrote:
> Ian Hickson schrieb:
> 
> > Deprecating HTML thus seems like vain effort. (We already tried over 
> > the past few years with XHTML 1.x, and it didn't work.)
> 
> I'd say it _did_ work.  :-)

To a first order of magnitude, nobody uses XHTML. (A few people -- around 
15% last I checked -- label their HTML documents as XHTML, but even those 
documents, if processed as XHTML, likely wouldn't work, for reasons such 
as those given in [1].)

[1] http://hixie.ch/advocacy/xhtml

So I'm not really sure under what measure you would say it worked. Could 
you elaborate on what data and criteria you are basing this evaluation on?


On Sun, 21 Dec 2008, Nils Dagsson Moskopp wrote:
>
> I'd say too: The worst abominations have disappeared (for new sites, 
> that is). the <font> element, for example, or frames through deprecating 
> them.

That has nothing to do with XHTML.


> Fact: Deprecating stuff takes it out of (X)HTML-Books, Howtos like 
> Selfhtml warn against it, thus ensuring lesser use by novices.

Tutorials can certainly help guide authors towards best practices, and 
that is why things like <font> aren't in HTML5. That, though, is 
independent of things like XHTML vs HTML.


> Does anyone remember <marquee> ?

<marquee> is used so much (primarily in Asian markets) that all Web 
browsers have been forced to copy IE and support it even though it was 
never in a standard.


On Sun, 21 Dec 2008, Giovanni Campagna wrote:
> 
> As I discovered lately, the main problem of HTML5 is its design oriented 
> to keep features that are distributed across browsers, that work or that 
> are simple way to solve big problem. Actually, they are a bunch of 
> different features somehow not integrated to the others.
>
> Instead, programmer (please note, I use the word programmer, not author 
> or web designer) developing *new* application may more like a more 
> structured and logical organization, like XHTML modularization is.

Could you elaborate on how spec design like XHTML modularisation has any 
impact on developers of Web applications? I was under the impression that 
the only benefit was in the development of other specs based on the 
modules (and that only if those needs happened to mesh with the particular 
modules picked).


> [Some] HTML5 features, summed in big groups, [..] can be achieved 
> without any of HTML5, for example
>
> 1) common syntax for the most used datatypes.

I assume you mean in forms?

> 1) use XMLSchema datatypes

It's unclear how XML Schema datatypes would work with HTML forms and how 
they would be better than what we have in forms in HTML5 now.


> 2) additional DOM interfaces, which include HTMLElement - HTMLCollection -
> HTMLFormsControlCollection - HTMLOptionsCollection - DOMTokenList -
> DOMStringMap
>
> 2) you don't need HTMLElement: markup insertion, attributes querying can 
> be done using DOM3Core (that in latest browser are even more performant 
> as no parser is involved), events are far better handled by DOM3Events, 
> styling is included by CSSOM
>
> you don't need collection either: just use appropriate DOMNodeLists, 
> while for DOMStringMap you may use binding specific features (all Object 
> are hash maps in ECMAScript3): it works this way even in HTML5

Both HTMLElement and collections are in DOM2 HTML (even DOM1 HTML).

DOMStringMap is basically nothing but a binding-specific feature.


> 3) Elements and Content Models
>
> 3) use XHTML2, which is extensible because modularized

XHTML2 is not backwards compatible, and was a big part of the motivation 
behind starting the HTML5 effort.

Extensibility is an anti-feature -- we specifically don't _want_ people to 
extend HTML without working with the wider community. That way lies 
fragmentation of the language and lack of interoperability. Indeed, what 
little non-centralised extension HTML has seen -- <spacer>, <blink>, 
<marquee> -- has been widely decried as a disaster.


> 4) Element types: metadata - structure - sectioning - grouping - text -
> editing - embedding - table - forms - interactive - scripting elements
> 
> 4) metadata is better handled by XHTML2 Meta Attributes module, which fully
> integrates the RDF module in any elements;

It's not clear that that is better, but that is an open issue that I will 
deal with separately.


> structure, sectioning, grouping are the same;

It's unclear why you think XHTML2's features in this area are better than 
HTML5's. Can you elaborate?


> text is very similar: you don't have time, but you can have <span
> datatype="xsd:date" content="2008-12-21">Today</span> as in HTML5 you have
> <time value="2008-12-21">Today</time>;

Why is that better? It seems far worse.


> for progress and meter semantic you can use role attribute (for styling 
> you always use CSS);

That would have a terrible accessibility story as far as I can tell.


> editing is the same, but you have an attribute instead of an element, so 
> you don't have the issue that ins and del can contain everything, even a 
> whole document (not including <html>);

This is an area where we are mostly just constrained by legacy -- <ins> 
amd <del> are from HTML4, not new in HTML5.


> embedding is much more powerful as any element can be replaced by 
> embedded content;

This isn't more powerful, it's more buggy. Just compare <object> with 
<img>. Making things general is something that language designers often 
feel is a good way to solve many problem at once, but usually it just ends 
up not solving any of the problems well. For example, XHTML2 doesn't have 
anything like <video>'s APIs.


> tables are the same (you don't have tables API; but you can still use
> DOM3Core);

Tables in HTML5 are mostly unchanged from HTML4.


> XForms are actually more powerful than WebForms2, since you divide
> presentation from data from action (that is implemented declaratively);

XForms were the original motivation behind HTML5 -- they don't solve the 
problem that HTML5 tries to solve, which is the ability to add new 
features to existing documents.


> interactive elements are not needed at all: details is better implemented as
> it is now (ECMAScript3 + CSS3),

That has a terrible accessibility story.


> datagrid is just a way to put data in a tree model: use plain XML for 
> that;

How would that be used by, for example, GMail? Gmail today doesn't have a 
way to show a list control of all your e-mail without loading all your 
mail; <datagrid> allows it.


> command and a in XHTML2 implemented in any element using href attribute;
> menu is mostly an ul with some style;

I think you misunderstand <command> and <menu>. Does XHTML2 have anything 
for context menus and native toolbars?


> scripting uses XMLEvents and handler: it looks the same, but it is 
> different as it is more event oriented (scripts are not executed by 
> default, they're executed when some event fires)

Scripting in HTML5 is mostly just describing what we have today, which we 
need to keep for backwards-compatibility.


> [5, 6, 7 not listed in original e-mail]
>
> 8) HTML Syntax
>
> 8) as I said before, use XML for that

This doesn't seem to solve many of the problems being faced by developers 
today, and ignores many of our requirements, such as backwards- 
compatibility, and the desire for incremental improvements only.


> What I am asking now is so to "modularize HTML". copy those features 
> into separate, interoperable modules, removing legacy features (like 
> window.on-whatever event listener)
>
> A copy of those will remain in HTML5, because browser implement them at 
> the moment, and the HTML5 goal is that all browser implement the same 
> things in the same ways
> 
> Instead, some web developers in the future will think that a modularized 
> and less redudant API is more usable, like I personally do, and switch 
> to that, without mixing with HTML5: actually, I guess what a Database 
> API does inside HTML.

Some parts of HTML5 are indeed going to be split out into separate specs, 
but unless you know someone who can actually edit these other specs, it's 
not going to happen any time soon.

See also:

   http://lists.w3.org/Archives/Public/public-html/2008Oct/0127.html


On Thu, 18 Dec 2008, Elliotte Harold wrote:
>
> "However, if the element is found within an XSLT transformation sheet 
> (assuming the UA also supports XSLT), then the processor would instead 
> treat the script element as an opaque element that forms part of the 
> transform."
> 
> "transformation sheet" is not a term in common use, and I don't think it 
> appears in the relevant specs. I suggest the action be taken of changing 
> this to either "stylesheet" or "style sheet". (stylesheet is the form in 
> the XSLT 1.0 spec, but both forms with and without space are seen in 
> practice.)

Fixed.

I used "transformation expressed in XSLT", to avoid confusing with the 
term "stylesheet". XSLT 1.0 defines the two terms as equivalent.


On Thu, 18 Dec 2008, Elliotte Harold wrote:
>
> "The nodes representing HTML elements in the DOM must implement, and 
> expose to scripts, the interfaces listed for them in the relevant 
> sections of this specification. This includes HTML elements in XML 
> documents, even when those documents are in another context (e.g. inside 
> an XSLT transform)."
> 
> I find this very questionable. If an XSLT processor is parsing a 
> stylesheet, including a browser-hosted XSLT processor, there is no 
> reason or expectation for it to treat HTML elements specially in the 
> context of the stylesheet.

It's a requirement because of the case where a script gets hold of a node 
in an XSLT transformation and inserts it into an HTML document, removing 
any association with the XSLT transformation. Since it's the same node the 
whole time, it can't change interface or class, yet it must once in the 
HTML document behave like an HTML element. Thus it has to have been an 
HTML element the whose time.


> Possibly doing so would lead to violations of the XSLT spec, especially 
> given the error recovery littered throughout the HTML 5 spec. And of 
> course XSLT is just one example. There are others where similar issues 
> may apply.

I don't understand the relevance of implementing the DOM interfaces with 
the error handling you mention. (And why is the error handling any more 
relevant here than the rules even in the absence of errors? I don't 
understand why the distinction is relevant here.)

Note that the spec very clearly says that the rules in the HTML spec can 
be overriden by other specs, and even mentions XSLT as an example of this, 
as you quote:

> I think something along the lines of section 2.2 would be more 
> reasonable. "Web browsers that support XHTML must process elements and 
> attributes from the HTML namespace found in XML documents as described 
> in this specification, so that users can interact with them, *unless the 
> semantics of those elements have been overridden by other 
> specifications.*"
> 
> What's missing in 3.3.2 is something along the lines of "unless the 
> semantics of those elements have been overridden by other 
> specifications."

Could you give an example of such a spec for this case?


> I'm not sure exactly what language we need here. Maybe something like
> 
> "The nodes representing HTML elements in the DOM must implement, and 
> expose to scripts, the interfaces listed for them in the relevant 
> sections of this specification. This includes HTML elements in XML 
> documents unless those documents are in another context (e.g. inside an 
> XSLT transform)."
> 
> That is, change "even when" to "unless". It would also be helpful here 
> to define exactly what "another context" means. That is, what is the 
> context where the HTML DOM is appropriate and what are its limits?  
> That wasn't clear to me from ereading the preceding sections. However 
> whatever those limits are, I think they should stop well short of 
> applying to an XSLT stylesheet.

As noted above, if the interfaces are ever not exposed, then this leads to 
impossible situations or contradictions that would be highly confusing to 
authors.


On Sun, 21 Dec 2008, Kartikaya Gupta wrote:
> Section 8.2.4.1, for the '<' input, says:
> 
> > When the content model flag is set to either the RCDATA state or the 
> > CDATA state and the escape flag is false: switch to the tag open 
> > state.
> 
> I think the lack of commas in this sentence makes it ambiguous: it can 
> either be interpreted as "(cmf == RCDATA or cmf == CDATA) and (escape 
> flag == false)" or "(cmf == RCDATA) or (cmf == CDATA and escape flag == 
> false)". Adding a comma either after "RCDATA state" or after "CDATA 
> state" would fix the ambiguity. Other similar sentences already have 
> commas.

Fixed.


On Sun, 21 Dec 2008, Philip Taylor wrote:
>  On Sun, Dec 21, 2008 at 5:41 AM, Ian Hickson <ian at hixie.ch> wrote:
> > On Sat, 20 Dec 2008, Edward Z. Yang wrote:
> >>
> >> 1. Given an input stream that is known to be valid UTF-8, is it 
> >> possible to implement the tokenization algorithm with byte-wise 
> >> operations only? I think it's possible, since all of the character 
> >> matching parts of the algorithm map to characters in ASCII space.
> >
> > Yes. (At least, that's the intent; if you find anything that 
> > contradicts that, please let me know.)
> 
> I think there are some cases where it still should work but you might 
> have to be a little careful - e.g. "<table>foo" notionally results in 
> three parse errors according to the spec (one for each character token 
> which gets foster-parented), so "<table>☹" results in one if you work 
> with Unicode characters but three if you treat each UTF-8 byte as a 
> separate character token.
> 
> But in practice, tokenisers emit sequence-of-many-characters tokens 
> instead of single-character tokens, so they only emit one parse error 
> for "<table>foo", and the html5lib test cases assume that behaviour, and 
> it should work identically if you have sequence-of-many-bytes tokens 
> instead.
> 
> (Apparently only the distinction between 0 and more-than-0 parse errors 
> is important as far as the spec is concerned, since that has an effect 
> on whether the document is conforming; but it seems useful for 
> implementors to share test cases that are precise about exactly where 
> all the parse errors are emitted, since that helps find bugs, and so the 
> parse error count is relevant.)

I considered changing this in the spec, but it doesn't really seem to 
matter. UAs are already allowed to do all kinds of things with parse 
errors that would make these cases indistinguishable.


On Mon, 22 Dec 2008, Edward Z. Yang wrote:
>
> 8.2.4.4 Close tag open state
>
> The condition here is reaaaally long. Is there any way we can make it 
> shorter?

On Mon, 22 Dec 2008, Henri Sivonen wrote:
> 
> Not really, but it's possible to flatten out the lookahead by adding 
> states so that the condition in each state becomes simpler. (In fact, 
> it's possible to remove lookahead from the tokenizer altogether by 
> adding more states.) See Tokenizer.java in the Validator.nu HTML Parser.

I haven't changed anything here. I'm not sure what is really being 
proposed.


On Mon, 22 Dec 2008, Edward Z. Yang wrote:
> 
> I think EOF should be handled explicitly in the states after we "Consume 
> the U+0023 NUMBER SIGN," since the spec as it stands right now implies 
> that there will always be another character after the number sign. Or am 
> I being a little redundant?

On Mon, 22 Dec 2008, Philip Taylor wrote:
>
> EOF is always treated as if it were a character, e.g. lots of places say 
> "Consume the next input character: ... EOF -> ... Reconsume the EOF 
> character in the data state". If you have "&#" at the end of a file, the 
> next character is the EOF character, which is not 'x' or 'X' and so it 
> is "anything else". So it seems consistent and unambiguous to me.

On Mon, 22 Dec 2008, Edward Z. Yang wrote:
> 
> That seems fair, although most implementations won't have an actual end 
> of file character; they'll be checking their string index to see if 
> they've gone out of bounds. But the spec is internally consistent (I'm 
> just used to seeing an EOF special case on almost every state).

I haven't changed the spec here either, for the reason Philip gave.

(Personally I would recommend actually using an EOF character, as it makes 
the code much simpler and can in fact make it faster. There are several 
characters you can use to denote an EOF (since several have to be stripped 
early), I personally favour U+0000.


On Tue, 23 Dec 2008, Kartikaya Gupta wrote:
>
> For the steps under 'A start tag whose tag name is "textarea"' in 
> 8.2.5.10 (in body insertion mode), step 3 seems wrong to me, since step 
> 1 already includes an append operation. As specified now, it will cause 
> two textarea elements to be added (assuming "new element" refers to the 
> textarea).

Fixed. Thanks. (It wouldn't cause two elements, though, it would just case 
the one element to be inserted twice.)


On Mon, 22 Dec 2008, Edward Z. Yang wrote:
> 
> When I'm consuming a character reference, when does the ampersand get 
> consumed? This doesn't seem to be obvious from the documentation, which 
> talks of consuming character references and number hash signs, but never 
> the ampersand.

On Tue, 23 Dec 2008, Philip Taylor wrote:
> 
> They're consumed in the state that comes before the character
> reference state, e.g.:
> 
>   "8.2.4.1 Data state
>   Consume the next input character:
>    -> U+0026 AMPERSAND (&) ... switch to the character reference data state."

No change, based on Philip's comment.


On Mon, 22 Dec 2008, Edward Z. Yang wrote:
>
> "in the range 0x0000 to 0x0008, U+000B, U+000E to 0x001F, 0x007F to 
> 0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDFin the range 0x0000 to 
> 0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 
> 0xFDD0 to 0xFDDF"
> 
> U+000B is not a range.

While this is technically true, I don't really see a better way to phrase 
this that isn't verbose (e.g. "ranges and codepoints" or some such).

If it helps, consider the whole set of subranges and code points to be a 
single discontinuous range, hence the use of the singular "range". :-)


On Tue, 23 Dec 2008, Edward Z. Yang wrote:
>
> In section 8.2.4.26 the spec says:
> 
> > If the next six characters are an ASCII case-insensitive match for the 
> > word "PUBLIC", then consume those characters and switch to the before 
> > DOCTYPE public identifier state.
> 
> The P has already been consumed at the beginning of this section. Thus, 
> I believe it should read:
> 
> If this character and the next five characters are an ASCII 
> case-insensitive match for the word "PUBLIC", etc.
> 
> Same goes for the match for SYSTEM.

You're still checking the next input character at that point, so "P" is 
still the "next input character", so the next six are "PUBLIC".

At least, that's how I'm defending what the spec says. :-)

In practice I think having the text be clear ("PUBLIC") is less confusing 
than having it be pedantic ("P" and "UBLIC" or "this and the next five" or 
some such). It's not like people are going to assume the spec is allowing 
"XPUBLIC" or "*PUBLIC" and so forth, right?

Cheers,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


More information about the whatwg mailing list