[whatwg] Comments on Web Forms 2.0

Fri Aug 27 02:25:46 PDT 2004

On Sun, 22 Aug 2004, Henri Sivonen wrote:
> > > >
> > > > 2.5. Extensions to file upload controls
> > > 
> > > >     * UAs should use the list of acceptable types in constructing a
> > > > filter
> > > > for a file picker, if one is provided to the user.
> > > 
> > > That feature is not likely to be reliably implementable considering that
> > > real-world systems do not have comprehensive ways of mapping between file
> > > system type data and MIME types.
> > 
> > I am told modern systems do, now.
> 
> Which modern systems?

Windows, Mac, Gnome, etc.

> > > >     For text input controls it specifies the maximum length of the
> > > > input, in
> > > > terms of numbers of characters. For details on counting string lengths,
> > > > see
> > > > [CHARMOD].
> > > 
> > > Should UAs use NFC for submissions?
> > 
> > I don't know, should they?
> 
> I am inclined to think that NFC SHOULD be used in order to accommodate
> transitional systems that treat Unicode as "wide ASCII". For example, a
> server-side system written in PHP4 may not have Unicode normalization
> facilities available to it and might send the data to Mozilla later. If a UA
> had posted content in NFD to the server and the server naïvely sent to the
> content to the OS X version of Mozilla, text in common European languages
> would break in an ugly way.
> 
> I would hesitate making NFC a MUST, though, because I don't know whether small
> devices can hold the data that is needed in order to carry out Unicode
> normalization. Requiring desktop apps to normalize shouldn't be a big deal. At
> least OS X and Gnome provide normalization facilities and ICU can be thrown in
> as a cross-platform solution.
> 
> In any case, robust server-side systems should not trust that the input in is
> a particular normalization form and should normalize the data themselves. The
> point is accommodating systems that are not robust.

Ok, NFC and SHOULD it is.

> > > > To prevent an attribute from being processed in this way, put a 
> > > > non-breaking zero-width space character (&#xFEFF;) at the start of 
> > > > the attribute.
> > > 
> > > Isn't the use of that char as anything but the BOM deprecated or at 
> > > least considered harmful?
> > 
> > Arguably, it _is_ a BOM here.
> > 
> > I'm not overly fond of this either, but it's the only solution I could 
> > find that was relatively harmless (the BOM can always be dropped at 
> > the start of strings)
> 
> Exactly. Which is why tools used for generating the page might drop it 
> on the server!

That's fine. When put at the start of the string, it should be dropped. 

> Actually, I am distributing one such tool myself. Is the tool broken?
> http://iki.fi/hsivonen/php-utf8/

It depends. If it drops the BOM in the middle of the string, then yes.

I expect this to be used so that you first output the attribute with this 
"BOM", then the user-derived string, then the rest of the document:

   ...
   print("<input value=\"\xFEFF");
   print(escape(data));
   print("\">");
   ...

> My immediate thought is ZWNJ, but I'm not sure if using it is a good 
> idea.

I think that would be worse than the BOM.

> > > > Note that a string containing the codepoint's value itself (for 
> > > > example, the six-character string "U+263A" or the seven-character 
> > > > string "☺") is not considered to be human readable and must 
> > > > not be used as a transliteration.
> > > 
> > > Do you expect UAs that already do this change their behavior with 
> > > the legacy submission types?
> > 
> > We can hope.
> 
> FWIW, there may be CMS input form handlers that expect the prohibited 
> behavior. I have been involved in developing one myself. (Not that I 
> recommend relying on such things. Obviously, UTF-8 is the way to go.)

Yeah. Google, for one. I've also seen login forms where people typed in 
characters not in the form's submission set, and thus got a username that 
was not the one they thought it was, so when they switched to another UA 
that did things differently, it broke. It's madness.

> > > > which has a root element named "submission", with no prefix, 
> > > > defining a default namespace 
> > > > uuid:d10e4fd6-2c01-49e8-8f9d-0ab964387e32.
> > > 
> > > I think that is an inappropriate attempt to micromanage the 
> > > syntactic details that are in the realm of a lower-level spec. I 
> > > think the submission format should either allow all the syntactic 
> > > sugar that comes with Namespaces in XML or be layered directly on 
> > > top XML 1.0 without namespace support.
> > 
> > The reason it is micromanaged is to make it possible to use either a 
> > pure XML 1.0 parser _or_ an XML 1.0 with namespaces parser on the 
> > server side without getting into any complications.
> 
> I was able to guess that that was the rationale behind the requirement. 
> But why is the ability use a namespace-unaware XML processor a 
> requirement? The only reason I can come up with is that PHP4 is borked 
> by default but widely used.

There are various people using non-namespace-aware parsers. I don't really 
want to force namespace-aware parsing when in fact the document is anyway 
guarenteed to only have one namespace.

> Processing namespaced XML with tools that don't support namespaces is 
> clueless and just plain wrong. If tools that don't support namespaces 
> are to be accommodated, wouldn't the natural way be to spec that the 
> elements are not in a namespace and the namespace processing layer is 
> not used? That way you wouldn't endorse behavior that is clueless and 
> just plain wrong.

It's actually more the other way around. This is a non-namespaced 
document, but to accomodate people who are going to be using it in 
namespace-aware environments, possibly merging it into other documents, 
etc, it makes sense to actually give it a namespace.

For example, the same data format is later used for seeding forms. If on 
the server you stack the data into a huge XML file containing other data 
too, it would make sense to be able to just yank out that namespaced 
subtree and just use it for preseeding too.

> 1) The current best practice for dispatching on the type of an XML 
> document is dispatching on the namespace. If there was no namespace, one 
> would have to fall back on dispatching on the content type. This is not 
> a real problem with this particular vocabulary because this vocabulary 
> has a distinct content type from the start.

It does during submission. But when the data is flying about after 
submission, who knows.

> 2) You couldn't mix the vocabulary with other vocabularies using 
> namespaces. This is a theoretical problem but probably not a real one, 
> because the vocabulary is limited to a specific case of client-server 
> interaction.

It's only limited _if_ it doesn't have a namespace.

Also, it is later used for preseeding forms.

> Besides, the way you limit the use of namespaces in the current spec 
> language would also preclude creative augmentations to the submission 
> vocabulary.

Well, extensions would be non-compliant, yes. But at least there is a 
clear mechanism for experimentation.

> > > > but must include a BOM.
> > > 
> > > I think that is not a legitimate requirement when UTF-8 is used.
> > 
> > Why not?
> 
> It is a requirement that applies to the XML serialization, but the 
> requirement is not present in the XML spec. The requirement would mean 
> that you could not use any arbitrary but conforming XML serializer.
> 
> The use of the BOM as a UTF-8 signature is a Microsoftism that was only 
> allowed in XML 1.0 second edition, because fighting Microsoft text 
> editors would have been futile. Still, if you pick a non-Microsoft XML 
> serializer off the shelf, chances are it does not emit a BOM in the 
> UTF-8 mode.
> 
> Is there a good reason to limit the use of arbitrary but conforming 
> off-the-shelf XML serializers?

I guess that makes sense. And the BOM isn't really needed anyway. Ok, I've 
made it optional for UTF-8.

> > > > UAs may use either CDATA blocks, entities, or both in escaping the 
> > > > contents of attributes and elements, as appropriate.
> > > 
> > > In order not to imply that this spec could restrict the ways 
> > > characters are escaped, that sentence should be a note rather than 
> > > part of the normative prose. (Of course, only the pre-defined 
> > > entities are available. Then there are NCRs.)
> > 
> > This spec _could_ restrict the ways characters are escaped. It needs 
> > to not be a note so that the "may" has normative value. No?
> 
> The could restrict the escaping in the same sense the HTTP spec could 
> restrict how you choose TCP sequence numbers.
> 
> In general, please see section 4.3 of RFC 3470.

Yes, indeed. That's why WF2 specifically _doesn't_ restrict this.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'