[whatwg] Comments on Web Forms 2.0

Henri Sivonen hsivonen at iki.fi
Sun Aug 22 07:32:50 PDT 2004

On Aug 17, 2004, at 16:37, Ian Hickson wrote:

> On Tue, 13 Jul 2004, Henri Sivonen wrote:
>>> 2.5. Extensions to file upload controls
>>>     * UAs should use the list of acceptable types in constructing a 
>>> filter
>>> for a file picker, if one is provided to the user.
>> That feature is not likely to be reliably implementable considering 
>> that
>> real-world systems do not have comprehensive ways of mapping between 
>> file
>> system type data and MIME types.
> I am told modern systems do, now.

Which modern systems?

>>>     For text input controls it specifies the maximum length of the 
>>> input, in
>>> terms of numbers of characters. For details on counting string 
>>> lengths, see
>>> [CHARMOD].
>> Should UAs use NFC for submissions?
> I don't know, should they?

I am inclined to think that NFC SHOULD be used in order to accommodate 
transitional systems that treat Unicode as "wide ASCII". For example, a 
server-side system written in PHP4 may not have Unicode normalization 
facilities available to it and might send the data to Mozilla later. If 
a UA had posted content in NFD to the server and the server naïvely 
sent to the content to the OS X version of Mozilla, text in common 
European languages would break in an ugly way.

I would hesitate making NFC a MUST, though, because I don't know 
whether small devices can hold the data that is needed in order to 
carry out Unicode normalization. Requiring desktop apps to normalize 
shouldn't be a big deal. At least OS X and Gnome provide normalization 
facilities and ICU can be thrown in as a cross-platform solution.

In any case, robust server-side systems should not trust that the input 
in is a particular normalization form and should normalize the data 
themselves. The point is accommodating systems that are not robust.

>>> To prevent an attribute from being processed in this way, put a 
>>> non-breaking
>>> zero-width space character () at the start of the attribute.
>> Isn't the use of that char as anything but the BOM deprecated or at 
>> least
>> considered harmful?
> Arguably, it _is_ a BOM here.
> I'm not overly fond of this either, but it's the only solution I could
> find that was relatively harmless (the BOM can always be dropped at the
> start of strings)

Exactly. Which is why tools used for generating the page might drop it 
on the server!

Actually, I am distributing one such tool myself. Is the tool broken?

> and yet did the job. Better suggestions are welcome
> though.

My immediate thought is ZWNJ, but I'm not sure if using it is a good 

>>> Note that a string containing the codepoint's value itself (for 
>>> example, the
>>> six-character string "U+263A" or the seven-character string 
>>> "☺") is
>>> not considered to be human readable and must not be used as a
>>> transliteration.
>> Do you expect UAs that already do this change their behavior with the 
>> legacy
>> submission types?
> We can hope.

FWIW, there may be CMS input form handlers that expect the prohibited 
behavior. I have been involved in developing one myself. (Not that I 
recommend relying on such things. Obviously, UTF-8 is the way to go.)

>>> which has a root element named "submission", with no prefix, 
>>> defining a
>>> default namespace uuid:d10e4fd6-2c01-49e8-8f9d-0ab964387e32.
>> I think that is an inappropriate attempt to micromanage the syntactic 
>> details
>> that are in the realm of a lower-level spec. I think the submission 
>> format
>> should either allow all the syntactic sugar that comes with 
>> Namespaces in XML
>> or be layered directly on top XML 1.0 without namespace support.
> The reason it is micromanaged is to make it possible to use either a 
> pure
> XML 1.0 parser _or_ an XML 1.0 with namespaces parser on the server 
> side
> without getting into any complications.

I was able to guess that that was the rationale behind the requirement. 
But why is the ability use a namespace-unaware XML processor a 
requirement? The only reason I can come up with is that PHP4 is borked 
by default but widely used.

Processing namespaced XML with tools that don't support namespaces is 
clueless and just plain wrong. If tools that don't support namespaces 
are to be accommodated, wouldn't the natural way be to spec that the 
elements are not in a namespace and the namespace processing layer is 
not used? That way you wouldn't endorse behavior that is clueless and 
just plain wrong.

I can see three problems with namespacelessness:
1) The current best practice for dispatching on the type of an XML 
document is dispatching on the namespace. If there was no namespace, 
one would have to fall back on dispatching on the content type. This is 
not a real problem with this particular vocabulary because this 
vocabulary has a distinct content type from the start.

2) You couldn't mix the vocabulary with other vocabularies using 
namespaces. This is a theoretical problem but probably not a real one, 
because the vocabulary is limited to a specific case of client-server 
interaction. Besides, the way you limit the use of namespaces in the 
current spec language would also preclude creative augmentations to the 
submission vocabulary.

3) You intend to submit the spec to a consortium that shall not be 
named and you know the powers that be in the consortium that shall not 
be named would veto any spec that builds directly on top XML 1.0 
without the namespace layer in between.

So of the three problems only the last one is significant and it is a 
political problem and not a technical one. Sadly, political problems 
may be more difficult to overcome than technical problems.

>>> but must include a BOM.
>> I think that is not a legitimate requirement when UTF-8 is used.
> Why not?

It is a requirement that applies to the XML serialization, but the 
requirement is not present in the XML spec. The requirement would mean 
that you could not use any arbitrary but conforming XML serializer.

The use of the BOM as a UTF-8 signature is a Microsoftism that was only 
allowed in XML 1.0 second edition, because fighting Microsoft text 
editors would have been futile. Still, if you pick a non-Microsoft XML 
serializer off the shelf, chances are it does not emit a BOM in the 
UTF-8 mode.

Is there a good reason to limit the use of arbitrary but conforming 
off-the-shelf XML serializers?

>>> UAs may use either CDATA blocks, entities, or both in escaping the
>>> contents of attributes and elements, as appropriate.
>> In order not to imply that this spec could restrict the ways 
>> characters
>> are escaped, that sentence should be a note rather than part of the
>> normative prose. (Of course, only the pre-defined entities are
>> available. Then there are NCRs.)
> This spec _could_ restrict the ways characters are escaped. It needs to
> not be a note so that the "may" has normative value. No?

The could restrict the escaping in the same sense the HTTP spec could 
restrict how you choose TCP sequence numbers.

In general, please see section 4.3 of RFC 3470.

Henri Sivonen
hsivonen at iki.fi

More information about the whatwg mailing list