[html5] r6648 - [e] (0) Try to tidy up some more of the Unicode/code unit mess with a probably o [...]
whatwg at whatwg.org
whatwg at whatwg.org
Thu Oct 6 16:24:39 PDT 2011
Author: ianh
Date: 2011-10-06 16:24:38 -0700 (Thu, 06 Oct 2011)
New Revision: 6648
Modified:
complete.html
index
source
Log:
[e] (0) Try to tidy up some more of the Unicode/code unit mess with a probably over-reaching definition (there's over 2000 uses of the word 'character' in the text, so I didn't check that all of them use this new definition... hopefully it works out; otherwise, we'll just have to try something else again).
Fixing http://www.w3.org/Bugs/Public/show_bug.cgi?id=13676
Modified: complete.html
===================================================================
--- complete.html 2011-10-06 22:41:30 UTC (rev 6647)
+++ complete.html 2011-10-06 23:24:38 UTC (rev 6648)
@@ -3365,11 +3365,27 @@
<p>The term <dfn id=unicode-character>Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>
+ <p>The term <dfn id=character>character</dfn>, when not qualified as
+ <em>Unicode</em> character, means a <a href=#unicode-character>Unicode character</a>
+ where possible, or a surrogate code point when not: when an
+ algorithm that processes strings is defined in terms of characters,
+ a pair of <span title="code unit">code units</span> consisting of a
+ high surrogate followed by a low surrogate must be treated as a
+ single character, but isolated surrogates must each be treated as a
+ single character also.</p>
+ <p>The <dfn id=code-point-length>code-point length</dfn> of a string is the number of
+ <span title="code unit">code units</span> in that string. <a href=#refsWEBIDL>[WEBIDL]</a></p>
+ <p class=note>This complexity results from the historical decision
+ to define the DOM API in terms of 16 bit (UTF-16) <span title="code
+ unit">code units</span>, rather than in terms of <a href=#unicode-character title="Unicode character">Unicode characters</a>.</p>
+
+
+
<h3 id=conformance-requirements><span class=secno>2.2 </span>Conformance requirements</h3>
<p>All diagrams, examples, and notes in this specification are
@@ -4457,9 +4473,6 @@
whitespace</dfn> from a string, the user agent must remove all <a href=#space-character title="space character">space characters</a> that are at the
start or end of the string.</p>
- <p>The <dfn id=code-point-length>code-point length</dfn> of a string is the number of
- <span title="code unit">code units</span> in that string. <a href=#refsWEBIDL>[WEBIDL]</a></p>
-
<p>When a user agent has to <dfn id=strictly-split-a-string>strictly split a string</dfn> on a
particular delimiter character <var title="">delimiter</var>, it
must use the following algorithm:</p>
@@ -33917,9 +33930,9 @@
</ol><p>The <dfn id=webvtt-cue-text-tokenizer>WebVTT cue text tokenizer</dfn> is as follows. It emits
a token, which is either a string (whose value is a sequence of
- Unicode characters), a start tag (with a tag name, a list of
- classes, and optionally an annotation), an end tag (with a tag
- name), or a timestamp tag (with a tag value).</p>
+ characters), a start tag (with a tag name, a list of classes, and
+ optionally an annotation), an end tag (with a tag name), or a
+ timestamp tag (with a tag value).</p>
<ol><li><p>Let <var title="">input</var> and <var title="">position</var> be the same variables as those of the same
name in the algorithm that invoked these steps.</li>
Modified: index
===================================================================
--- index 2011-10-06 22:41:30 UTC (rev 6647)
+++ index 2011-10-06 23:24:38 UTC (rev 6648)
@@ -3365,11 +3365,27 @@
<p>The term <dfn id=unicode-character>Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
is not a surrogate code point). <a href=#refsUNICODE>[UNICODE]</a></p>
+ <p>The term <dfn id=character>character</dfn>, when not qualified as
+ <em>Unicode</em> character, means a <a href=#unicode-character>Unicode character</a>
+ where possible, or a surrogate code point when not: when an
+ algorithm that processes strings is defined in terms of characters,
+ a pair of <span title="code unit">code units</span> consisting of a
+ high surrogate followed by a low surrogate must be treated as a
+ single character, but isolated surrogates must each be treated as a
+ single character also.</p>
+ <p>The <dfn id=code-point-length>code-point length</dfn> of a string is the number of
+ <span title="code unit">code units</span> in that string. <a href=#refsWEBIDL>[WEBIDL]</a></p>
+ <p class=note>This complexity results from the historical decision
+ to define the DOM API in terms of 16 bit (UTF-16) <span title="code
+ unit">code units</span>, rather than in terms of <a href=#unicode-character title="Unicode character">Unicode characters</a>.</p>
+
+
+
<h3 id=conformance-requirements><span class=secno>2.2 </span>Conformance requirements</h3>
<p>All diagrams, examples, and notes in this specification are
@@ -4457,9 +4473,6 @@
whitespace</dfn> from a string, the user agent must remove all <a href=#space-character title="space character">space characters</a> that are at the
start or end of the string.</p>
- <p>The <dfn id=code-point-length>code-point length</dfn> of a string is the number of
- <span title="code unit">code units</span> in that string. <a href=#refsWEBIDL>[WEBIDL]</a></p>
-
<p>When a user agent has to <dfn id=strictly-split-a-string>strictly split a string</dfn> on a
particular delimiter character <var title="">delimiter</var>, it
must use the following algorithm:</p>
@@ -33917,9 +33930,9 @@
</ol><p>The <dfn id=webvtt-cue-text-tokenizer>WebVTT cue text tokenizer</dfn> is as follows. It emits
a token, which is either a string (whose value is a sequence of
- Unicode characters), a start tag (with a tag name, a list of
- classes, and optionally an annotation), an end tag (with a tag
- name), or a timestamp tag (with a tag value).</p>
+ characters), a start tag (with a tag name, a list of classes, and
+ optionally an annotation), an end tag (with a tag name), or a
+ timestamp tag (with a tag value).</p>
<ol><li><p>Let <var title="">input</var> and <var title="">position</var> be the same variables as those of the same
name in the algorithm that invoked these steps.</li>
Modified: source
===================================================================
--- source 2011-10-06 22:41:30 UTC (rev 6647)
+++ source 2011-10-06 23:24:38 UTC (rev 6648)
@@ -2242,8 +2242,26 @@
is not a surrogate code point). <a
href="#refsUNICODE">[UNICODE]</a></p>
+ <p>The term <dfn>character</dfn>, when not qualified as
+ <em>Unicode</em> character, means a <span>Unicode character</span>
+ where possible, or a surrogate code point when not: when an
+ algorithm that processes strings is defined in terms of characters,
+ a pair of <span title="code unit">code units</span> consisting of a
+ high surrogate followed by a low surrogate must be treated as a
+ single character, but isolated surrogates must each be treated as a
+ single character also.</p>
+ <p>The <dfn>code-point length</dfn> of a string is the number of
+ <span title="code unit">code units</span> in that string. <a
+ href="#refsWEBIDL">[WEBIDL]</a></p>
+ <p class="note">This complexity results from the historical decision
+ to define the DOM API in terms of 16 bit (UTF-16) <span title="code
+ unit">code units</span>, rather than in terms of <span
+ title="Unicode character">Unicode characters</span>.</p>
+
+
+
<!--END dev-html-->
<!--START microdata-->
@@ -3519,10 +3537,6 @@
title="space character">space characters</span> that are at the
start or end of the string.</p>
- <p>The <dfn>code-point length</dfn> of a string is the number of
- <span title="code unit">code units</span> in that string. <a
- href="#refsWEBIDL">[WEBIDL]</a></p>
-
<p>When a user agent has to <dfn>strictly split a string</dfn> on a
particular delimiter character <var title="">delimiter</var>, it
must use the following algorithm:</p>
@@ -37228,9 +37242,9 @@
<p>The <dfn>WebVTT cue text tokenizer</dfn> is as follows. It emits
a token, which is either a string (whose value is a sequence of
- Unicode characters), a start tag (with a tag name, a list of
- classes, and optionally an annotation), an end tag (with a tag
- name), or a timestamp tag (with a tag value).</p>
+ characters), a start tag (with a tag name, a list of classes, and
+ optionally an annotation), an end tag (with a tag name), or a
+ timestamp tag (with a tag value).</p>
<ol>
More information about the Commit-Watchers
mailing list