[whatwg] Exposing spelling/grammar suggestions in contentEditable

Thu Dec 2 13:53:01 PST 2010

On Thu, Dec 2, 2010 at 8:30 PM, Charles Pritchard <chuck at jumis.com> wrote:
> On 11/28/2010 11:30 PM, Benjamin Hawkes-Lewis wrote:
>>
>> Breaches would include:
>>
>>    1. Detecting the user's language (including fine distinctions like
>> British/US English).
>>    2. Fingerprinting the user's system. Different systems likely use
>> different dictionaries with different coverage. You could use
>> dictionary profiles to guess at the user's system (potentially down to
>> operating system and version).
>
> I haven't seen a response on these issues: They're currently exposed via
> window.navigator,
> so I'm just having a hard time seeing what the push-back is actually about.

Note 1: I'm not taking a position here on the appropriateness of
leaking this information.

Note 2: I do not claim any special security expertise.

Some UAs leak language on "navigator.language", but it is not part of
any (proposed or actual) specification AFAIK.

While the user's system /might/ be exposed by "navigator", the HTML5
draft specifically allows standard blank returns or misinformation, so
long as it's not inconsistent with the User-Agent header:

http://www.whatwg.org/specs/web-apps/current-work/multipage/timers.html#client-identification

At the HTTP layer by "Accept-Language" or "User-Agent" headers might
leak the same information, but both are optional.

HTTP 1.1 warns that: "It might be contrary to the privacy expectations
of the user to send an Accept-Language header with the complete
linguistic preferences of the user in every request".

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4

While HTTP 1.1 says UAs "SHOULD" send a User-Agent header, privacy
concerns are arguably a legitimate exclusion to that SHOULD.

UAs are of course notorious for lying in their User-Agent header for
compatibility reasons.

Allowing UAs explicitly to provide information via dedicated optional
fields is different to requiring UAs them to leak it in the course of
providing another service (such as spelling).

> I think a good case was made for NOT exposing actual spelling suggestions. I
> haven't heard one regarding exposing DOM ranges for mis-spelled text.
> Limitations of the <input type="text"> element to a single range, is a
> reasonable issue..

Since you can populate and retrieve the text of a DOM range, what
difference would this make to security?

In one of your other emails, you mentioned establishing high arbitrary
limits on the number of calls to the spelling API (e.g. 1000) to
protect against abuse.

I suspect you would not need 1000 queries to identify language or
systems by dictionary - you just need to know the critical identifying
differences in advance?

In any case, if you had an API that told you the misspelled ranges
with one query, could you not supply an input DOM with 2000 words to a
single query, in order to build a dictionary profile? Perhaps I'm
misunderstanding your proposal here?

> Has there been a fundamental discussion about security regarding locale
> fingerprinting?
>
> At this point, we're talking about language codes as a level of personal
> privacy we reserve for a person's name, home address, etc.

That's a bit of a strawman.

Identifying a potential breach is not the same as equating the
seriousness of that breach with other potential breaches.

For myself, I wouldn't say leaking a system plus locale is as bad as
leaking a home address, for example.

> Has this point, and the potential for abuse, actually been discussed by experts?

Perhaps these meet that bar:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec15.html#sec15.1.4

http://panopticlick.eff.org/browser-uniqueness.pdf

http://www.schneier.com/blog/archives/2010/01/tracking_your_b.html

FWIW, for one implementor perspective, WebKit's Maciej has said (of
exposing navigator.acceptLanguage): "I think the privacy concern is
minimal".

http://www.mail-archive.com/webkit-dev@lists.webkit.org/msg07686.html

> I can tell you, that blocking the issue does have real usability costs:

Would you agree those usability costs are:

   - If using a webservice for spellchecking, the user might be shown
words as misspelt that they have already added to their system
dictionary.
   - Spellchecking-on-demand would be subject to connection speed.
   - Web applications providing their own spelling UI backed by a
webservice would need to provide the browser's UI backed by the
browser's spelling service when offline.

(You mentioned clientside storage as another possibility; I doubt you
get enough storage for whole dictionaries, however.)

Have I missed anything out?

> blocking the issue without expert review, means that we're weighing actual,
> measurable usability costs with perceived insecurities. That doesn't seem
> reasonable to me.

You mention measuring and weighing. What metrics would you propose?
How would you balance them?

> FWIW: It's reasonably simple to use a minimum of scripting code to detect an
> input language, given only a sentence or two of data.

Indeed. However:

   - That requires the user actively contribute that input (rather
than spellchecking hidden, generated text, for example).
   - That only tells you what language the user typed in, which might
be different to their system language.

> I understand that
> there are situations where language use is regulated, but those situations
> carry so many other reductions in freedom: I highly doubt that exposing
> input locale would be anything but trivial in comparison to other issues.

The WebKit thread above provides an example of real-world impact of
how language information may be used even in liberal democracies:

"for a while eBay treated users as if they were browsing from Germany
if the German language appeared anywhere in the list, blocking certain
auctions."

http://www.mail-archive.com/webkit-dev@lists.webkit.org/msg07693.html

(I suspect that reflects eBay attempting to comply with German
legislation against sale of Nazi memorabilia.)

> Can I get some further, reasonable discussion, on this issue? It's fine that
> Benjamin brought up that such data could be exposed,
> but when looked at in context of the current scripting environment: that
> data is already exposed.

… optionally, explicitly, configurably.

--
Benjamin Hawkes-Lewis