[whatwg] Comments on the definition of a valid e-mail address
Ian Hickson
ian at hixie.ch
Sun Aug 30 22:53:47 PDT 2009
On Sun, 23 Aug 2009, Aryeh Gregor wrote:
>
> Section 4.10.4.1.5 defines a valid e-mail address as follows:
>
> "A valid e-mail address is a string that matches the production
> dot-atom-text "@" dot-atom-text where dot-atom-text is defined in RFC
> 5322 section 3.2.3. [RFC5322]"
>
> This is much more restrictive than the full range of e-mail addresses
> allowed by RFC 5322 et al. I've been considering whether to use <input
> type=email> in MediaWiki, and whether to change our server-side e-mail
> address validation to match. Historically, MediaWiki has mostly just
> required that an @ symbol be present in the address. Originally we used
> a simplistic regex, but when users complained, we looked into the RFCs
> and decided it was too complicated to bother with validation beyond
> checking for an @ sign.
>
> So before switching us over, I decided to do some research on how many
> users' addresses would be invalidated. I used the database for the
> English Wikipedia. Over all registered users, I found 3,088,880
> confirmed addresses, not necessarily all distinct. ("Confirmed" here
> means that in theory, modulo bugs, the user followed a confirmation link
> in the e-mail they received, so the address probably works in practice.)
> Of those, 3,255 (~0.1%) failed HTML 5 validation, as determined using
> the following regex-based database query:
>
> root at rosemary:enwiki> SELECT COUNT(*) FROM user WHERE
> user_email_authenticated IS NOT NULL AND user_email NOT REGEXP
> '^[-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+)*@[-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+)*$'
> AND user_email != '';
> +----------+
> | COUNT(*) |
> +----------+
> | 3255 |
> +----------+
> 1 row in set (16 min 10.80 sec)
Thanks for this research, this is exactly the kind of hard data that is
most useful when writing the spec.
> (Someone please tell me if my regex doesn't match HTML 5 here.)
If we let
X = [-a-zA-Z0-9!#$%&\'*+/=?^_`{|}~]+
...then the regexp is:
^X(\.X)*@X(\.X)*$
I believe this is correct, yes.
> Inspection showed that the overwhelming majority of the failures were
> due to the presence of excess whitespace, often a single trailing space,
> or a space inserted before or after the @ sign. When I adjusted the
> regex to ignore those failures, I got a smaller list, 202 (about 0.007%
> of the total): [...]
>
> Some of these were clearly wrong, and shouldn't have been confirmed to
> begin with. Some even didn't have an @ sign, so probably were submitted
> in some window when we did no validation at all (and I have no idea how
> they got confirmed). Of the ones that possibly work, I identified two
> major categories:
>
> 1) Addresses in the form "foo <bar at baz.example>", or similar. These
> mostly match RFC 5322's name-addr production instead of addr-spec (some
> have trailing semicolons, or are missing the initial <, etc.). I assume
> these were copy-pasted from a mail application.
These are intentionally not allowed, since it is expected that the name
will be taken from elsewhere, and the e-mail address will then be pasted
into a template with along the lines of "$name <$email>".
> 2) Addresses with dots in incorrect places, in either the local part
> or the domain name part. For instance, multiple consecutive dots, or
> leading/trailing dots. These don't match RFC 5322 at all AFAICT, but
> I asked one of the users with an invalid address of the form
> <foo. at example.com>, and he said it worked fine for him. GNU mail gave
> a syntax error when I tried to send mail to that address, but Gmail
> sent it without complaint, and the user received it successfully.
I've change the grammar to allow a trailing dot in the username part.
> I should also note that this was only the English Wikipedia, and it
> might be that speakers of other languages are more prone to use other
> types of addresses that don't meet HTML 5's specification. When looking
> at the Swedish and German databases, for instance, I found one or two
> addresses that had apparently been confirmed but contained non-ASCII
> characters. I didn't know the users with those addresses, and I didn't
> want to send them unsolicited mail, so I wasn't able to establish
> whether those addresses actually worked or the confirmation was bogus.
I'll leave it as requiring ASCII for now; I expect UAs to do IDNA
processing on the UI end for the domain side. I'm not sure what is
supposed to happen on the username side.
> Conclusions: At a minimum, I suggest that HTML 5 require that user
> agents strip all whitespace from e-mails, not just newlines. Roughly
> 0.1% of the addresses from my sample were valid except for extraneous
> whitespace. It's a small additional change that would cut the number of
> illegitimately invalid addresses in my sample by a factor of more than
> ten.
This is a UI issue -- if the user enters whitespace, the user agent is
allowed to trim it. It won't submit with whitespace, so user agents are
likely to want to do this.
> Beyond that, although it's safe to say that quoted-string or
> domain-literal or even entirely invalid addresses are extraordinarily
> rare, there are *some* real people who do use them. Unless something is
> so completely invalid that it's obviously impossible that any mail
> server would even try to send it anywhere, you're probably going to be
> cutting out some small number of users.
Do you have any more details on what types of addresses we need to allow?
> So why not have the spec say that in the case of e-mail addresses, the
> browser may warn the user, but should permit them to submit the address
> anyway? If the user is willing to override the warning, then it's
> likely that they personally know that the e-mail address works, e.g.,
> because they use it.
I dunno; your data had a number of "obviously wrong" e-mail addresses. I
would expect users to just click through warnings without checking.
> Alternatively, you could just loosen the restrictions even further, and
> only ban input that doesn't contain an @ sign. (Or that doesn't match
> ^[^@]+@[^@]+\.[^@]+$, or whatever.) Or just don't ban anything at all,
> like with type=tel. type=email differs from most of the other types
> with validity constraints (like month, number, etc.) in that the
> difference between valid and invalid values is a purely pragmatic
> question (what will actually work?) that the user can often answer
> better than the application. It doesn't seem like a good idea for the
> standard to tell users that the e-mail addresses they've actually been
> using are invalid.
I'm not quite ready to give up yet!
On Sun, 23 Aug 2009, Aryeh Gregor wrote:
> . . . and I should add that I think it might be useful to have an note
> recommending that application authors not do any validation beyond what
> the spec ends up mandating as required (preferably almost nothing).
> I've had a lot of problems with sites that think + isn't valid in e-mail
> addresses, including pretty major sites that should know better. You
> don't really know if it will work anyway until you try actually sending
> mail to it -- maybe the local part was mistyped or invented -- so why
> not just do that?
This is basically why I want the spec to define how you check for a valid
e-mail address -- so that the authors won't do anything more than basic
sanity checking.
On Sun, 23 Aug 2009, Tab Atkins Jr. wrote:
>
> Unless you avoid validating *entirely*, there's virtually always going
> to be some subset of theoretically valid addresses that you'll flag as
> invalid, though.
I think it's more the theoretically invalid ones (that work anyway) that
we're worried about.
On Mon, 24 Aug 2009, Aryeh Gregor wrote:
>
> The breakdown of the 202 is as follows.
>
> * Single trailing dot in domain part: 100 (prohibited by RFC but
> plausibly deliverable)
Raising an error on these seems ok, the user almost certainly didn't mean
the dot and can just remove it.
> * Single trailing dot in local part: 40 (prohibited by RFC but
> plausibly deliverable)
Now allowed.
> * Valid address in angle brackets (with other junk around it): 21
> (permitted by RFC, kind of, and plausibly deliverable)
Intentionally not allowed.
> * Multiple consecutive dots: 20 (prohibited by RFC but plausibly deliverable)
I've change the grammar to allow multiple dots in the username part.
> * No @: 9 (unlikely to be deliverable)
> * Comment: 3 (permitted by RFC and plausibly deliverable)
Intentionally not allowed.
> * Miscellaneous: 9 (one containing [NO]@[SPAM], two with trailing >,
> one in "quotes", one with single leading dot in local part, two with
> single leading comma in local part, one with leading ": ", one with
> leading "\")
All but the one with a "." are intentionally disallowed. The one with a
leading "." is now allowed.
So I think that the spec is good now.
On Tue, 25 Aug 2009, TAMURA, Kent wrote:
>
> http://www.whatwg.org/specs/web-apps/current-work/#e-mail-state
> > A valid e-mail address is a string that matches the production
> > dot-atom-text "@" dot-atom-text
> > where dot-atom-text is defined in RFC 5322 section 3.2.3.
> > [RFC5322]<http://www.whatwg.org/specs/web-apps/current-work/#refsRFC5322>
>
> I'd like stricter rule for it. e.g.
> dot-atom-text "@" 1*(ALPHA / DIGIT) 1*("." 1*(ALPHA / DIGIT))
>
> I understand the current production, dot-atom-text "@" dot-atom-text, is
> a subset of addr-spec of RFC 5322. However dot-atom-text for the
> domain-part is not practical. The production accepts apparently
> unusable email address like "tkent@!!!!"
I've restricted the text after the "@" to domain label syntax only.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
More information about the whatwg
mailing list