[whatwg] Comments on the definition of a valid e-mail address

Smylers Smylers at stripey.com
Mon Aug 24 01:36:12 PDT 2009


Aryeh Gregor writes:

> Historically, MediaWiki has mostly just required that an @ symbol be
> present in the address.  Originally we used a simplistic regex,

It's relatively well known that a simple regex can't be used to match
e-mail addresses (and not match things that aren't!); Jeffrey Friedl's
'Mastering Regular Expressions' (O'Reilly) included a pattern for this
over a decade ago, but it is exceedingly long:

  http://groups.google.co.uk/group/comp.lang.perl.misc/msg/603ba6fc642a3124
  http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

> ... but when users complained, we looked into the RFCs and decided it
> was too complicated to bother with validation beyond checking for an @
> sign.

It's too complicated for most developers to roll their own validation,
but there are standard libraries available which get it right.

> ... I decided to do some research on how many users' addresses would
> be invalidated [by HTML 5's validation] ...
> 
> 1) Addresses in the form "foo <bar at baz.example>", or similar.  These
> mostly match RFC 5322's name-addr production instead of addr-spec

Forms on websites capturing users' e-mail addresses typically want just
the address part, prompting for the human-readable name in a separate
box, so I think HTML 5's <input type=email> not allowing the above is
helpful.

> 2) Addresses with dots in incorrect places, in either the local part
> or the domain name part.  For instance, multiple consecutive dots, or
> leading/trailing dots.  These don't match RFC 5322 at all AFAICT, but
> I asked one of the users with an invalid address of the form
> <foo. at example.com>, and he said it worked fine for him.  GNU mail gave
> a syntax error when I tried to send mail to that address, but Gmail
> sent it without complaint, and the user received it successfully.

There may actually be several categories of oddly placed dots.  While
the address in the form you give above works it may be, say, that those
with repeated dots in the hostname part don't work.

On the specific case of a . immediately before the @, I've seen that
before: this Perl library module extends an RFC-compliant module to
allow just that; its author admits ".@" breaks the RFCs but claims such
breakage is useful in the real world, specifically when dealing with
e-mail addresses for Japanese mobile phones:

  http://search.cpan.org/perldoc?Email::Valid::Loose

That somebody has found this to be a sufficiently widespread problem
with standard Perl e-mail address validation to write and upload a
module which 'fixes' this (and just that; it makes no other changes)
suggests that people will find HTML 5's <input type=email> to be
problematic in precisely the same way.

> There were other types of addresses that didn't meet HTML 5's
> specification after whitespace was stripped, but none with more than a
> single-digit number of addresses occurring in the sample of three
> million or so that I looked at.

So it may actually be that there isn't a general problem here of lots of
real-world e-mail addresses which work but don't comply with the RFCs;
it may simply be the one case of ".@"?

There aren't a plethora of Email::Valid extensions which relax various
different criteria; just the one which allows ".@".

> Alternatively, you could just loosen the restrictions even further,
> and only ban input that doesn't contain an @ sign.  (Or that doesn't
> match ^[^@]+@[^@]+\.[^@]+$, or whatever.)  Or just don't ban anything
> at all, like with type=tel.  type=email differs from most of the other
> types with validity constraints (like month, number, etc.) in that the
> difference between valid and invalid values is a purely pragmatic
> question (what will actually work?) that the user can often answer
> better than the application.  It doesn't seem like a good idea for the
> standard to tell users that the e-mail addresses they've actually been
> using are invalid.

Users often mis-type e-mail addresses.  It seems useful to be able to
trap as many typos as possible.  Many authors obviously believe this,
given how many employ JavaScript validators.  If HTML 5 were overly
permissive about <input type=email> then it's likely such authors would
continue to use homegrown JavaScript solutions, which slightly defeats
the purpose of HTML 5 introducing <input type=email).

Smylers



More information about the whatwg mailing list