[whatwg] several messages about XML syntax and HTML5
mart at degeneration.co.uk
Fri Dec 8 10:58:10 PST 2006
Alexey Feldgendler wrote:
> LiveJournal, a popular blogging service, inserts hand-authored content into hand-authored templates. While the templates are written by competent authors who (mostly) know how to write proper HTML, blog posts are most often written by people who barely learnt how to use a bunch of tags. LiveJournal makes some simple preprocessing (breaks paragraphs on newlines and strips dangerous markup like <script>) but otherwise leaves the content as is. That's why most blog pages on LiveJournal aren't even close to being valid HTML.
Actually, LiveJournal's HTML sanitizer is not as simple as you
suggest here. It does actually attempt to "fix" various markup errors
* Auto-closing badly-nested or unclosed elements
* Escaping instances of bare special characters (<, >, &, etc)
* Adding quotes around all attribute values
Of course, it isn't a validator, so apart from some special cases
(filtering <script>, for example) it has no knowledge about the content
models of various HTML elements, so it's a good example of the fact that
it's unfeasible for a tool such as this to "fix" a user's mess when
dealing with hand-made markup.
LiveJournal actually has a WYSIWYG editor in addition to accepting
hand-edited HTML, but since it's based on the in-browser designMode
thing it often generates worse markup than most users.
(It doesn't help that there is a coding standard in force for
LiveJournal which mandates XHTML served as text/html across the board
and that the system itself injects invalid HTML into the otherwise-valid
 LiveJournal actually has two of these. One is stream-based and is
used to fix up the template output:
...while the other is a lot more picky and is used for fixing up
content such as user entries and comments:
Links to source code just included in case anyone is interested.
More information about the whatwg