[whatwg] Provding Better Tools
jking at dark-phantasy.com
Sun Dec 3 19:57:07 PST 2006
On Sun, 03 Dec 2006 20:10:34 -0500, Michel Fortin
<michel.fortin at michelf.com> wrote:
> My experience optimizing PHP Markdown, and building the custom mixed
> Markdown/HTML-block pesudo-tokenizer of PHP Markdown Extra, tells me
> that it'll probably stay very slow as long as the implementation is made
> of PHP code.
Yeah, it is. I'm not much of a programmer, but I thought the algorithm
too useful not to try and implement.
> Assuming you've implemented the algorithm in the spec as PHP code, you
> could probably make it faster by using regular expressions in the
> tokenization steps instead of iterating character by character. For
> instance, you could implement many of the tokenizer states by matching
> from the start of a string with a regex. And maybe then it'll also be
> possible to combine a couple of states within the same regex too.
This is precisely what I've done. Before I did said optimization, the
parser would crash more often than not on a document larger than a few
kilobytes on my machine.
> The more we replace PHP code by regular expressions, the faster it'll
> go, but further we deviate from the processing algorithm described in
> the spec. I wonder how far we could go while keeping the exact same
My pattern optimization is pretty simple: when switching states the parser
first tries matching whatever range of characters will keep the machine in
the same state, and then acts as normal on the first character that
doesn't match. There is, effectively, next to no deviation from the spec
short of emitting one char token per unbroken string rather than one token
per character. Since the tokens are merged into one text node in the tree
builder anyway, the deviation is essentially nil.
> The true good solution would be to have a parser implemented in C and
> available through every standard installation of PHP. It could be used
> by other languages too.
I am keeping my fingers crossed, hoping that someone much more
knowledgable than I will do this. :)
More information about the whatwg