[Imps] Reasonable limits on buffered values
Henri Sivonen
hsivonen at iki.fi
Wed Jul 4 03:00:19 PDT 2007
On Jul 4, 2007, at 11:55, Ian Hickson wrote:
> On Thu, 28 Dec 2006, Henri Sivonen wrote:
>>
>> My primary strategy against denial of service attacks that target the
>> conformance checking service is to limit the number of bytes
>> accepted as
>> input. This indirectly throttles everything that is proportional
>> to the
>> size of input, which is OK for most stuff that has linear growth
>> behavior. (It doesn't address things like the billion laughs attack,
>> though.)
>>
>> I have additionally placed arbitrary hard limits on the size of
>> particular buffers.
>
> I recommend a simpler and broader strategy: limit the total CPU and
> memory
> usage of the process.
The case I have is threads running with a shared memory space. Of
course, fork()ing an initialized process and individually protected
process memory spaces would be nice for this kind of thing (even for
performance if you never called free() and just let the process die
after completing the HTTP response :-), but Java is what it is.
> After a certain level of CPU or memory usage,
> possibly monitored by a separate, higher priority thread, simply
> terminate
> the algorithm and explain that the system cannot handle the given
> document.
The basic problem is documented at
http://java.sun.com/j2se/1.5.0/docs/guide/misc/
threadPrimitiveDeprecation.html
One way to use the dangerous stuff less dangerously would be
converting all the actual conformance checking code to be lockless
(i.e. refactor Jing to use Java 1.2-or-later collections instead of
the Java 1.1 legacy), catching the ThreadDeath in the servlet and
making calling read() into the outbound shared HTTP client raise a
flag (until read() returns) that told the watchdog thread not to kill
the thread while the thread holds locks in the shared HTTP client.
Letting the JVM kill a thread with an OutOfMemoryError is not cool,
because an innocent thread might get killed instead of the actual
resource hog.
>> I'm wondering if there's a best practice here. Is there data on
>> how long
>> non-malicious attribute values legitimately appear on the Web?
>
> I have seen (and created) multimegabyte attribute values. (Typically,
> data: URIs of one kind or another, but not always.)
In my working copy, I have made the attribute value buffer grow on
demand. (Allocate a larger buffer and copy memory.) I haven't yet
figured a shrinking policy. I discard the whole parser after the
parse but other users of the parsing library might want to release
memory between multiple parses reusing one parser instance or release
memory during the parse as soon as possible. I'd be interested in
hearing how robust reusable Java libraries in general handle the
tradeoff between execution speed and avoiding retaining too much
memory in the form of grown buffers.
>> At least there can be only one attribute buffer being filled at a
>> time.
>> Buffering of the textContent of <progress> and friends is potentially
>> worse than an attribute buffer, because you could use the leading
>> 1 MB
>> of bytes to establish <progress> start tags (each creating a
>> buffer for
>> content) and then use the trailing 1 MB to fill those buffers
>> simultaneously. Perhaps I should worry about those buffers
>> instead. What
>> might be a reasonable strategy for securing those (short of
>> writing the
>> associated algorithms as automata that don't need buffers)?
>
> In that kind of case, I would recommend having one buffer for all
> decoded
> "text", and then having all text nodes and text buffers refer to
> start and
> end points in that buffer.
For a single contiguous buffer this assumes that you know how much
"text" you are going to get before you allocate the buffer.
(Otherwise you keep reallocating and copying.)
The SAX API and my tokenizer use this idea to the extent that
character tokens are reported as ranges pointing to the tokenizer's
read buffer. However, the buffer has fixed max size and gets
overwritten, so the ranges are valid only until control returns back
to the tokenizer. (The interface makes no guarantees about how a run
of character tokens gets split into ranges.)
> This is also remarkably cheap in both CPU and
> memory; you only have to pay the cost of a single copy of the text
> content, regardless of the complexity of the data. It is also
> basically no
> overhead compared to having individual buffers, since you are still
> passing around strings. For mutable cases (e.g. to support
> scripting), you
> can use a copy-on-write scheme.
For reporting character tokens or attribute values you can't just
pass a single range to the higher layer even if you had the whole
source text decoded to a single buffer: there may be entity
references in between.
Moreover, *every* XML API for Java (SAX, XNI, DOM, JDOM, XOM, dom4j,
StAX) represents attribute values as java.lang.String, so in order to
work with any existing code, my tokenizer has to package attribute
values as java.lang.Strings. And with java.lang.String, you don't get
to manage the backing buffer.
Currently, when preparing to create the string, I pessimistically
assume that every attribute value is going to have an entity
reference in them or is going to fall over a read buffer boundary. I
guess I could later optimize and assume a better case and lazily
switch to the bad case handling when the bad case is actually
realized. (Same with comments. Currently, I'm also pessimistic about
tag names and attribute names falling over the read buffer boundary.)
>> Is there data on haw large legitimate HTML documents appear on the
>> Web?
>> The current limit of 2 MB is based on rounding the size of the Web
>> Apps
>> spec up.
>
> I have seen infinitely long documents. Discounting those, I have seen
> documents of tens and hundrends of megabytes.
Yeah, but is it a good idea for a publicly accessible conformance
checker to try to ingest those?
--
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
More information about the Implementors
mailing list