[Imps] Reasonable limits on buffered values

Wed Jul 4 03:00:19 PDT 2007

On Jul 4, 2007, at 11:55, Ian Hickson wrote:

> On Thu, 28 Dec 2006, Henri Sivonen wrote:
>>
>> My primary strategy against denial of service attacks that target the
>> conformance checking service is to limit the number of bytes  
>> accepted as
>> input. This indirectly throttles everything that is proportional  
>> to the
>> size of input, which is OK for most stuff that has linear growth
>> behavior. (It doesn't address things like the billion laughs attack,
>> though.)
>>
>> I have additionally placed arbitrary hard limits on the size of
>> particular buffers.
>
> I recommend a simpler and broader strategy: limit the total CPU and  
> memory
> usage of the process.

The case I have is threads running with a shared memory space. Of  
course, fork()ing an initialized process and individually protected  
process memory spaces would be nice for this kind of thing (even for  
performance if you never called free() and just let the process die  
after completing the HTTP response :-), but Java is what it is.

> After a certain level of CPU or memory usage,
> possibly monitored by a separate, higher priority thread, simply  
> terminate
> the algorithm and explain that the system cannot handle the given
> document.

The basic problem is documented at
http://java.sun.com/j2se/1.5.0/docs/guide/misc/ 
threadPrimitiveDeprecation.html

One way to use the dangerous stuff less dangerously would be  
converting all the actual conformance checking code to be lockless  
(i.e. refactor Jing to use Java 1.2-or-later collections instead of  
the Java 1.1 legacy), catching the ThreadDeath in the servlet and  
making calling read() into the outbound shared HTTP client raise a  
flag (until read() returns) that told the watchdog thread not to kill  
the thread while the thread holds locks in the shared HTTP client.

Letting the JVM kill a thread with an OutOfMemoryError is not cool,  
because an innocent thread might get killed instead of the actual  
resource hog.

>> I'm wondering if there's a best practice here. Is there data on  
>> how long
>> non-malicious attribute values legitimately appear on the Web?
>
> I have seen (and created) multimegabyte attribute values. (Typically,
> data: URIs of one kind or another, but not always.)

In my working copy, I have made the attribute value buffer grow on  
demand. (Allocate a larger buffer and copy memory.) I haven't yet  
figured a shrinking policy. I discard the whole parser after the  
parse but other users of the parsing library might want to release  
memory between multiple parses reusing one parser instance or release  
memory during the parse as soon as possible. I'd be interested in  
hearing how robust reusable Java libraries in general handle the  
tradeoff between execution speed and avoiding retaining too much  
memory in the form of grown buffers.

>> At least there can be only one attribute buffer being filled at a  
>> time.
>> Buffering of the textContent of <progress> and friends is potentially
>> worse than an attribute buffer, because you could use the leading  
>> 1 MB
>> of bytes to establish <progress> start tags (each creating a  
>> buffer for
>> content) and then use the trailing 1 MB to fill those buffers
>> simultaneously. Perhaps I should worry about those buffers  
>> instead. What
>> might be a reasonable strategy for securing those (short of  
>> writing the
>> associated algorithms as automata that don't need buffers)?
>
> In that kind of case, I would recommend having one buffer for all  
> decoded
> "text", and then having all text nodes and text buffers refer to  
> start and
> end points in that buffer.

For a single contiguous buffer this assumes that you know how much  
"text" you are going to get before you allocate the buffer.  
(Otherwise you keep reallocating and copying.)

The SAX API and my tokenizer use this idea to the extent that  
character tokens are reported as ranges pointing to the tokenizer's  
read buffer. However, the buffer has fixed max size and gets  
overwritten, so the ranges are valid only until control returns back  
to the tokenizer. (The interface makes no guarantees about how a run  
of character tokens gets split into ranges.)

> This is also remarkably cheap in both CPU and
> memory; you only have to pay the cost of a single copy of the text
> content, regardless of the complexity of the data. It is also  
> basically no
> overhead compared to having individual buffers, since you are still
> passing around strings. For mutable cases (e.g. to support  
> scripting), you
> can use a copy-on-write scheme.

For reporting character tokens or attribute values you can't just  
pass a single range to the higher layer even if you had the whole  
source text decoded to a single buffer: there may be entity  
references in between.

Moreover, *every* XML API for Java (SAX, XNI, DOM, JDOM, XOM, dom4j,  
StAX) represents attribute values as java.lang.String, so in order to  
work with any existing code, my tokenizer has to package attribute  
values as java.lang.Strings. And with java.lang.String, you don't get  
to manage the backing buffer.

Currently, when preparing to create the string, I pessimistically  
assume that every attribute value is going to have an entity  
reference in them or is going to fall over a read buffer boundary. I  
guess I could later optimize and assume a better case and lazily  
switch to the bad case handling when the bad case is actually  
realized. (Same with comments. Currently, I'm also pessimistic about  
tag names and attribute names falling over the read buffer boundary.)

>> Is there data on haw large legitimate HTML documents appear on the  
>> Web?
>> The current limit of 2 MB is based on rounding the size of the Web  
>> Apps
>> spec up.
>
> I have seen infinitely long documents. Discounting those, I have seen
> documents of tens and hundrends of megabytes.

Yeah, but is it a good idea for a publicly accessible conformance  
checker to try to ingest those?

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/