[whatwg] on bibtex-in-html5

Thu May 21 05:02:04 PDT 2009

Hi Henri,

On Thu, May 21, 2009 at 4:00 AM, Henri Sivonen <hsivonen at iki.fi> wrote:
> On May 20, 2009, at 19:24, Bruce D'Arcus wrote:
>
>> Re: the recent microdata work and the subsequent effort to include
>> BibTeX in the spec, I summarized my argument against this on my blog:
>>
>>
>> <http://community.muohio.edu/blogs/darcusb/archives/2009/05/20/on-the-inclusion-of-bibtex-in-html5>
>
> Quoting from the blog post:
>>
>> On the last use case, he has chosen BibTeX, on the basis that it is widely
>> used and simple to author and process.
>>
> Those are good criteria.

Except the assumption that BIbTeX is widely used is overdrawn once you
get out of the technology and sciences sectors.

>>        • BibTeX is designed for the sciences, that typically only cite
>> secondary academic literature. It is thus inadequate for, nor widely used,
>> in many fields outside of the sciences: the humanities and law being quite
>> obvious examples. For this reason, BibTeX cannot by default adequately
>> represent even the use cases Ian has identified. For example, there are many
>> citations on Wikipedia that can only be represented using effectively
>> useless types such as “misc” and which require new properties to be
>> invented.
>>
> This doesn't mean that BibTeX is a bad basis. The set of types and fields is
> limited, though.

It's limited, and it's flat.

> Since renderings of bibliography don't show the type of the reference
> usually, having to use 'misc' for almost everything isn't a practical
> problem although it is aesthetically displeasing.

But this is not the point of adding structured data to HTML; it's to
allow it be extracted, and subsequently processed, as data.

Citation and bibliographic formatting conventions do include
information that suggests type; it's not that it requires a human
reader to decipher. Surely that should not limit how we address this
going forward?

> The set of fields is more of an issue, but it can be fixed by inventing more
> fields--it doesn't mean the whole base solution needs to be discarded.
> Fortunately, having custom fields in .bib doesn't break existing pre-Web,
> pre-ISBN bibliography styles. I've used at least these custom fields:
>
> key: Show this citation pseudo-id in rendering instead of the actual id used
> for matching.
> url: The absolute URL of a resource that is on the Web.
> refdate: The date when the author made the reference to an ephemeral source
> such as a Web page.
> isbn: The ISBN of a publication.
> stdnumber: RFC or ISO number. e.g. "RFC 2397" or "ISO/IEC 10646:2003(E)"
>
> Particularly the 'url' and 'isbn' field names should be obvious and
> uncontroversial additions.

Trust me: this is not nearly as simple as you think. More below ...

>>        • Related, BibTeX cannot represent much of the data in widely used
>> bibliographic applications such as Endnote, RefWorks and Zotero except in
>> very general ways.
>
> Do you have an example? (I've never used the other formats.)

Here's the in-progress mapping of Zotero's types to RDF (BIBO, and a
few others; PO from the BBC, and SIOC):

<https://www.zotero.org/trac/wiki/BiboMapping>

Here's some info on Microsoft's bib format for OOXML, that will give
you some info:

<http://community.muohio.edu/blogs/darcusb/archives/2006/09/05/open-xml-draft-14>

Here's the type schema for CSL (though it needs work, and we
de-emphasize this for formatting in any case; CSL is oriented towards
output formatting only really):

<http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/csl/schema/branches/split/csl-types.rnc?view=markup>

Here's the variable list:

<http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/csl/schema/branches/split/csl-variables.rnc?revision=941&view=markup>

>>        • The BibTeX extensibility model puts a rather large burden on
>> inventing new properties to accommodate data not in the core model. For
>> example, the core model has no way to represent a DOI identifier (this is no
>> surprise, as BibTeX was created before DOIs existed). As a  consequence,
>> people have gradually added this to their BibTeX records and styles in a
>> more ad hoc way. This ad hoc approach to extensibility has one of two
>> consequences: either the vocabulary terms are understood as completely
>> uncontrolled strings, or one needs to standardize them. If we assume the
>> first case, we introduce potential interoperability problems.
>
> In practice, those problems have already been introduced. For some reason I
> don't understand, there's an existing pattern of calling a field 'doi' but
> putting an absolute URI in the value. (As opposed to using a field name
> 'url' or a value that contains only the DOI-significant part.)

The point is, when you get beyond dealing with secondary literature
(the domain of BibTeX and the sciences), the range of possible data
expands significantly. Things can get really complicated.

Consider what's actually pretty simple comparatively:

An English translation of a "classic" work. You often need original
publication information such as title (in the original language),
publisher and issued date, etc.

With a flat model, you have to invent new properties to accommodate
every little exception like this.

>> If we assume the second, we have an organizational and process problem:
>> that the WHATWG and/or the W3C—neither of which have expertise in this
>> domain—become the gate-keepers for such extensions. In either case, we have
>> a rather brittle and anachronistic approach to extension.
>
> Problems of this nature haven't stopped the WHATWG in the past. :-)
>
>>        • The BibTeX model conflicts with Dublin Core and with vCard, both
>> of which are quite sensibly used elsewhere in the microdata spec to encode
>> information related to the document proper. There seems little justification
>> in having two different ways to represent a document  depending on whether
>> on it is THIS document or THAT document.
>
> When you are referring to THAT document, you generally want the names of the
> authors--not their full business cards. Therefore, vCard is an overkill, and
> conversion to .bib is more useful than conversion to vCard for this use
> case.

Well, vCard is just an example of a structured representation; in
BIBO, we prefer to recommend FOAF. The point is simply that authors
and other contributors are not strings; they're people (and sometimes
organizations).

>> My suggestion instead?
>>        • reuse Dublin Core and vCard for the generic data: titles,
>> creators/contributors, publisher, dates, part/version relations, etc.,  and
>> only add those properties (volume, issue, pages, editors, etc.) that they
>> omit
>
> This would make conversion to and from the dominant bibliography format
> (.bib) more complex.

BibTeX is NOT "the dominant bibliography format." This is exactly part
of my point in this.

> Furthermore, there's a risk of a GIGO effect where the
> conversion can't be done algorithmically. (IIRC, you can't algorithmically
> map a .bib author name to the vCard name structure without a huge dictionary
> of names.)

Both FOAF and vCard have unstructured personal name properties
(foaf:name and v:fn) that address this.

>>        • typing should NOT be handled a bibtex-type property, but the same
>> way everything else is typed in the microdata proposal: a global identifier
>
> Why is typing even needed except for separating articles from compilations?

I agree it's not that important, but not everyone agrees with us ;-)

>>        • make it possible for people to interweave other, richer,
>> vocabularies such as bibo within such item descriptions. In other words,
>> extension properties should be URIs.
>>        • define the mapping to RDF of such an “item” description; can we
>> say, for example, that it constitutes a dct:references link from the
>> document to the described source?
>
> How are these useful for conversions to and from the incumbent format
> (BibTeX)? (Only BibTeX is supported by all of Google Scholar, the ACM
> Portal, Stanford Spires, NASA ADS at Harvard and Citebase.org. The three
> last ones being databases that arXiv seems to delegate to.)

All of these examples are either from the sciences (they certainly
don't represent the humanities or law.), or deal exclusively with
secondary scholarly literature. So they represent THIS document
(exclusively journal article I'd guess); not THOSE documents that are
included in the reference list. So if we're talking about HTML5 and
the microdata proposal, the conversion would be from DC to BibTeX.

Bruce