[whatwg] on bibtex-in-html5

Henri Sivonen hsivonen at iki.fi
Thu May 21 01:00:11 PDT 2009


On May 20, 2009, at 19:24, Bruce D'Arcus wrote:

> Re: the recent microdata work and the subsequent effort to include
> BibTeX in the spec, I summarized my argument against this on my blog:
>
> <http://community.muohio.edu/blogs/darcusb/archives/2009/05/20/on-the-inclusion-of-bibtex-in-html5 
> >
Quoting from the blog post:
> On the last use case, he has chosen BibTeX, on the basis that it is  
> widely used and simple to author and process.
>
Those are good criteria.
> 	• BibTeX is designed for the sciences, that typically only cite  
> secondary academic literature. It is thus inadequate for, nor widely  
> used, in many fields outside of the sciences: the humanities and law  
> being quite obvious examples. For this reason, BibTeX cannot by  
> default adequately represent even the use cases Ian has identified.  
> For example, there are many citations on Wikipedia that can only be  
> represented using effectively useless types such as “misc” and which  
> require new properties to be invented.
>
This doesn't mean that BibTeX is a bad basis. The set of types and  
fields is limited, though.

Since renderings of bibliography don't show the type of the reference  
usually, having to use 'misc' for almost everything isn't a practical  
problem although it is aesthetically displeasing.

The set of fields is more of an issue, but it can be fixed by  
inventing more fields--it doesn't mean the whole base solution needs  
to be discarded. Fortunately, having custom fields in .bib doesn't  
break existing pre-Web, pre-ISBN bibliography styles. I've used at  
least these custom fields:

key: Show this citation pseudo-id in rendering instead of the actual  
id used for matching.
url: The absolute URL of a resource that is on the Web.
refdate: The date when the author made the reference to an ephemeral  
source such as a Web page.
isbn: The ISBN of a publication.
stdnumber: RFC or ISO number. e.g. "RFC 2397" or "ISO/IEC 10646:2003(E)"

Particularly the 'url' and 'isbn' field names should be obvious and  
uncontroversial additions.

> 	• Related, BibTeX cannot represent much of the data in widely used  
> bibliographic applications such as Endnote, RefWorks and Zotero  
> except in very general ways.

Do you have an example? (I've never used the other formats.)

> 	• The BibTeX extensibility model puts a rather large burden on  
> inventing new properties to accommodate data not in the core model.  
> For example, the core model has no way to represent a DOI identifier  
> (this is no surprise, as BibTeX was created before DOIs existed). As  
> a  consequence, people have gradually added this to their BibTeX  
> records and styles in a more ad hoc way. This ad hoc approach to  
> extensibility has one of two consequences: either the vocabulary  
> terms are understood as completely uncontrolled strings, or one  
> needs to standardize them. If we assume the first case, we introduce  
> potential interoperability problems.

In practice, those problems have already been introduced. For some  
reason I don't understand, there's an existing pattern of calling a  
field 'doi' but putting an absolute URI in the value. (As opposed to  
using a field name 'url' or a value that contains only the DOI- 
significant part.)

> If we assume the second, we have an organizational and process  
> problem: that the WHATWG and/or the W3C—neither of which have  
> expertise in this domain—become the gate-keepers for such  
> extensions. In either case, we have a rather brittle and  
> anachronistic approach to extension.

Problems of this nature haven't stopped the WHATWG in the past. :-)

> 	• The BibTeX model conflicts with Dublin Core and with vCard, both  
> of which are quite sensibly used elsewhere in the microdata spec to  
> encode information related to the document proper. There seems  
> little justification in having two different ways to represent a  
> document  depending on whether on it is THIS document or THAT  
> document.

When you are referring to THAT document, you generally want the names  
of the authors--not their full business cards. Therefore, vCard is an  
overkill, and conversion to .bib is more useful than conversion to  
vCard for this use case.

> My suggestion instead?
> 	• reuse Dublin Core and vCard for the generic data: titles,  
> creators/contributors, publisher, dates, part/version relations,  
> etc.,  and only add those properties (volume, issue, pages, editors,  
> etc.) that they omit

This would make conversion to and from the dominant bibliography  
format (.bib) more complex. Furthermore, there's a risk of a GIGO  
effect where the conversion can't be done algorithmically. (IIRC, you  
can't algorithmically map a .bib author name to the vCard name  
structure without a huge dictionary of names.)

> 	• typing should NOT be handled a bibtex-type property, but the same  
> way everything else is typed in the microdata proposal: a global  
> identifier

Why is typing even needed except for separating articles from  
compilations?

> 	• make it possible for people to interweave other, richer,  
> vocabularies such as bibo within such item descriptions. In other  
> words, extension properties should be URIs.
> 	• define the mapping to RDF of such an “item” description; can we  
> say, for example, that it constitutes a dct:references link from the  
> document to the described source?

How are these useful for conversions to and from the incumbent format  
(BibTeX)? (Only BibTeX is supported by all of Google Scholar, the ACM  
Portal, Stanford Spires, NASA ADS at Harvard and Citebase.org. The  
three last ones being databases that arXiv seems to delegate to.)

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/





More information about the whatwg mailing list