[whatwg] on bibtex-in-html5

Thu May 21 06:51:35 PDT 2009

On May 21, 2009, at 15:02, Bruce D'Arcus wrote:

> Except the assumption that BIbTeX is widely used is overdrawn once you
> get out of the technology and sciences sectors.

OK.

>> This doesn't mean that BibTeX is a bad basis. The set of types and  
>> fields is
>> limited, though.
>
> It's limited, and it's flat.

In order to not get completely ignored in the technology and sciences  
sectors, a bibliography microdata format needs to be able to plug into  
the network effects of BibTeX. Having a non-flat microdata format  
while BibTeX remains flat would seriously hinder conversions from  
microdata to BibTeX.

How are non-flat bibliographies (beyond an article being in a book /  
journal / Web site) presented?

>> Since renderings of bibliography don't show the type of the reference
>> usually, having to use 'misc' for almost everything isn't a practical
>> problem although it is aesthetically displeasing.
>
> But this is not the point of adding structured data to HTML; it's to
> allow it be extracted, and subsequently processed, as data.

More to the point, allow to be extracted and used as bibliography  
source data for another publication to avoid repetitive data entry.

> Citation and bibliographic formatting conventions do include
> information that suggests type; it's not that it requires a human
> reader to decipher.

OK. The styles that I've observed make a difference that isn't  
traceable to the availability of fields on an item have mainly made a  
distinction between atomic publications and compilations.

>>>        • Related, BibTeX cannot represent much of the data in  
>>> widely used
>>> bibliographic applications such as Endnote, RefWorks and Zotero  
>>> except in
>>> very general ways.
>>
>> Do you have an example? (I've never used the other formats.)
>
> Here's the in-progress mapping of Zotero's types to RDF (BIBO, and a
> few others; PO from the BBC, and SIOC):
>
> <https://www.zotero.org/trac/wiki/BiboMapping>

On the surface, it seems that it would possible to mint more field  
types and publications for BibTeX to support those cases, but what is  
the publication type information used for? Are there as many different  
entry presentations as there are entry types? Or are the type tokens  
supposed to be mapped to localized human-readable label strings?

Also, the non-flatness I see is an item being part of a compilation  
which is already supported by BibTeX without allowing the whole model  
to generalize into a graph.

> Here's some info on Microsoft's bib format for OOXML, that will give
> you some info:
>
> <http://community.muohio.edu/blogs/darcusb/archives/2006/09/05/open-xml-draft-14 
> >

It seems relatively straight-forward technically to extend BibTeX with  
the field types from OOXML that BibTeX doesn't cover. The main issue  
seems to be the bikeshed of what names to use.

> Here's the type schema for CSL (though it needs work, and we
> de-emphasize this for formatting in any case; CSL is oriented towards
> output formatting only really):
>
> <http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/csl/schema/branches/split/csl-types.rnc?view=markup 
> >
>
> Here's the variable list:
>
> <http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/csl/schema/branches/split/csl-variables.rnc?revision=941&view=markup 
> >

I don't see a fundamental reason why the BibTeX vocabulary couldn't be  
extended with stuff from there.

>>>        • The BibTeX extensibility model puts a rather large burden  
>>> on
>>> inventing new properties to accommodate data not in the core  
>>> model. For
>>> example, the core model has no way to represent a DOI identifier  
>>> (this is no
>>> surprise, as BibTeX was created before DOIs existed). As a   
>>> consequence,
>>> people have gradually added this to their BibTeX records and  
>>> styles in a
>>> more ad hoc way. This ad hoc approach to extensibility has one of  
>>> two
>>> consequences: either the vocabulary terms are understood as  
>>> completely
>>> uncontrolled strings, or one needs to standardize them. If we  
>>> assume the
>>> first case, we introduce potential interoperability problems.
>>
>> In practice, those problems have already been introduced. For some  
>> reason I
>> don't understand, there's an existing pattern of calling a field  
>> 'doi' but
>> putting an absolute URI in the value. (As opposed to using a field  
>> name
>> 'url' or a value that contains only the DOI-significant part.)
>
> The point is, when you get beyond dealing with secondary literature
> (the domain of BibTeX and the sciences), the range of possible data
> expands significantly. Things can get really complicated.
>
> Consider what's actually pretty simple comparatively:
>
> An English translation of a "classic" work. You often need original
> publication information such as title (in the original language),
> publisher and issued date, etc.
>
> With a flat model, you have to invent new properties to accommodate
> every little exception like this.

What formats/software do people use for cases like that in practice?

>>> If we assume the second, we have an organizational and process  
>>> problem:
>>> that the WHATWG and/or the W3C—neither of which have expertise in  
>>> this
>>> domain—become the gate-keepers for such extensions. In either  
>>> case, we have
>>> a rather brittle and anachronistic approach to extension.
>>
>> Problems of this nature haven't stopped the WHATWG in the past. :-)
>>
>>>        • The BibTeX model conflicts with Dublin Core and with  
>>> vCard, both
>>> of which are quite sensibly used elsewhere in the microdata spec  
>>> to encode
>>> information related to the document proper. There seems little  
>>> justification
>>> in having two different ways to represent a document  depending on  
>>> whether
>>> on it is THIS document or THAT document.
>>
>> When you are referring to THAT document, you generally want the  
>> names of the
>> authors--not their full business cards. Therefore, vCard is an  
>> overkill, and
>> conversion to .bib is more useful than conversion to vCard for this  
>> use
>> case.
>
> Well, vCard is just an example of a structured representation; in
> BIBO, we prefer to recommend FOAF. The point is simply that authors
> and other contributors are not strings; they're people (and sometimes
> organizations).

What software currently supports FOAF in bibliographies?

>>> My suggestion instead?
>>>        • reuse Dublin Core and vCard for the generic data: titles,
>>> creators/contributors, publisher, dates, part/version relations,  
>>> etc.,  and
>>> only add those properties (volume, issue, pages, editors, etc.)  
>>> that they
>>> omit
>>
>> This would make conversion to and from the dominant bibliography  
>> format
>> (.bib) more complex.
>
> BibTeX is NOT "the dominant bibliography format." This is exactly part
> of my point in this.
>
>> Furthermore, there's a risk of a GIGO effect where the
>> conversion can't be done algorithmically. (IIRC, you can't  
>> algorithmically
>> map a .bib author name to the vCard name structure without a huge  
>> dictionary
>> of names.)
>
> Both FOAF and vCard have unstructured personal name properties
> (foaf:name and v:fn) that address this.

But vCard required both N and FN, so if you only have FN, you can't  
get an N without a lot of dictionary-based domain knowledge and  
special rules. (Or you can make a GIGO N...)

>>>        • make it possible for people to interweave other, richer,
>>> vocabularies such as bibo within such item descriptions. In other  
>>> words,
>>> extension properties should be URIs.
>>>        • define the mapping to RDF of such an “item” description;  
>>> can we
>>> say, for example, that it constitutes a dct:references link from the
>>> document to the described source?
>>
>> How are these useful for conversions to and from the incumbent format
>> (BibTeX)? (Only BibTeX is supported by all of Google Scholar, the ACM
>> Portal, Stanford Spires, NASA ADS at Harvard and Citebase.org. The  
>> three
>> last ones being databases that arXiv seems to delegate to.)
>
> All of these examples are either from the sciences (they certainly
> don't represent the humanities or law.), or deal exclusively with
> secondary scholarly literature.

Maybe there are different needs for humanities and law. I don't know,  
though I'm skeptical. Is there one dominant format for humanities and  
one dominant format for law? (I notice that ACM and Google have  
EndNote in common in addition to BibTeX.)

It doesn't make sense to adopt something less established in order to  
avoid favoring sciences. That is, it may turn out that some fields  
need a format that is less flat that BibTeX, but offering that kind of  
generality where the flatness of BibTeX works seems to be the kind of  
complication that only makes people stick to the simpler thing they  
already have, i.e. BibTeX.

> So if we're talking about HTML5 and
> the microdata proposal, the conversion would be from DC to BibTeX.

Is conversion from DC to BibTeX well-defined? Wouldn't it open all the  
same issues that extending BibTeX vocabulary involves? What  
bibliography generators support DC as source data?

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/