[whatwg] on bibtex-in-html5

Thu May 21 09:11:04 PDT 2009

On Thu, May 21, 2009 at 9:51 AM, Henri Sivonen <hsivonen at iki.fi> wrote:
> On May 21, 2009, at 15:02, Bruce D'Arcus wrote:
>
>> Except the assumption that BIbTeX is widely used is overdrawn once you
>> get out of the technology and sciences sectors.
>
> OK.
>
>>> This doesn't mean that BibTeX is a bad basis. The set of types and fields
>>> is
>>> limited, though.
>>
>> It's limited, and it's flat.
>
> In order to not get completely ignored in the technology and sciences
> sectors, a bibliography microdata format needs to be able to plug into the
> network effects of BibTeX. Having a non-flat microdata format while BibTeX
> remains flat would seriously hinder conversions from microdata to BibTeX.

All that matters from a BIbTeX perspective is that the data is a clean
superset. E.g. so long as a book, chapter, article, etc. can be
reliably converted to and from BibTeX, there's no problem.

The same is true of all the other bib formats out there: RIS, NLM,
MODS, PRISM, OOXML, etc.

> How are non-flat bibliographies (beyond an article being in a book / journal
> / Web site) presented?

A journal article is always a good example. If you like, take a look
at the RDFa embedded in this example:

<http://bruce.darcus.name/publications/articles/outside-agitator>

Now, let's consider the most basic and important distinction: how you
represent the journal title.

In BibTeX, it's (typically) a flat "journal" key.

In the DC/BIBO representation here, you use a dc:isPartOf relation, so
that the triples look like:

<http://bruce.darcus.name/publications/articles/outside-agitator> a
bibo:AcademicArticle ;
    dc:title "Dissent, Public Space and the Politics of Citizenship:
Riots and the Outside Agitator"@en ;
    bibo:doi "10.1080/1356257042000309652" ;
    bibo:issue "3" ;
    bibo:pageEnd "370" ;
    bibo:pageStart "355" ;
    bibo:volume "8" ;
    dc:creator <http://bruce.darcus.name/about#me> ;
    dc:isPartOf [ dc:title "Space &amp; Polity" ] .

So that same mechanism can be used to represent related titles of all
sorts: weblogs, magazines and newspapers, court reporters (which are
really just periodicals that published legal decisions), etc.

The alternative in a totally flat model is having to invent new title
properties every time you come across new data (or using a more
generic key than "journal" to represent the containing title).

I explain the basic thinking behind this using some actual examples
from citation styles here:

<http://www.users.muohio.edu/darcusb/misc/citations-spec.html>

They're really just design notes, but I think communicate the point.

>>> Since renderings of bibliography don't show the type of the reference
>>> usually, having to use 'misc' for almost everything isn't a practical
>>> problem although it is aesthetically displeasing.
>>
>> But this is not the point of adding structured data to HTML; it's to
>> allow it be extracted, and subsequently processed, as data.
>
> More to the point, allow to be extracted and used as bibliography source
> data for another publication to avoid repetitive data entry.

Yes.

>> Citation and bibliographic formatting conventions do include
>> information that suggests type; it's not that it requires a human
>> reader to decipher.
>
> OK. The styles that I've observed make a difference that isn't traceable to
> the availability of fields on an item have mainly made a distinction between
> atomic publications and compilations.

Yes. But you also have styles that have conventions like "if you have
a book, format title in italics, else ..." So there are little hints
like that which give a (human) reader information they can use to find
the source in question.

As the creator of CSL, I've always said my intention is to contribute
toward helping us move beyond some of these eccentric traditions,
though!

>>>>       • Related, BibTeX cannot represent much of the data in widely used
>>>> bibliographic applications such as Endnote, RefWorks and Zotero except
>>>> in
>>>> very general ways.
>>>
>>> Do you have an example? (I've never used the other formats.)
>>
>> Here's the in-progress mapping of Zotero's types to RDF (BIBO, and a
>> few others; PO from the BBC, and SIOC):
>>
>> <https://www.zotero.org/trac/wiki/BiboMapping>
>
> On the surface, it seems that it would possible to mint more field types and
> publications for BibTeX to support those cases, but what is the publication
> type information used for? Are there as many different entry presentations
> as there are entry types? Or are the type tokens supposed to be mapped to
> localized human-readable label strings?

It depends. For Zotero, a lot of it is about mapping to particular UI
configurations for data entry and editing.

But they can also be used for mapping to output styling as defined in
CSL (which is loosely inspired by BibTeX's BST language, but is XML).

> Also, the non-flatness I see is an item being part of a compilation which is
> already supported by BibTeX without allowing the whole model to generalize
> into a graph.

Where is the generic BibTeX key to denote a containing item? There's
no "publication-title" or "container-title." There's no
"collection-title" or "original-title."

I'd be willing to grant some of your points if these keys existed, but
they do not.

>> Here's some info on Microsoft's bib format for OOXML, that will give
>> you some info:
>>
>>
>> <http://community.muohio.edu/blogs/darcusb/archives/2006/09/05/open-xml-draft-14>
>
> It seems relatively straight-forward technically to extend BibTeX with the
> field types from OOXML that BibTeX doesn't cover. The main issue seems to be
> the bikeshed of what names to use.

And the social issue of who decides, and how.

>> Here's the type schema for CSL (though it needs work, and we
>> de-emphasize this for formatting in any case; CSL is oriented towards
>> output formatting only really):
>>
>>
>> <http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/csl/schema/branches/split/csl-types.rnc?view=markup>
>>
>> Here's the variable list:
>>
>>
>> <http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/csl/schema/branches/split/csl-variables.rnc?revision=941&view=markup>
>
> I don't see a fundamental reason why the BibTeX vocabulary couldn't be
> extended with stuff from there.

Again, it could, but why force this?

>>>>       • The BibTeX extensibility model puts a rather large burden on
>>>> inventing new properties to accommodate data not in the core model. For
>>>> example, the core model has no way to represent a DOI identifier (this
>>>> is no
>>>> surprise, as BibTeX was created before DOIs existed). As a  consequence,
>>>> people have gradually added this to their BibTeX records and styles in a
>>>> more ad hoc way. This ad hoc approach to extensibility has one of two
>>>> consequences: either the vocabulary terms are understood as completely
>>>> uncontrolled strings, or one needs to standardize them. If we assume the
>>>> first case, we introduce potential interoperability problems.
>>>
>>> In practice, those problems have already been introduced. For some reason
>>> I
>>> don't understand, there's an existing pattern of calling a field 'doi'
>>> but
>>> putting an absolute URI in the value. (As opposed to using a field name
>>> 'url' or a value that contains only the DOI-significant part.)
>>
>> The point is, when you get beyond dealing with secondary literature
>> (the domain of BibTeX and the sciences), the range of possible data
>> expands significantly. Things can get really complicated.
>>
>> Consider what's actually pretty simple comparatively:
>>
>> An English translation of a "classic" work. You often need original
>> publication information such as title (in the original language),
>> publisher and issued date, etc.
>>
>> With a flat model, you have to invent new properties to accommodate
>> every little exception like this.
>
> What formats/software do people use for cases like that in practice?

Well, just to be clear, as I said above, you can certainly invent flat
keys for this use case by doing like biblatex does: origpublisher,
origtitle, etc., etc.

With the case of Zotero and BIBO, just use a dc:isVersionOf relation
to the original.

>>>> If we assume the second, we have an organizational and process problem:
>>>> that the WHATWG and/or the W3C—neither of which have expertise in this
>>>> domain—become the gate-keepers for such extensions. In either case, we
>>>> have
>>>> a rather brittle and anachronistic approach to extension.
>>>
>>> Problems of this nature haven't stopped the WHATWG in the past. :-)
>>>
>>>>       • The BibTeX model conflicts with Dublin Core and with vCard, both
>>>> of which are quite sensibly used elsewhere in the microdata spec to
>>>> encode
>>>> information related to the document proper. There seems little
>>>> justification
>>>> in having two different ways to represent a document  depending on
>>>> whether
>>>> on it is THIS document or THAT document.
>>>
>>> When you are referring to THAT document, you generally want the names of
>>> the
>>> authors--not their full business cards. Therefore, vCard is an overkill,
>>> and
>>> conversion to .bib is more useful than conversion to vCard for this use
>>> case.
>>
>> Well, vCard is just an example of a structured representation; in
>> BIBO, we prefer to recommend FOAF. The point is simply that authors
>> and other contributors are not strings; they're people (and sometimes
>> organizations).
>
> What software currently supports FOAF in bibliographies?

How is that relevant? No bib software supports microdata either.

>>>> My suggestion instead?
>>>>       • reuse Dublin Core and vCard for the generic data: titles,
>>>> creators/contributors, publisher, dates, part/version relations, etc.,
>>>>  and
>>>> only add those properties (volume, issue, pages, editors, etc.) that
>>>> they
>>>> omit
>>>
>>> This would make conversion to and from the dominant bibliography format
>>> (.bib) more complex.
>>
>> BibTeX is NOT "the dominant bibliography format." This is exactly part
>> of my point in this.
>>
>>> Furthermore, there's a risk of a GIGO effect where the
>>> conversion can't be done algorithmically. (IIRC, you can't
>>> algorithmically
>>> map a .bib author name to the vCard name structure without a huge
>>> dictionary
>>> of names.)
>>
>> Both FOAF and vCard have unstructured personal name properties
>> (foaf:name and v:fn) that address this.
>
> But vCard required both N and FN, so if you only have FN, you can't get an N
> without a lot of dictionary-based domain knowledge and special rules. (Or
> you can make a GIGO N...)

Hmm ... that's not how it's implemented in hcard.

But as I said, I'm not attached to vcard nor do I think this is the
most important point.

>>>>       • make it possible for people to interweave other, richer,
>>>> vocabularies such as bibo within such item descriptions. In other words,
>>>> extension properties should be URIs.
>>>>       • define the mapping to RDF of such an “item” description; can we
>>>> say, for example, that it constitutes a dct:references link from the
>>>> document to the described source?
>>>
>>> How are these useful for conversions to and from the incumbent format
>>> (BibTeX)? (Only BibTeX is supported by all of Google Scholar, the ACM
>>> Portal, Stanford Spires, NASA ADS at Harvard and Citebase.org. The three
>>> last ones being databases that arXiv seems to delegate to.)
>>
>> All of these examples are either from the sciences (they certainly
>> don't represent the humanities or law.), or deal exclusively with
>> secondary scholarly literature.
>
> Maybe there are different needs for humanities and law. I don't know, though
> I'm skeptical. Is there one dominant format for humanities and one dominant
> format for law? (I notice that ACM and Google have EndNote in common in
> addition to BibTeX.)

You're ducking my more important point here. You bring in evidence
about the data that these services export about the documents they
index. I am simply saying that this is nothing to do with the entries
in those documents' bibliographies. So apples and oranges.

If those services had structured representations of those
bibliographic entries (which you would expect to see over time with
RDFa and/or microdata), bibtex alone would not be able to represent
them in huge swaths of work outside science and technology.

Consider this book of mine:

<http://bruce.darcus.name/publications/books/boundaries-of-dissent/>

As a book, bibtex can represent it fine.

But if you were to look at the bibliography, you'd see it includes
legal cases, and films, and archival documents, none of which bibtex
supports out-of-box (with the default keys and types).

> It doesn't make sense to adopt something less established in order to avoid
> favoring sciences.

It's worse than this: adopting BibTeX is actively excluding other
fields. That's the kind of bias that I should think would be
unacceptable. It's exactly why I think RDFa the better solution here.

Let's step back a bit and break this down.

Dublin Core is arguably more widely used and supported than BIbTeX as
a generic metadata format. This is presumably why Ian includes
language in the spec that says the title of an HTML5 document is a
dc:title.

The core of what I'm suggesting is simply using DC consistently: for
both documents, and references to other documents.

> That is, it may turn out that some fields need a format
> that is less flat that BibTeX, but offering that kind of generality where
> the flatness of BibTeX works seems to be the kind of complication that only
> makes people stick to the simpler thing they already have, i.e. BibTeX.

Again, you and Ian are both generalizing in unwarranted ways. We're
not talking about "people" in general here: we're talking about a wide
array of potential users, from a wide array of different communities.

Your "people" here are simple bibtex users, who are small minority of
bibliographic users in the grand scheme of things. Surely this effort
should not be privileging those users alone (which just by coincidence
happen be the people making the decisions here)?

In my field, anecdotally, I'd guess that 99% of my colleagues have
never even heard of BIbTeX, much less use it. Instead, they use
Endnote or Zotero or RefWorks to manage their bibliographic data and
format their documents, or they do it themselves, by hand (more common
than most realize).

For them, the important formats to import are typically labeled
"Endnote" (which is usually really Refer or RIS).

Or consider the user or developer who can't figure out how to
represent their data in bibtex-in-html5 because its designers simply
didn't consider it. In that case, "people" go elsewhere, or invent
their own solutions.

>> So if we're talking about HTML5 and
>> the microdata proposal, the conversion would be from DC to BibTeX.
>
> Is conversion from DC to BibTeX well-defined? Wouldn't it open all the same
> issues that extending BibTeX vocabulary involves? What bibliography
> generators support DC as source data?

Zotero and similar services and applications ingest a whole lot of
web, including DC. They also export that data to BibTeX, and use it to
process formatted bibliographies.

In any case, I reject the idea that BibTeX should have some special
place in this conversation.

Bruce