[whatwg] on bibtex-in-html5

Wed Jun 10 04:57:38 PDT 2009

Am cc-ing he Zoteor dev list just for posterity ...

On Wed, Jun 10, 2009 at 5:44 AM, Ian Hickson<ian at hixie.ch> wrote:
> On Wed, 20 May 2009, Bruce D'Arcus wrote:
>>
>> Re: the recent microdata work and the subsequent effort to include
>> BibTeX in the spec, I summarized my argument against this on my blog:
>>
>> <http://community.muohio.edu/blogs/darcusb/archives/2009/05/20/on-the-inclusion-of-bibtex-in-html5>
>
> | 1. BibTeX is designed for the sciences, that typically only cite
> |    secondary academic literature. It is thus inadequate for, nor widely
> |    used, in many fields outside of the sciences: the humanities and law
> |    being quite obvious examples. For this reason, BibTeX cannot by
> |    default adequately represent even the use cases Ian has identified.
> |    For example, there are many citations on Wikipedia that can only be
> |    represented using effectively useless types such as "misc" and which
> |    require new properties to be invented.
>
> We will probably have to increase the coverage in due course, yes.
> However, we should verify that the mechanism works in principle before
> investing the time to extend the vocabulary.

No; you should drop this proposal and move it to an experimental annex.

If you do insist, against all reason, in pushing forward with this
without modification, then I suggest you explain how this process of
extension will work. If, as I suspect, it'll be another case of a
centralized authority (you; who have admitted you really know nothing
about this space), then that's a deal-breaker from my perspective.

> | 2. Related, BibTeX cannot represent much of the data in widely used
> |    bibliographic applications such as Endnote, RefWorks and Zotero except
> |    in very general ways.
>
> If such data is important, we can always add support when this becomes
> clear.

Man this is frustrating.

> | 3. The BibTeX extensibility model puts a rather large burden on inventing
> |    new properties to accommodate data not in the core model. For example,
> |    the core model has no way to represent a DOI identifier (this is no
> |    surprise, as BibTeX was created before DOIs existed). As a
> |    consequence, people have gradually added this to their BibTeX records
> |    and styles in a more ad hoc way. This ad hoc approach to extensibility
> |    has one of two consequences: either the vocabulary terms are
> |    understood as completely uncontrolled strings, or one needs to
> |    standardize them. If we assume the first case, we introduce potential
> |    interoperability problems. If we assume the second, we have an
> |    organizational and process problem: that the WHATWG and/or the
> |    W3C-neither of which have expertise in this domain-become the
> |    gate-keepers for such extensions. In either case, we have a rather
> |    brittle and anachronistic approach to extension.
>
> I don't see any of this as a problem.

The problem, to repeat myself again, is related to the above "we'll
extend it as we see fit" issue.

The two biggest problems in bibtex are two properties:

book
journal

They're a problem because they're both horribly concrete/narrow, and
(arguably) redundant.

If those were instead replaced with something more generic like either:

1) publication-title

... or, better yet ...

2) a nested/related object (call it "publication" or "container" or "isPartOf")

... then extension becomes easier. If I need to encode a newspaper
article, then I just do:

title = Some Article
publication-title = Some Newspaper

.. or (better, because I can attach other information to the container):

title = Some Article
publication = [ title = Some Newspaper ]

As is, you need to add stuff like this just to resolve the problems
I've repeayedly pointed out:

newspaper-title
magazine-title
court-reporter-title
television-program-title
radio-program-title

Aside: of course, some of the above could be collapsed into more
generic stuff like "broadcast-title", but I'm just following the same,
broken, approach as bibtex.

This stuff isn't theoretical Ian. Just look through this wikipedia
page, for example:

<http://en.wikipedia.org/wiki/Guantanamo_Bay_detention_camp>

The citations include references to legal cases and briefs, and news
articles (television, radio and print). Your proposal doesn't cover
this stuff.

OTOH, applications like Zoteor can.

> | 4. The BibTeX model conflicts with Dublin Core and with vCard, both of
> |    which are quite sensibly used elsewhere in the microdata spec to
> |    encode information related to the document proper. There seems little
> |    justification in having two different ways to represent a document
> |    depending on whether on it is THIS document or THAT document.
>
> I don't understand this point. Could you provide an example of this
> conflict?

Here's an academic article in an open access biology journal.

<http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1000082>

THIS article refers to the metadata about the document proper, with
the title "Accelerated Adaptive Evolution on a Newly Formed X
Chromosome."

The metadata about the documents referenced in text are included in
the bibliography. This is what I mean by THAT document.

My point—and this is an important one—is that one should be able to
use to the same mechanism to describe both, but still to be able to
distinguish them. I'd think this journal would insist on it.

> | 5. Aspects of BibTeX's core model are ambiguous/confusing. For example,
> |    what number does "number" refer to? Is it a document number, or an
> |    issue number?
>
> What's the difference? Why does it matter?

I can't find the example, but I've come across cases where one needed
both an issue and document number. Since I haven't cited it, though, I
guess you can leave it aside ;-).

> | My suggestion instead?
> | 1. reuse Dublin Core and vCard for the generic data: titles,
> |    creators/contributors, publisher, dates, part/version relations, etc.,
> |    and only add those properties (volume, issue, pages, editors, etc.)
> |    that they omit
>
> This seems unduly heavy duty (especially the use of vCard for author
> names) when all that is needed is brief bibliographic entries.

On what basis do you make this claim? "[A]ll that is needed" for whom?

I'll point out here that the article I link to above includes
affiliation information for the authors.

But this isn't the most critical point.

> | 2. typing should NOT be handled a bibtex-type property, but the same way
> |    everything else is typed in the microdata proposal: a global
> |    identifier
>
> Why?

a) consistency; why introduce a new mechanism (from the standpoint of
microdata)?

b) flexibility (since I've made clear that bibtex is not adequate, and
I have no intention relying on the WHATWG to determine what's
important)

> | 3. make it possible for people to interweave other, richer, vocabularies
> |    such as bibo within such item descriptions. In other words, extension
> |    properties should be URIs.
>
> This is already possible.

OK, possible; but hardly very easy. See above.

> | 4. define the mapping to RDF of such an "item" description; can we say,
> |    for example, that it constitutes a dct:references link from the
> |    document to the described source?
>
> The mapping to RDF is already defined; further mappings can be done using
> the "sameAs" mechanism.

How so? I'm asking: what's the relationship between the document and
the cited document?

> On Thu, 21 May 2009, Henri Sivonen wrote:
>>
>> The set of fields is more of an issue, but it can be fixed by inventing
>> more fields--it doesn't mean the whole base solution needs to be
>> discarded. Fortunately, having custom fields in .bib doesn't break
>> existing pre-Web, pre-ISBN bibliography styles. I've used at least these
>> custom fields:
>>
>> key: Show this citation pseudo-id in rendering instead of the actual id used
>> for matching.
>> url: The absolute URL of a resource that is on the Web.
>> refdate: The date when the author made the reference to an ephemeral source
>> such as a Web page.
>> isbn: The ISBN of a publication.
>> stdnumber: RFC or ISO number. e.g. "RFC 2397" or "ISO/IEC 10646:2003(E)"
>>
>> Particularly the 'url' and 'isbn' field names should be obvious and
>> uncontroversial additions.
>
> "url" seems widely supported and I included it. I haven't added any other
> fields yet; I imagine that once this feature gets traction, we'll have
> more direct data as to which fields would be most useful, and then we can
> see what common practices are in the bibtex world for those cases and use
> compatible mechanisms.
>
>
> On Thu, 21 May 2009, Bruce D'Arcus wrote:
>> Henri wrote:
>> > This doesn't mean that BibTeX is a bad basis. The set of types and
>> > fields is limited, though.
>>
>> It's limited, and it's flat.
>
> Right. That's a good thing. It makes the vocabulary more usable.

Again, where do you get this from? If you haven't considered my use
case, then it's NOT easy to use at all! How does Joe and Jane User
figure out how to encode a newspaper article in your proposal? This is
really basic.

Bruce