[whatwg] on bibtex-in-html5

Wed Jun 3 07:24:46 PDT 2009

On Tue, Jun 2, 2009 at 12:05 PM, James Graham <jgraham at opera.com> wrote:
> Bruce D'Arcus wrote:
>>
>> So exactly what is the process by which this gets resolved? Is there one?
>
> Hixie will respond to substantive emails sent to this list at some point.
> However there are some hundreds of outstanding emails (see [1]) so the
> responses can take a while. If you have a pressing deadline that would
> benefit from your issue being addressed sooner, I suggest you talk to Hixie
> about it.

No problem; I just wanted to know how things worked here. Thanks.

> FWIW I have a few general thoughts about the bibtex section which may or may
> not be interesting:
>
> 1) It seems like this and similar sections (bibtex, vCard, iCalendar) could
> be productively split out of the main spec into separate normative
> documents, since they are rather self-contained and have rather obvious
> interest for communities who are unlikely to find them at present or to be
> interested in the rest of the spec.

+1 to splitting them off.

I think there's still an open question, however, about whether any of
these—and particularly the bibliographic one (at least as it's
currently specified)—should be normative. I don't believe they should
be.

But, moving on ...

> Although the drag and drop stuff being
> dependent on them does mean that you'd need some circular references.
>
> 2) For the bibliographic data the most important issues that I see are ease
> of use and ease of export. Although I am not attached to the bibtex format
> per-se I would be extremely disappointed if a different, harder to author,
> format were used. Formats that are flexible but rarely used are less useful
> overall than more limited formats with ubiquitous deployment. In addition
> formats that are hard to use make it more likely that people will make
> accidental mistakes, so decreasing the reliability of the data and devaluing
> tools that consume the data.
>
> Although I don't think we have to use bibtex as the basis for the format, I
> do think a canonical mapping to bibtex is a requirement. Obviously this
> reflects my background in the physical sciences but, at least in that field
> LaTeX and, by association, bibtex are overwhelmingly popular. I am well
> aware that the situation in other fields is different but without clean,
> high fidelity, bibtex export (at least to the extend required to support
> common citation patterns within the physical sciences) the format will lose
> out on a large audience with a higher than average number of potential early
> adopters.

Fair enough; all I'm saying is the same deference should be paid to
other research fields. The sciences for too long have dominated these
discussions, to the detriment of other fields. So I would hope we
could avoid that here.

Let's move on to a use case of two to illustrate the issues here.

Zotero is likely to be an early adopter of microdata as well,
certainly as a consumer of these data, and perhaps also as a producer.

<http://www.zotero.org/>

Zotero is a Firefox extension that can import and export BibTeX, among
a variety of other formats (RIS, MODS, and the new BIBO/DC RDF work,
which is its primary format). It includes a number of components that
allow citation and document metadata to be extracted, and later
republished.

So, for example, a user is browsing the web, and they are reading this
article from the NY Times.

<http://www.nytimes.com/2009/06/03/world/asia/03military.html>

Zotero has a "translator" (basically, a dedicated screen-scraper) for
the NY Times, and so the user can simply click an icon in their
toolbar to extract the metadata into their database.  They can then
later cite it in their own documents, and Zotero will be responsible
for correctly formatting those citations and bibliographic entries.

So I have questions on this use case:

1) how do these data about the article get encoded in microdata in
such a way that Zotero (or any other similar tool) doesn't have to
continue to write and maintain dedicated translators for every site?
E.g. how should the newspaper article metadata be encoded?

It seems the assumption that bibtex is only for bibliographies leaves
that out. Instead, the current draft of the spec tells us the title of
the document corresponds to dc:title, and not much else.

My argument is to beef up the ability to describe documents in general
.* In strawman pseudo-code:

title = doc.title
type = doc.type
source = doc.isPartOf.title # or if not dc:isPartOf, something similarly generic
issued = doc.issued
creators = doc.creator
print creators[0].name **

E.g. don't pretend that document metadata is different than
bibliographic metadata. The latter is simply a reference to the former
(usually; there are some exceptions where people cite events).

2) If Zotero consumes these data and then the user cites it in their
document, and elects to "export to HTML5", how should that same
newspaper article data be encoded in the bibliography?

BibTeX isn't terribly helpful; example:

<http://www.mail-archive.com/lyx-users@lists.lyx.org/msg42082.html>

Newspaper articles are cited a LOT; they're all over the place on
wikipedia. And this doesn't even get into patents, or hearing
transcripts, or legal opinions, or films. We need to be able to
represent all of these, and bibtex is of little help here.

My argument:

The newspaper metadata should be encoded in the bibliography the same
as they are in the document from which they were extracted; it's
artificial and arbitrary to treat them differently.

I don't believe such an approach is significantly harder to author,
nor to translate to bibtex (or RIS).

3) BibTeX was invented before the web and HTML even existed.  So, for
example, you have very particular ways to identify resources in BibTeX
(a local id "key"). How does this relate to HTML and XHTML concepts
like @id, @cite, and @href, or the new microdata "about" property
(borrowed from rdf)?

Bruce

* this is all assuming this will remain as normative parts of the spec
** Treating contributors as dumb name strings is fraught with
problems, the most obvious being non-Western (particularly Asian)
names, and organizational contributors.