ALA TechSource Logo

ala-online-v1.jpg

 
curve Home spacer Publications spacer Subscribe spacer Blog spacer About  
    

Out of the Secret Garden: The RDA/DC Initiative

Submitted by Karen G. Schneider on June 21, 2007 - 9:30pm

(If you're at ALA Annual Conference/>/> while you're reading this, the RDA Update Forum is Saturday, June 23, 4:00-5:30 at WCC 206.)

"Libraries have lost their place as primary information providers, surpassed by more agile (and in many cases wealthier) purveyors of digital information delivery services. Although libraries still manage materials that are not available elsewhere, the library's approach to user service and the user interface is not competing successfully against services like Amazon or Google."

-- Karen Coyle and Diane Hillman, "Resource Description and Access: Cataloging Rules for the 20th Century"

You may not think you care about AACR2 (Anglo-American Cataloging Rules) or its successor, RDA (Resource Description and Access). That may seem like boring old-school stuff, not nearly as fun or glitzy as romping in Second Life or, as I am wont to do, posting the details of your afternoon snack on Twitter.

But the next time you complain about the limitations of library data—the gazillions of records we have created about the physical items in our libraries—and wonder why none of the cool new applications leverage the millions of library records shared worldwide, or why your expensive catalog can't integrate with a nifty new social software tool, or you wonder why there's no Google mashup to connect readers and books, consider this: to a large extent, it's because our data suck.

Not only that, it's our fault our data suck. Fixing this problem is not simply a matter of pointing at library vendors and saying, "Do better!" In many cases, vendors aren't doing too badly, considering what they have to deal with: our funky, inexplicable, old-fashioned, library-specific data that are the product of our cataloging rules.

We have built a mighty empire filled with standards and rules such as AACR2 and MARC that, long before the rest of the world was online, allowed us to do some amazing things within and across our institutions. If you've ever watched an interlibrary loan librarian buzz through hundreds of libraries with the flick of a wrist, hunting down a book for a patron, you have some idea of how important and powerful our standards have been for us.

Given the potency of library data, it's not surprising that there are many communities online that have expressed interest in our enormous data sets. But we can't share our data with them (let alone explain it to ourselves half the time), because our library data are plagued by an aging, conflicted, poorly elucidated witch's cauldron of practices that are written down on paper but are not embedded within the structure of our data.

Double, double, toil and trouble

Another day in Tech Services Because of this, cataloging is not so much a science as a dark art, driven by informal, implicit understandings rather than clear schema and vocabularies. MARC, despite its name, is only nominally machine-readable, and is not easily usable within the context of modern programming languages. People outside of library software programming have never seen anything like it. It's not all that human-readable, either, as this 045 field demonstrates:

045 2#$bd186405 $bd186408 

Did you catch that this means May – August, 1864?

Even worse, as Karen Coyle and Diane Hillman warned us earlier this year in an article with the sotto voce humorous subtitle, "Cataloging Rules for the Twentieth Century," RDA, rather than pushing us to cataloging rules compatible with 21st century requirements, repeats many of the anachronisms found in earlier editions of AACR.

The most profound limitations with RDA to date have to do with its lack of compatibility with machine-manipulable data--that is, data that can be read, and processed by, computers. RDA may be ponderous—the latest draft proposes 14 chapters and 4 appendices, with a couple of chapters weighing in at over 120 pages--but like the giant reptiles who died millions of years ago, it does not make up for its girth with intelligence.

Squeeze this onto the Semantic Web!Coyle and Hillman cite the mixed language for "number of units," pointing out that phrases such as "12 posters" are not easily machine-readable, and that many of the rules are still based on the "linear, card-based model" that, incredibly, continues to be the foundation of modern cataloging. One of the most telling anachronisms in RDA is its continuation of notions such as "primary" and "secondary," which as Coyle and Hillman point out, are concepts designed for effective use of space on a 3 x 5 card. What possible relevance do "primary" and "secondary" have in the online world, where all access points are created equal?

In other words, RDA keeps library data in a walled garden, barely manipulable by our own complex tools and unusable outside the library community.

Though we can commend our profession for being out there early in the world of online sharing—MARC, in its heyday, was an amazing invention—we have to admit that developers worldwide are not flocking to our obscure, poorly articulated standards. Talk to libraries struggling to implement any "cool tool" from outside the library universe, from Endeca and FAST and Siderean to just something as simple as describing a library record with a URI, and we're the odd ones out, trying to fit our square pegs into the world's round holes.

Meanwhile, as software engineers worldwide build applications that acknowledge the typical web user's discovery workflow--which begins with a search engine--we in LibraryLand need to plead, lure, and "educate" people to cross the moat and go through the thick doors of our proprietary library databases--never mind enabling ourselves and others to do powerful and interesting things with our data.

Tunneling out of the walled garden

But on May 1, 2007, the moon was in the seventh house and Jupiter aligned with Mars. At least, that's how it appeared to catalogers and other metadata mavens when they learned that the Dublin Core and RDA communities had agreed to pull library data out of its silo and into the Semantic Web.

That's not exactly how the agreement was described at the meeting, but before I start unfolding the catabiblish (that is, librarian language specific to catalogers), some background information is in order.

The concept of the Semantic Web should come naturally to librarians. Wikipedia (so help me) says that "the semantic web is an evolving extension of the World Wide Web in which web content can be expressed not only in natural language, but also in a form that can be read and used by software agents, thus permitting them to find, share and integrate information more easily."

Harry Potter ponders RDAWeb pages are designed to be read by people, not machines. Imagine a child seeking a copy of Harry Potter and the Deathly Hallows at her local library. Let's pretend our library data were unambiguous, explicit, and truly machine-readable. The information about that Potter book, rather than being hidden behind the walls of outdated library lingo, could be read by computers and presented on the screen. In other words, a child looking for a book would be able to search the Web and find that book within the larger context of "I am searching the Web for things that interest me," rather than interrupting her workflow, exiting the Web proper, and entering searches into the library-specific databases we call OPACs.

You may be wondering why the Semantic Web is necessary. Why not just export our catalog to the Web, or make a Web page for every record? But this is where we librarians know something about the universe worth sharing with others. Simply exporting our data to the Web is to turn our back on the very important work catalogers have contributed to librarianship (and really, the world) by thinking structurally about data in the first place. Where the world sees primordial soup, we see well-chiseled points of description. It's not that important that we thought up the "title proper"; it's really significant that we know why it's important to have that data in fields in the first place.

We know order matters. Re-expressing our data so they can be read by the Semantic Web is an avenue for retaining that which is good about our view of data--that metadata and structure are useful and meaningful, enrich the discovery process, and (theoretically) allow us to play well with others--while leaving behind the weak, antiquated, solipsistic characteristics of our encoding practices.

It could well be that positioning RDA so it is compatible by 21st century standards doesn't just make our data more explicable and usable; it could be what saves us as a profession, by clarifying to the world that we contribute a body of thought to information science that truly matters.

Free Harry Potter

How do we get Harry Potter out of the garden and onto the Web? The two communities (RDA and Dublin Core) have agreed to work together to accomplish the following:

Make our data structure explicit and machine-readable. The RDA people call this "developing an RDA Element Vocabulary." Think of it as "putting our data structure in standard, consistent recipes that computers know how to cook." Right now, even when our standards are in writing, they are not easily used by computers. If you've ever worked in a library run by unwritten rules that were hard to interpret by new staff, you know the problem with not having explicit data structures.

Cataloging is a demanding skill, but we make it even harder than it should be by not being fully explicit about our data sets. Try to find a URL leading to an explicit definition of "title proper." It's all buried in the heads of catalogers, who, brilliant mavens that they are, need to follow the advice of human-computer expert Donald Norman and put their information "in the world"--and not just for human consumption, but also so our data can be understood more broadly, within the framework of the Semantic Web.

Clarify our terms. The catabiblish for this is "expose RDA value vocabularies." We have a lot of very specific and yet undefined language in our cataloging framework.  Rather than explaining our language explicitly, we share this knowledge through education and practice, creating impossibly high hurdles for people outside our profession (or for any non-cataloging librarian) to fully understand our terms.

For example, Chapter 3 of the current draft of RDA lists "carriers" such as computer chip cartridge, microfilm slip, and stereograph wheel. But these terms aren't explained or defined, only listed; they aren't implicitly clear. People have to be taught to implement these terms properly--the sign of a system that isn't explicit. (It doesn't make us "smart" that so much of our knowledge is implicit and is not formally explicated.) We need to explain what we mean by these and other terms so that others (including the next generation of catalogers—and the next generation of software) can understand them.

Describe what we're trying to do. That's done through developing an Application Profile, or AP, which serves as a kind of letter of instruction for conveying intention, building documentation, and enabling interoperability. An AP declares "which metadata terms an organization, information resource, application, or user community uses in its metadata." The AP doesn't tell others how to use our data elements; it just makes them reusable, ensuring that when we go to exchange data we understand the basics behind each other's records.

I'm not going to go in depth about how the AP should be based on FRBR (Functional Requirements for Bibliographic Records) and FRAD (Functional Requirements for Authority Data), because if you're a cataloger you probably already "get it," and if you aren't, your head will explode. But part of the reason we're moving from AACR2 in the first place is that our rules and practices stand in the way of doing some things that have become important since the late 1940s, such as make it easy for OPAC displays to group like items, so that a book will appear next to its CD, DVD, large print, and online versions.

Not everyone is wild about Harry

For those of us not acquainted with the cataloging world, moving RDA to a Semantic Web model doesn't sound threatening. Isn't this an improvement? Don't we want to play well with others? But the idea of change stirs fear in some hearts (some of them fairly highly placed in the ALA hierarchy, by the way), and explains why the May 1 RDA/DC agreement was historic.

One rumor is that the plan is to dumb down library data and put catalogers out of work. The Dublin Core Metadata Initiative (DCMI) is partly to blame for this misconception; people are more familiar with the famous "15 elements" used for Simple Dublin Core, and that has raised fears that the ulterior motive is to move us to a simple cataloging model based around this limited element set. 

But the reality is that Dublin Core can support very robust schema, and Dublin Core is in many ways incidental to this discussion anyway. It's simply a building-block model for getting our cataloging language modernized, structured, explicit, and usable by others. The significance of the RDA/DC agreement is that the Dublin Core Metadata Initiative has been very involved in attempting to think through what interoperability really means, including but going past the Semantic Web. It's simple but powerful: at sum, it pretty much boils down to formal expression to limit the ambiguity of language, and URIs for identification.

Hug a cataloger today

The key here is to understand that the RDA/DC agreement, if it leads to the actions above--and people such as Diane Hillman and Gordon Dunsire are working top-speed on this initiative--will ultimately make it possible to get over that moat and get our data out onto the Web in new, interesting, findable, and user-friendly ways, without abandoning our classic commitment to enrichment of information--and in fact, by demonstrating proof of concept why we are committed to these practices in the first place. Whether we succeed or fail in this effort may well determine the future of our profession.


Comments (8)

If you're reading this far,

If you're reading this far, note that in the editing of this piece, 'data structure' changed to 'data.' We have great data. We just don't have good data structure. Sigh.

As a cataloging

As a cataloging non-cataloger (a high school librarian), I appreciate the comment in Karen's post about the dark science of cataloging. 'Because of this, cataloging is not so much a science as a dark art, driven by informal, implicit understandings rather than clear schema and vocabularies.' I really think the catalog, whatever form it takes, is the center of the library, at least in terms of information seeking. However, I'm constantly running up against making decisions in cataloging about specific little pieces which I don't understand and which are not explained very well anywhere (I'm aware of). I am hoping that (and assume that) somehow or another AACR2's replacement and MARC (or other) formats will some day be indistinguishable. Right now it is not easy for a cataloging non-cataloger to make the right choices about formatting a record in too many cases, e.g. cataloging websites, podcasts, etc. We do the best we are able and many are scared of doing anything for fear of getting it 'wrong.' I've been in the profession since 1974 and was barely aware of MARC in library school. I've trained myself over the years and have trained others to use MARC as effectively as possible. I'm now installing AquaBrowser on my school's library server in the hope that it will help our patrons find what they are seeking in a more intuitive way. But I know we've got a long way to go and the school library vendors don't really seem to be in the forefront of leadership in the area. I try to keep up with the field because I believe change is necessary and I want to know what's happening now. I am doing a short presentation at IASL in Taiwan next month on the OPAC and Web 2.0 integration and I've found it's a moving target at this point. If we can just get school-level practitioners to pay attention with all they've got on their plates we'll be doing very well indeed.

Sigh, that had paragraphs

Sigh, that had paragraphs when I wrote it. Sorry folks. Not sure how to communicate deep structure like that to the blog software here.

Very interesting post. At

Very interesting post. At Netflix, I'm spending a fair amount of time with issues like 'title proper'. I'm not going to hold my breath for some giant electronic brain to grok the semantic web for me, but I really would appreciate it if the info 'buried in the heads of catalogers' was exposed in human-readable form, with examples. I'll even go to the 025 section of the city library, but web pages would be handier. I've read bits of AACR2 (the part on people's names) and it is quite an accomplishment, light years beyond the first-name, last-name fields that programmers continue to write. Heck, even LDAP knows better than that. I'd like libraries to catch up with the rest of the world in data handling and the the rest of the world to catch up with libraries in data modeling and the care of data. The practice is the thing, much more than the schema. In the Atom working group, we agreed on an 'updated' date for items, but then argued for months about what it meant. Finally, we decided that it meant exactly what the publisher thought it meant. Flagging something as a visible change is a publishing decision. Using Dublin Core elements wouldn't have made that discussion any shorter. RFC 4287 says it this way: 'The atom:updated element is a Date construct indicating the most recent instant in time when an entry or feed was modified in a way the publisher considers significant. Therefore, not all modifications necessarily result in a changed atom:updated value.' That is the kind of context you need for effective data interchange and, I think, the kind of thing that is buried in catalogers' heads.

To truly improve

To truly improve interoperability what we need are not new 'library' standards but a modification to something that already exists. Take for instance PRISM (http://xml.coverpages.org/prism.html) which incorporates the Dublin Core Initiative, or the ONIX XML format maintained by Editeur (http://www.editeur.org/). Do we have 'special concerns'? Of course we do, but that can be handled by an extension to a standard. The Article-level interest group is working on such an extension for ONIX. Which, by the way, the TOCROSS Project (http://www.jisc.ac.uk/whatwedo/programmes/programme_pals2/project_tocros...) used to show the feasibility of using 'RSS to automate the population of OPACs with details of journal articles, without the need for manual cataloguing, classification or data entry.'

Karen, I wonder if the real

Karen, I wonder if the real key to the whole cataloging enterprise is not so much that 'we know why it's important to have that data in fields in the first place', but rather the concept of vocabulary *consistency*, which I explicitly use to avoid the dreaded words, 'authority control' ! (it is also about a good liberal arts education which at times is especially necessary when it comes to choosing helpful, detailed, subject headings - but this is another topic). You say that the Dublin Core Metadata Initiative 'boils down to formal expression to limit the ambiguity of language, and URIs for identification.' It seems like *everyone*, including scholars, who use faceted-search catalogs like Endeca - that take greater advantage of currecnt LCSH subject heading practice - seem to really like it (as long as there is a prominent option to browse pre-coordianted subject heading lists as well). But without formal authority control over personal names or subjects - but rather URIs - how will we be able to use things like LCSH data in things like Endeca? Without the aspect of authority control (and careful, helpful, well-thought out subject headings), it seems to me that cataloging is a glorified data entry job. Am I missing something big here? By the way, are there any semantic web examples that are currently operable on the web that I could check out?

What an excellent post! I

What an excellent post! I have 15 years of experience as a database specialist in fields such as health care, finance, real estate and higher education but when I started working in a library two years ago, I was mystified by the by the inaccessibility of the data. I applaud the effort to bring standards of interoperability to library information.

As long as one reads this

As long as one reads this entire post, Karen's comments are insightful, rational and ultimately helpful. But I confess that it was very hard for me to get beyond the hyperbole at the beginning, and I wonder how many others will walk away with an incomplete idea of the nature of this post. The sound byte has a Web analog, and if the wrong portions of this piece are lifted, a completely wrong idea is conveyed. Karen's contributions to the dialog are legion, and important; but, in this case, the juxtaposition of hyperbole and thoughtful essay may not serve the community well. Let's not let the need for a provocative hook sabotage content.