Editors Note: This post is an excerpt from Improving the Visibility and Use of Digital Repositories Through SEO, by Kenning Arlitsch and Patrick S. OBrien. The authors, along with Montana State colleagues Jason Clark and Scott Young, will be teaching the online course/workshop Search Engine Optimization (SEO) for Libraries, which starts July 17.
Metadata schemas are powerful frameworks for organizing content, and libraries
have long used them to describe their holdings (think MARC). Numerous schemas
exist for academic disciplines: CDWA is used for art, Darwin Core for biology,
EML for ecology, DDI for social sciences, and so on. Dublin Core is probably the
most heavily used schema in digital libraries, and it is perfectly adequate for many
applications, but the problem with any metadata schema is that most website
developers don’t use any at all, and search engines can’t count on the metadata
being applied consistently in those that do. The result is that general-purpose
search engines like Google tend not to use the metadata even where it is applied
Some specialty engines, like Google Scholar, do make extensive use of metadata. Google Scholar, however, wants metadata schemas that
can express bibliographic citations specifically and accurately, which Dublin Core
does not do very well.
Because search engines crawl the web pages that are generated from databases
(rather than crawling the databases themselves), your carefully applied metadata
inside the database will not even be seen by search engines unless you write scripts
to display the metadata tags and their values in HTML meta tags. It is crucial to
understand that any metadata offered to search engines must be recognizable as
part of a schema and must be machine-readable, which is to say that the search
engine must be able to parse the metadata accurately. For example, if you enter
a bibliographic citation into a single metadata field, the search engine probably
won’t know how to distinguish the article title from the journal title, or the volume
from the issue number. In order for the search engine to read those citations
effectively each part of the citation must have its own field. Making sure metadata
is machine-readable requires patterns and consistency, which will also prepare it
for transformation to other schema. This is far more important than picking any
single metadata schema.
We invest a great deal of time and money creating digital collections, and we
usually create web pages that describe the collection’s purpose, what it contains, its
contributors, and so on, to give visitors some context they can use to understand
the collection. We also take great pains in creating metadata that describe each object in the collection to give it meaning and allow users to reference or discuss
the item. While humans can understand and associate the concepts they read,
search engines have a very limited capacity for interpreting the meaning of the
information we so painstakingly provide.
To help search engines understand the context and meaning of our digital
objects we must provide structure to our content using additional tags in our
HTML. These tags will say to search engines directly, for example, “this information
describes a specific digital object as a scholarly paper, written by an author who
works at an academic institution, published by an organization on a certain
date.” Sounds easy enough, but communicating with a machine requires an
up-front agreement on the specific language and precise vocabulary being used to
communicate. The word “bloody” has very different meanings to a person raised
in the United States and a person raised in the United Kingdom. Search engines
do not understand the regional variations, sarcasm, humor, hand gestures, facial
expressions, body language, tone of voice, inflection, and so on that humans rely
on heavily to communicate meaning.
Enter schema.org. In 2011 Google, Bing, Yandex (the largest Russian search
engine), and Yahoo! “joined forces to create a common set of schemas for structureddata markup on web pages”
with the aim of helping search engines to better
understand websites. Originally, schema.org was planned
to use only HTML microdata as the mechanism, or language for
implementing schema.org structured data vocabularies. But it has also recently
added support for RDFa as an alternative “language” that developers using “RDFbased
tools and Linked Data” can use to implement the schema.org vocabulary.
We think it’s important for repository managers (and
especially catalogers) to be aware of these developments because they hold
great promise for fulfilling the potential of the semantic web. Sites that already
offer microdata provide a great benefit to Google’s users through its “rich snippets,”
which display additional details about web pages in the search results.
Another example of Google’s use of microdata appears in its “recipe search,” where
metadata about recipes provide a faceted navigational search. If Google
can do this for recipes, imagine what it could do for library digital repositories that
already have rich metadata describing the objects. The bridge that will get that rich
metadata to be understood by search engines is the techniques recommended by
schema.org, and putting those techniques into place in digital repositories is the
responsibility of librarians and archivists.