ALA TechSource Logo
 
curve Home spacer Publications spacer Subscribe spacer Blog spacer About  
    

Microsoft's Live Search Books

Submitted by Tom Peters on December 12, 2006 - 12:14pm

After playing around for an hour or so with the recently released public beta version of Microsoft's Live Search Books (LSB), I have to admit—against some vague sense that my better judgment is failing me—that I like it.

Sure, others have reported that LSB does not work well—or at all—when using browser software other than Internet Explorer, but if you stick to the straight-and-narrow Microsoft path, the service works and shows potential.

Spontaneous combustion of Krook in Bleak HouseOn December 6th, when the beta version was released to the public, I conducted a couple of sample searches on "phrenology" and "spontaneous combustion," two of my favorite hot topics from the 19th century. Spontaneous combustion sometimes is qualified as spontaneous human combustion, to differentiate the phenomenon, one imagines, from the spontaneous combustion of grain dust in an elevator, or burning bushes, or some random rodent explosion.

The phrenology search returned an impressive 518 books, with Sylvester Graham's human Lectures on Chastity showing some promise. The preface to that fine work was written by James Coates, Ph.D., who is described as a "medical magnetist."

A phrenological examination

The search for books about spontaneous combustion returned 660 items, with Eliphalet Nott's 1857 Lectures on Temperance floating (or blasting) near the top of the heap. Evidently, the spontaneous combustion of habitual drunkards was a particularly vexing spectacle.

The relevance algorithm used by LSB seems to work in two stages. When a set of books is returned as the result of a search, the most relevant half dozen titles are displayed. Of course, the methods of determining relevance are shrouded in mystery. When I couldn't find a way to advance to the next set of six titles, I scrolled down to the bottom of the page, only to notice that the number of returned results kept increasing as I scrolled downward. Then, once you select a title, you get a batch of relevancy-ranked snippets. If you choose to go to one of the pages containing the returned snippets, the search term(s) you used are highlighted in the text.

Speaking of highlighting, several of the old books I examined contained lots of underlining, highlighting, and marginalia. Initially, this put me off, reminding me that these scanned books are not pristine, but taken right off the shelves of research libraries, with bar codes, property stamps, and doodling there for the entire world to see now. Then I became fascinated by the marginalia. Although much of it is banal, this huge mass of petrified marginalia, now scattered by the digital winds to the four corners of the globe, could be a boon for the study of marginalia.

Although my overall initial impression was positive, some aspects of this service confused or rankled me. The name of the service, for example, is confusing. The phrase "Live Search Books" is neither natural nor mellifluous. Who were the marketing geniuses that came up with that name? Why qualify a search as being "live"? What are the alternatives? a dead search? a batch search that runs at night?

And why can't Microsoft list the total number of scanned books in the collection? All the company can say in the Q&A is there are "tens of thousands" of English-language books in the collection. There seems to be a trait of the commercial mind that abhors the simple statement of facts, as if doing so would be, as a general rule learned in school, bad for business.

The full text of every book retrieved is available for viewing online and downloading. When I downloaded and saved a book, the only file format option was PDF. I could not figure out how to read these PDF e-books provided by Microsoft in Microsoft Reader, but they displayed just fine in Adobe Reader.

According to an article that appeared last week in CNN online, within six months Microsoft plans to integrate Live Search Books' results into results from other content categories, such as Web pages. This is a similar strategy to what Google seems to be doing with SearchMash. During this beta phase, only books no longer protected by U.S. copyright laws will be available for search and retrieval. The books have been scanned from the print collections of the University of California, the University of Toronto, and the British Library. According to the CNN article, additional scanned books from the New York Public Library, Cornell University, and the American Museum of Veterinary Medicine will be added soon. Once that last collection comes online, you can bet I'll be searching for "spontaneous rodent combustion."Technorati tags: book search, digital books, digitization, E-Books, ebooks, Microsoft Live Search Books, search engines, SearchMash


Comments (8)

It's a good move from MS.

It's a good move from MS.

Public works are a

Public works are a limitation but they are a first step. Google has a wide range of books and I believe this project is one that is worth doing -- the library of Alexandria for a modern age. So many times I have used Google books to find important information on health practices, eating, spiritual knowledge, etc. Opening this sort of fast searching to the world really enables anyone to follow their own path of study and to see what else is out there. The search engines will eventually move to more sophisticated understanding of text but the keyword and synonym approach is highly effective for many types of searching. Especially when you are using focused keywords as you often are in academic research.

Hi, Phil. Your comment got

Hi, Phil. Your comment got me thinking and searching about the 'shroud of relevancy rankings' again. Here's a sentence I found on the FAQ pages for Live Search Book: 'Books and pages are listed by relevancy, which is determined by how often your search words appear in a book or on a page, where those words appear in the book, and occurrences of metadata specific to the book.' Although this statement is not completely mysterious, I would argue that also does not represent anything approaching full disclosure of the relevancy ranking algorithm used. In contrast, if I conducted a full-text search looking for the three characters 'abc' within three words of the three characters 'xyz', where a word is defined as any string of one or more characters surrounded by spaces, there is very little mystery in that--perhaps a concise definition of what constitutes a character should be included. Does the LBS relevancy ranking algorithm weigh word usage that appears in chapter titles or first sentences of chapters more than other occurrences? The phrase 'where those words appear in the book' suggests that something like this is happening behind the scenes. Does it stem the search term and then right truncate? I don't know, and I cannot find a detailed explanation. I would go further and suggest that the mysteriousness (in this sense) of the relevancy ranking algorithms used by commercial information storage and retrieval services is a business asset. They seem to be trade secrets. Users of such services have to assume that the relevancy ranking algorithm is well-designed, worked flawlessly, and roughly approximates their own sense of what constitutes relevancy. And, as you note, computers are wonderful at counting, but the complexities and vagaries of the actual use of human language to communicate often don't mesh well with literal counting.

Tom Peters' sample searches

Tom Peters' sample searches highlight the typical strength of book digitization projects: keyword searching of historical subjects and material. But they also expose its serious limitations. Results are only from public domain works -- which is why his search of 19th century topics returns reasonable results. However, his results will be missing any recent material written on his topics which is still under copyright. Peters also notes that relevancy ranking is 'shrouded in mystery'. Really it's not. The primary way for search engines to rank results is by the number of times the search term is found in the document. (Counting is one area in which computers excel.) This algorithm, however, fails the best writers -- writers who use pronouns instead of endlessly repeating the same keywords, or who use a variety of terms with shades of nuance to their meanings. It also gives no way of qualifying the relationship of multiple search terms -- the hierarchy of thought between terms. There is a brilliant solution to these problems, however! Visit yor local library, and use its catalog and other tools that have been created with a CONTROLLED VOCABULARY. Through them you can find a wealth of material on exactly the topic you are researching. If you don't find enough, the reference librarians are truly anxious to help you find more. Digitized book search tools have some valuable uses. Just don't get discouraged by all their drawbacks. Use the right tools for the job -- many of which have been around for a long time. And remember, your librarians are ready to help meet all your information needs!

We've posted an extended

We've posted an extended look at the project and other book digitization projects on ResourceShelf. It might be of interest. http://www.resourceshelf.com/2006/12/06/microsoft-book-search-goes-live-...

Hi, Walt. I also dream that

Hi, Walt. I also dream that search engines and mass digitization projects would be more open with facts and figures about their collections, costs, digitization processes, search algorithms, etc. Not all IT companies horde and hide these facts. Second Life (www.secondlife.com) provides what appear to be accurate figures about activity in their virtual world, but, oddly, I can't find similar figures for Teen Second Life (teen.secondlife.com). Maybe it's a safety issue for the teen virtual world. In the realm of fast food, McDonald's used to proudly announce on their signs how many millions (then billions) of hamburgers they had served (and we can only assume that most of those were actually eaten), but then they gave up and resorted to the phrase 'billions and billions served.' Emily, I agree that the name sprung from a larger suite of services offered by Microsoft, but the phrase 'Live Search Books,' while fitting into Microsoft's overall branding and marketing strategy, still is standalone clunky.

I suspect it has the

I suspect it has the unwieldy name Live Search Books because Microsoft's new search engine is known as Live Search. It's something akin to Google search with customizable home page, etc. http://www.live.com/ They have a whole suite of 'windows live' services, like search, blogging, email, instant messenger. http://get.live.com/

LBS works just fine on the

LBS works just fine on the dominant alternative browser, Firefox, and I believe the only problems reported turned out to be setup problems. Good commentary. As for the number of books: It would be great to know--but that's also true for GBS. And, of course, for any open web search engine, it would be great to be able to fully explore results. I can dream...