Monday, December 12, 2011

How SharePoint helped The Times Online put its entire 200-year digital archive on the web

World’s Most Famous Newspaper Uses FAST Search Technology to Present the Entire Archive Online.


Having seen the success of the Times Digital Archive product, used primarily in education and libraries, the Times Online team led by Anne Spackman, Editor-in-Chief for Times Online, embarked on discussions about the feasibility and commercial potential of presenting the entire Times archive online. With enthusiastic backing from senior management, the team needed to find an Enterprise Search solution with the right scalability, proven stability, and technical sophistication to deliver its vision. The team chose to implement the FAST Enterprise Search platform.


Times Online
Times Online Archive: history as it happened, online. Having seen the success of the Times Digital Archive product, used primarily in education and libraries, the Times Online team led by Anne Spackman, Editor-in-Chief for Times Online, embarked on discussions about the feasibility and commercial potential of presenting the entire Times archive online. With enthusiastic backing from senior management, the team needed to find an Enterprise Search solution with the right scalability, proven stability, and technical sophistication to deliver its vision.

The Times Online team had experienced FAST’s ability to deliver both on its ground-breaking Travel Channel and for its award-winning news site. The FAST Enterprise Search platform was therefore the natural choice for the archive project. The media expertise within the FAST consultancy team had also proven valuable, offering industry best-practice suggestions on indexing techniques and search strategies that would be critical to the efficient delivery of the archive project.

Having been soft launched in early 2008, Times Online Archive was launched on 14 June 2008. The majority of the FAST implementation took six months to complete, but the project as a whole, including the construction of a new site, took nearly a year to realize. The archive project has achieved its initial objective of presenting online news from 1785 to 1985—20 million articles, pictures, and advertisements—from the groundbreaking coverage of the Crimean War by William Howard Russell, to letters to the Editor from figures such as Karl Marx and Benito Mussolini—all reproduced exactly as they appeared in the original newspaper edition.

For an introductory period the use of the archive is free, but users must register their personal details, which are valuable to the commercial team analyzing the demographics of the user base.

Behind the story: Digital scanning brings 200 years into view for Search
Optical Character Recognition (OCR) technology was used to scan in every page of the newspaper from 1785 to 1985. The archive team wanted to present images of the actual pages and not just plain text; however, this method of digitizing content has certain limitations. FAST engaged with the Times Online team to strategize ways to minimize the negative impact on search quality.

The OCR process converts each article into an XML format that can be automatically fed through the FAST file traverser into the FAST document processing pipeline. At this point, words are checked against predefined dictionaries to establish a match and be tagged.

“The flexibility of the FAST product enabled us to complete the indexation process efficiently and respond to issues such as proper nouns spelt three different ways within an article, due to the OCR conversion,” says Drew Broomhall, Search Editor at Times Online.

The OCR process also gives each word an ‘X’ and ‘Y’ coordinate as to where it appears on the page. When a user searches for a term, the system will return all articles that contain that term and will highlight its position within the article.

Relevance models determine the order of search results and when considering this part of the project, the Archive team discovered the need to evolve a different ranking strategy than would apply with contemporary newspapers.

“We found that in many archive articles, the nature of the content was not even mentioned in the headline and so used the FAST ranking module to build a fairly flat ranking profile where the headline and body of the text are roughly equivalent. We plan to evolve this over time using search intelligence to refine relevance,” says Drew Broomhall.

Benefits: A robust, scalable search platform
FAST has not only provided a search platform that would index and search The Times archive of 20 million articles, but also consistently deliver a high-quality query response within a two-second performance target.


To read the rest of the article, please click here: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=4000004317

No comments:

Post a Comment