CENDI PRINCIPALS AND ALTERNATES MEETING

Department of Energy/Office of Scientific and Technical Information

Germantown , MD

June 10, 2004



MINUTES

Architecture Components
Providing Better Access to DOD S&T Data with an Information Delivery Architecture Using Content-Centric Software/Tools
Introduction and Overview of Linking and OpenURL
Beyond ROI: Seeking Reverence of Knowledge

Welcome

Walter Warnick, Chair of CENDI, opened the meeting at 9:15 am and welcomed everyone to DOE Germantown. A special award was given to Kent Smith for his years of service as chair of CENDI. George Strawn and Pat Bryant were welcomed as new members from the National Science Foundation.

ARCHITECTURE COMPONENTS

“Providing Better Access to DOD S&T Data with an Information Delivery Architecture
Using Content-Centric Software/Tools”

Kurt Molholm, DTIC Administrator and Ricardo Thoroughgood, Chief, STINET Management Division, DTIC

DTIC’s current architecture follows the “old industry paradigm” of fragmented functions supported by separate software environments. For example, depending on the system, the database back end is supported by Oracle, DB2, or SQL Server. Middleware may include Cold Fusion or WebObjects. The front end is either a search engine or a series of static pages. The old architecture, which assumes either information containers or data elements, doesn’t work with individual digital components or objects in a document.

In the new architecture, DTIC is looking for an end-to-end solution that incorporates the following functions: Identity and Access Management to enforce access policies across servers with a single sign-on; Performance Management across all the processes, methodologies, metrics and systems in the enterprise; Categorization and Dynamic Classification that is fast, accurate and secure across structured and unstructured data (documents and other media) and that is XML-based.

DTIC’s current information products include databases of about 2 million technical reports of approximately 110 pages each. DTIC is closing in on having 250,000 full-text documents online. These documents are initially converted into TIF and then into PDG documents. Also, most of the documents are archived in microform, since DTIC is still unsure about the long-term preservation quality of digital files. There are approximately 1.8 million citations online with 30,000-35,000 documents added each year.

Ultimately, the Total Electronic Migration System (TEMS) being implemented for the Information Analysis Centers (IACs) will serve as the prototype for the DTIC model, which will make everything online, digital and searchable, including information (some proprietary) from the IACs and classified materials that will be added to a classified STINET by the end of the fiscal year.

With these goals in mind, DTIC has been investigating new architectures including third generation search engines. The objectives are to convert DTIC collections to enable metadata and full document search and retrieval rather than the current TIFF images and unsearchable PDFs. The solution will also aid the staff in processing the documents (updating or replacing the current EDMS and eliminating multiple input systems). Tools are needed for monitoring services 24/7 and reporting statistics and metrics for applications. The system must support multiple taxonomies, categories, navigation, harvesting and the conversion of TIFF text images to XML.

DTIC reviewed a variety of products in each of these categories, including some products which it is already using. Most of the products worked well for their “designed purposes”, and most of them worked with a variety of back-end and relational database management systems. The trend is to integration of search with other products; most products incorporate some third generation search features.

However, only one product, Mark Logic Content Interaction Server (previously Cerisent), does essentially everything in an integrated way, using XML open standards. Mark Logic, partnering with Documentum, is capable of serving as a document management system, including the database back end, middleware and a search engine. It can also be used as the storage media formatting facility and it has excellent support for data conversion.

In addition, Mercury Interactive Service was selected as the application performance tool. It was considered to be a better match for DTIC than Keynote, even though it is more expensive, because it monitors performance of applications, not just web sites.

For the time being, DTIC has determined that third generation search features are not a good option. DTIC will focus first on the conversion of images to PDF and XML. Also, DTIC is waiting to see what happens to federal-wide PKI implementations before making a decision about identity and access software.

In the short term, DTIC will enhance its management information by purchasing Mercury Interactive Service and HeartBeat Monitoring software to monitor availability and performance. This will be added to Keynote which measures web site performance from the end-user perspective and Web Trends which is used for detailed web analytics. Other immediate actions include the purchase of a color scanner and improved facilities for CD and DVD production. DTIC also plans to expand its Multisearch, federated search, option, which is based on Explorit.

In the longer-term, DTIC will conduct a pilot of the Cerisent/Documentum system, migrate from its mainframes by 2005, evaluate the Old Dominion University’s research project to create XML from TIFF, and examine the future of microfiche.

Mr. Thoroughgood gave a brief demonstration of the Mark Logic system.

Discussion

The CENDI members expressed a great interest in DTIC’s requirements, evaluation and proposed pilot system. Mr. Thoroughgood agreed to document the findings of DTIC’s investigation and share the document with CENDI.

Action Item: Mr. Thoroughgood agreed to document DTIC’s investigation of software to support a content-centric architecture and to share the document with CENDI.

“Introduction and Overview of Linking and OpenURL”
Oliver Pesch, Chief Strategist for E-Resources, EBSCO Information Services

One of the foremost objectives of linking is to connect users to content. Item level linking connects a reference or citation to the full text or to another service relevant to the item. Typically, the linking is between online information systems from different vendors -- institutions have many resources from many vendors (direct from publishers, through aggregators and through A&I services). In some cases, these resources overlap or provide multiple and different ways to access services and content. The problem can be solved by establishing predefined links but these may be inconsistent and are increasingly complex for the institution to manage. A link resolver helps to manage the links, providing the appropriate copy and giving the library control over its collections and the links between these collections. A link resolver can eliminate the cost of paying for interlibrary loan or for a “pay as you go” copy, when the library already has a subscription to the item.

The OpenURL is an accepted “standard” syntax for creating a link between an information source and a link resolver. It predefines a set of metadata data elements to be used in describing an “item” based on its genre type. Metadata fields include author, title, journal name, ISSN, date, volume, issue and page. A genre field may be included in order to make the interpretation of the metadata more precise. For example, the metadata for a journal article differs from that for a book chapter.

The OpenURL is also a mechanism for transporting this information between sources using the HTTP protocol for transmission. The OpenURL is not a search mechanism; the links take place after the search has been done on one system, so the OpenURL is not part of the search process.

As a result of a search, an OpenURL-enabled resource, such as an abstracting and indexing or a table of contents service, sends (when the user clicks the link) appropriate metadata to the institution’s resolver and where the citation is coming from. The resolver then makes the decision about which target resource should be used to retrieve that full text content or service based on the library’s license agreements and information about the user. Each linking server has a list of potential target links set up for the specific institution’s environment, based on subscriptions, licensing agreements, etc. It also has information about who can use what service and how to determine the services a particular user can access. If the user is allowed to see the full text or have access to the service, the link will appear on the link server’s page (sometimes called a link menu) as an active, live link.

A common question is how the OpenURL and the digital object identifier (DOI) are related. The digital object identifier can be transported in a OpenURL. The result is a much more direct and persistent link. Additional metadata can also be sent with the persistent identifier in the OpenURL structure.

While institutions have developed their own link resolvers, companies, such as EBSCO, have developed commercial linking servers to make link management easier. Linking server’s do not search the internet for full text, nor do they know where all the full text is or the rights that the user has to that full text. Instead, the linking server has a knowledge base, which includes a list of potential target links, URL templates to create a link, and a set of rules to decide if a link should be shown. A server also tries to enhance the supplied data if it can. The linking server’s knowledge base is tailored for each library. When implemented it reflects the library’s holdings, the rules for determining what a user can access and how to identify the users, and information about the titles and subscription dates for that library’s licenses. The knowledgebase is controlled by the library.

The key component of the linking is a good knowledgebase. A good knowledgebase has an extensive list of targets, accurate and up-to-date title lists for databases, and the ability to manage library-specific holdings such as e-journal, print journal, and publisher package subscriptions. LinkSource, from EBSCO, is one such linking server. EBSCO also offers another article linking service called SmartLinks. SmartLinks, which contains a directory of over 16 million article links (includes all EBSCOhost full text articles, the EBSCO e-journals, and articles available through CrossRef.), uses proprietary technologies to match citation metadata to a link then further verify if the user has rights to see the link. This is just one part of the services needed for a Link Resolver, but with SmartLinks, the article links are validated and rights checked. Most major link resolvers take advantage of SmartLinks to provide EBSCO customers with better article linking.

The term “OpenURL-enabled”, often used when describing an information source, has two meanings. It can mean that the online service is capable of creating an OpenURL that can be submitted to resolvers. It can also mean that the database has full text and can be linked to using the OpenURL syntax. The first case is an OpenURL-enabled source and the second is an OpenURL-enabled target. Link resolvers can link to targets that are not OpenURL-enabled as long as some form of link can be calculated – normally a collaborative effort between the developer of the link resolver and the target online service is needed

There are continuing challenges such as adoption of the standard, data quality issues including work on standards for storing and presenting bibliographic data, and knowledge base maintenance and expansion. Since OpenURL is about packaging and shipping information, the incorporation of identifiers and authentication into the process is an important issue. There are two authentication projects underway. Athens is a European project to create and manage a central list of resources and users who can use them. Shibboleth is a similar project in the US in which a few of EBSCO’s clients are involved.

“Beyond ROI: Seeking Reverence of Knowledge”
Dr. Walter Warnick, Director, DOE Office of Scientific and Technical Information

For years scientific and technical information programs have struggled with the concept of “Return on Investment”. Dr. Warnick suggested that the focus instead must be on “Reverence for Knowledge” (ROK). OSTI’s mission is to advance science and to sustain technological creativity by making R&D findings available and useful to the DOE researchers and the American people. Text has been the traditional role, which OSTI has addressed by building deep Web databases including classified, unclassified and international input. The Information Bridge has 84,000 full text reports, all full-text searchable. OSTI also has bibliographic databases like Energy Citations Databases. There are numerous restricted access databases in addition to those that are open access or public domain. All totaled, there are about 2 million records that go back to the Manhattan Project.

OSTI not only builds deep Web databases, but it enhances them by providing search tools. Deep web database searching is being enhanced by exposing these resources to surface Web engines such as Yahoo and Google, and developing cross-database search tools over the last four years. Science.gov is one of the recent manifestations of this cross-database search capability, achieved in good part by working through CENDI and the Science.gov Alliance.

Other deep Web enhancements include harvesting where the index of reports is stored at DOE and the original full text documents are retained at the laboratories. Reference resolution has been implemented on Information Bridge. Working with CrossRef, OSTI implemented a reference resolver, which allows authors to request that references be turned into active links. Alert services will be added to various products so that users can receive periodic e-mail updates on the new and relevant information related to their queries.

OSTI is also adding new content and creating a network around DOE to manage data in the same way that text has been managed. A new DOE Patents Database, modeled on technical reports, is being created that involves federating the patent databases from the laboratories to create a DOE-wide patent database. Federated searching will be used to search across these distributed repositories. A new limited distribution database for internal customers is also under development.

These initiatives have both supported and taken advantage of the developments for Science.gov. Science.gov 1.0 was launched in December 2002 and it introduced the concept of meta-search to interagency databases. Science.gov 2.0, launched in May 2004, introduced relevance ranking to government information. QuickRank is a simple algorithm that will soon be enhanced. Science.gov 3.0 will improve on this and other aspects of Science.gov 2.0. Launch of these features are planned in phases during 2005.

A design for Science.gov 4.0 is already underway. It will involve a special architecture called Grid Technology, in which the DOE research program is very interested. At DOE, Grid Technology was conceived to enable computers distributed across the country to communicate simultaneously and collaboratively to perform complex numeric computations. OSTI has worked with the DOE sponsors of Grid Technology research and has piqued their interest to extend the technology to text. The research program has agreed to fund development at $750K over the next two years.

For Science.gov 4.0, the Grid architecture will work in the following way. The user will send a query to the server located at OSTI. The server will send the query simultaneously to each agency. In the Grid architecture (Science.gov 4.0), rather than searching the databases directly, the server will pulse the agency Grid nodes which are co-located with the agency databases. Each agency will have at least one node, which consists of software co-located with the database. Each agency node would be designed to work with that agency’s database(s). The reason for co-location is that large quantities of data will be moving between the Grid node and the databases and co-location is more efficient.

In the new architecture, the OSTI-generated query goes to the node, which pulses the database. The database responds by sending titles and snippets. The hits are sent to the agency node which applies the QuickRank to quickly home in on a manageable number of the most relevant hits. Once QuickRank is applied, the node pulses back to the database and retrieves the full-text of the most relevant hits. When the full-text is retrieved, another relevance ranking algorithm is applied. The DeepRank algorithm, which is being developed as part of this project, is a more sophisticated algorithm. It searches the full text of the report for words and groupings of words that are part of the user’s query. The result is the identity of the report plus a relevance ranking score that is based on the full text. This result is sent to the server at OSTI, where the returns from all the agencies are interleaved as if all the ranked documents were coming from the same database. The list of interagency reports in relevance-ranked order is returned to the user, providing a much more precise result set.

Dr. Warnick then went on to discuss a series of propositions about science and knowledge sharing. Science advances when knowledge is shared. Knowledge sharing is a diffusion process and this is what advances science. Therefore, it is often hard to find a direct and clear connection between the sharing of a specific piece of information and a notable advance in science. The advance is often a cumulative effect, integrating countless pieces of knowledge obtained by the researcher who adds it to his own information.

This leads to the question of how we value science knowledge. It is hard to determine what investments, particularly in basic science such as high energy physics, are worth to each of us. Congress makes funding decisions based on a number of criteria, but in the end they are subjective.

However, Dr. Warnick’s third proposition is that most important things in life do not lend themselves to quantification and ROI. The point is that the information management community needs to get beyond ROI because it is a dead end. Instead, we need to pursue ROK, Reverence of Knowledge. While there may be reverence for knowledge as it exists in grand institutions such as the Reading Room of the British Library or of the Library of Congress, we need to establish reverence for the sharing of science information. Sharing the science is what the CENDI agencies are all about. Dr. Warnick proposed that this be a point of discussion at the planning meeting. What’s next would be an excellent place for the CENDI membership to start its planning discussions.

Return to Minutes Archive