| CENDI PRINCIPALS AND ALTERNATES MEETING |
National Agricultural Library
Beltsville, Maryland
April 3, 2002
WELCOME
Kent Smith opened the meeting at 9:15 am. Mr. McCone welcomed the participants to NAL.
CENDI and several of its working groups have been following XML (eXtensible Mark-up Language) over the last several years. XML was included on the list of technologies of interest to CENDI at the August planning meeting. Last year, CENDI members had a basic tutorial and discussed the outside perspectives on XML. In this particular meeting, the focus is on what CENDI agencies are currently doing and what is going on within the federal government. Three agency initiatives were highlighted.
National Library of Medicine
Simon Liu
The term "XML" is shorthand for many things. XML is actually a family of 10 core standards plus domain-specific, XML-based standards.
XML is the foundation specification that defines the character set and rules for constructing XML element names, attributes and structures. This basic organizational structure is supplemented by XLL, which identifies who will do what and provides some level of management. XLL includes two standards, XML Linking Language (XLink) and XPointer. XLink provides links and link management among the content components. The XML Pointer Language (XPointer) is used to reference content components. XML Namespaces allow interpretation of the specific elements and strings by associating them with referenced dictionaries also known as namespaces.
XSL allows the results to be presented to the outside world. It includes XML Path Language (XPath), which references both labeled and unlabeled content components in XML documents. XSL associates the layout of a document with the markup. XSL Transformation (XSLT) controls the views of XML documents and the ordering of the elements.
A series of standards, including Resource Description Framework (RDF), can be used within XML to exchange metadata among applications. The XML Schema defines XML data structures by specifying content, data types, etc. This level of specification, which is important for validation, was not included in the older document type definitions for XML structures.
Document Object Model (DOM) is a standard set of programmatic calls for building, navigating, and reading/writing XML documents. These core standards can be applied to various disciplines or aspects of disciplines, such as finance, education, biology, etc. There are more than 1000 XML-based standards registered at XML.ORG.
At NLM, the goal is an infrastructure for web-based information exchange, both with external partners and internal applications. XML is a key component of NLM's web-based computing environment. The value proposition for XML is that it offers proof against technology change, promotes interoperability, and allows for output to multiple channels.
In the NLM environment, journal publishers may send bibliographic citations and abstracts in XML format. However, information is also captured via OCR (optical character recognition) and, in some cases, keyboard entry. These other methods for capturing information for processing are converted to XML and loaded to both the database and to the Data Creation and Maintenance System. Because the various input streams are converted to the same XML structure, a consistent set of programs can be applied to the content.
Originally, NLM began with SGML. There was a slight incentive to publishers who would provide it. The goal was to get as much of the information from the publishers in usable machine-readable form as possible. In the early days, this was only about 15 to 20 percent of NLM's input. The amount received from publishers is now close to 70 percent, and XML has made it easier for publishers to comply.
On the output end, licensees of NLM products are provided with XML output. The entire MEDLINE database is in XML. This was a result of NLM's system reinvention activities in 1997. Because of XML, only one set of output programs is necessary. This reduces the cost, simplifies the system, and allows integration of internal and external processes, while continuing to provide flexible content to licensees. The general benefits are interoperability and the ability to output the same content to multiple channels, because XML is technology independent.
In some cases, NLM has led the use of XML within the biomedical publishing community and, in other cases, it has followed the lead of others. A consortium at Johns Hopkins University is looking at XML within the medical community including XML for research papers. NLM recently joined the W3C (World Wide Web Consortium).
NLM is experimenting with other uses of XML. For example, the generic presentation of the content via XML may help NLM with its Section 508 compliance; NLM is experimenting with getting the web site to talk. Similarly, there are uses in the security area. XML is the basis for the Security Assertion Markup Language.
NLM's approach to implementing XML is to form joint XML committees and working groups within the organization. These groups are provided with continuing education regarding XML and are encouraged to build an XML community through cooperation with partners and involvement with standards organizations. Beginning from the core XML and migrating toward domain-specific, XML-based standards, XML is applied to both operational and research projects.
Mr. Liu shared several key lessons that NLM has learned. It is important to
take a broad and holistic approach to the use of XML in your organization, including
input, internal processing, and output. XML is not a short-term commitment but
a major organizational change. It is important to understand the core standards
and to keep abreast of domain-specific standards developments. Cooperation and
learning from others is important, as is the staging of XML projects over the
course of time. Lastly, security needs to be included in the process.
NASA STI Program
Michael Nelson
The NASA STI Program's use of XML is focused on the Open Archive Initiative (OAI). The OAI began in response to the increased number of technical report and e-print services across scholarly disciplines that were emerging in 1999. Mr. Nelson, Herbert von de Sompel, Rick Luce, and others wanted to provide for a cross-cutting digital library of these materials. A demonstration system of what was then called the Universal Preprint Service (UPS) was built for the Santa Fe Meeting in the Fall of 1999. The UPS was soon renamed the Open Archive Initiative.
What began with an emphasis on e-print services has grown into a generic bulk metadata transport protocol that is independent of resource type. There is significant interest on the part of commercial publishers. Elsevier's SIRIS product uses OAI.
The OAI focuses on data providers and service providers as two separate entities. (In reality, the same organization could be a provider of both data and services.) Service providers can harvest from multiple data providers and data providers can support multiple service providers.
The request to the archive uses the http protocol. There are only six verbs in the protocol. The first three verbs request information about the archive based on its self-description, and the last three get metadata from the archive. Flow control is provided through resumption tokens, so that the repository can control the number of records in each chunk that is being harvested. This allows the repository to stop the harvesting of its entire database, or at least to manage the resources that a harvester is requesting from the system. The response from the archive is in XML, using unqualified Dublin Core as the content standard. However, community-specific extensions are encouraged and expected, since groups are likely to want richer formats. A major issue includes achieving a balance between simplicity and functionality.
The centralized metadata architecture of the OAI addresses several problems
related to distributed searching, including speed, unavailable servers, and
inconsistent presentations of result. The distributed full-text repositories
allow the data providers to manage their own content and continue to provide
their own services and specialized metadata formats for internal or specialized
external products. The only requirement is that they expose the OAI metadata
for harvesting.
OAI is not a full digital library system but a front-end add-on. It is middleware,
and users should never be exposed to OAI. OAI does not address security or terms
and conditions of access. These aspects are handled by a transport layer. Multiple
interfaces can be used to control different access privileges for different
customers.
The strategy of OAI has been to increase the number of data providers with the idea that if data is available, service providers will grow up in response. To date, there are more than 50 registered data providers and many internal, closed implementations of OAI that are not registered. Service providers are beginning to come online. (http://www.openarchives.org/)
The NASA Technical Report Server's (NTRS) OAI Architecture is currently a distributed search model, but it will be moving to the centralized metadata architecture of OAI in the near future. The baseline OAI system at NASA's Langley Research Center should be available very soon. The problem is achieving OAI compliance at the other NASA centers. A phased approach is planned.
NASA is also working with the DOE Laboratories on technical report interchange
so that there can be a network of technical report servers across agencies.
Mr. Nelson suggested that science.gov could use this approach to harvest content
from NASA for its purposes or, more generally, to harvest from across the science.gov
agencies.
DOE/Office of Scientific and Technical Information
Vince Dattoria
Dr. Warnick introduced the agency approach: DOE is working with its laboratories in implementing XML. The intent is to go easy on data providers and let service providers do value-added effort. DOE is using an extended Dublin Core as its metadata standard and XML for exchange. These standards are embedded in OSTI's core business processes, including data import, harvesting from providers, and re-purposing of content.
OSTI harvests metadata from several sources including the National Renewable Energy Laboratory (NREL). OSTI pulls reports from NREL daily based on a query and then extracts the necessary metadata content. A nearly zero-defect rate has been achieved. The system is flexible and allows OSTI to select, via the search statement, the elements it wants returned for a specific date or date range. The XML tags include the unique ID and the edit or creation date, so OSTI can identify records for which corrections or modifications must be made in the OSTI databases. The source of the data is included in the pull so that OSTI can attribute the source in its own databases.
Currently, OSTI is harvesting metadata for unlimited unclassified material, but it will be doing limited unclassified in the near future. OSTI is looking to exploit this technique with other providers. There has been minimal impact on the provider systems, because OSTI does not harvest the full text but only the URL that points to the full text on the provider's site.
XML allows OSTI to re-purpose the data for specific domains. For example, OSTI
can create current awareness products in such areas as renewable energy and
technology. The National Technical Information Service (NTIS) harvests 15 elements
from OSTI for re-purposing to its own database.
"Wild Expectations for Searching XML: A Content Expert's Perspective
on XQuery"
Pat Case, Legislative Information System, Congressional Research Service
The promise of XML has created many expectations within user communities. These include state-of-the-art tools for document composition, the ability to produce multiple displays from a single document, and easier interchange of data and documents. The Library of Congress (LOC) had its own expectations of XML, which focused on searching. It expected to be able to search within elements, their children and their parents on demand; to search structured data and text; and to have a query language that supported a complete set of text search operators and functions.
XML and a standard query language are of particular interest to the Congressional Research Service (CRS). More search functionality is available on CRS' traditional internal legislative search system than via Thomas on the Web; CRS would like to improve the capabilities for everyone. With XML, CRS can "chunk" the text of a record and ensure that each chunk is displayed with the pages, volume, etc. This information cannot currently be displayed if the record is "chunked". XML will also allow entities to take on specific roles. For example, the same person may appear as the sponsor of a bill or as a person who remarked about a particular bill. Using XML mark-up, the person's name could be tagged and qualified with the role the person is playing in that particular instance. Unanimous consent agreements and other types of legislative concepts could be marked up by the Government Printing Office (GPO) or by CRS. In general, XML would allow the user to perform more precise searching and ensure more complete recall.
There is currently no search engine for querying XML, and no standard for XML querying is expected until sometime next year. However, realizing the importance of such a standard to the library community, and, in particular, to the CRS, the LOC formed a study group about 18 months ago. They realized their expectations might not be met unless they became involved in the standards development process.
Ms. Case has been involved for several months in the W3C XQuery working group. This group includes high-level representatives from many systems vendors. She often finds herself the sole proponent for more advanced querying concepts. Her goal is to revive some of the sophisticated search functions that were left behind with more simplistic web-based searching. In the effort to show the XQuery group the problem with simple searching and the need for tokenized text processing, the LOC produced its own test search cases and presented them to the XQuery group.
The XQuery group has produced a number of drafts, but, as of 2001, XQuery has
not met the expectations of the LOC. Key search functions that are missing include
thesaurus support, character normalization, stemming (truncation), stop word
lists, and proximity searching. The LOC group is also advocating the development
of an ignore operator (a "mild not" that would not exclude from the
results cases where a required term exists in the same document with an unrequired
term). Ultimately, they would like to see the XQuery effort result in an end
user syntax that standardizes text searching across web search engines.
"CIO Perspectives and Opportunities"
Lisa Carnahan, National Institute of Standards and Technology/ CIO XML Working Group, Registries/Repository Team
The CIO Council has established an XML Working Group (http://xml.gov/working_group.htm). This is not a policy, rule marking, or standards body. The goal is to accelerate, facilitate, and catalyze the effective implementation of XML technology in the federal government. They are trying to get people moving in the same direction. They are particularly interested in business activities both for the public and private sector. The group's activities include developing best practices and recommended standards, establishing partnerships with industry and the public sector and vertically within the government, and conducting research-oriented education and outreach.
The Working Group is divided into several teams, one of which deals with registries and repositories. The Registry team's mission is to "facilitate the awareness and appropriate reuse of existing data element definitions, schemas and related documents." The Registry is a service for depositing XML DTDs and other related information, and it is not meant to mandate policies or use of the registered objects. On the team's web site (http://xmlregistry.nist.gov/xml-gov/), the Efforts section is particularly important, because it is the place to look for collaborative projects and lessons learned.
The team is currently developing a prototype registry as a proof of concept. The registry software provided by Data Networks Corporation is based on ebXML version 1.0. Questions to be addressed by the prototype include: "Do standardized registry services meet our needs? What are the interoperability requirements and the issues that need to be solved to achieve a distributed registry model?" They are developing policies and procedures about who can submit, what the vetting process should be, and what administrative functions and information are needed, etc. In about a month, they will be in a "friendly beta test stage" with the system going live later.
Ultimately, neither the Team nor the National Institute of Standards and Technology (NIST) are likely to house the operational registry. A home at the General Services Administration (GSA) or at several sites is more likely. It is unlikely that this will be a centralized registry but, rather, a series of distributed registries.
Ms. Carnahan encouraged CENDI to exchange metadata schemes, definitions, etc., among the CENDI members. This information could also be valuable to other S&T agencies and for authors. She also suggested that representatives from the agencies attend the XML Working Group Meetings. There are no real members; instead, people just "show up". The access is limited to federal representatives and to contractors who are directly representing federal employees. The plan is to foster interagency cooperation and develop agency specific XML groups. In general, she believes that end users should be better represented on standards groups.
Dr. Brand Neimann, who is also a member of the Working Group, discussed some
of the broader issues the group is addressing, including multi-media sources
like voice. He distributed a series of handouts related to this effort.
"NAL Showcase"
Gary McCone, National Agricultural Library
Digital Desktop Library Initiative
Eleanor Frierson
NAL's budget has been essentially flat for the last six years, while the cost of journals has increased an average of 15 to 20 percent per year. In response to the need to control costs and to provide better access to the more than 106,000 USDA folks world wide, NAL embarked on an initiative to bring electronic resources to the desktop through consortium purchasing.
An internal consortium was formed. NAL identified resources of interest and
then negotiated for free access by the vendors for a trial period. A trial was
conducted in May/June 2001 that included products worth over $10M. The first
two consortium licenses were signed in August-September for the Economist Intelligence
Unit Country Reports and Country Profiles. The savings are estimated at over
$200,000 for these products. Three agencies within the department are contributing.
NAL is currently investigating how to provide more user support and training
and is seeking to identify funds for additional joint purchases.
Agricultural Thesaurus
Lori Finch
The subjects of interest to the USDA are wide-ranging, including much of the biological sciences, social sciences, earth and environment, physics and chemistry, etc. In the Agricultural Thesaurus, these interests are conveyed in 17 high level categories. (http://agclass.nal.usda.gov/agt/intro.htm)
Ms. Finch demonstrated the way that the thesaurus has been integrated into the searching. It can be used to automatically search for a preferred term when the user enters a non-preferred term (for example swine instead of pigs). It can also be used to include narrower terms in a search (such as piglets, boars, etc.). Ambiguous terms can be resolved by asking the user to make a choice when the thesaurus determines that a term has multiple meanings.
Extension terms, covering the areas of family studies, youth development, and consumer science, are being added to the Ag Thesaurus. Proposals for candidate terms are received via a comment form from USDA employees only. The comment form has been submitted to the Office of Management and Budget (OMB) for approval as a public input mechanism.
The thesaurus, funded by the Agricultural Research Service (ARS), has been available since 2000. They have encouraged scientists to use it, review it, and provide feedback. It was first used as a controlled vocabulary for the Agricultural Research Information System. AGNIC also uses it as a controlled vocabulary. The National Program Staff at ARS use the hierarchy and lead-in terms to enhance searching of their web site. Ultimately, NAL would like to use the new Ag Thesaurus as the controlled vocabulary for AGRICOLA.
The thesaurus is being investigated as a possible web service so that it could
be maintained only at NAL, but used in real-time from throughout the USDA and
beyond.
USDA Digital Publications Preservation
Evelyn Frangakis
An NAL-sponsored conference in 1997 identified the major requirements for preservation of USDA publications that are born digital. A digital publication is defined as "a data or information product prepared by the USDA in digital form intended to be disseminated to the public." A framework was developed, and a National Steering Committee was created to implement the framework, which involved policy and legislative activities and recommendations. The group included representatives of various mission areas as well as external stakeholders and experts.
Early in the process, the Steering Committee identified six major issues. These were managing technology over time, metadata, user access, organizational buy-in, operational considerations, and the cost of creating and sustaining such a system.
There have been major accomplishments to-date. A survey instrument was developed to identify digital publications produced by the USDA. The survey was validated by a small pilot project with the Economic Research Service. The Economic Research Service has served as a pilot agency and has given NAL money to do research for them.
A metadata conference was held for USDA staff. The goal was primarily to raise awareness and to educate the USDA staff, including information producers, webmasters, librarians, etc. This was part of achieving buy-in.
The Steering Committee developed draft digital preservation guidelines for the department. Last month, NAL received informal word to turn them into USDA regulations. They are currently working on a roll-out and marketing plan for the guidelines. Much of this work will be educational in nature. The guidelines are very broad to enhance both access and preservation. More detailed guidelines will come out in the future.
It is important to continue to get organizational buy-in. Technical requirements continue to be developed, particularly with regard to the management of technology in the long-term. Funding for digital publication preservation continues to be a challenge.
As part of the effort, NAL has investigated the use of the Open Archival Information
System (OAIS) Reference Model. They have successfully mapped their NAL Digital
Preservation and Archiving Prototype to that of the OAIS, and have incorporated
other standards and processes, such as MARC, Dublin Core, and CORC.
NAL's Metadata Template
Sherry Davids
Members of three NAL divisions worked together to develop the metadata template, which will be used for documents created by NAL staff and for materials that are digitized from print for preservation purposes. The template includes elements for preservation, identifies content stability, and supports access. Resource availability is left up to the creator of the document. In a prototype system, NAL is using Handles. The catalogers are required to use the Ag Thesaurus. Some fields are required and others are not. There is a link to the GNIS (Geographic Name Information System/USGS) site to do geospatial descriptions. XML is not available right now, but it will be included in the final system.
When the metadata is completed and submitted, the Handle is sent to the master
Handle system at the Corporation for National Research Initiatives (CNRI). It
also notifies cataloging of the document and creates metadata that can be cut
and pasted into an HTML document as metatags. Currently, the information must
be cut and pasted into the NAL's library system. (This will be more automatic
when NAL procures a new library system.) A MARC 530 field is added if the document
was digitized and has a call number related to the print. The record is saved
into the CORC system automatically as part of the normal workflow