| CENDI ANNUAL PLANNING MEETING |
National Library of Medicine
Bethesda, MD
August 6, 2002
Following are the two keynote speeches (Dr. Clifford Lynch and Dr. Martin Cummings) presented at the planning meeting.
"Challenges and Opportunities for Agencies in Homeland Security and
Permanent Public Access to STI"
Dr. Clifford Lynch, Coalition for Networked Information
The Coalition for Networked Information (CNI) is a membership organization of academic organizations, publishers, government agencies, system suppliers, and other content providers, from the U.S. and a few foreign countries. CNI has a broad agenda around advanced digital services and networking focused on academia. Issues emerging in these communities are consistent with CENDI's planning themes, which raise the possibility of joint opportunities between CENDI and CNI.
A recent draft publication from a National Science Foundation (NSF) committee provides a vision of scientific research, describes the IT environment needed to support it, and addresses the necessary funding and organizational responses needed by NSF to develop the cyber infrastructure required for science. The group was chaired by Dr. Daniel Atkins, former Dean of the School of Information Studies, University of Michigan. The draft was released in May and the comment period has ended. The final version should be released soon, but the broad recommendations expressed in the draft are expected to remain unchanged.
The committee grew out of the Partnership for Advanced Computing, a coalition of the NSF-funded supercomputer centers. As the Partnership's activities related to the high-speed backbone and advanced networking come to an end, there is a need to think about what should come next. Within this context, the committee was asked to address the needs of science in the next decade.
The committee reaffirmed the need for high performance networking, but it also focused on the need for middleware and applications software. Given the way that science is changing, we need to think not only about computation but also about communication. Data issues must be included in NSF research, including data structures, integration with scholarly literature, collaboratory environments, digital archiving, and digital library management.
Even though the Atkins Report does not address other federal agencies, it may support scientific publications and data as critical components for scientific infrastructure. Dr. Lynch suggested that CENDI consider a presentation by or discussions with members of the Atkins Committee.
Permanent public access is another issue of joint interest. It is receiving increased lip service and attention within the academic community. Actual activities are beginning to happen, primarily around scientific journal literature. The Mellon Foundation has funded key activities through its seven planning grants. The Mellon projects have been effective in getting large publishers such as Elsevier involved in the key issues related to archival deposit. There is the beginning of convergence on an archival DTD (or OAIS Submission Information Package) for journals, originally developed by Dr. David Lippman at the National Center for Biotechnology Information (NCBI). A not-for-profit entity will be established to centralize archiving for academic environments. However, enormous problems remain for small publishers. Non-journal literature, such as technical reports, and the multiplicity of formats and submission regimes must be addressed.
Digital preservation is important to homeland security. The loss of information that resulted from the 9/11 catastrophe, along with the evaporation of corporate knowledge and the ".com" bust have emphasized the importance of preservation. We need to gain recognition of "content" as critical infrastructure. It must encompass key STI and data sets. Attacks on networked-based systems are increasing. In the past, security has focused on denial of service or seizing control of a site, with little attention to the security of content. Content is part of a critical infrastructure. How do we prevent corruption to science and technical data sets and literature? While long-term migration deserves attention, the stewardship of this information, one day at a time, must be considered. We need to make sure that the bits get to the future accurately before they can be preserved.
Someone must take responsibility. Finding the model for stewardship of these materials is more difficult since libraries are shifting to electronic subscriptions for which they do not take physical possession. In the past, the physical transfer of published materials to multiple libraries ensured preservation through redundancy. What provision is being made for data integrity to prevent systematic, low-grade corruption?
The roles of libraries and commercial publishers have changed, with publishers now having more responsibility for archiving. Dr. Lynch questions whether the publishers are being adequately audited. Much the same pattern exists for government information. With government information, rights generally are not an issue, but there are issues of complexity with regard to databases and data sets.
The economics of digital preservation and long-term access continue to be an unknown. TULIP, a preservation project from the early 1990s, gave the universities a taste of the economics involved in digital preservation. Unfortunately, it may have confused the cost of the system development with the cost of commercial development, and the universities backed off. The publishers concluded, therefore, that in order to make a market, they needed to store the digital versions of published material at the publishers' locations. Now, storage costs are no longer the problem. The key issues are providing new access systems, and these are ultimately more expensive than storage. Dr. Lynch envisions a two-tier future where data is managed for the long-term but access mechanisms come and go.
Publishers are not dealing with the preservation of ancillary data sets, primarily because of the size of such data sets. They do recognize some of the guidelines related to certain disciplines. Ultimately, this lack of attention to ancillary data may be a detriment, but institutional stewards could archive the data. Alternatively, NSF could establish Centers for Scientific Data Stewardship, modeled after the supercomputer centers, which would span sectors.
The Library of Congress has received the Congressional funding and mandate to develop a digital preservation infrastructure, but the project is more concerned with cultural history than with science and technology. Copyright extensions along with the bizarre dance of legislation with entertainment conglomerates are impeding the development of the infrastructure for both culture and science. In response, private repositories have been proposed that would meet certain criteria, which the government (i.e., the Library of Congress) would audit. This model fits the growing emphasis on repositories that grew from preprints to the deposit of full versions of published works. This model is being extended to additional disciplines; some cultures embrace this approach while others do not
Institutional repositories are also growing in order to manage the assets of academic communities. Most faculty members have web sites with papers, software, courseware, etc. Unfortunately, many of these sites are not run professionally, and there is concern about the loss of these assets as faculty members retire. Institutional repositories may be an attractive alternative to individual sites by removing the maintenance burden from the faculty member, consolidating resources, providing more security, and capturing institutional events. Examples include MIT's Open Courseware Initiative.
The institutional repositories may serve as switching points, eventually turning content over to disciplinary repositories. These collaborations (cross propagation) among repositories are supported by the work of the Open Archives Initiatives (OAI) and other standards for interoperability. There are two aspects to the OAI activities. The first, supported by CNI and the Digital Library Federation, is a purely technical effort that defines the metadata harvesting protocol and makes content available to services. This approach is agnostic about economics, licensing, access, and intellectual property. The second OAI activity is a medium in which technical efforts can be launched. The goal is to ensure that scholars put their papers on publicly accessible archives. Institutional archives/repositories are a path to the second aspect of OAI by providing a service rather than making compliance compulsory.
Institutional repositories may be on the front line of validation. Public key infrastructure (PKI) and authentication are crucial. Dr. Lynch believes that identity management is more likely to be the purview of the government and institutions than of the commercial sector or learned societies because of the wider range of motivations in the former sectors. Broader uses of these technologies mean that you can amortize the cost across the various uses, while the information sector benefits. However, significant issues remain before such an authentication infrastructure can be implemented on a large scale. What happens when a person moves from one institution to another? What is the professional courtesy involved? How should identities be managed for those who are deceased?
Security and preservation have not been connected traditionally. The connection is more porous and complicated than policy makers have acknowledged. It is important to realize that replication of archives can also propagate problems. The LOCKSS (Lots of Copies Keep Stuff Safe) approach developed by Stanford/Highwire Press has possibilities, since it attempts to maintain multiple valid copies through large-scale repair and caching. LOCKSS has an ingest layer that is currently highly specific to journals. The LOCKSS funders have been told that they need to separate the layers to allow different ingest models and to make the underlying mechanism applicable to a variety of object types. They are looking for test sites.
Dr. Lynch believes that the commercialization of the Internet was not a mistake. However, we must pay for our lapses and inattention to security and the drive to implement systems that have not been thoroughly tested. We are unsophisticated in our thinking about these issues. Hard policy choices about anonymity on the Internet and what can and can't be done while remaining anonymous must be revisited. How we resolve these questions will be critical.
In this environment, partnerships will become increasingly important. Dr. Lynch suggested that CENDI think about the parts of the S&T critical infrastructure that are held privately and look for collaborations. Various partnerships are emerging between higher education and government for preservation that ensure stewardship, migration and redundancy of digital content.
Dr. Martin Cummings, Director Emeritus, NLM
Dr. Cummings provided a historical perspective of both technologies that support information management and the public/private issues. Throughout history, the public and private sectors have collaborated on key information technology initiatives. Dr. John Billings identified the need for a more effective way for the government to conduct the 1890 Census. He presented the requirements, and Herman Hollerith, from the private sector, built the punchcard-operated tabulator. This could be thought of as the first public/private STI development. Similarly, William Schockley's transistor was married to John von Neumann's concept of the computer. Once this marriage occurred, computers developed quickly; Moore's law (Gordon Moore) continues to be confirmed with a doubling of transistors on a chip every 18 months. Thomas Watson at IBM determined that perhaps five computers would be needed worldwide. However, today, we have the equivalent of one of those computers in our briefcase.
The Internet was another marriage of the public and private sector development. The Internet population was estimated at 400M people worldwide in 2000. The number will double to 800M by 2006. The U.S. now has 175M people with Internet access, which is expected to increase to 240M by 2006.
Dr. Cummings believes that one of the best histories of government policy and public/private sector relations was presented by Kent Smith, the current CENDI chair, in his Miles Conrad Lecture. He recalled the copyright battle between NLM and a large medical publisher in Williams v. Wilkens. NLM refused to pay the publisher for photocopies for library users. It took seven years to litigate, but the result was a significant decision regarding fair use for the public. The appeal showed that no economic damage had been done to the publisher. The significance of the decision and the details of the court proceedings are documented in The Road to Copyright by Paul Goldstein.
Similarly, when NLM determined marginal costs for performing searches (a fee for service), Elsevier challenged the rate because it was considered too low. NLM said that public access should be free or minimal. Congress agreed that such a service by NLM was a public good.
A newer issue related to the public/private sector is that of contracting government
services to the private sector. Dr. Cummings established a clear distinction
between contracting out and privatization. He believes that if a contractor
can do the work better or cheaper than the government, e.g., translation of
foreign materials, system development, etc., then contracting is appropriate.
However, privatization gives up government responsibility and oversight.