| PDF Version for printing |
CENDI - 2004-3
SHORT TABLE OF CONTENTS
1.0 Introduction 2.0 Scope and Methodology 3.0 Highlighted Systems 4.0 Setting the Stage 5.0 Stakeholder Roles 6.0 Preservation by Document Type 7.0 Standards by Format Type 8.0 The Workflow 9.0 The Introduction of "Off-The-Shelf" Systems 10.0 Standards Activities |
Sponsored by Prepared by Evelyn Frangakis February 2004 |
The authors wish to thank the following for their contributions to this report: William Anderson (US CODATA) The CENDI Digital Preservation Task Group |
|
ICSTI is a unique forum for interaction amongst organizations that create, disseminate and use scientific andtechnical information. ICSTI's mission cuts across scientific and technical disciplines as well as internationalborders to give Member organizations the benefit of a truly global community. ICSTI seeks to reduce or eliminate barriers to effective transfer of information by:
Web: http://www.icsti.org |
CENDI is an interagency cooperative organization composed of the scientific and technical information (STI)managers from the Departments of Agriculture, Commerce, Energy, Education, Defense, the EnvironmentalProtection Agency, Health and Human Services, Interior, Government Printing Office, National Archives andRecords Administration, and the National Aeronautics and Space Administration. CENDI's mission is to help improve the productivity of federal science-based programs through the development and management of effective scientific and technical information support systems. In fulfilling its mission, CENDI member agencies play an important role in helping to strengthen US competitiveness and address science- and technology-based national priorities. Web: http://www.cendi.gov |
1.0 Introduction
2.0 Scope and Methodology
3.0 Highlighted Systems
4.0 Setting the Stage
4.1 Archiving Concepts and Definitions
4.2 The Scientific Environment
4.3 The Technological Environment
4.4 Scientific Publishing and Communications
4.4.1 Open Access
4.4.2 Institutional Respositories
4.5 Legal Deposit and Copyright
5.0 Stakeholder Roles
5.1 Publishers
5.2 National Libraries
5.3 Institutions
5.4 Museums
5.5 National, State and Regional Archives
5.6 Trusted Third Parties
5.7 The Role of Government
5.8 Foundations and Other Private Funding Source
6.0 Preservation by Document Type
6.1 Electronic Journals
6.2 Theses and Dissertations
6.3 Scientific Data Sets
6.4 Technical Reports
6.5 Conferences, Meetings and Lectures
6.6 E-Records
7.0 Standards by Format Type
7.1 Text
7.2 Images
7.3 Numeric Data
7.4 Video and Audio
7.5 Output from Design, Modeling and Visualization Tools
8.0 The Workflow
8.1 Selection Criteria
8.2 Metadata Creation
8.3 Archiving and Transformation
8.3.1 Transformation to a Preservation Format
8.3.2 Migration
8.3.3 Migration On-Request
8.4 Storage
8.5 Dissemination
9.0 The Introduction of "Off-The-Shelf" Systems
9.1 DSpace Institutional Digital Repository System
9.2 Digital Information Archive System
9.3 OCLC Digital Archive
9.4 PANDORA Digital Archiving System (PANDAS)
9.5 Lots of Copies Keep Stuff Safe (LOCKSS)
9.6 Fedora™ (Flexible Extensible Digital Object Repository Architecture)
10.0 Standards Activities
10.1 Metadata
10.1.1 Descriptive Metadata
10.1.2 Preservation Metadata
10.1.3 Technical Metadata
10.1.4 Structural Metadata
10.2 Permanence Ratings
10.3 Open Archival Information System Reference Model (OAIS RM)
10.4 Producer-Archive Interface Methodology
10.5 Persistent Identifiers
11.0 New Issues and the Research Agenda
11.1 Authenticity
11.2 Rendering Objects for Permanent Access
11.3 Saving the Dynamic Web
11.4 Appraising and Retaining Scientific Data
11.5 Preserving Government Information
11.6 Archiving the Archive
11.7 Interoperable Archives
11.8 Partnerships
11.9 Costs and Sustainability
12.0 Findings and Trends
13.0 Recommended Next Steps
14.0 References
15.0 Appendix I: Follow Up Discussion Questions
In 1999, the International Council for Scientific and Technical Information (ICSTI) and CENDI jointly sponsored a report on Digital Electronic Archiving: The State of the Art and Practice (Carroll & Hodge 1999). ICSTI and CENDI remain interested in digital preservation as they represent large repositories, publishers, and libraries of scientific and technical information. This report is an update to that 1999 report.
This report focuses on operational digital preservation systems specifically in science and technology (S&T). It considers the wide range of digital objects of interest to S&T, including e-journals, technical reports, e-records, project documents, scientific data, etc. The report also discusses archiving based on format types -- text, data, audio, video, etc. It is, of course, international in scope, and as much as possible crosses organizational sectors (academic, government, commercial, etc.).
However, this report does not attempt to provide a comprehensive survey of systems, but, rather, to highlight selected systems/projects that can help to identify trends, remaining issues and activities that ICSTI, CENDI, and other organizations interested in the preservation and permanent access to the record of science can consider when developing their own systems and policies. More than 50 projects and systems were identified from the surveys, from experts, or from the literature. From these, 21 were selected for highlighting in this report. However, references are made to other projects throughout the report as appropriate.
The major findings are as follows:
Systems solutions are being developed by a variety of stakeholders and partnerships.
The advent of off-the-shelf solutions shows advancing maturity in the area of digital preservation. The library model with shared cataloging tools and service providers is apparent. The six key systems, the OCLC Digital Archive, DSpace, LOCKSS (Lots of Copies Keep Stuff Safe), Fedora™, PANDAS, and the Digital Information Archive System (DIAS) from IBM, come from different types of organizations -- a library service provider, a university repository, a large academic research library paired with a provider of publishing services, a university repository teamed with another university's digital library research group, a national library system, and a national library working with a commercial company, showing the need for partnerships and the interactions among a variety of stakeholders.
The Open Archival Information System (OAIS) Reference Model has been widely adopted.
The OAIS Reference Model, which became an Information Standard Organization (ISO) standard in June 2003, has been adopted widely. All types of archives use the OAIS terminology and conceptual model. However, it is not as prevalent in the scientific data community for which it was initiated, partly because these organizations already had systems, customers, producers, and processes of a legacy nature. Efforts are underway among some data archives to minimally ingest Submission Information Packages (SIPs) and to produce Dissemination Information Packages (DIPs) in order to respond to the spirit of the standard. As systems are redesigned and the need for interoperability increases, it is likely that the OAIS Model will become more prevalent as the conceptual basis for scientific archives.
Organizations are focused on capturing and acquiring digital information, rather than preservation or permanent access.
Even if they use the term archive or have preservation in their mission, the initial goal is to get a critical mass of material, to promote a culture of deposit/submission/harvesting and sharing, and to provide access to the currently collected materials. While many of the institutional repository activities are committed to long term preservation and access, the technical and metadata aspects required are not yet well incorporated into their systems.
Efforts for digital depository legislation are gaining momentum.
There are significant activities on the part of national libraries and other stakeholder groups with regard to changing existing laws or adding new laws that would require deposit of digital materials. This has gained significant momentum over the last several years, and most recently, the United Kingdom and New Zealand have passed such legislation. Digital deposit legislation may be more accepted, now that there have been major pilot projects involving national libraries and large commercial publishers. In addition, voluntary arrangements are already in place, so the legislation more closely reflects current practice rather than leading it.
Migration remains the preservation strategy of choice; it is still too soon for most archives to have undergone a significant technological change.
Other than the large data archives, which have existed for many years, archives have not yet faced large-scale technological changes. This means that migration remains the strategy for most of the materials of interest to libraries, archives, and publishers. The prevalence of migration, particularly from one version of software to another, also indicates the prevalence of commercially available products, such as Microsoft Office and Adobe products, in the scientific environment. While concerns were expressed about outdated software, hardware, and media, these issues are not the current focus as the institutions grapple with collecting and ingesting the flood of current archival content.
There are increased standards-related activities
There are standards-related activities underway in the areas of producer-archive interaction, permanence ratings, persistent identifiers as critical components of digital preservation systems, preservation metadata, and preservation formats (e.g., PDF-A for text). These activities are likely to produce significant results, because they are codifying many of the best practices that have been identified over the last several years of pilot projects.
Open standards developed for interoperability hold promise as the basis for preservation formats
While the main rationale for development of open standards is interoperability among software environments, these standards also may be applicable for long term archiving. Open formats such as those for geographic information systems (OpenGIS), product design and manufacturing (STEP), open office documents (OpenOffice), and chemical structures (Molfiles and SMILES) are working toward hardware and software independence. The potential for using these formats for preservation should be investigated further.
Key technical issues remain
There are several key technical areas requiring future research that have been identified in recent studies funded by the National Science Foundation. Additional research is needed into the automatic generation of metadata, through self-describing objects or the provision of archiving mechanisms in authoring tools. Registries, perhaps of a global nature, are needed to maintain authoritative, computer-actionable information about metadata tag sets, reference information for formats and hardware/software behaviors. Research into the archiving and preservation of dynamic, non-HTML and database-driven Web content is a major research activity for several groups. Other technical issues include creating interoperable archives and best practices for archiving and preserving the archive itself.
Partnerships are increasingly important
Over the last several years there has been an increasing realization that partnerships are the only way to ensure that digital information will be preserved. In addition to ensuring some measure of comprehensiveness over the wide spectrum of scientific information in digital form, partnerships have the benefit of providing some measure of redundancy, sustainability and sharing of the cost for preservation which is likely to exceed the revenues that can be made on the reuse of any particular object. A workable infrastructure will result from a multi-pronged approach involving publishers, libraries, archives, institutions, and trusted third parties, with appropriate support from governments, foundations and other funding sources, users and creators during the life cycle of the material to be preserved.
Key social, political, and economic issues remain, including the need to develop a "will to preserve and provide permanent access" within the scientific and technical community and society in general
There are several outstanding social and political issues that require further discussions by the various stakeholder groups involved in preserving and providing permanent access to scientific and technical information. For example, the social, political and legal aspects of creating federated archives and working partnerships that cross stakeholder groups and object types (data, publications, multimedia, etc.) must be resolved. The archiving and preservation of and long term access to government information pose special challenges in this regard. Sustainable business models that will survive for the long term also remain elusive. Collecting information about the cost of digital archiving and preservation proved to be as difficult as in the first report, with most of the respondents unable or unwilling to provide cost information. However, several major organizations (OCLC, DSpace, National Library of Australia) are trying value-added services and licensing of software to other organizations as ways of offsetting the cost of preservation activities. Overriding these social, political and economic issues is the need to develop within the scientific and technical community and society in general a culture that encourages the "will to preserve and provide permanent access".
The work on digital preservation is continuing apace with significant developments in off-the-shelf, generalized digital preservation systems; legal deposit legislation; partnerships and federations; and standards activities. However, much remains to be done. There are activities in which both ICSTI and CENDI can take a lead or become involved that will move preservation practice forward.
ICSTI and CENDI have been interested in electronic archiving and other issues related to the management of digital information since 1996. In 1998, a synopsis of relevant projects, which focused on science, was international in scope, addressed various types of digital objects and included projects at all stages of development, and was determined to be beneficial to the members of these organizations and to the digital preservation community as a whole. Therefore, the International Council for Scientific and Technical Information (ICSTI) and CENDI jointly sponsored a 1999 report on Digital Electronic Archiving: The State of the Art and Practice (Carroll & Hodge 1999).
Since 1999, both organizations have remained active in issues and discussions related to digital preservation. Based on the findings of the 1999 report, ICSTI and CENDI sponsored a variety of workshops, presentations and articles on digital preservation (Hodge 2000, Hodge 2002, Mahon & Siegel 2002). ICSTI's President, David Russon, made recommendations concerning the importance of preservation in the sciences to the World Science Congress (Russon 1999). CENDI and the Federal Library and Information Center Committee (FLICC) sponsored a workshop on the Open Archival Information System Reference Model in relation to the management of US government information (CENDI & FLICC 2001). Most recently, the need for preservation was included in ICSTI's input to the World Summit on the Information Society (ICSTI 2003). On an ongoing basis, CENDI's Digital Preservation Task Group monitors and reviews best practices and standards as they relate to preservation of the results of science and technology research in the science mission agencies of the US federal government.
Once again, both organizations are joining together to produce this report on the state of digital preservation. The purpose of this report is to determine the new advances and issues in the preservation of scientific and technical information by focusing on operational systems specifically in the sciences. The goal is to advance the thinking and practice of ICSTI and CENDI members and to provide a basis for further work by others, particularly in the scientific community.
The report begins with a statement of the scope and methodology and an overview of the highlighted systems. The subsequent sections use the highlighted systems and information gathered from experts and from the literature to discuss stakeholder roles, archiving and preservation practices by document type and format type, the workflow established by operational systems, standards activities, and the availability of "off the shelf" systems. The report concludes with a discussion of trends and issues and possible next steps for CENDI and ICSTI.
The scope of this report is purposefully quite narrow:
The call for participation was sent to several listservs, and the members of the CENDI and ICSTI communities were asked to contribute suggestions for operational systems. In addition, the investigators identified key people involved in digital preservation, attended several meetings on the topic, and performed literature searches. Over 50 systems or projects were identified from these various sources. After initial information was collected, follow up discussion questions (see Appendix I) were used to gather more detailed information.
The survey of operational systems is not intended to be comprehensive. Inclusion or exclusion from the report should not be taken as an endorsement or lack thereof on the part of the investigators, CENDI, ICSTI or any of their member organizations. The goal is to see what these representative systems can tell us about the state of the practice of digital preservation in science and technology and the outstanding issues, lessons learned and next steps.
From the more than 50 systems or projects identified, 21 systems were selected to highlight because of the operational nature of their systems and the potential interest to the scientific community. The highlighted systems represent several countries and international organizations. They are from the government, academic and private sectors. Commercial, learned society, and gray literature publishers are represented. The highlighted systems manage a wide range of scientific resources including e-journals, e-theses, scientific data sets, technical drawings and photographs.
The following table provides key information about the highlighted systems. The information from these more detailed interviews is used throughout the report, along with selected information from other non-highlighted sources.
| Highlighted Project | Brief Description | Special Archive Characteristics |
| American Institute of Physics |
Learned society publisher. | Well-established policy and procedures for archiving e-journals. Policy used as model by others. |
| Aerospace Industries Association/Boeing Co. |
Preserving engineering drawings (CAD/CAM) in the aerospace industry. | Developing standards for interoperability of engineering drawings. Working within the consortium to develop standards for preserving the STEP files as the basis for a preservation format. |
| Digital Information Archiving System (DIAS) Dutch National Library |
System developed by IBM for the Dutch National Library for deposit of e-journals from multiple publishers. | Dutch National Library set requirements and sponsored development of IBM's DIAS System. Implemented at KB in December 2002. Established as the official archive by Kluwer and Elsevier Science. |
DiVA
|
An electronic publishing system that treats the digital version at the master and creates an archive for long term preservation. Creates institutional archives for theses and dissertations, working reports and other types of born digital documents. Currently used by universities of Uppsala, Umeå, Stockholm, Örebro and Södertörn in Sweden and Statsbibliotek, Århus in Denmark. All full text publications are available through a common interface, known as the DiVA portal. | Archiving is an outgrowth of DiVA's electronic publishing system. Local repositories transmit archival packages directly to the Royal Library (the Swedish National Library) and in particular to the National Bibliography and to satisfy requirements for e-deposit of theses and dissertations. Data originally entered by the author is the basis for the metadata.
Metadata is stored in the DiVA document format, a locally developed schema.
Transformations of this schema provide metadata in a variety of other
formats and support various services including OAI-PMH. |
| DSpace at MIT | Institutional archives; MIT's implementation is primarily science and technology and is being developed to share the university's intellectual assets. | MIT's implementation of a generalized system for institutional repository development. Heavy use of existing lessons learned and standards such as Open Archives Initiative, Open Archival Information System Reference Model, and Dublin Core. Looking to establish an Alliance that will work on federating the D-Space repositories across institutions. Software is open source. |
Elsevier Science Direct -- also part of the Dutch National Library (URL not available) |
E-journals from a single publisher | First publisher to establish an agreement
with the KB to permanently archive the Science Direct journals. |
| Data center with the mission to preserve the remotely sensed, cartographic and topographic records entrusted to the US Geological Survey. | Developed check lists, procedures and an Advisory Committee to help in selection and appraisal of data sets. Currently completing an online decision support tool. |
|
Fedora™ (Flexible Extensible Digital Object Repository
Architecture) |
Can handle a variety of objects; all MIME types. Current applications focus on E-books, XML objects, images at multiple resolution. | Used as the underlying architecture by several systems including DSpace and the University of Virginia Library's Centralized Digital Repository. Flexible container architecture with option to default or customize various aspects of the system. Newest version includes content versioning critical for preservation activities. |
International Union of Crystallography
|
Learned society publisher with online journals and supplementary data. | Policy for archiving online journals, available with frequently asked questions. Also working with members of their community to ensure better archiving of the data component. |
| JSTOR |
Journals and conference proceedings (digitized from paper, new initiative on e-journals) | New Electronic Archiving Initiative is bringing together the organizational elements necessary to ensure the long term preservation of and access to e-journals. Initial work is focused on business model and technical infrastructure development. |
|
|
Archive of data from NASA's work in the life sciences. | Developing the archive to conform to the Open Archival Information System Reference Model. |
LOCKSS (Lots of Copies Keep Stuff Safe)
|
Software system to create preservation copies of journals at various library sites. | New release of software to support synchronization of redundant archives. Generally works within the publishers' current business models. LOCKSS-DOCS project would extend the technology into the US Government Printing Office’s Federal Depository Library Program. |
NASA Goddard Space Flight Center Library (URL not available) |
As part of the Library's support for knowledge management activities, a series of digital preservation projects have resulted in a partially operational system for internal information at the NASA Goddard Space Flight Center. | Capture and storage of a variety of project-oriented materials including web sites, images, project documents and videos. Operational video system that includes webcasts, digital storage, video indexing and segment retrieval. Development of a Goddard Core set of descriptive metadata and a single metadata repository across document/format types. |
National Motor Museum |
Operational system to scan and preserve photos for a technology museum. | Digitizing photos in the collection; following same approach as Profiles in Science |
|
|
Text and images submitted by the subscriber | Subscription-based system for making available and preserving a variety of materials via the OCLC’s Connexion system and WorldCat. |
PANDORA - National Library of Australia
|
Long running project to capture web-based publications of Australia. | Development of the PANDAS system, selection criteria and other infrastructure components to support the capture and preservation of Australian publications online. PANDAS system will soon be available to others with trial access. Revised collection guidelines. New efforts in agreements with Australian publishers and government. |
Profiles in Science, National Library of Medicine
|
Digital library of papers, photos, audio and video clips and memoirs for noteworthy scientists, particularly Nobel Laureates. | Various digital object types, including audio, video, manuscripts, letters, e-mails, etc. Organizes these into collections for each scientist and, in some cases, links across collections. |
PubMed Central, National Library of Medicine |
Systems hosted by the NLM to archive journals in the life sciences. Currently archiving over 120 journals in the biomedicine and life sciences. | National Center for Biotechnology Information developed a Journal Archiving & Interchange DTD and a Journal Publishing DTD. Various terms and conditions with publishers – agreement that the current issue can be free either immediately or after a certain period of time. |
The Internet Archive/Alexa
|
Not-for-profit organization that takes periodic snapshots of the Internet. About 10% might be scientific and technology related depending on the definition | Snapshots of the Internet as well as focused crawls based on institutional criteria. Working on issues related to the dynamic web and copyright. |
|
|
Government agency responsible for the printing, preservation and distribution of government publications. Includes responsibility for the Federal Depository Library Program that includes regional depositories and a network of various types of libraries to ensure access by the public. Now includes requirements for moving toward a more electronic depository library program. | System to harvest metadata and capture content for government publications from agency web sites. Also involved in helping set requirements for the OCLC Digital Archive. Working with LOCKSS-DOC. Implementing digital signatures to support authenticity. |
Victorian Electronic Records Strategy (VERS) – Australia |
Responsible for setting the strategy for electronic records systems in the state of Victoria, Australia. | Well-established system for ingesting and managing e-records. Development of standards. |
Before reviewing and analyzing the findings of this research, it is helpful to look at the world in which digital preservation of science occurs. Several aspects of the environment are highlighted below, including current archiving concepts, the scientific environment, technology trends, scientific communications, and the legal deposit and copyright regimes.
4.1 Archiving Concepts and Definitions
A significant shift in the terminology of archiving has taken place since the first report in 1999. The term “electronic” has been replaced with the word “digital”, perhaps indicating a shift from concern about electronic journals to the full range of material represented in bits and bytes. While major efforts toward digitizing paper materials continue, there is a clear emphasis on objects that are “born digital. The technical issues of long term preservation are similar once the analog materials have been digitized, but the fact that there is no analog original to preserve makes the “born digital” information all the more fragile.
Another significant shift in terminology is the move away from the word “archiving”. This term was problematic from the outset. Those involved with digital information were concerned that “archiving” was too closely identified with records management storage. In addition, the term “archive” had taken on new meanings from e-print and preprint archives, which are primarily repositories with no inherent responsibility for or commitment to long term preservation. The more common term now is preservation, which links this activity to the long history of preservation in paper.
The phrase “permanent access” is usually paired with the term “digital preservation,” indicating that preservation is only half the battle. The more difficult issue in the digital environment is how to provide for permanent access and adequate rendering of the object given the technological changes that have and will continue to occur.
4.2 The Scientific Environment
The goal of e-science is to take advantage of high speed computing and networking to provide virtual laboratories, collaboratoria and informatics methods to enable scientific discovery. E-science activities by their very nature require digital input and result in digital output that must be managed. Instead of physical laboratory experiments, the investigation is conducted via modeling and simulation approaches that are only available in digital environments and that require systems and networks capable of massive, distributed computer processing. These large network systems are generally referred to as the Grid (National Science Foundation 2003). E-science initiatives are often government sponsored; major initiatives are underway in Japan, the US and the UK.
A global network of e-science centers would result in massive amounts of information. However, the Grid may also provide the basis for a distributed system for archiving and preserving the data, perhaps resulting in more comprehensive data curation (Pothen 2002). The National Science Foundation’s Advanced Cyberinfrastructure Program (ACP) emphasizes the connection between e-science (or digital science) and the need to preserve data and other outcomes from the R&D process. Discussions are underway as to how the various stakeholder groups and new communication mechanisms such as institutional repositories might provide the backbone for supporting data preservation and curation (Messerschmitt,2003). In September 2002, the Library of Congress and the San Diego Super Computer Center announced a project to evaluate the Storage Resource Broker Data Grid for preservation of LC’s digital holdings (Tooby & Lamolinara,2002; Mayfield 2002; Shread 2002).
4.3 The Technological Environment
Since the publication of the 1999 report, there have been continuing advances in both hardware and software technologies. New processors and operating systems are on the market. Microsoft Office Suite has undergone several upgrades. Windows has seen Windows 2000, Millennium, and XP. Oracle has introduced several versions including 10G. Even in a time of global economic slowdown, technology pressures are ever advancing, causing increased concerns about the future of digital information in a time of limited resources.
Meanwhile, the Internet becomes ever more pervasive. While the rate of growth of available content on the public Web has slowed and recent research suggests that the Web may have decreased in size (OCLC Office of Research 2003), the Web still includes a vast amount of information. One might speculate that some of the scientific information has gone underground: i.e., into the deep Web. More scientific information may be Web-enabled, but hidden in databases, behind firewalls, or on institutional intranets. Concerns about national security, cyberterrorism, and the frequency of cyber attacks have made the archiving process more difficult (Kahle 2003).
4.4 Scientific Publishing and Communications
There are many factors of scientific communication and publishing that impact the digital preservation environment. Open access and institutional repositories are highlighted here.
One of the major changes impacting the future landscape of scientific communication and publishing is the advent of open access initiatives. Open access asserts that scholarly materials, particularly those in the sciences, should be available for free to users and institutions, with the need for new business models on the part of publishers.
“By 'open access' to this literature, we mean its free availability on the public internet, permitting any user to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited. Open access eliminates two kinds of access barriers: (1) price barriers, and (2) permission barriers associated with restrictive use of copyright, licensing terms, or DRM [digital rights management].” (Budapest Open Access Initiative 2002)
The early efforts in open access to biomedical literature (eBioMed), led by Dr. Harold Varmus of the US National Institutes of Health and others, spurred a number of open access statements and initiatives including the Budapest Open Archives Initiative (2002) quoted above, the Public Library of Science (2003), and the Bethesda Statement on Open Access (2003). Open Access initiatives in the sciences are particularly strong in developing countries (CODATA 2003).
US legislative actions, such as H.R. 2613 - “The Public Access to Science Act” (also called the Sabo Bill), may have a major impact on open access to science in the US. The Sabo Bill requires authors who receive federal funds to deposit their work in an open depository and make the information generated from these efforts free of copyright. In a related bill passed in 2001, researchers who receive federal grants must make their data sets publicly available within certain time limits.
The impact of open access can already be seen in the sciences. The Directory of Open Access Journals, maintained by Lund University Libraries and sponsored by the Information Program of the Open Society Institute and SPARC (Scholarly Publishing and Academic Resources Coalition), includes over 350 open access journals in 15 subject categories. The scientific categories include Agriculture & Food Sciences, Biology & Life Sciences, Chemistry, Health Sciences, Earth & Environmental Sciences, Mathematics & Statistics, Physics & Astronomy and Technology & Engineering (DOAJ 2003). Some of these journals are alternatives to the more expensive commercial journals in various disciplines developed by open access publishers such as BioMed Central, SPARC partners, and some institutional repositories. These organizations may also act as trusted third parties for other publishers who are willing to deposit their materials in an open access arrangement with terms and conditions.
Open access may appear to be a boon for digital preservation in the sciences. However, many open access initiatives are based only on the immediate desire for access. “The major open-access initiatives differ on whether open access includes measures to assure long term preservation. For example, the definitions used by BMC [BioMed Central] and the Bethesda Statement include this element, but the BOAI [Budapest Open Access Initiative] and PLoS [Public Library of Science] definitions do not. Taking steps to preserve open-access literature directly answers an objection often raised against open access. This makes it both desirable and important for open-access initiatives to take steps to preserve their literature and to say so prominently. The need for prominent mention often brings the mention right into the definition of "open access". But none of this means that preservation is part of open access, merely that it is desirable. Is preservation an essential part of openness or a separate essential?” (Suber 2003)
Suber (2003) advocates that long term preservation is only one of several desirable requirements for open access, along with deposit in an archive or repository, but that preservation and openness are not inherently linked. “…By bundling them [preservation and openness] all under the concept of openness, we risk blurring or over-burdening our simple concept and we risk delaying progress by multiplying the conditions that our initiatives must meet.” So while open access can act as a catalyst for addressing long term preservation without the restrictions of copyright, open access may also focus on immediate dissemination rather than long term preservation goals.
A definition generalized from Lynch (2003) defines an “institutional repository” as a set of services that the institution offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members. The primary impact of the institutional repository movement has been in academia, spearheaded, in part, by university management. After collecting theses and dissertations, many academic institutions have begun to broaden the types of materials included in their repositories to include virtually all materials of long term value that are produced by faculty, staff, or employees. The Association of Research Libraries produced a position paper on the growth of institutional repositories and the types of infrastructure, which gives six examples of institutional repositories (Crowe 2002).
In the government sector, there have been institutional (or one might call them enterprise) repositories for decades. In the US science mission agencies with institutional repositories include the Defense Technical Information Center, the Department of Energy’s Office of Scientific and Technical Information, and the NASA Center for AeroSpace Information. On a government-wide scale, the National Technical Information Service and the Government Printing Office also have responsibility. In France, Institut de l’Information Scientifique et Technique (INIST-CNRS) is responsible for similar activities. Other countries have similar organizations with varying authorizations to collect, preserve and disseminate scientific and technical information for their respective enterprises.
In the 1990s, these organizations began to collect technical reports, reprints, and other text materials in electronic form, and to add certain types of non-print materials to their collections. In the last several years, dissemination of the full text has shifted from print and microfiche to e-mail or FTP downloads and Web access. Many of these materials are now received and stored electronically, and a wider range of materials is being collected, resulting in large repositories of digital information that must be preserved.
While many of these institutional and enterprise repositories have a history of preserving paper, they are increasingly conscious of the responsibility of being a repository in the digital environment. “[An institutional repository] is most essentially an organizational commitment to the stewardship of these digital materials, including long term preservation where appropriate, as well as organization and access or distribution.” (Lynch, 2003)
The arguments and issues related to the long term preservation of e-print and other institutional archives are outlined in Penfield & James (2003). They conclude that depending on the circumstances both filling the repositories, i.e., focusing on content, and long term preservation should be part of building an open archive. The issues, feasibility and requirements for e-print preservation have been identified by the Arts and Humanities Data Services (AHDS) SHERPA (Securing a Hybrid Environment for Research Preservation and Access) Project sponsored by the Joint Information Systems Committee (JISC) and CURL (Consortium of University Research Libraries) in the UK (James 2003).
4.5 Legal Deposit and Copyright
The goal of legal deposit is to ensure that access to a nation’s published works is preserved in libraries and archives. “A statutory obligation which requires that any organization, commercial or public, and any individual producing any type of documentation in multiple copies, be obliged to deposit one or more copies with a recognized national institution.” (Lariviere 2000) Its principle is established in international convention and in the national legislation of many countries.
Digital information requires active management to ensure that a complete record of a nation’s published material exists for the future. If legal deposit is applied to digital information, the protection of publishers' rights and investments needs to be considered, since the potential for multiple accesses to a single digital information object is an issue. Another issue is the differing nature of digital information from that of its traditional physical counterpart (PADI 2003).
“Legal deposit legislation in many countries predates the current information age and requires a new legal framework in order to encompass digital publications. The complications associated with the collection and control of electronic materials, together with the lack of a comprehensive legal model, have made drafting appropriate legislation problematic and slow. Major issues to be considered include copyright, preservation requirements, public access, scope of coverage, method of collection, protection of publishers' rights, penalties, and implementation of revised legislation.” (PADI 2003)
Digital legal deposit has undergone significant change over the last several years. The major initiatives are outlined below. Information and quotes are from PADI (2003) unless otherwise noted.
Countries that have enacted legislation that covers physical format and online forms of digital publications or that have a legislative process in place include: Canada, Denmark (static online publications), New Zealand, Norway (static online publications), South Africa, and the United Kingdom.
•• Canadian legal deposit legislation has been extended to include electronic publications issued in physical formats. The National Library of Canada has continued to collect electronic publications on a voluntary deposit basis, with emphasis on publications not available in any other format.
•• Denmark's deposit legislation states that “all published material is subject to legal deposit, regardless of the production technique or type of carrier.” Emphasis has shifted from printers of documents to publishers of documentary materials in the broadest sense, including physical format digital and static Internet publications. “The Royal Library of Denmark acts as the deposit institution for Danish maps, electronic products and Internet publications. A legal deposit registration system for downloading deposit documents has been created in collaboration with UNI-C, a government data research institute.”
•• New Zealand’s 2003 legislation applies to public documents issued in print or in electronic physical or online form. It specifically provides for the copying of Internet documents. “Until the Requirement relating to electronic documents comes into force, electronic documents in physical format continue to be purchased or obtained by voluntary deposit through standard acquisition processes. Currently the [National] Library is developing its processes for the selection, acquisition, harvesting, description, storage and provision of access to physical format and online electronic documents.”
•• Norway’s Legal Deposit Act has cultural preservation as its primary intent. Physical format electronic documents and static Internet documents are included, but dynamic electronic resources are not. However, the legislation includes any works which can be read, heard, broadcast or transmitted and is written in a way to be applicable to future electronic formats. (Van Nuys 2003; PADI 2003)
Norway is one of five countries involved in Web archiving that can base its work on legal deposit legislation. Countries that have started some type of Web archiving activity include: Denmark and Australia (selective collection strategy); Sweden, Iceland, and Finland (harvested entire national web spaces); while the National Library of the Netherlands has made an agreement with the Dutch Publishers' Association (NUV) for deposit of electronic publications offline and online (Van Nuys 2003)
The National Library of Norway is investigating ways to fulfill the intent of the act as applied to digital documents and is considering using a combination of different collection approaches. The Paradigma Project, which began in August 2001 and will end in December 2004, will “develop and establish routines for the selection, collection, description, identification, and storage of all types of digital documents and to give users access to these publications in compliance with the Legal Deposit Act.” (Van Nuys 2003)
•• In South Africa's Legal Deposit Act of 1997, the definition of 'document' and interpretation of the term 'medium' enables the Act to apply to electronic publications available in both physical format and online. Due to the technical and administrative challenges associated with the deposit of dynamic electronic publications, online electronic materials are presently only subject to deposit when specifically requested by the State Library of Pretoria.”
•• In the UK, the Code of Practice for the Voluntary Deposit of Non-Print Publications came into effect in 2000, endorsed by various UK publisher trade bodies and legal deposit libraries. The arrangement provided for the deposit of microfilms, physical format digital publications and other offline electronic media, but the challenges for the deposit of static and dynamic online publications were also recognized in the guidelines. Subsequently, the Legal Deposit Libraries Act 2003 became law in October and will ensure that works published in non-print format will be collected. Categories of non-print materials that will be collected and saved include electronic journals and other materials accessed over the Internet; a limited range of research-level web sites; microforms such as film and fiche; as well as CDs, DVDs and other “hand-held” electronic media. The Act will be implemented through a series of regulations, and it is anticipated that the first set of regulations will deal with offline publications such as CDs and microform material (British Library Press & Public Relations 2003)
Countries that have legislation in place that currently applies to physical formats but not to online digital publications include: Austria, France, Germany, and Sweden. Physical format digital material refers to information that is digital and stored on transportable media such as floppy disks, magnetic tape, CDs, and DVDs. Further detail follows:
•• Austria’s response to a legislative gap for the deposit of online and networked digital material is the AOLA (Austrian Online Archive) project, established to investigate the challenges associated with the collection and archiving of online publications.
•• The revised French legal deposit legislation requires legal deposit of documents regardless of the technical means of production, as soon as they are made accessible to the public by the publication of a physical carrier. Legal deposit of CD-ROMs has been enforced since 1994, but to date, “deposit provisions do not cover online electronic publications, and no incentives exist for the voluntary deposit of non-physical format digital materials.”
•• In Germany, publishers are required to deposit copies of their publications, including physical format digital materials.
•• Sweden’s legislation requires legal deposit of electronic documents available in physical format, such as optical disks. Online electronic documents, like those found on the Internet, are not covered by this legislation. The Royal Library of Sweden’s Kulturarw3 (Cultural Heritage Cubed) project is investigating preservation of published electronic documents; it collects electronic information through harvesting.
Where legislation is not in place, national libraries and publishers are negotiating voluntary deposit schemes as a means of collecting digital publications. “Current trends suggest that in some instances these voluntary codes will become permanent, especially where governments prove reluctant to change laws and if legal deposit is afforded a low priority for amendment.”
The Netherlands does not have legal deposit legislation and relies on voluntary deposit based on bilateral agreements with publishers. These deposit arrangements have also been negotiated for digital information. Other voluntary deposit efforts either currently operate or are under development in Canada, Germany, the United Kingdom and Australia. “In addition, a model code has been developed by the Conference of European National Librarians and the Federation of European Publishers to facilitate the drafting of locally-endorsed voluntary deposit arrangements.”
•• In Australia, the Copyright Amendment (Digital Agenda) Act 2000 made no changes to the existing provisions. To cover the gap in federal legal deposit law, the National Library of Australia (NLA) has implemented an interim Voluntary Deposit Scheme for Electronic Publications, together with a Policy on the Use of Australian CD-ROMs and Other Electronic Materials Acquired by Deposit. While Commonwealth statutes don’t include electronic publications, some states, such as Tasmania, have legislation that includes some digital components.
A recent study by Charlesworth (2003), sponsored by The Wellcome Trust and the JISC, addresses the legal issues related to archiving the Web. Charlesworth notes that the most obvious “legal stumbling block” is copyright law, but cautions that there are also hazards regarding defamation law, content liability and data protection depending on the countries regime in these areas. However, he believes that the issues are not insurmountable, with careful selection of the sites to be archived, effective rights management policies, and good access rights mechanisms.
Previous investigations of digital preservation have identified numerous stakeholder groups involved in digital preservation. Flecker (2002) identified discipline-based models, commercial services, government agencies, research libraries, and passionate individuals. Lavoie (2003) reduces the stakeholder roles to rights holders, archives, and beneficiaries. The previous ICSTI report identified creators/producers, publishers, libraries and library consortia, funding agencies and users (Carroll and Hodge 1999).
The following section describes the preservation activities by publishers, national libraries, institutions and their libraries and museums, archives, and trusted third parties. It also discusses the role of governments.
A study sponsored by the Association of Learned and Professional Society Publishers showed that 52 percent of commercial and 45 percent of not-for-profit publishers interviewed have formally addressed long term preservation of their publications with most taking on the responsibility themselves. Third party archives such as JSTOR, OCLC, and HighWire LOCKSS are used. Discipline-specific depositories such as PubMed Central were found to play only a minor role at present (Cox 2003).
Commercial publisher initiatives are coming from two major impetuses. First, these materials have intellectual property value that benefits the publisher if the materials remain under the control of the publisher. Secondly, publishers have begun to realize the economic benefits of the reuse of the content. This is especially true as mark-up languages and XML schema are used that allow material to be extracted, merged, integrated and even provided to users on-demand through Web-based content models. Many publishers have SGML/XML-based systems that provide preservation-oriented formats as a natural outcome of their publishing processes.
Wiley’s DART (Digital Assets Repository Technology), for example, has three major priorities (Morgan 2000). These are digital printing (including distributed and on-demand), creation of electronic versions of existing paper products so that they can be more easily provided on the Web or to online retailers, and creation of new products, such as coursepacks, based on the re-use of previously published material. The specific goal of the metadata designed into the DART system is to support Wiley’s commercial priorities.
Many learned society publishers consider preservation to be an extension of their mission to preserve the knowledge of their discipline, justifying the resources committed to these activities. Many of these society publishers have been at the forefront of preservation activities for both text and data and instrumental in raising the awareness among the researchers in their respective disciplines.
The American Institute of Physics (AIP) advertises archiving services as one of the Composition Services in its Electronic Journals Platform (OJPS) (American Institute of Physics 2003). AIP performs rich mark-up in SGML or to the customer’s specific DTD. In the near-term, AIP can supply files in a variety of formats including Postscript, PDF, and SGML. The files include graphics and RGB files for color work. Dissemination is available via FTP, CD-ROM, or 8mm tape. In addition, authors can request their articles in a variety of formats appropriate for inclusion in conference proceedings, books or other reprint vehicles. In the long term, AIP’s ASCII-based format will be is reliable for future preservation and reuse.
Based on the AIP model, the International Union of Crystallography (IUCr) has published a policy on long term preservation and access. It also utilizes the concepts and terminology of the OAIS Reference Model (International Union of Crystallography, 2001). The policy specifically covers IUCr’s online journals, but the intent is to extend it to other types of materials available from the union’s web site. The policy is only partially applied; the IUCr has taken steps to create local offline copies of the journals in SGML as well as in HTML and PDF. However, this is primarily aimed at short-term disaster recovery. IUCr intends to pursue partnerships with major public crystallographic databases for preservation of the data, since there is a close relationship between the text publishing and the data activities. This involves working with CODATA to raise awareness of the need for these databases to develop their own preservation strategies.
In 2001, the International Union of Pure and Applied Physics (IUPAP) held a conference that brought together publishers, researchers and librarians to discuss the long term preservation of digital documents in physics. Two recommendations resulted from the meeting -- the development of a registry of physics archives that would include information about hardware and software so that it could serve as an early warning about possible need for migration or data at risk, and the creation of a subgroup to investigate the use of XML and other format standards as applied to physics documents (Smith 2001). In addition, IUPAP has encouraged its member societies to develop XML schema and standards appropriate to their disciplines. (Butterworth 2003)
National libraries were given a major role in the 2002 joint statement between the International Publishers Association (IPA) and the International Federation of Libraries and Archiving Institutions (IFLA 2002). This statement sets out several key points, including the importance of digital information and the fact that it is severely at risk under the current circumstances. Successful, long term archiving and preservation will require a partnership and neither the libraries nor the producers of the information can adequately archive alone. Ultimately the most appropriate stakeholder to manage the long term preservation of digital materials is the national library infrastructure. National libraries are already trusted third parties and digital preservation is an extension of the mandate of legal deposit in the analog environment. IFLA and the IPA have also agreed to continue joint activities including technology research and searching for funding opportunities.
The new relationships between publishers and national libraries may be the result of publishers, particularly commercial publishers, determining that their missions are better served by focusing on the initial publication and dissemination of the material than on long term preservation. The initial wariness on the part of publishers may have subsided, particularly among those publishers who participated in pilot projects over the last several years. The majority of these pilot projects have proven successful and seemed to have produced a symbiosis of the needs of these publishers and the needs of libraries. Also, long term preservation became such an issue with the publishers’ constituents, primarily the libraries, that preservation arrangements were necessary.
Many major publishers have signed agreements with national libraries as trusted third parties. After developing its own electronic warehouse, Elsevier determined that it needed to partner with others (Hunter 2003). Elsevier identified KB (The National Library of the Netherlands) as its official archive based on KB’s technical competence. The formal arrangement addresses permanent retention and international access. The archive holds those articles that are withdrawn as well as those that are active. The archive will be available for KB walk-in users only. Elsevier emphasizes that this archive is not a hot backup for the company’s data recovery, but it could be used to support recovery from a truly catastrophic event. The intent is to use the KB agreement as a model for two to three negotiations with other national libraries.
In May 2003, Kluwer Academic announced an agreement with KB to serve as the archive for the journals featured on Kluwer Online. Kluwer Online contains over 235,000 articles from over 670 journals. In September 2003, an agreement was signed with BioMed Central to archive its 100 open access journals and the other deposited materials (BioMed Central 2003). Unlike agreements with other publishers, KB’s remote users as well as walk-in users will have access in accord with BioMed Central’s open access philosophy. The KB is seeking to enter into agreements with other major scientific publishers.
The National Library of Australia was an early investigator of digital preservation methodologies and support tools. PANDORA (now officially known as PANDORA: Australia's Web Archive) is national in scope with all the mainland State libraries, ScreenSound Australia and the Australian War Memorial as partners. (The State Library of Tasmania continued to develop its own archive, Our Digital Island). PANDORA now contains over 4,000 titles and over 8,000 instances (Phillips 2003). (An 'instance' is a single gathering of a title that has been added to the archive. Many titles are re-gathered on a regular basis to capture changing content, for example, when serial titles add new issues.) The Archive consists of approximately 16 million files and the display copies alone occupy almost 500 gigabytes of storage space. (There are two additional copies for preservation purposes, as well as back up copies.) The Archive covers the full range of material published online in Australia, including science and technology (266 titles), agriculture (210), health (214), computers and the internet (157).
A major area of preservation for some national libraries is electronic theses and dissertations. Major operational systems are in place in Denmark, Sweden, and India. (The activities in Denmark and Sweden are described in Section 6.2.) Since July 1998, Die Deutsche Bibliothek (the National Library of Germany) has collected online dissertations and theses. The university libraries report electronic dissertations to Die Deutsche Bibliothek and then they are stored to the library’s archive server DEPOSIT.DDB.DE. Since February 2001, Die Deutsche Bibliothek has hosted the "Co-ordination Agency DissOnline". Die Deutsche Bibliothek is planning an e-deposit system that will eventually hold and preserve not only dissertations but electronic journals, web pages, and other materials considered to be of preservation value (Steinke 2003).
Institutions, particularly major research universities and their management, are becoming major players in preservation activities (Lynch 2003), perhaps as an outgrowth of the development of institutional repositories and the availability of open source software such as DSpace and the Open Archives Initiative-Protocol for Metadata Harvesting (OAI-PMH). While not every institutional repository is committed to a long term archive, there are key relationships between producers and the repository that are similar to those identified in the Producer-Archive Interface Methodology Abstract Standard draft (CCSDS 2002) that can create a natural pathway between short term and long term commitment. Lynch (2003) posits that “Only an institutionally based approach to managing these data resources, which operates in alignment with what the faculty at each individual institution are actually doing, can provide a comprehensive dissemination and preservation mechanism for the data that supports the new scholarship for the digital world. Journals will move too slowly and too unevenly to manage these resources, and disciplinary data repositories cannot be comprehensive. Institutional repositories can maintain data in addition to authored scholarly works. In this sense, the institutional repository is a complement and a supplement, rather than a substitute, for traditional scholarly publication venues.”
The DSpace at Massachusetts Institute of Technology (MIT) implementation includes submissions from a number of MIT departments, including Ocean Engineering and the Laboratory for Information and Decision Systems. Each department is treated as a community and then programs can cluster under each community. It is possible to search across the communities or to select a community for searching or browsing by author or title. DSpace expects to add other communities over the next year. (Tansley 2003). The resources in DSpace at MIT include preprints, technical reports, working papers, conference papers, learning objects and e-theses, which may include audio, video, text and data sets. In addition, the Out of Print Books of MIT Press are available to MIT staff and students via this site.
As part of the NASA Goddard Space Flight Center Library’s mission to preserve and provide ongoing access to information of value to Goddard project managers and researchers, the Library conducted several pilot projects in digital preservation. The focus for the Library is on internal project-related materials; the objects include videos of colloquia, seminars and internal mini-courses, intranet web sites with scientific and technical content, project documents, and images, including photographs and animations. The web site, image and video repositories have been demonstrated separately. The metadata have also been combined into a central repository so that users can search across object types.
5.4 Museums
Museums are taking an increasingly active role in digital preservation. Most
museums are interested in digitization as a way to make artifacts more accessible,
particularly those artifacts that are rare and fragile. In addition, digitization
provides support for curation and restoration activities, for insurance and
disaster recovery. While the majority of the museums do not deal with born
digital objects, they provide significant digital resources for scientific
investigation, valuable access points to materials that are physical and which, therefore,
can “reside in only one place,” and “benchmarks” for various scientific investigations
and analyses as in the case of taxonomic voucher specimens.
Museums also provide significant insight into the development of non-text digital repositories. For example, the digitization project at the National Motor Museum in the UK is part of a funded project to retrospectively document the photographic collection of the museum. The goal is to digitize the entire collection, but the current emphasis is on the 250,000 images in the working collection. The photographs are digitized as their physical storage is being re-allocated. During this re-allocation process, the ‘original’ or ‘first generation’ prints, that have copies are removed from the working collection, digitized, documented and stored in a secure environmentally controlled environment. While the current effort does not include the development of a dedicated web site, the digitization methodology was designed to ensure that images created during the process can be made accessible via the web.
As an outgrowth of individual and collective work with digital objects, museums are using the digital environment to create online exhibits. These activities combine multiple media, including images, text, video and sound to support a museum’s outreach and educational missions. The complexity of many of these online exhibits provides particular challenges to digital preservationists, including the need to link the digital item to its physical artifact.
5.5 National, State and Regional Archives
While archives approach preservation through different practices and approaches, they may also provide significant repositories of records related to scientific and technical endeavors. National, state and regional archives have been very active in the area of preservation technologies and practices. Their work is particularly important because it must deal with massive quantities of information, in a wide variety of formats. Key activities are underway at the National Archives and Records Administration (US), the National Archives (UK), and the Public Record Office of Victoria (AU). While the material is generally managed by collection or class of item and an emphasis is placed on the “original order” of the e-records, the distinctions between collection and individual item are becoming increasingly blurred since access to individual items is more easily provided when the material is in digital form.
There may be particular similarities in the preservation issues of archives and the preservation issues of data centers and other scientific and technical enterprises that create massive amounts of data. Both communities must establish practices related to selection or appraisal and retention, since to keep everything may not be feasible. A recent analysis performed for the development of appraisal guidance from the National Archives and Records Administration (NARA) for US government agencies identified special issues related to scientific data. NARA is holding meetings with the scientific data community to determine the needs of this community for archival level appraisal and retention for other types of data and scientific information. Guidance for the retention of observation data from the physical sciences is provided as a special case scenario in recent appraisal guidance related to NARA’s strategic initiatives (NARA 2003c).
Preservation may also be performed for content holders by trusted third parties. Trusted-third-parties are organizations that provide preservation services without being publishers, owners or subscribers to the materials preserved. Activities such as those of the Research Libraries Group/Commission on Preservation and Access Task Force on Archiving of Digital Information (RLG 1996) and the RLG/OCLC Working Group on Digital Archive Attributes (RLG 2002a) have helped to lay the foundation for current and evolving work on third party archiving activities. RLG and the US National Archives and Records Administration are co-creators of a task force on digital repository certification, whose resulting work is intended to go into the international standardization process through the ISO Archiving Series (RLG 2003). The trusted-third-parties highlighted below include a national library and two not-for-profit organizations.
The PubMed Central Journal Archive is an archive for life science journal literature established by the National Library of Medicine (US). It is available as a trusted third party for any qualified journal publisher (not just from the US) to deposit the electronic versions of journal articles. As of October 2003, the archive contained approximately 135 journal titles with others waiting to be included. One of the major contributions of PubMed Central has been the establishment of best practices for formats, mark-up, and e-journal selection.
The JSTOR operational archive of journal backruns and digitisation of paper journals, consists of six topical collections, including General Science and Ecology and Botany. As of July 2003, these collections included over 400,000 articles. As of August 2003, the Ecology and Botany collection included 29 titles and General Science included seven. As an extension to its digitisation services, the JSTOR’s Electronic-Archiving Initiative is charged with developing the organizational and technical infrastructure necessary to ensure the long-term preservation of and access to electronic journals (JSTOR 2003). Areas of consideration include business models, governance, technical infrastructure, metadata formats, and management of supplemental information. Key decisions will be needed concerning the development of an approach that balances the needs of scholars, publishers and libraries. A pilot project is currently underway with a start-up grant from the Mellon Foundation. It involves ten publishers, including several major science publishers. Contributing publishers will submit samples during the summer and fall of 2003, and the goal is to have a prototype when the grant period concludes in March 2004.
The Internet Archive (2003) is a non-profit organization that takes periodic snapshots of the Web, and makes them available to the public. In addition, there are several large institutional customers that use the Archive as a service bureau to create snapshots of the web for them. Broad crawls of the web are done approximately every two months. Focused crawls are performed more frequently. The rules for selecting sites to archive depend on the client and are more precise for partners such as the British Government, the Israeli Government and the Library of Congress. Currently there are discussions underway with the National Archives and Records Administration in the US and the UK. National Archives (previously the Public Records Office). Agreements with other national libraries and archives are likely. The Internet Archive provides the data from its crawls as a corpus for special projects (i.e., the investigation of web surfing patterns by Xerox PARC, the 1997 snapshot of the Web at the Library of Congress, and the 1996 US Elections pages displayed by the Smithsonian).
The role of government, while it varies from country to country, has focused on direct funding through national libraries, national archives, and government institutional repositories and on indirect funding of non-government initiatives and public-private partnerships. Governments have also been instrumental in funding research and establishing appropriate policies that encourage or contribute to an infrastructure for digital preservation. In many cases, e-government legislation includes establishment of archiving and preservation initiatives. Many of these activities directly involve scientific and technical information.
Early preservation research was funded through the European Union’s Information Society Directorate and its focus areas are electronic publishing, digital culture and library telematics. The system for archiving the Elsevier Science journals is funded by the Dutch Ministry of Education. The Congress of the United States has appropriated $25 million in funding for the development of a strategic plan for an infrastructure for preservation of digital objects through the Library of Congress’ National Digital Information Infrastructure for Preservation Program (NDIIPP 2003). Five million dollars are to be spent during the initial phase for planning and also for acquiring and preserving digital information that would otherwise vanish in the interim. The full amount of the funding is $99.8 million with $75 million available as the amount is matched by nonfederal donations, including in-kind contributions. The first call for proposals was announced in late 2003. In addition, programs like the Library of Congress’ MINERVA Project have been critical in helping to determine the nature of and potential solutions to problems in web capture.
Governments also establish supportive environments through legislation and directives that require collection of digital materials or remove barriers to collecting. Many data centers, including the US Earth Resources Observation Systems (EROS), are authorized through legislation.
E-government legislation in various countries has included digital preservation components. The E-Government Act of 2002 in the US addresses issues of long term preservation (though this was significantly reduced in the final version of the bill). The creation of the E-Envoy position in the UK is indicative of the degree to which e-government is embraced in that country. There is a significant effort to move publications, transactions and communications of all types and from all levels of citizenry and government to government to an electronic environment. In Australia, the e-government policies established an infrastructure that specifies critical components for a digital preservation environment including metadata standards (Dublin Core) and persistent identifiers.
5.8 Foundations and Other Private Funding Sources
Foundations and other private funding sources have been instrumental in providing the funds needed to “jump start” activities in the area of digital preservation. Digital preservation and long term access is a public good and, therefore, the heavy investment required is hard for industry, academia and even the government to justify. Foundations have been part of many innovative partnerships in this area.
The Andrew W. Mellon Foundation over the last several years has supported a wide range of research and pilot projects through its Scholarly Communication and Research in Information Technologies Programs (Andrew W. Mellon Foundation 2003). Early projects included the development of the initial JSTOR pilot, which resulted in an operational and actively used system for the digitization of backfile journal issues, including a large number in the sciences. Mellon continues to support the effort through its funding of JSTOR’s analysis of the impact of e-journals on JSTOR’s activities. Following initial funding in the area of digitization of paper journals, Mellon became heavily involved in funding major projects related to the archiving and preservation of electronic journals, including projects at Harvard and Yale. While many of Mellon’s activities have been irrespective of discipline, there has been significant involvement in Mellon Projects on the part of major scientific publishers, such as Elsevier Science, and major scientific research libraries such as MIT, Harvard and Yale. Mellon’s more recent activities include funding an investigation into the preservation of government documents by the California Digital Library and supporting the continued development of Fedora, DSpace and LOCKSS. All these projects are discussed in more detail in subsequent sections of this report.
The Wellcome Trust, an independent research institute focused on human and animal health, has funded similar initiatives (Wellcome Trust 2004). The Wellcome Trust and the Joint Information Systems Committee (JISC) co-sponsored an investigation of web archiving by UKOLN (Day 2003). While the report focuses on the needs of the Wellcome Trust Library and JISC, it has applicability to all organizations interested in the issues and complexities of archiving the web.
A critical point for digital preservation projects is the point at which research and pilot activities move into an operational phase. Generally, support from a foundation is reduced or eliminated when the project reaches this phase. The Mellon Foundation has been particularly aware of this problem and required sustainability planning and an analysis of ongoing costs as part of its research projects. Not only has this approach recognized sustainability and cost as key issues for digital preservation, but this practical focus from the outset has resulted in better planning, appropriate expectations on the part of the stakeholder groups, and proven, long term outcomes from the investment of foundation monies.
Similarly, the Wellcome Trust funded research into another very practical issue in digital preservation, the issue of copyright, particularly when archiving and preserving web-based resources. The study co-sponsored by JISC and conducted as a companion to its more technical report, discusses copyright in the UK, EU, Australia, and the US (Charlesworth 2003).
Despite this significant support by these key foundations, Neil Beagrie of JISC (Beagrie 2003) noted that it is difficult to identify funding sources for digital preservation activities in science. He indicated that a list of foundations and remits would be a valuable tool for those trying to identify funding sources.
6.0 Preservation by Document Type
There are many document types or genres that are important in scientific communication. These include journal articles, books, theses and dissertations, conference presentations and papers, and project documentation. These document types may be presented as Web sites and they may also qualify as electronic records. These genres may include multiple format types. For example, electronic journals may require supplemental files such as spreadsheets, videos, or software.
This section discusses preservation practices by document types. More information about specific format types is included in Section 7.0, Standards by Format Type.
Electronic journals have been at the forefront of preservation discussions because of their critical role in scientific communication and the commercial interests involved. The practices for preserving electronic journals show an increased maturity as evidenced by more formalized procedures such as a DTD for journals.
In 2001 the Mellon Foundation funded a study at Harvard University to investigate whether a common DTD could be developed for journals (Inera 2001). The study indicated that a common DTD could be developed but that there would be some loss in specificity, particularly in certain areas as math and chemistry. It also suggested the extension of previous work at the National Library of Medicine’s National Center for Biotechnology Information on an XML format for archiving material deposited in PubMed Central (PMC).
This previous work at PubMed Central began in 2000 with an attempt to create a common DTD across two publishers. It soon became apparent that updating this DTD every time a new publisher was added was not the optimal situation. PubMed Central decided to create a more generalized DTD for journal articles.
The Archiving and Interchange DTD Suite is based on an analysis of all the major DTDs that were being used for journal literature, regardless of the discipline. The suite is a set of XML building blocks or modules from which any number of DTDs can be created for a variety of purposes including archiving. Using the Suite, NLM created a Journal Archiving and Interchange DTD, which will replace the current PMC DTD as the foundation for the PubMed Central archive. In addition, a more restrictive Journal Publishing DTD has been released which can be used by a journal to mark up its content in XML for submission to PubMed Central. Several publishers and projects, such as JSTOR, the Public Library of Science, High Wire Press and CSIRO, are analyzing or planning to use the Journal Publishing DTD (Beck 2003).
In addition, an XML Interchange Structure Working Group was created to recommend changes and additions to the tagset. On November 1, 2003, Version 1.1 of the DTD was released. Work is beginning on other special DTD’s for online books and documentation based on the suite modules (Beck 2003).
Many of the institutional and national library preservation efforts involve
theses and dissertations, since these institutions often have responsibility
for providing this genre to their respective national library for incorporation
into the national bibliography.
One of the most advanced preservation projects is DiVA at the Electronic
Publishing Centre of Uppsala University in Sweden. The DiVA system treats
the electronic copy as the “digital master” for both electronic and print
versions of the document. Local repositories at five universities create archival
copies as part of the publishing process. These archival copies are provided
to the Royal Library, the National Library of Sweden, as archival packages,
in a system that uses a federation of remote libraries to provide full text
and metadata to the national library for long term preservation purposes
via e-deposit.
Local repositories such as that at The Royal Technology Library (KTH) in Sweden are working on the local repositories with the expectation of participating in the DiVA workflow when it is finalized. The goal of the effort at KTH is to create a campus archive of KTH publications, particularly dissertations, that promotes access and re-use. KTH began with abstracts for the dissertations in 1997. They receive approximately 250 dissertations per year. Preservation is an ultimate goal, so, along with the DiVA Project, they will be working to contribute the electronic publications to the National Library in Stockholm.
The National Library of Germany has developed DissOnline to provide access to the theses and dissertations of that nation. Eventually, the DissOnline collection will become part of its deposit system where long term preservation will be addressed (Steinke 2003).
Because of the nature of the authoring environment, most theses and dissertations are received in PDF, HTML, Word or TeX/LaTeX format. Many national libraries are still retaining the native format rather than transforming the original into a preservation format. In addition, many have hybrid systems where they preserve both the paper and the electronic because they are mandated to do so. The availability of these theses and dissertations to the public via the web depends on the copyright regime of the individual country.
Data was the earliest digital output of science to be archived. Through large data centers such as the NASA Distributed Active Archive Centers (DAACs), the data centres of the UK, and the World Data Centers, a variety of important, often non-reproducible datasets have been collected, stored, managed and made available for future reuse. These data sets range from simple numeric data streams of simple structure but large size, to large collections of still and moving images.
The Earth Resources Observation Systems (EROS) Data Center collects and preserves satellite imagery and aerial photography, cartographic and topographic data created by or for the US government and under the custody of the U.S. Geological Survey. Currently, it has approximately 12 million objects in several general collections, including each of the Landsat missions.
Several major efforts are currently underway to improve data centers through more consistent and interoperable procedures. The NASA Space Science Data Center (NSSDC) and the NASA Life Sciences Data Center (LSDC) are moving forward in the use of the OAIS Reference Model as the conceptual basis for their systems.
Data is also increasingly being stored as a result of submission as supplementary information with journal articles in digital form. PubMed Central, BioMed Central, the American Institute of Physics, Elsevier, The American Chemical Society, the Astrophysics Data System, the International Union of Pure and Applied Physics, the International Union of Crystallographers, to name a few, are routinely accepting the submission of supplementary data. However, it isn’t clear how much data will be lost because the author does not submit it or because no formal publication was created as the end result of the research.
For this reason, CODATA (The Committee on Data for Science and Technology), an international organization for the interoperability and standardization of data in the sciences for the purposes of communication, has been raising awareness of the issue of data preservation. In June 2002, the South African CODATA Committee hosted a meeting in Pretoria, South Africa, geared toward the needs of developing countries and the African Continent in particular. CODATA and ERPANET jointly sponsored a workshop on “Selection, Appraisal, and Retention of Scientific Data” in December 2003 (ERPANET 2003). Another workshop will be held in China in 2004. In addition, CODATA and ICSTI are creating a portal to provide information resources about archiving and preserving data with an emphasis on best practices and linking people to experts. In particular, this effort is aimed at supporting developing countries by providing a network of experts and highlighting practices that can be implemented in these countries (Anderson 2003).
Similarly, the National Science Board recently convened a meeting of experts for the National Science Foundation in the US. The goal of the meeting was to discuss the role that NSF should/could play as a funding agency in the preservation and access to data of long term value that is created as the result of grant funding.
Technical reports and other gray literature are key mechanisms for the dissemination of research and development results especially in industry and government. Many government and institutional archives are focused on technical reports, since libraries may not routinely collect them.
The ANSI/NISO Standard for Technical Reports (Z39.18) is currently undergoing its five-year review. As part of this activity, the review group is considering how the standard should change to reflect the digital nature of technical report creation and publication. In addition, the group is considering as an appendix to the standard a DTD for technical reports that has been developed by Old Dominion University (Maly and Zubair 2003). While this standard and DTD do not directly address preservation and long term access, the mark-up recommended in the DTD will support automatic metadata generation and additional semantic mark-up that would disaggregate the content from the presentation of the document. These are key factors in the development of a sustainable preservation system.
6.5 Conferences, Meetings and Lectures
Significant scientific information is first presented at conferences, meetings, lectures, colloquia, etc. Many disciplines, such as biotechnology, rely heavily on this method of communication rather than the formal publications. Therefore, the ability to preserve and access this type of information into the future is important.
As part of its knowledge management activity, the NASA Goddard Space Flight Center captures the content of colloquia, lectures and courses (Hodge, et al 2003). These events are routinely webcast and then saved digitally. Older videos are also collected and digitised. The encoded files are then indexed using a video indexing program, which allows users to query the videos by keyword and find precise retrieval intervals within the video stream. The software uses advanced voice recognition techniques and a dictionary that has been enhanced by adding the NASA Thesaurus to expand the queries and locate related intervals that do not specifically include the requested term. For example, a search on “planet” will also search on the names of the individual planets because the thesaurus has these terms as narrower terms to the term “planet”. Recent work has included the linking of presentation slides to the appropriate parts of the video stream.
The extent to which government-produced scientific and technical information is treated as an electronic record depends on the practice of the particular government or institution. Of course, e-records can include any or all of the above document types. However, there are significant e-records efforts underway within the governments that will have an impact on the overall digital preservation landscape.
The Victorian Electronic Records Strategy (VERS) of Australia is one of the most explicit suites of tools, standards and best practices with regard to e-records. The system has been operational since 1999 following a proof of concept/demonstrator project. (The latest version, released in July 2003 is mandatory for Victorian Government agencies.) The standard details functional requirements, the metadata set for long term preservation, and the long term format for records, which includes XML, PDF or TIFF, and digital signatures. Documents are converted into PDF and the context metadata is stored as XML. The converted records are encapsulated, i.e., bundled together into self-describing objects. VERS addresses many office-type documents, including e-mail. However, it does not specifically address databases or non-document type records such as sound and movie files, though they can be accommodated within VERS objects. VERS is currently working on a compliance program for vendors of records management systems. In addition, the Public Record Office of Victoria is in the process of obtaining a digital repository. The contract is expected to be awarded by the end of 2003, and the repository should be completed by the end of 2004. The repository is broadly based on the OAIS Reference Model and on VERS. It will be integrated into the existing records management, repository and access mechanisms for paper records (Quenault 2003).
While the Electronic Records Archive for the US National Archives and Records Administration (NARA) is not yet operational, there has been significant progress in that direction. The ERA began several years ago with a series of pilot projects, many of which involved the San Diego Supercomputer Center and its work on the Storage Resource Broker. These pilot projects were aimed at conducting the research necessary to create specifications for the architecture needed by an operational system to manage large-scale e-records systems, including the ability to deal with collections and different layers of metadata.
In addition, a draft Requirements Document (NARA 2003a) issued as part of the draft Request for Proposal issued in August 2003 describes the system. The final Concept of Operations also released in August 2003 describes the various user scenarios for such an Archive (NARA 2003b). “… the ERA system will ingest, preserve, and provide access to electronic records of all three Branches of the US Government. ERA is envisioned as a comprehensive, systematic and dynamic means for preserving any kind of electronic record, free from dependence on specific hardware and/or software.”
Meanwhile, under its Electronic Records Management initiative, NARA is working to extend the types of electronic formats that it can accept. NARA has already extended its acceptable formats to include scanned images of textual records and PDF. Three additional formats are expected in the near future; these may include web records, digital photographs or geographic information systems (Bellardo 2003). NARA is working with Adobe in developing the PDF-A format.
Guidance is being developed for future records so that NARA will be able to accept almost any format. They are working with partner agencies on archival metadata and relevant XML schema to provide more control through mark-up, including Dublin Core elements. Transfer may take place via FTP or Digital Linear Tape, which may become a long term preservation medium (Bellardo 2003).
A NARA Appraisal Guidance document issued in October 2003 includes an appendix on the appraisal of special types of information, including environmental, health and scientific observation data in the physical sciences (NARA 2003c). Recently, NARA has created a board of scientists and publishers to discuss the specific issues related to scientific e-records.
The UK Public Records Office (recently renamed the National Archive) has also been active in the area of digital preservation (Public Record Office 2003). The Digital Archive receives selected electronic records from government departments under the management of the Records Management Department. The Digital Archive is available to onsite users from designated PCs. Advice and guidance is provided to government agencies with regard to file formats, storage media, the care and handling of removable media, graphic file formats and image compression. Future topics on which guidance will be issued include digital signatures, encryption, and checksums. In April 2003, the National Archive’s Digital Preservation Department hosted an international conference on “Practical Experiences in Digital Preservation,” where issues of technology, organization, and cost were discussed by a variety of national archives, including those from the US, the Netherlands, Iceland, and the UK (National Archives 2003).
These organizations and others are part of the InterPARES effort which in 2002 began the follow-on project, InterPARES II. InterPARES II broadens the number and types of archives that are included in the group, and it addresses an extended scope of e-record problems. It currently includes over 100 researchers (Eastwood 2003). It will address issues of reliability and accuracy in addition to issues of authenticity, and it will address them throughout the records' lifecycle (from creation to permanent preservation). InterPARES I was concerned primarily with authenticity and with non-current records destined to permanent preservation (InterPARES 2003).
7.0 Standards by Document Type
The best format for long term preservation remains elusive, perhaps because there is no single answer to the question. Instead it depends on the format type of the original object, the characteristics of the original that the preserving organization considers to be most important to preserve, and the expected use/re-use of the object in the future (distance education versus legal evidence). Most experts agree that the best format for preservation is that which is least proprietary while conveying significant aspects of the original.
This section outlines the status of format standards for text, images, videos, data and other products of scientific research and communication with the realization that the practices represent a range of institutions with varying needs and decision criteria.
The most common formats for storing text were XML (ASCII, with or without Unicode), PDF and TIFF. Each of these formats has its place in the preservation strategy.
For scientific and technical text, as well as other objects, ASCII is the most open format, accommodating virtually all software or browsers now and into the future. However, for some digital objects, ASCII is problematic when paired with the requirement to provide permanent access and to render the look and feel of the original. Therefore, PubMed Central, DiVA and the Humboldt University cite XML as the preferred format for preservation because it is based on ASCII, non-proprietary and well-adapted for re-purposing and interoperability. The PubMed Central Guidelines require separate SGML or XML files for the full text of each article. DiVA creates XML for all available full text and Unicode is used to preserve the extended character sets from the original.
TIFF, an image format, is used to preserve the look and feel of original text objects. The use of TIFF in text environments began with the advent of scanning and Optical Character Recognition technologies, which used the TIFF images. TIFF can be employed at various resolutions dep