CENDI PRINCIPALS AND ALTERNATES MEETING

National Library of Medicine
Bethesda, Maryland
February 4, 2003

Minutes

Creating and Preserving Organizational Memory: Government and Academic Repositories
Remarks from the U.S. Public Printer and Superintendent of Documents
Revolutionizing E-Records: Agency-NARA Partnerships
Institutional Repositories: The Academic Perspective
Spotlight on NLM: Next Generation Internet and NLM’s Embryo Project

WELCOME

Kent Smith, Chair of CENDI, opened the meeting at 9:15 am. He introduced CENDI to those attending for the first time, emphasizing the importance of melding content and technology. CENDI’s priorities include the development of Science.gov; digital archiving, preservation and permanent access; STI policy; and web metrics and usability testing. Through its working groups, CENDI looks at new technologies such as XML, digital object identifiers, the semantic web and cyberinfrastructures. IT Security was addressed by looking at intrusion detection methods. CENDI’s goal is to address these issues as a group to achieve better, more effective results than the agencies can do alone.

Dr. Donald Lindberg, Director of the NLM, welcomed the CENDI members to the Library. He noted that CENDI successfully operates “below the firing line.” The networking within CENDI allows the STI managers to “feel good” in a society which doesn’t always understand their missions. NLM shares many interests with other agencies including preservation of digital information. For example, Dr. Lindberg, on behalf of NLM, is anxious to work in partnership with the Library of Congress (LOC or LC) and the National Archives and Records Administration (NARA) in order to avoid duplication of effort. There is potential for a joint initiative with the National Science Foundation (NSF) on the research agenda for preservation. The agencies also share the need to provide information to multiple user groups, particularly to the public. He noted that we often misinterpret the needs of our users. Products developed for one audience are found to be valuable to other audiences as well. For example, health practitioners use NLM’s MedlinePlus, even though it was developed for health consumers. Over the years, what we are doing, how we are doing it, and who we are serving is always changing.

Creating and Preserving Organizational Memory: Government and Academic Repositories

Remarks from the U.S. Public Printer and Superintendent of Documents
Bruce James and Judy Russell

Mr. James has a long history in the printing business. When he started, they were still setting wood type. In the process, he has built 13 companies and learned a lot about organizational change. Mr. James joined the Government Printing Office (GPO) with the challenge of taking it into the 21st Century. He is reviewing where GPO is and determining the opportunities and challenges. In particular, he is looking at the needs of GPO’s customers, many of which are CENDI members.

GPO was created in 1813 when Congress began to think about information needed for the democracy. The Federal Depository Library Program (FDLP) and the GPO sales program were early initiatives. In 1854, Congress started its own printing business. It quickly became the largest printing and information dissemination organization in the world. The headquarters building, if laid flat, would cover 24 city blocks. In addition, there are 30 locations cross the country. GPO currently employs 3000 people who collect, manage and distribute government information. A digital printing system was introduced early in order to produce the Congressional Record in Washington, DC. GPO has high security facilities located in Colorado.

Currently, less than 50 percent of GPO’s output ever sees paper. More and more, it is moving to a digital world. Mr. James expects to see printing go away in his lifetime, with conversion to paper taking place only at the user’s desktop.

Judy Russell was introduced as the 22nd Superintendent of Documents. She emphasized the need for collaboration among agencies and the fact that GPO has many resources to bring to the table. She spoke of the existing opportunity of agencies to provide improved access to federal information. GPO is anxious to have more partnerships with CENDI to forward items of common interest.

Of particular importance to GPO is the need to review the FDLP program. There is a Council meeting scheduled for the first week in April. They need to look at the program and to begin to answer the question as to why an institution would want to be a FDLP member when it can get the information from the Internet. What is the value for the FDLP members? The answer to this question will be different for academic, public, or law libraries. It has been noted that the reference librarians at the institutions provide subject expertise that GPO cannot provide, and this is a real strength of the FDLP program.

Confusion remains over electronic information and Title 44, since these issues have not been resolved. The problem of what you want to preserve and save continues because the volume has increased significantly. GPO is seeking to get Congress more involved in these discussions. GPO is responsible for permanent availability while NARA is the place of last resort. GPO needs to talk to partners to decide what this means.

Revolutionizing E-Records: Agency-NARA Partnerships
Dr. Lewis Bellardo, Deputy Archivist, NARA

NARA’s vision is to ensure ready access to essential evidence that documents the rights of American citizens to know what the government is doing; assures accountability on the part of federal officials; and documents the national experience. In the digital environment, NARA believes that it needs new partnerships with stakeholders to achieve this vision.

In meeting this vision, NARA has several goals. These include helping federal agencies economically and effectively manage records to meet agency business needs. Records should be kept long enough to protect the rights, assure accountability, and document the national experience, but records should be destroyed when they are no longer needed and it is practical to do so. With these goals in mind, NARA has several initiatives: the Records Management Initiative (RMI), the Electronic Records Management (ERM), and the Electronic Records Archives (ERA).

In the RMI, NARA took a look at the current environment by talking to records managers, CIOs, general counsels, and other stakeholders. They reviewed agency work processed, looking at strategies for improvement and change. The underlying principle of the redesign is to create mutually supportive relationships with the agencies so that NARA’s program adds value and helps to ensure that records are managed more effectively. They want to add value to business processes to support the work and needs of the agencies.

The old model was fairly effective: help was provided and money was saved because the agencies could use free records centers, while NARA got a larger percentage of the material they were supposed to get. However, the downsizing of records management, end user decisions at the desktop, the rise of IT shops, and the increase in the number of disparate formats in which agency materials were created, caused a misalignment of the agencies and NARA. Currently, RMI is focusing on allocating resources to identify and manage records that are most at risk. This includes flexible scheduling, custody and appraisal policies, guidance and training, direct assistance and advocacy. For example, a guidance document related to affiliated archives is to be released in about two weeks.

NARA is also looking at the concept of “targeted assistance”, using national priorities that are set. Homeland Security and FBI special projects have been helpful in further developing this thinking. NARA has determined that it needs to be more selective in oversight programs, but is willing to make tough calls when agencies are not conforming, particularly after help has been offered. The approach is to dangle a carrot rather than swing a stick.

NARA has conducted prototypes that involve taking records in even before they are legally accessioned, thus beginning the archiving process much earlier in the information life cycle. They also aim to expand the acceptable transfer formats to better accommodate the current and future digital environment.

The ERM is one of the 24 E-government initiatives, with the goal of expanding citizen-centered e-government, working with partner agencies, and reducing duplicative systems. NARA was chosen by the Office of Management and Budget (OMB) to lead the e-records initiative. The initiative includes four parts: correspondence management, enterprise-wide ERM, electronic information management standards, and the transfer of permanent e-records to NARA.

NARA is also involved in the e-regulation initiative. E-regulations.gov is an interesting model. The request from a user comes into the GPO server. It is sent to the EPA server, where the user writes the comment on any regulation across multiple agencies. The comments are then forwarded to the particular agency and get on the docket.

ERM is focusing on electronic information management standards. They have endorsed the DoD second version (January 15, 2003) as the standard for all agencies. NARA will be partnering with DoD on the next version that will move them into an archival setting.

The focus on the transfer of permanent e-records involves the development of acceptable formats. Initially, there will be guidance on current or “as is” records. E-mail with attachments and scanned images of textual records are already supported. PDF should be supported by the end of March. Three more formats are expected in 2004. While it isn’t clear which formats, they may include web records, digital photographs, or GIS. NARA is involved with Adobe in developing the PDF-a format, which is being submitted through the International Standards Organization (ISO) standards process.

In addition, guidance is being developed for the “to-be” records. Formats of the future will impose more control through mark-up languages such as XML. They are working with partner agencies on archival metadata and relevant XML schema, including Dublin Core elements as the core. Transfer may take place via FTP or Digital Linear Tape, which may become a long-term preservation medium.

The ERA provides the IT infrastructure and the technical solutions necessary to implement the results of RMI and ERM. The goal is to preserve and provide ongoing, economical access to archival e-records. The ERA system deals with the three basic functions of input, storage (online, near-line, and offline), and output. Tools will be developed that support automatic input based on records schedules, description in an archival catalog, and reference services that support researchers. The goal is to develop a system that is scaleable from small to large archives. The system must be extensible to a whole range of formats and ultimately able to take any format.

The key to success of the ERA strategy is partnering with other agencies and with industry. NARA has projects underway with NSF, the Digital Library Federation, and the Library of Congress. They recognize that this is a high-risk project, but the Office of Management and Budget (OMB) has been very supportive. Dr. Bellardo presented the ERA timeline. The work for 2003-2007 emphasizes the official military personnel files. It involves a Request for Proposal (RFP) for a performance-based contractor to develop the system based on requirements. The RFP presents data requirements and use cases for what the system must do. NARA wants the vendors to tell them what will work. The goal is to base the system on COTS products.

NARA is involved with the Interagency Committee on Government Information under Section 347 of the E-Government Act (Note: Secretariat information on this is that the committee establishment is addressed in Title II, Section 207 of PL 107-347, E-Government Act of 2002). Within two years, it must make recommendations for standards for organizing and categorizing government information.

Dr. Bellardo ended with an emphasis on the interests that NARA and CENDI share. These include preservation and access to information over time, with an emphasis on scientific and technical data, research and development data, and government publications. Areas of possible collaboration that were identified at an initial meeting between NARA and CENDI representatives include development of metadata standards, arrangements for carrying out complementary functions between the agencies and NARA, harvesting of metadata and content, and initiatives surrounding the E-Government Act.

Discussion

There is a need and desire to have the major information management organizations, including the CENDI agencies, GPO, and NARA, to speak with a common voice, where possible, so that no single agency is trying to address these issues.

The group also discussed some of the more difficult types of information; for example, web portals with private and government information. These difficult types of information emphasize the need for disposition and retention information in metadata that can be operated on by computer.

Institutional Repositories: The Academic Perspective
Raym Crowe, Scholarly Publishing and Academic Resources Coalition (SPARC)

The SPARC program is a coalition of 240 libraries worldwide. It was begun as a reaction to dysfunctions in the scholarly journal publishing market, including the increases in journal prices that have far exceeded the inflation rate. From 1986-2001, the price of scholarly journals rose an average of 300 percent, even though the libraries procured five percent fewer titles. There is a substantial price disparity between commercial and society journal publishing. One study found that commercial journals are an average of three to nine times more expensive than society journals. The $8 billion scholarly journal publishing business is controlled by a few large commercial publishers. The bundling of paper journals with electronic has also led to the “big deal”, which means that the large publishing conglomerates that control a large number of titles bundle these titles together for a larger price. This competition harms the society and smaller publishers who do not have the ability to bundle large numbers of titles. By the time the libraries pay for the bundled titles, there is little money left to procure titles from smaller publishers and societies.

However, SPARC believes that the digital environment has unlocked opportunities. It expands opportunities for access, provides potential for new uses for research, and encourages better ways to handle the increased volume of research that is being generated. One way that the digital environment addresses the journal crisis is by allowing the disaggregation of the current publishing environment.

In the current scholarly journal publishing model, there are four components that are bundled together and controlled by the publishers – registration for intellectual property purposes, certification of the validity/quality of the research, awareness/announcement, and archiving. These functions could be disaggregated, allowing different organizations to provide services in these areas. For example, the peer review (certification) process could be undertaken by a learned society. A national library or third party could perform the archiving and long-term preservation. Separating the responsibility for these functions would increase competition and force those involved in the chain to add value to their services.

Institutional repositories can play a large part in this process. Institutional repositories are defined by the organization with contributions being limited to those from students, faculty, and staff of the institution. Contributions are scholarly in nature, but the material may be broader than published articles, including teaching materials of ongoing value, videos, theses and dissertations, manuscripts, preprints, etc. The goal is to have a cumulative and perpetual repository. The commitment is to open, free global access.

Institutional repositories are a natural extension of the fact that faculty are putting their work on the Web. However, individual web authoring results in independent, unmanaged materials that lack coherence and interoperability.

Standards are the key to having institutional repositories that can interoperate. Standards are also key to having the components of a disaggregated system work together. Most formalized institutional archives are Open Archive Initiative (OAI) compliant. This means there is a standard set of core metadata for discovery and a protocol for exposing metadata. Harvesting services can be built across institutional repositories.

There are many benefits to institutional repositories. Free articles get more visibility and are cited more often. There is a “branding issue” that is especially valuable to state schools that have budget issues and need to prove the public good of their institution. Institutional repositories can complement each other as well as complement existing publishing models.

However, there are impediments to the development of institutional repositories. These include trends against formal publication if preprints are posted, and concerns on the part of authors about intellectual property rights, including first publication. There should be a critical mass of information in a repository in order to make the system worthwhile. There are also discipline-specific differences. For example, in physics and economics, which are disciplines where sharing preprints is part of the tradition and culture, the institutional repository paradigm has advanced much more quickly than in disciplines such as medicine.

There are also perceived quality issues for users. The co-mingling of working papers and peer review can be problematic. It is necessary for certification methods to be developed and readily apparent to the user. There is concern that repositories will undermine existing journals without being able to pick up the workload.

The Open Society Institute has just written a series of white papers on business models for open access. The analysis is that these institutional archives won’t have a significant impact on large publishers such as Elsevier. Societies and marginal publishers have little choice as it is.

Discussion

There is a large amount of material that is valuable but is not in digital form. Many institutional archives are not dealing with this material. Some publishers, particularly the learned societies, are committed to retrospective digitization, but too many are not. However, it is important to make a case for the need to digitize some of this material. Both DTIC and NTIS have statistics that show the amount of interest in older materials.

We need unanimity in the need for digitization before agencies can go to Congress to explain why the public would benefit or be interested. Dr. Bellardo indicated that he is seeing a growing interest in content and, at least in the Executive Branch, he believes that legacy issues can be raised if time is spent preparing the business cases.

Spotlight on NLM: Next Generation Internet and NLM’s Embryo Project
Dr. Michael Ackerman

Dr. Michael Ackerman introduced the Embryo Project, one of several projects funded under NLM’s Next Generation Internet (NGI) initiative. This is an outgrowth of NLM’s involvement in the HPCC program in the early 1990s. Since that time, NLM’s involvement in the NGI has been to develop applications for it. They believe that healthcare applications will test the system through requirements and usage.

NLM’s NGI uses the Abilene Network, developed by the University Corporation for Advancement of the Internet and which has about 200 members, including 20 government agencies and 10 industrial partners. The aim is to develop applications on the Internet-2 and then deploy them back to the Internet-1. This is a two- to five-year cycle of development and deployment, allowing R&D to be performed while improving the capabilities of the public internet in a systematic and reliable way. NLM has funded 16 projects in healthcare that won’t work on the regular Internet. For example they are partnering with the Radiological Society of North America on the manipulation of large datasets for breast cancer diagnosis and treatment.

The Visible Embryo Project seeks to deploy a system of visualization/collaboration workstations using high performance networking. It digitizes and makes available via the system, embryo data from the Carnegie Collection. Lastly, it demonstrates the use of the system in medical collaboration environments.

Dr. Mark Pullen of George Mason, the principal investigator for overall direction and collaboration technology, introduced members of the team, most of whom gave their presentations remotely via the networked collaboration workstation. The other members include Eolas, providing technical coordination and software; the San Diego Supercomputer Center, providing the data repository; the Lawrence Livermore National Laboratory, providing network facilitation; the Armed Forces Institute for Pathology, providing the data; and the Oregon Health Sciences University, the University of Illinois at Champagne and Johns Hopkins University, developing medical demonstrations using the system.

The images in the project are based on a series of embryos that were available from the Carnegie Collection at the National Museum of Health and Medicine at the Armed Forces Institute of Pathology. The Carnegie Collection is the largest repository of human embryology. Repositories of content, such as this one, are helping to get the content out into meaningful applications. It is a non-renewable and non-replaceable collection, so this digitization is also serving as a preservation effort.

Fifteen embryos were cross-sectioned and slides were produced at 10-15 microns at 20x magnification. The high-resolution images were sent to the San Diego Supercomputer Center. Records, photographs, and other information are also collected. There is the potential for a total of four terabytes of data. Work is underway, using volunteers, to annotate the embryos. Two embryos have been annotated to date, with a third planned.

The Annotation Collaboration System, which integrates a number of collaboration, database, and visualization tools, has been developed to allow contributors to work most efficiently. The system provides asynchronous access, so contributors can work at their own schedules. Other aspects of development include the metadata input and browser, the system to track annotators, and the development of semantic metadata records that are connected to every 3-D point in the database. This granularity of metadata allows for manipulation of a large amount of data in real time.

A major technology development is network-scope shared arrays. This technology allows the sharing of a data space across the network. Changes are instantly propagated across the network as if the network were the computer’s bus.

The interactive visualization of large volumes of information is led by the San Diego Supercomputer Center, using off-the-shelf hardware. The “storage resource broker” is being used to handle data sets that don’t fit into standard memory.

The projects under the Visible Embryo umbrella are looking at various clinical and educational uses of the Carnegie data. For example, fetal programming is based on the concept that each month of embryonic development has something specific to do with future health. There is a library of morphological maps that help to display more about the true health of the embryo. Eventually, the gene expression information could be mapped over the morphological pictures.

Teaching methods can be improved because embryonic growth can be shown more quickly through the visualization tools. The system supports active learning software that is also very conducive to distance education. While embryology has declined as a formal discipline in universities and medical schools, it is hoped that these types of teaching techniques will allow the few embryology professors to use them across the educational system. Master classes are being recorded in order to make them available for multiple medical schools.

Return to Minutes Archive