CENDI PRINCIPALS AND ALTERNATES MEETING
NATIONAL SCIENCE FOUNDATION, Ballston, VA
January 4, 2005
MINUTES
- FROM PRESERVATION TO SUSTAINABILITY: CONNECTING RESEARCH TO OPERATIONS
- Long-Lived Data Preservation: A National Science Board Investigation
- NSDL: Core Integration, Collections, and Targeted Research Projects
- Sustainability of a Public National Digital Library for Science, and Technology, Engineering and Mathematics Education
- NSF SHOWCASE: RE-INVENTING THE NSF WEBSITE
- CENDI MERIT AWARD
Welcome
Dr. Walter Warnick, CENDI Chair, opened the meeting at 9:10 am. He thanked Dr. Strawn and Ms. Higgs for hosting the meeting. Dr. Strawn welcomed everyone to NSF.
FROM PRESERVATION TO SUSTAINABILITY: CONNECTING RESEARCH TO OPERATIONS
“Long-Lived Data Preservation: A National Science Board Investigation” Dr. Christopher Greer, National Science Board Task Group on Long-Lived Collections, Division of Biological Infrastructure/National Science Foundation
The National Science Board was established by the NSF Act of 1950. It has 24 members who provide oversight and policy making for NSF and advice to the President and Congress on matters of science and engineering policy. They provide advice on science policy issues. The NSB is interested in digital data because the role of data is increasing in research and education. Digital data is a powerful catalyst for new research and it provides opportunities to broaden participation by students and researchers at all levels.
The Long-Lived Data Collections Task Force was established in February 2004. The task force was charged to “ delineate the policy issues relevant to the National Science Foundation and its style and culture of supporting the collection and curation of research data, and [to] make recommendations for the National Science Board and the community to consider.” A review of the data collections supported by NSF found that the majority of these data sets are external to NSF. There is a large number and they are very heterogeneous. Two workshops were held, one to address agency issues, and the other focused on the community and prior relevant reports, including the Atkins Report on Cyberinfrastructure for Science. The report from the task force, which is currently under NSF senior management review, is expected to be distributed in draft for public comment in March 2005.
Dr. Greer highlighted the major findings in the report. A key finding is that words and concepts in the data community are used differently by different people and there is a need to create a framework for commonalities to unify discussions on the issues. The policy for long-lived data collections should be informed by a clear vision of the needs of people and institutions in the data collections universe. Specific responsibilities for management must be enumerated, with a comprehensive list of these responsibilities as an outcome of further deliberations. Data users should provide clear attribution and robust metadata should be supplied by the authors and data managers so that attribution can be determined. Lastly, a metadata policy is needed.
The report identifies the following roles with regard to long-lived data collections – authors, data managers, data users, and supporting agencies. The author is responsible for meeting standards criteria and depositing the data. Robust standards need to be developed and there must be a system for developing standards within and across communities in an evolving way. Data managers make the data accessible. They should provide clear deposition standards along with access and intellectual property rules. Supporting agencies make the universe work efficiently by establishing strategic plans. It is the NSB’s responsibility to bring this about.
The report also distinguishes three types of data collections. At the most generic level, a data collection is a dynamic, heterogeneous community system. There is a continuum of collections from research to resources to reference collections.
Research Collections may have a limited timeline depending on the specific project. They may or may not be a budget line item in the funding of a project. These collections are generally used within a small group.
Resource Collections are part of a specific organization; for example, the genome sequencing collections. Research collections serve a particular community, and they may have only a defined timeframe. Resource collections may evolve, and, therefore, the management is often dynamic with the data or the collection being moved from one data manager to another.
A Reference Collection is a global resource that serves many communities and aids in the creation of other collections. Reference Collections usually have direct funding and the timeline is almost indefinite.
Data may be deposited in a research or a reference collection and then move to a resource collection. It is necessary for a community to describe its standards and to commit to maintaining the funding in order for a resource collection to be established. Formalizing such funding is an important and critical step because it affirms the community’s need for the particular collection. For example, the Protein Data Bank began at DOE’s Brookhaven Lab with external funding by NSF. As a major resource collection, it is now funded by eight national and three international organizations at $9 million per year.
Long-lived data collection policies must recognize the dynamic character of data collections. An approach similar to publishing of documents is needed to handle digital data.
Data collections can be managed centrally or in a distributed fashion, depending on the collection’s history, the needs of the community, and community expectations. The funding and management models that are selected have major policy implications. Indirect support to a distributed model is very complicated because it is difficult to ensure continuity of distributed collections. There should be a distinction between a long-term commitment to funding the data collection and a commitment to the data management institution. For example, the community can say that the data collection must be sustained but its data management organization may change.
The value of older data has policy implications as well. In some disciplines such as social and behavioral science, data becomes more valuable as it gets older.
A crucial question is whether a comprehensive data plan that addresses maintenance, standards, access, and ongoing funding should be part of every proposal. A data plan creates the opportunity for a community response.
It is important to recognize that data collection activities go beyond the collection and distribution of the data. They include curation, quality control, peer review, and standards development. Collection managers are key to standards setting but who will pay for them to be involved in these activities? The community must give certain individuals proxy responsibilities to make standards decisions. There are implicit and explicit assumptions that go along with these responsibilities. Should the implicit responsibilities be funded and made more formal than they are currently?
Broadly enabling research and education to take advantage of data requires proper training at all levels. There must be a pipeline for training new scientists, data managers, and users, and retraining for those in positions that involve data management and use. Career paths for “data scientists” must be addressed. Recognition for data management and collection work must be given in the reward structure. Supporting agencies can partner with NSF and others to address this problem.
It is also important to address the relationship between data and the scholarly publishing community. On certain types of data, like those of the Earth Observing System, it is difficult to make a direct connection between publications and outcomes. In other cases, the connections are more explicit.
In general, the report recommends the development of policies that embrace the structure of the “collections universe”. Other recommendations include the delineation of responsibility/authority and needs related to data, further evaluation of data management plan options and their costs, evaluation of community-proxy functions as they relate to standards development, and the provision of education and training in data management.
“NSDL: Core Integration, Collections, and Targeted Research Projects” Dr. William Arms, NSDL Principal Investigator; Professor, Cornell University
NSF is interested in digital libraries because it believes that they will have a positive impact on the quality of education. While higher education in the U.S. is one of the best in the world, it is very expensive. K-12 education is often mediocre or worse. Technology-enhanced education can raise quality while reducing cost by making efficient use of existing teaching resources. Educational materials are expensive to create but the wheel is often reinvented. NSF has funded many projects over the years but the overall impact has been small. Digital libraries can help by allowing users to find and re-use resources that have been developed.
Digital libraries provide three necessary functions: 1) they relate materials to specific educational needs; 2) they allow users to find relevant materials through searching and browsing; and 3) they promote reuse and preservation of materials that would otherwise go unused. This connection of resources to needs is a library-oriented function.
There are several challenges in the educational environment. In terms of searching and browsing, the needs are very specific including distinctions by educational level, granularity of the materials and primary versus secondary materials. Many materials require prerequisites. Increasingly, materials are tied to state curriculum standards. In addition, users are often pressed for time.
The NSDL has funded more than 150 open-ended projects in collection development, service development, and related research since 2000. In the beginning, managed projects received a small portion of the funding. However, interoperability challenges have warranted an increase. In 2001, core integration was approximately 20 percent of the NSDL program budget. With the pathways project, the amount is projected at approximately 50 percent for 2005. The goal is to have the results of the projects last. Achieving the correct funding balance between managed and open projects is very important.
The philosophy of the NSDL’s Core Integration is that you can build a very large digital library with a small staff. However, every aspect must be planned with scalability in mind and compromises must be made.
Teaching materials are scattered across the Internet. In the initial repository design, the repository held metadata information about every collection and item in the NSDL based on the Dublin Core metadata and the Open Archives Initiative-Protocol for Metadata Harvesting. The NSDL is the modern equivalent of a union catalog. This NSDL approach was based on the assumption that the content will not be homogeneous. It acknowledges mixed content and mixed metadata. A single metadata standard for all items is impossible for economic, technical and cultural reasons. The system must accommodate a messy metadata environment with a range of metadata from very poorly done, inconsistent metadata to very rich and very formal standardized approaches such as MARC and AACR2.
There is a spectrum of interoperability. The levels of agreement vary. Federation, harvesting, and gathering approaches require different levels of agreement between the NSDL and the partners, and the metadata approaches encountered are likely to vary. In the least formalized environment, the NSDL resorts to harvesting and gathering. For example, an expert-guided crawl of selected web sites has been developed by Via, a group from Riverside, California.
The Phase 2 design for the repository will be a much richer data structure (including both content and metadata). The Fedora model is being used to allow for the storage of both metadata and content in a federated environment. The content archive is being built by the San Diego Supercomputer Center. The relationships between objects are handled by RDF. A metadata format registry allows multiple formats to be used with cross-walks and augmentation and enhancement of metadata over time. Web services, authoring, and publishing services are provided.
Identification becomes more important in this distributed environment. The system is being built using Handles. The NSDL is working with the SCORM community on the modifications needed to deal with distributed learning objects.
Administration and sustainability are issues. It is important to bridge the gap between the individual projects without stopping the spontaneity. Much of the time is spent in community building and outreach.
The Pathways Project is a large, multi-year grant for long term management of the NSDL collections. It will be one digital library but many portals. Subcontracts will be awarded to a number of organizations to focus on specialized areas and to move key components such as expert crawling forward more quickly.
The catalog of resources is now being built with a distributed selection mechanism. Registered contributors, primarily former NSF program officers, can add resources to the collection. The process is managed by trained science librarians led by John Saylor at the Cornell Library.
The NSDL also seeks out partnerships with others. For example, AskNSDL is a partnership with Syracuse University. The Eisenhower National Clearinghouse is developing a Middle School Portal because this is the stage in which many students are lost to science.
Unfortunately, NSDL has historically viewed itself at the center of the universe, a philosophy which others have taken as well for their systems. Ultimately, the users must be at the center. How does the NSDL put the user at the center knowing that users have moving needs? Yahoo and Google are very much into harvesting the deep web. NSDL has established an annotated Google page with the NSDL logo so that resources displayed from the NSDL as a result of a Google search are acknowledged.
We are still in the early stage of digital libraries and there is much work to be done. The most important aspects are the education content and its organization. It isn’t important to affix responsibility to individual organizations. However, a balance must be maintained between a managed program and an open-ended program. No digital library is an island. The question is how to make the island fit together as an archipelago.
“Sustainability of a Public National Digital Library for Science, and Technology, Engineering and Mathematics Education” Dr. Paul Berkman, EvREsearch LTD and University of California, - Santa Barbara
Dr. Berkman is the Chair of the Sustainability Standing Committee, which is a volunteer activity within the National Science Digital Library (NSDL) that is tasked with developing practical strategies to sustain this education program beyond the period of federal funding. Since its inception in 2000, the NSDL has received of $100 million from the National Science Foundation with projects in 33 states. The issues and elements of sustainability (see accompanying table) overlap with preservation and persistence, which ultimately affect outcomes for society in the long term.
Program sustainability involves strategies to facilitate long-term collaboration among projects, users, sponsors, agencies, and other stakeholders. How should the researchers, educators, and administrators build, apply, and sustain the NSDL over the long term, for the benefit and progress of society at both the global and local levels? What legal, accounting, and administrative procedures must be instituted in order to implement in a timely manner the types of partnerships that will be needed? What are the mutual benefits between the NSDL and these projects that can be leveraged? How can the federal funds be leveraged into the future?
Project sustainability addresses how revenues should be allocated to best facilitate the ongoing development, maintenance, and evolution of individual projects. Sustainability will likely require entrepreneurial activities, such as e-commerce for some products and services, as well as licensing agreements for protected intellectual properties. This issues also will involve revenue allocation and accounting within the NSDL program.
User-community sustainability deals with assessment, outreach, and engagement strategies that will best promote the evolution of the NSDL. Targeted applications that address specific needs of diverse users, such as the Antarctic Treaty Searchable Database (http://webhost.nvi.net/aspire) for Antarctic courses and the international community of Antarctic Treaty decision-makers, are seen as critical. Similarly, the NSDL must meet the needs and requirements of K-12 education audiences by providing meaningful materials that address education standards. A key will be for the NSDL to assess applications beyond those originally envisioned by building innovative collaborations with user-communities for their mutual benefit.

Technology sustainability addresses the strategies that will enable the NSDL to provide sustained leadership in developing and implementing the technologies needed for integrated access to information and user-defined knowledge discovery. The nature of information and knowledge discovery is different in the digital era than in other information eras (see Figure). Clay was a significant improvement over stone tablets, since it could be bound into protobooks. Detail was possible in papyrus that was not possible in clay. Paper was a significant invention over papyrus, particularly with the innovation of technologies for copying information. What is distinct about digital information compared to its hardcopy predecessors? The paradigm shift with digital information is that it is now possible to automatically and dynamically manage the inherent structure of information as well as its content (as in libraries) and context (as in archives).

Many believe that digital information is divided into structured (with relational schema), semi-structured, and unstructured formats. It also is commonly estimated that structured information accounts for less than 20 percent of the digital information and shrinking. However, all information has structure, without which it would have no meaning. For example, when a message is encrypted (i.e., the structure is altered), it still has content and context but no meaning absent the key to decode the structure. Dr. Berkman contends that unstructured information is a misnomer for information that really is “unmanaged” with metadata, markup, or databases that create relational schema. Moreover, information management with these current methods does not scale to the volume and increasing rate at which digital information is being produced. In addition, the accelerating production of digital information will result in an exponential growth in the volume of metadata relative to the actual data, which will further engender unrealistic costs. Lastly, since there are 2 N –1 possible permutations of N objects, current technologies cannot achieve the full integration capacity of digital information by subjectively identifying relationships or associations on the front end. Consequently, it is necessary to think beyond metadata, mark-up, and database strategies to realize the full potential of digital information.
With effectively instantaneous and infinite access to digital information in our Internet age, the challenge is now being able to objectively integrate information, independent of scale, based on user-defined criteria to discover knowledge. Information has three indivisible units that all are required to derive meaning – content, context, and structure – which Dr. Berkman calls the Physics of Information. Dr. Berkman further contends that the only scalable strategy to address the information overload problem is to objectively use the inherent structure of the digital information.
F or the benefit of the NSDL and other federal science-technology projects, Dr. Berkman presented a proposal to create an Interagency Sustainability Task Force under the auspices of CENDI. This task force would continue to address sustainability from the program, project, user-community and technology perspectives using the educational scope of the NSDL and the interagency science and technology scope of CENDI. Moreover, this task force would directly support interagency research and development of priorities that have been jointly outlined by the Office of Science and Technology Policy and the Office of Management and Budget.
“NSF Showcase ~ Extreme Makeover: Reinventing the NSF Web Site” Mary Lou Higgs and Curt Suplee, National Science Foundation
NSF’s current effort to redesign its web site aims to achieve a consistent look and feel and to communicate directly to the tax payers through the web site. In the current environment, there is a great deal of diversity among the web sites for the directorates and projects. Also, the current site is intended primarily to point scientists to grant opportunities. As a member of the public, you wouldn’t learn anything about NSF if you didn’t already know what NSF does. It is difficult to see what NSF does with regard to a particular discipline, because the information for a particular discipline is often scattered across part of the NSF organization, and therefore, across the web structure.
The goal in the new design was to visually and textually describe immediately what NSF does. The real estate spent for researchers on the homepage has decreased dramatically with the majority of that information provided through an extensive linked portal for researchers. Each panel will eventually be a news item. Much of the content is generated from a database via the content management system.
The Research Overview is a category in the taxonomy. The content will be filled as needed and refreshed once a day. This highlights the major areas of science and NSF’s role. The TV, films, and museum exhibitions supported by NSF are highlighted through a special event and calendar link. Each discipline has an interactive feature. Twelve overviews have been created using a public approach based on the big questions that NSF is exploring, rather than the traditional organizational breakdowns.
Another category is Discoveries, which are derived from the GPRA nuggets that are submitted to OMB. Discoveries are no longer news but they have long-term value. They are subcategories of a genuine, measurable scientific or technical advance that resulted from research that originated at NSF. These discoveries go all the way back to 1950 and the beginning of NSF. There are approximately 36 discoveries so far, with 150 expected by the end of 2005.
Special Reports sum up several news items. Explanatory animations are used. They are not just window dressing but informative and entertaining.
Achieving this redesign required significant effort on the business end, including addressing NSF cultural issues. Two groups were created. The Web Implementation Group (WIG), made up of the NSF web managers, focused on content for the research community. The Web Advisory Group (WAG), a group of senior managers across the agency, focused on policy issues and the public orientation. The groups met every week for a year. Consensus was generally needed, especially within the WAG. The process promoted buy-in across NSF. The webmasters have some flexibility. An “e-Publish”, custom-built Java application to manage content was developed. This eliminated hundreds of static pages. Quality issues quickly came to the fore.
The biggest complaint on the old site was the lack of a good search engine. Verity K2 provides fielded search capabilities and the Google Search Appliance is also used. The search is tabbed so the user can repeat the same search on various pages.
Usability testing has already been done and the results are much improved. Three rounds of focus group testing were conducted. The first was a high level test with NSF panels, followed by a scripted search using an “incomplete proto-site”. This testing used NSF staff, external principal investigators, and members of the general public. A second round was conducted after changes were made. The American Customer Satisfaction Index survey was conducted on the old site. The results will serve as a baseline for conducting a new survey after the site has been fully launched, rather than providing explicit design directives.
The site-wide template structure supports development of pages that are 508-compliant. UsableNet (a LIFT Transcoder for text-only viewing) is also used.
A Help Center was developed for principal investigators during the transition. This includes frequently asked questions and a list of the changes that have been made.
The new design is currently in beta testing. They have received several hundred comments including some corrections to information that has been on the site for a long time. NSF expects to move the site to production in late January, and to formally unveil the site at the February 2005 AAAS annual conference.
CENDI Meritorious Service Award
Dr. Warnick presented the Meritorious Service Award to Bonnie Klein for her work with the Copyright Working Group and in particular with the development of the the Copyright Frequently Asked Questions. The FAQ has brought significant visibility to CENDI. It is used as a reference both inside and outside of the government, and is a valuable tool for lawyers and operations staff. Ms. Klein thanked the CENDI members for the award and for their support of the Working Group's activities. She noted that rights management is an important component of government information management, especially in the digital environment.
The Meritorious Service Award will be presented to Dr. Simon Liu at the next meeting since he was unable to attend this meeting.