CENDI PRINCIPALS AND ALTERNATES MEETING
National Science Foundation
Ballston, VA
March 10, 2009

Final Minutes
Attendees

 “DataNet: A Sustainable Digital Data Preservation NetworkDr. Sylvia Spengler, IIS Program Manager, National Science Foundation
The Importance of Cyberinfrastructure for Research and Education Dr. Edward Seidel, Director, Office of CyberinfrastructureNational Science Foundation
National Science Foundation Showcase Dr. George Strawn, NSF 


Members
Cindy Etkin (GPO)
Eleanor Frierson (NAL)
Glenn Gardner (LOC)
Tina Gheen (NSF)
Donald Hagen (NTIS)
Ellen Herbst (NTIS)
Sharon Jordan (DOE/OSTI)
Carla Patterson (NARA)
Roberta Shaffer (LOC)
Dr. Elliot Siegel (NLM)
Dr. George Strawn (NSF)
Dr. Walter Warnick (DOE/OSTI)
Lisa Weber (NARA)

Speakers
James Graham (NSF)
Stacy Roy (NSF)
Dr. Edward Seidel (NSF)
Dr. Sylvia Spengler (NSF)
Dr. Henry (Hank ) Warchall (NSF)

Secretariat
Bonnie C. Carroll
Gail Hodge

Observers
Prue Adler (ARL)
Ron Bluestone (LOC)
Kelly Callison (DOE/OSTI)
Todd Carpenter (NISO)
Theodore Defosse (GPO)
Rachel Frick (IMLS)
Larry Lannom (CNRI)
Joan Lippincott (CNI)
Brand Niemann (EPA)
Deborah Ozga (NIH)
Helen Sherman (DTIC)
Paul Uhlir (NAS)
Kathleen Williams (NARA)
Dr. Fred Wood (NLM)

Task/Working Group Chairs
Michael Pendleton (EPA)
Vakare Valaitis (DTIC)

Via Teleconference
Lynne Petterson (EPA)
Gail Rayburn (Johns Hopkins Applied Physics Lab)
John Sykes (EPA)

 

Welcome

Ms. Herbst, CENDI Chair, opened the meeting at 9:20 am.  She thanked the National Science Foundation for hosting the meeting and providing the speakers. Due to scheduling issues, the business meeting was held in the morning and the technical session in the afternoon.

DEVELOPING THE CYBERINFRASTRUCTURE

 DataNet: A Sustainable Digital Data Preservation Network(link to presentation, .pdf)
Dr. Sylvia Spengler, IIS Program Manager, National Science Foundation

Science is driven by Grand Challenges from within the sciences. Challenges are different by discipline, but there are a few underlying challenges. Digital data is one of these. As the Long-Lived Digital Data Report concluded, technology has created a fundamental change in science and digital data is at the heart of this change.  The President’s Council of Advisors on Science and Technology (PCAST) Recommendations on Digital Data and the NSF-supported Experts Study called “To Stand the Test of Time” are key documents that have informed the NSF data activities. The DataNet effort spans a number of different NSF offices, but the lead is shared by the Office of the Director and the Computer and Information Science & Engineering Directorate. 

Preserving and sharing is at the heart of the DataNet Program. It is what makes science run. There are several key reasons why data should be preserved and shared. Some data is irreplaceable. Data is needed to replicate results, provide input for longitudinal studies, support training and testing of models, and to address the grand challenges of interdisciplinary science. Data also broadens the participation in the conduct of science, including its use for educational purposes.

The three primary goals of DataNet are to preserve data and provide long-term access, create systems and services that are economically and technologically sustainable, and empower science-driven information integration. 

The Program will have five to six award programs, including two this year and two to four next year. Each award will be approximately $20 million each over five years. Proposals must cross more than one domain, involve diverse content, and address issues of sustainability. They should integrate library and archival science, computer science, and other scientific domains. Proposals should push the frontiers. Key issues include who are the users now and who might they be in the future. The call specifically excluded digitization. Projects must provide for full life-cycle management, including data deposition and acquisition; data curation; metadata management; privacy and security; data discovery, access and dissemination; data interoperability and integration; data evaluation and visualization; related research; and education and training activities. Community and user input must be collected and assessed. International involvement is sought, but the non-US partners must obtain funding from their own national sources.

While there are science questions in the DataNet Project, the evaluation will include how well the partners are drawn in to address the challenge. Science must be the driver for the project, the project must address all requirements, and it must be of interest to at least two NSF directorates. The FY09 DataNet competition received 23 preliminary proposals in November. Seven of those proposals were invited to submit full proposals by May 15, 2009.

Two award recommendations were approved by the National Science Board in December 2008. The Johns Hopkins University’s (JHU) Data Conservancy Project is led by the head of the JHU Libraries. It seeks to build on the library’s success in managing the Sloan Digital Sky Survey and the National Virtual Observatory. Its vision is that what works in astronomy can be transferred to different areas of science. The project will also look at the role of university libraries with regard to data, changes needed in science information culture by working with publishers and others regarding incentives for the deposition of data, and extensions to the curriculum in library and information science to address the needs for data scientists and their training.

DataNetOne, led by the University of New Mexico, is focused on earth observing networks. It seeks to enable new science through universal access to data about life on earth and the environment that sustains it. DataNetOne will integrate existing and future earth observing networks with a strong emphasis on user involvement. This project is driven by research challenges in climate change, biodiversity and climate change. Other partners include UC Santa Barbara and University of Tennessee/Oak Ridge National Laboratory. This effort is using Service Oriented Architecture (SOA) as a starting point.

Ultimately, it is hoped that the two project teams will work together since they are both dealing with observational data. The goal is a network of network partners, both national and international. The DataNet Project is also seeking to engage federal agency partners to think about cultural changes and interoperability across collections. DataNet will integrate with other Cyberinfrastructure activities including sustainability plans during Phase 2 after Year 5.

The Importance of Cyberinfrastructure for Research and Education(link to presentation, .pdf)
Dr. Edward Seidel, Director, Office of Cyberinfrastructure
National Science Foundation

The Transformation of Science can be seen in both the size and complexity of research teams and their output. Back in 1972, Stephen Hawking’s theory about black holes was developed by one person creating a relatively small set of data. Two decades later, this data could be computed quantitatively on a supercomputer with a team of ten people. In 1998, the follow-up involving 3-D collisions, visualization, and the development of parallel algorithms was done with a 15-person team that created about 50 gigabytes of data. The investigation of gamma ray burst will generate petaflops of data and involving theoreticians using data from a variety of sources and disciplines. The LHC (Large Hadron Collider) experiment, which seeks to determine the nature of mass, involves more than 10,000 scientists in more than 33 countries generating more than 25 petabytes of distributed data after they have already thrown some of the data away. A key question becomes how to handle the prodigious amounts of data and how to support the collaborations needed to address complex problems in science.

Grand challenge communities combine data-driven collaborations to address complex problems. Cyberinfrastructure is driven by every field of science and in combination. Science and society are being transformed by data. For example, real-life problems such as Hurricane Katrina could benefit from forecast models. One would like to automatically stream in data to each of the forecast models involved. For example, Storm Surge, Atmospheric, and other models create predictions of storm surges and waves. Traffic patterns are also based on very complex models. These are all complex problems, and they all need to be coupled to address the real-world problems facing science and society.

Cyberinfrastructure plays a key role. Science is no longer done only by individuals, groups, or teams, but by communities, since no one group can attack these challenges on its own. One might view this as the End of Science, as Wired Magazine proclaimed, or at least science as we know it. Cyberinfrastructure is not just high-end computing. There are social issues involved. Incentives, tenure, and promotion place requirements on Cyberinfrastructure. What Cyberinfrastructure is needed to support these communities? The changes place increased requirements on hardware, software, networks, and tools.

NSF’s Vision is a “national-level, integrated system to enable new paradigms in science. This vision is based on a number of vision documents since the Atkins report in 2003. Achieving the vision involves making strides in the areas of Virtual Organizations for Distributed Communities, a Computing Roadmap, Data and Visualization Integration, Learning and Workforce, and High-end Computing. They have made some progress in some areas and others still need more work. For example, high performance computing (HPC) software and tools still require scalable applications. High-end computing is supported at both the national and the campus level but it needs to be end to end. Modeling frameworks are making some progress.
The Teragrid capacity is expanding from 180 teraflops less than two years ago, to more than 1500 teraflops today, to multiple petaflops by 2011. One machine would be the equivalent of what the whole network is now. Processors are becoming the new unit for computing, but applications have lagged behind. In order to address the phenomenal growth of data it will be important to identify where the data is and to know how to categorize and store it. Data-driven science is coming so rapidly that we don’t know how to do it yet. We do not know how to make it accessible and overlay from different levels. NSF is trying to take the leadership role in solving these issues. Dr. Seidel believes that the deluge of data is both a crises and an opportunity for the U.S. to take a lead in science and technology if we address it properly.

Virtual Organizations need to be developed, including social and technical aspects, such as open standards. Is there a way that you can put the information together in order to address these social issues?

A lot more programs are needed to exploit the Cyberinfrastructure use of science. Advanced software and algorithms environments, remote instruments, and the creation of virtual organizations must be brought together. Cyber-learning is the use of remote data connected to secondary education.

The National CI Blueprint is seeking to bring all these efforts together. The network is a hole in the plan because generally they are separately funded by others. There are 11 resource providers in the TeraGrid, with advanced services. What are they doing beyond this? TeraGrid Phase 3 involves explicit visualization services. This is currently being competed. It is possible to stitch together a lot of spare power through the OpenScience Grid, which is one step in integrating the campus and national cyberinfrastructure.

Computational Science is the 3rd pillar of science and engineering. The Office of Cyberinfrastructure (OCI) has a very good high level document. Some issues have started to be addressed to greater or lesser degrees. Organizational structures both in government and outside are outdated.  Dr. Seidel wants to make it more about the people side. It will be important to help universities to align with these new goals. OCI needs to do a better job of keeping and preserving the software investments. There are a lot of commonalities among NIH, NSF, DOE and other cyberinfrastructures that would benefit from joint program funding. There is also work in Europe and Asia, where they are looking for coordination.

Cyberinfrastructure fundamentally changes not only the way science is done, but how it is shared and disseminated. Open access of data must be done in a sustainable way. There are issues about when the data becomes available. There is a group within NSF working on data and publication openness from different traditions. A general policy is needed and then communities can choose what they want to do within that framework.

“National Science Foundation Showcase” Dr. George Strawn, NSF  

 “Research.gov” – James Graham and Stacy Roy

Research.gov is a consortium for research-oriented proposal management across federal agencies. NSF modernized Fastlane and made it available to other agencies. Current users include NASA, DoD, and USDA. NSF is in discussions with NIH as well.

Research.gov is a menu of services that is intended to be collaborative across agencies, principal investigators (PIs), and institutional partners. The Research.gov portal is running BEA/Oracle, Web Logic, and Interwoven/Autonomy. The system is designed for high availability and performance. Portal technology provides modular services, both basic and advanced, to various audiences. It also facilitates the streaming of content and information from agencies into a single site. It is easy to add agencies. Publishing content is in the hands of the agency. NSF provides a cohesive interface and host-level review. Research.gov covers not only the awards application, but the post-award information. The option remains to submit to grants.gov as well. The content is updated every night.

The focus in December 2007, when Research.gov was originally launched, was only on the information services. The Federal Transparency Act resulted in several grants-related web sites, but they were rather shallow. Research-gov gives more information about the awards. Searches can be done on Congressional District, State, Institution, etc. Awards are blended for all agencies and the user can drill down into the details. This approach meets more than the Act’s requirements. There are plans to link up to publication citations and patents that result from the grants. This could be done through other services.

The base services include a policy library, news, and events. Partner agencies can participate in this service on an optional basis. It can automatically personalize the policies to the user’s interests. Alerts can be sent when relevant policies change. Program Officers at NSF like the events part of the site. It brings together news and highlights from across the agencies.

Another area is Science and Innovation that highlights what the research means to you in plain English. They are working on high-level buckets to classify information for the public. Information may be assigned with one or more main buckets. The information is also classified by region, state, and institution.

Every award requires the PI to write a report. Each report has a highlight. They are trying to determine whether to continue to call these highlights. Currently, the news section is at a higher level of abstraction.

They are not prepared to add other agencies to Fastlane right now but would entertain it in the future. Development will be prioritized based on user needs. Eventually all the Fastlane functions will move to Research.gov, but this will take several years.

“Find Reviewers Project at the National Science Foundation” – Dr. Henry (Hank Warchall

NSF Program Directors manage the review of approximately 42,000 research proposals annually. This is an average of about 75-100 proposals per Program Director per year. Proposal review includes both ad hoc and formal panel reviews. Approximately 50,000 panelists and ad hoc reviewers are needed each year. It is up to the Program Directors to find the reviewers. The challenge to meet the criteria for selection is quite a tall order. The Program Directors use a number of sources to identify potential qualified reviewers.

The Find Reviewer Service was developed for Program Directors. It is in the fledgling state of design and has not yet been implemented. The CIO conducted a series of focus group meetings around this topic. The meetings identified five priorities, two of which are addressed by this system. The system would search for Program Director-specified keywords in NSF proposals from prior years and the Web of Science commercial publication database. It provides a single interface for these internal and external sources.  It is planned to extend the system to include other external publication databases in the future.

The use scenario is that once a likely reviewer is found, the Program Director can automatically search for his/her NSF PI and/or Reviewer record. The prospective reviewer can be added to the “shopping cart”. Navigation is based on the needs of the Program Director. The system allows unlimited mining of NSF PI and Reviewer records.

The Proposal Search results can be inspected for more details. There are filters that reflect the taxonomy developed from the hits. It is possible to mine proposal data for information about people. At any point you can be led to a display of likely matches. When searching a journal, possible matches of author names to NSF internal records are provided with a confidence ranking since people’s names are ambiguous and people move from institution to institution. There is no globally unique ID for the authors in external publications, but NSF is watching the use of this concept among the commercial publishers.

Handouts

“Harnessing the Power of Digital Data for Science and Society.” Report of the Interagency Working Group on Digital Data to the Committee on Science of the National Science and Technology Council. January 2009.

“DoD TECHipedia: Our Troops Need Your Brainpower.” Factsheet. (Sherman)

“DefenseSolutions.gov: Great Ideas are Out There.” Factsheet. (Sherman)

“DTIC 2009 Conference Announcement.” (Sherman)

“FLICC Awards: Call for Nominations.” (Shaffer)

“LC Science Tracer Bullet: Bibliography on e-Science.” Science Reference Section, LoC. (Gardner)

“Federal R&D Project Summaries.” Brochure. (Jordan.)

“Institutional Repositories.” (Carroll)

“Letter for ICSTI Regarding Formal Relationship,” February 8, 2009. (Siegel/Carroll)

“2010 Joint CENDI/FLICC/NFAIS Meeting: Potential Topics.” (Carroll)

“Research.gov.” Brochure. (Graham/Roy)

“Simple Knowledge Organization Systems File Comparisons.” CENDI Terminology Resources Task Group. (Hodge)

“Image Metadata Mapping Project.” Image Metadata Group. (Hodge)

“Current CENDI Working Group Activities.” March 10, 2009. (Hodge)