March 8, 2012
ENVIRONMENTAL PROTECTION AGENCY
Federal Building Ariel Rios North (Meeting Room B-305) 1200 Pennsylvania Avenue, NW, Washington, DC 20460
9:00 am - Opening and Introductions
Lisa Weber, Director, Information Technology Policy and Administration, NARA, and CENDI Chair
Jack Puzak, Director, Office of Science Information Management, ORD/EPA
9:15 am - 11:00 am
“The Future of Taxpayer-Funded Research: Who Will Control Access to the Results?” [presentation]
11:00 am - Host Showcase – Environmental Protection Agency [presentation]
» Lynne Petterson
- Scientific Data Management: Progress Since the 2010 IWGDD/CENDI/EPA Conference
11:30 am - Group Lunch
Task/Working Group Chairs
Ms. Lisa Weber, CENDI Chair, opened the meeting at approximately 9:10 am. She thanked Lynne Petterson and Jack Puzak for hosting the meeting. Dr. Puzak, Director of the Office of Science Information Management (OSIM) within EPA’s Office of Research and Development (ORD), and CENDI principal, welcomed the CENDI members.
OSIM is responsible for providing information technology and information management and collaboration tools for EPA’s 1,800 researchers in 13 locations across the country. The researchers demand cutting edge tools and equipment. He described several examples of programs, including HERO, which is a set of tools being developed to mine literature and put in one place the research publications that are used and cited in regulations and risk assessments. It will have a public facing element that lists all the citations used, giving the public a better idea of the science behind EPA’s decisions. TOXCAST Forecaster is a rapid and relatively inexpensive assessment of chemicals and other pollutants. It helps assess what the adverse effects might be without going through the laboratory science. A public-facing component allows a level of collaboration.
ORD Director, Paul Anastas, who is returning to Yale, led the restructuring of the research program. The Path Forward Initiative has led to a program that is more multi-disciplinary, working across teams, programs and laboratories. This level of collaboration is a challenge for OSIM and for the researchers. It is forcing them into innovative ways of doing things and more involvement of individual research scientists in tool development. OSIM is finding different ways to make information more readily available to wide sets of researchers inside and outside EPA.
Stewardship of the Results of Federal R&D
This report was interested in the impact of the National Institutes of Health (NIH) public access policy on the production and dissemination of high quality research, rather than the impact on publishers’ economic and business models. Access is the most important part of the title. The current NIH directive requires that a manuscript arising from NIH-funded research and accepted in a peer-reviewed publication must be available within 12 months after publication. NIH is granted a non-exclusive license. The mandatory requirement has increased compliance compared to the previous voluntary effort. It was noted that this study was not funded by NIH or NLM.
It costs about $4 million/year to operate PubMed Central. This is not a large part of the NIH budget. It covers approximately 90,000 NIH-funded articles per year, but this is only about 15 percent of the biomedical literature that goes into PubMed. There are also articles that are not in PubMed at all. Other taxpayer-funded extramural research grants are valued at about $30 billion.
The Research Works Act (RWA) did not address the NIH policy directly but it did attempt an end-run around it by creating a new publisher’s right to refuse coverage by PubMed. The RWA has been withdrawn. This withdrawal made a gesture to say that taxpayers have a right to the research. The two contending pieces of legislation have moved to one and there has been a bit more give toward public access. The Federal Research Public Access Act (FRPAA) would extend the NIH directive to other agencies and shorten the embargo period to six months.
While most of the discussion has been around journal articles, the discussion should not be article centric. The articles are of the greatest use to non-researchers and to those who are not hard core researchers in a particular area. Researchers in a particular area are interested in the underlying data, tools, protocols, and pieces of the article, more than in the entire article. This has not been the focus of discussions so far. We need to think beyond articles and “unpack” them.
Dr. Maxwell reviewed some of the latest economic research on openness. One paper compared citations in follow-up research of the Biological Resource Centers versus closed archives. Open access got more citations. There are cases where closed access repository materials moved out into open access and the citation rate increased by 50-125 percent. The people citing the original article varied as well. When looked at through citation metrics, open access gives more bang for the taxpayer’s research buck.
This was citation-based research which has some issues, so the second paper looked at research on an open mouse strain versus one where the patent was strictly enforced. There was more research on the open mice than on the closed mice. Openness increased the overall flow and resulted in more diversity from the single idea. This is important because research is not linear and diversity is better. Citations were more likely to be found in applied as opposed to basic research journals. In other words, the openness got research results closer to commercialization than closed research.
Another article looked at the links between a phenotype-genotype and commercialization. There is a direct link between the sequencing result, the genotype, and commercialization. Intellectual property reduces the diversity of scientific experimentation. Reductions are on the order of 30 percent and stay low for a longer period of time. Openness moves science faster.
Many unforeseen findings and commercial outcomes are the results of unforeseen contributors. For example, InnoCentive started as a part of Eli Lilly and Company (pharmaceuticals). When InnoCentive opened the data to include unforeseen contributors, they had over 80,000 solvers who signed up as independent researchers. Most of the winning solvers came from outside the field, emphasizing the significance of “out of the box” thinking. Openness goes beyond the “local search” phenomenon; the results from people in your immediate circle may create an adequate solution but not necessarily the best one.
In terms of the dissemination of journals, unforeseen contributors and those outside the original field of research often don’t have good access. What if I’m not in the field or if my institution doesn’t subscribe to journals in that area? We need to think about serving people who are beyond the field and the value of doing so. Access to those in the private sector is also very limited.
The clear take away from the literature review is that increased openness has significant and demonstrable benefits. There is a problem with access in the current environment and it has real world consequences.
Currently, there are 7,300 open access journals. These journals have created new jobs. The more people use the results of government research, the better the return on public investment. In addition, accessibility makes it possible to see what happens to government research, making our enterprises much more useful and adding to the ability to put value to government scientific and technical research and information management.
Those who speak against open public access forecasted the demise of publishing and of the journal as we know it. None of these forecasts have come true. There has been an increase in the number of journals since 2008. Subscription journals have gone up, which is normally not a sign of weakened demand or supply. There is no evidence of journals going out of business because of these policies. Surveys do not show that people are canceling because of this policy, but rather, because of the general economic downturn. He won’t say that nothing negative will happen to traditional publishing in the future, but it is likely to have more to do with the Internet which impacts all media companies. The question is how to organize to get the most out of the production and dissemination of scientific knowledge? How can taxpayers get the best return on their investment?
There are other issues around the particular models that work best for a given field. Dr. Maxwell asserts that it does make a difference. There are also questions about what can be done with what is accessible, the length of the embargo, etc.
Dr. Maxwell proposed that we start with the presumption that openness is the best and then look at some of the intellectual property issues that might modify this approach. It was suggested that how to move an open policy forward within an agency and who needs to be convinced of its benefits might be an interesting point for future CENDI discussions.
Dr. Lynch started a term as one of the two chairs for BRDI. Many of the issues on the BRDI agenda are in common with the CNI agenda. There are also many commonalities between BRDI and CENDI in the areas of scientific data. While the Board coordinates on the national side, it also channels into the World Data Center systems and other international activities. Data takes on diplomatic implications in the international environment as the problems become global.
One set of issues focuses on making the case for funding the management and stewardship of scientific and scholarly data. This is complicated with a lot of political, scholarly, and social ramifications. The integrated nature and variety of research artifacts creates management issues. If we separate them, we are going against the holistic view that is needed. BRDI has tried to get at this issue through several symposia.
BRDI is also interested in the research process itself. Can looking at and using data in interesting ways help in multidisciplinary enterprises? Can it help to speed the progress of science? How does scientific discovery turn into jobs, products, and a better economy? The answers to these questions are hard to document other than by anecdote, but we need to continue to work on this.
A closely related issue is the funding of stewardship. This is challenging because we aren’t very good at predicting the long term cost of managing data. We are getting better at it, and have some base of knowledge now through some significantly long lived data archives. However, there are issues of scale and semantic migration over time. The issue of keeping the data comprehensible is even more difficult when the basic scientific understanding shifts over time. How often do we need to make adjustments to fit data into a different ontology from the one in which it was born?
Related to the question of how to fund stewardship is the question of prioritization. We have more data than we have stewardship resources. There is a need for triage approaches, but our bases for making these decisions are extremely shaky. Sometimes we can recreate the data but it is expensive. Sometimes there are ethical and political issues that stand in the way of recreating the data. There is some notion that observational data should be privileged because it is intrinsically irreplaceable. However, we don’t know how to make these decisions consistently, especially across disciplines.
Some incentives are being worked on, including the ability to cite data. BRDI and CODATA have been very involved in looking at the issues related to data citation. DataCite is trying to build specific solutions. This is a very important underpinning. A whole conversation with journal editors is needed on citations and norms for explanations of the data and where it comes from when it is referenced in an article. Discussions are also needed with funding agencies with regard to the strings that should be attached to their funding. These activities are continuing to evolve. It will also be important to understand and measure progress. How do you measure compliance with and the effectiveness of data management plans?
Embargo of data for the use of a particular investigator seems to be getting less commonplace; the use of long or indeterminate embargo periods is diminishing. Post publication embargo of underlying data is also increasingly unacceptable. It is noteworthy that some foundations that fund research on specific diseases are requiring the sharing of data immediately; subordinating the publishing interests of researchers to the need to advance cures for the disease.
A final set of complex issues focuses on intellectual property, especially in the international environment. What is the situation in the US and then, what is the situation when the data are used or shared across borders? Intellectual property is an area with interesting mythology especially in the data area. A number of faculty members will assert that it is their data and they have the intellectual property rights. However, it isn’t clear whether rights can be asserted in data, especially via copyright. What does the contract between the faculty and institution say? What about the contract with the funding institution? It would be beneficial if we could make some steps in this area to try to clarify a lot of the vagueness that currently exists.
On the international side, it is even murkier because databases have a different intellectual property regime in Europe and in other countries than in the US. The second international issue is data sharing reciprocity. There are some countries that use our data but want to treat their own as proprietary. BRDI will be working with CODATA to begin forums on these issues.
Another set of issues, which BRDI has largely stayed away from, deal with human subjects and personally identifiable information. This will be a major issue in the US as sequencing becomes cheaper and we move toward e-medical records which are in private hands and not in national hands (unlike the UK). Big databases hold the promise of data mining, discovery, healthcare economics, etc., but may be at odds with the kind of culture around human subjects that is much more restrictive, protective of privacy and limiting of re-use scenarios. He expects this to turn into a major public policy issue.
Dr. Lynch believes that the National Academies have an incredible convening capability on these really difficult issues. He hopes that BRDI will be able to help. He gave several examples of what BRDI is doing, including a study on workforce issues in data management and curation. Those CENDI members who are interested can sign up to find out about meetings, many of which are public.
Several of the agencies indicated that it would be interesting to share information about the policy issues that were raised by Dr. Lynch’s presentation. Ms. Carroll reminded the group of the matrix that informs CENDI members about their agency representatives to BRDI. The chairman of CENDI has made presentations at BRDI, but the question is how to make this communication more regular. Dr. Uhlir agreed that agenda and studies could be shared with CENDI. It was also suggested that the common interest in the value proposition would be a good one for CENDI and BRDI to focus on jointly. What do we need to do in the data world to make the content more accessible to people? How do you make it possible to speak the language of the data? How much does the re-user need to learn to speak the language of the source discipline? We don’t yet understand what is feasible. There is a lot of research and experimentation that needs to be done.
Host Showcase – Environmental Protection Agency (EPA)
At the time of the 2010 IWGDD/CENDI/EPA-sponsored conference, OSIM was only two years old. In the intervening time, they have spent a lot of effort identifying the current state of data management. ORD performs the lion’s share of research that backs up the regulations and policies of EPA. They perform different kinds of science and the volumes of data are growing.
The emphasis on data management falls in line with several outside influences such as the America Competes Act, the recommendations from the 2010 conference, and the ORD IM/IT Strategic Action Plan. There is also a draft EPA Data Policy. When the researchers were asked about data sharing, the number one thing they wanted was open access. However, this seems to be a very generational thing. ORD is becoming increasingly multi-disciplinary and multi-geolocational based on the Path Forward Initiative. Areas such as human health and ecosystem research will continue to generate a lot of data.
OSIM staff talked with researchers, quality managers, and science managers and many expressed a need for help with data management. A lot of inefficiencies and inconsistencies were identified. Much of the data is written once and read never. There is a lot of data on the network but there is also a lot on thumb drives or external hard drives that are never backed up. Expensive data was being acquired redundantly, because they don’t know what they have. This creates tremendous barriers to collaboration. Sometimes it is easier to redo the study than to try to find the existing data.
In formulating their data management policy, OSIM involved the people who would be most impacted. The work group included 45-50 people from across ORD. They began with two underlying principles: do no harm and one size does not fit all. There were heated debates throughout the six scheduled meetings and two ad hoc meetings. Information was disseminated before and after meetings, there was homework, and the notes were maintained on the collaborative site, the EPA Science Connector. This will provide documentation for how they came up with the policy; and they found that managers wanted to go through all the meeting notes.
From the discussions, they achieved consensus on a number of key principles: data are critical agency assets, with value and costs, and data management should be done with a consistent organizational approach.
The procedures start with creating a scientific data management plan. You must document what you do. If the data will have an impact on regulations or involve human health, it will likely have a more detailed plan. It is likely that throughout the life cycle of the data, the plan will change; therefore, the plan is intended to be a living document.
Data must be identified with metadata. In a complimentary project, OSIM is developing a managed vocabulary that is very loose but has definitions that work across disciplines. This will be used by systems and search engines to support data discovery.
For computationally intensive data or large datasets, it is important to identify the storage needs, access control, and file naming conventions. Data is being retained according to a records management framework. Ultimately, reuse of the data is enabled if there is a package of the data, the metadata, the data dictionaries, documentation, etc.
When they started, the guidance had 22 pages. They are now compiling illustrative examples and best practices. The guide is now 113 pages.
About 60 people were interviewed throughout the organization and they found a lot of variability and inconsistency in the data management that is going on. People are confused about records management and are asking for guidance. There has been a lot of follow-up with people and buy-in has been achieved because they cycled back to people they had talked with before.
The goal is to start small. In the beginning, OSIM may do the heavy lifting for the data managers and researchers in order to make a very low barrier for entry. They want to achieve results quickly. Early adopters have been identified. They are working with them to show that it really isn’t as much work as it might seem and that collaborative teams can save time because they can find data and have decisions such as naming already worked out.
Now OSIM is beginning to look at concurrent tasks such as grant and cooperative agreement language, procedures during employee departures, and possibly creating BPAs for contractor support in data management. A key question is what will be the ongoing role of OSIM?
Questions and issues continue to arise as they work with the early adopters. For example, on a recent trip to the Gulf Breeze Ecology Laboratory, they found different tools and applications, issues around iterations of “found data” (i.e., data that are used but not owned by EPA), and questions about who should review the SDM plans especially in a multidisciplinary environment. SDM is being set in the broader information realm. The goal is to ensure accessibility, usability, and awareness. SDM is a very collaborative activity, so the researchers will own the process. OSIM now has a Director of Communications.
Data.gov was launched on May 21, 2009. The goal was to focus on data. While data in the government is broader than scientific data, they did spend a lot of time talking with scientific data managers. The initial goal was to make as much data available as possible, but now they are coming back to issues of how to work with the agencies more effectively and managing the process from approval to publishing and inclusion in Data.gov.
A key step has been building communities of interest or subject areas. Many cross-agency domains such as health, energy, public safety, etc. are represented. Drupal is used and a discussion area and blog are provided, along with a list of the datasets that apply to that community and links to other resources. Data.gov is reimbursed for these community spaces. Agencies share the costs of $30K for a static community site and $100K if they want all the features. Applications, including mobile, can be added with some of them provided by the agencies themselves.
At the UN General Assembly in September, an international partnership was formed. Data.gov and another portal from India will work together to jointly develop the Open Government Platform, open source code that will help other governments to quickly bring up their own data.gov sites without having to develop them individually. The concept of data.gov is growing across the globe with 28 countries building their own sites. When doing this, you are actually spreading democracy across the globe as well.
Engagement with stakeholders to make the data more discoverable and useful is continuing. Data.gov has a metadata working group which is developing a core set of metadata and a framework for country and domain extensions. Data.gov is focused on XML. They also have an API for the XML-oriented data. The challenge is that some agencies have done a good job already of API key management, so Data.gov wants to take advantage of this by centralizing the management of existing keys across the government. Only one privacy statement would be needed instead of one for each agency. Metrics could be fed back to the agencies based on the use of the managed APIs by users/systems.
Data.gov is investigating the semantic web. They have a triple store of 65 billion triples and SPARQL queries are available against their data. They are trying to figure out how to position Data.gov with regard to linked open data. Work has been done to harmonize a set of data elements, but only limited success can be achieved because members of the data community do not use a lot of shared data standards. Mr. Royal is hoping that Hadoop for data searching will provide some features in this area. The current triple store uses a licensed version of Virtuoso.
At this point the architecture is still based on a lot of “data at the agency” but they have begun to experiment with Socrata, which requires that the data be held by Data.gov. Socrata provides online visualization, charts and graphs, browsing within the native datasets, etc. This is called the “interactive catalog”. Ultimately, they would like to have a brokered search so those that are behind a web site could be pulled out to data.gov on-the-fly and then used by Socrata. Other potential changes to the architecture include the federation of catalogs so that if the user doesn’t find it in a federal catalog, the user would go to the catalogs of state and local governments.
The Digital Strategy is not just about datasets but data as an asset and what this actually means to agencies. The paper will be issued in draft soon, and there are expected to be assignments for the Office of Management and Budget (OMB ), General Services Administration (GSA), etc., based on the recommendations. Mr. Royal expects this to help frame what Data.gov 2.0 should be. It would be useful if CENDI could provide concepts for Data.gov 2.0 in the area of science. Ms. Weber pointed out that the CENDI Grand Challenge paper might be useful in this regard.
The second international conference on Open Government Initiatives will be held in Washington, DC, at the World Bank on July 10-12, 2012. Mr. Royal will send CENDI members an invitation through Ms. Carroll. They are still looking for appropriate speakers.
Action Item: Ms. Carroll will provide to Mr. Royal a copy of the CENDI Grand Challenge paper.
Action Item: Mr. Royal will provide Ms. Carroll with an invitation for CENDI members to attend the Open Government Initiatives conference. She will forward to the members. Invitation may include a session on the program regarding the CENDI Grand Challenge.
The open technical portion of the program adjourned.