January 12, 2012
LIBRARY OF CONGRESS
James Madison Building, 6th Floor, Dining Room A, 101 Independence Avenue, SE, Washington, DC 20540
9:00 am - Welcome and Introductions
Lisa Weber, Director, Information Technology Policy and Administration, NARA, and CENDI Chair
9:15 am - 11:00 am
11:00 am - NLM, DOE Synergy in NLM Prize for NLMplus – An Award-winning Demonstration of Semantic Search [presentation]
» Tamas Doszkocs, CEO, WebLib
11:30 am - Host Showcase, Library of Congress
» Blane Dessy and Glenn Gardner
- Linked Open Data Service [id.loc.gov] (Kevin Ford, Network Development and MARC Standards Office) [presentation]
- Inventory and Management of Data Sets (Lillian Gassie and Rod Atkinson, Congressional Research Service) [presentation]
- World Law Library (OWLL) (Bob Gee and Tina Gheen, Law Library of Congress) [no presentation available]
Task/Working Group Chairs
Ms. Lisa Weber, CENDI Chair, opened the meeting at approximately 9:05 am. She thanked Glenn Gardner and the Library of Congress for hosting the meeting. Dr. Roberta Shaffer, previous LoC CENDI Principal and newly appointed Associate Librarian of Congress for Library Services, welcomed CENDI to the Library. She introduced Tina Gheen as the new principal for the Library. Glenn Gardner will continue as the alternate. Dr. Shaffer reflected on the challenges we face in information management. In the new economic environment, we have a need and an opportunity to really decide what is important. We will need to make hard decisions about what to include and what not to include in our collections, what we can do and what we can’t do in terms of services. In all cases, collaboration and cooperation, as exemplified by CENDI, will be important.
Policy Issues and Directions
The NSB Data Policy Task Force was established in February 2010 as a result of a retreat in 2009 that established data as the number one NSB priority. The task force was given a charge and a long list of issues related to meeting the NSF mission, and was specifically asked to look at areas where policy could be applied in NSF but also more generally in other federal agencies.
The Task Force had a lengthy discussion about publishing and dissemination and decided to separate the issue of data from publishing and dissemination. The public access policy at NLM was being implemented at that time and they decided that data was different enough from literature and it would take a long time to address them both.
The implementation of NSF’s data management plan requirement for grants raised interesting issues as to what is a good plan. They are continuing to monitor what is going on and to receive comments from review panels and others.
They considered open data and other issues and decided to develop a statement of principles that could lead to guidance and actionable policies. These seven principles are:
- openness and transparency are important;
- open data sharing is closely linked to open or public access publishing and they should work in concert;
- many different stakeholders and interests are involved and they should all participate in the adoption of policies;
- one size probably won’t fit all;
- policies and guidelines are needed along with active management and long-term curation;
- data and data management policies need identification of roles, responsibilities and resourcing; and
- the rights and responsibilities of investigators need to be recognized and reinforced and these vary in different communities.
An Expert Panel was held in March 2011. A quick review of some of the comments was presented, many of which reinforced the principles. The comments addressed were: citation and attribution; interoperability standards including those across disciplines; the need for persistent identifiers; data sharing as a priority but with the need to balance other issues; training, recognition and reward structure as requirements; the need for new funding and economic models for the active management of data; concerns about the level of data that should be shared (raw, processed, analyzed, etc.); and concerns about cost and long term curation.
Storage, preservation and curation, funding sources, and strategic partnerships are critical to data sharing and management. Cyberinfrastructure is necessary to support data-intensive science. This includes geographic distribution of resources, shared applications, standardization, and funding, including an appropriate ratio of infrastructure versus research funding.
The group looked at the broad data policy themes and developed a set of challenges which include a commitment to sharing and a need to change the way data are perceived; reproducibility; and new jobs and areas of expertise which are emerging but are not there yet. The cyberinfrastructure advances need to be deployed quickly. Data stewardship is critical but it is unclear where the responsibilities should live and what the economic/business model might be. There are also economic and legal challenges.
Dr. Griffiths believes that this analysis is only the first step. This group and others will need to come back and review time and time again, and there is much to do. The recommendations from the task group have been narrowed to five:
- Leadership must be provided to federal agencies and other national and international stakeholders.
- Comprehensive archiving must be available for both the data and the methods and techniques accessible in the published, peer-reviewed publications, which is important to reproduce and verify the research results.
- Computational professionals must be supported and computational and data-enabled science must be recognized as a profession. There are people whose careers will be focused on collecting data from others and they need to be recognized as well.
- A panel of stakeholders should be convened to explore and develop a range of viable business models and to address the issues related to maintaining digital data. This involves behavioral models and a realization that business models must be tied to stakeholder communities. We understand that two to four percent of the cost of a research project is needed for traditional publishing. However, we don’t have this level of understanding of the cost of data management.
- The models and infrastructure must be developed and implemented to actually expand sustainable data management. The Task Group was conscious that they did not want this to come across as another unfunded mandate.
NSF is addressing how these recommendations might be implemented. There is a need for agencies to participate from the beginning of this process. She hopes that as agencies look at the results of the task group, they can identify areas where they can participate and help to tackle the issues. This is the only way to make headway more quickly.
The report was made available on December 30, 2011. The comment period was extended and CENDI members were encouraged to comment to her directly or through the task group’s web site.
During discussion, it was noted that the old institutional boundaries for agencies and institutions such as NARA are changing. Dr. Griffiths believes that the institutions continue to focus on the institution because the boundaries are clearer. Unfortunately, this approach results in a 2-tiered system of those that can afford to follow the policy and those that can’t. We need to consider what resources we have that can be brought to bear in a collaborative environment, because it isn’t clear what ownership really means anyway. We never quite get over the sustainability barrier because of these boundaries.
It would be helpful if people in the library, information and academic communities considered data management a pressing national issue. The dilemma for NSF was that it didn’t have the leverage provided by a culture of sharing, such as NLM had with its regional medical libraries when implementing its public access policy. Science indicators show that Asian countries are moving forward quickly in the data management area. This poses a concern over intellectual property issues.
NSF has an internal working group looking at implementation activities. In the large scale and international agreements, more attention is being paid to data accessibility and management. Large awards require accessibility even for null results. The practices are so different across disciplines and communities will need to emerge. In the meantime, it is important for CENDI and the agencies to keep track of best practices and do some prototyping. There are some meetings coming up which will begin to present some of the specifics of implementation from the various NSF directors.
The policy challenges and opportunities for the ARL members focus on open access and digital repositories, e-science and e-research, 21st Century workforce development, and the challenges and opportunities posed by legislation and Executive Branch initiatives.
ARL uses the Budapest Open Access Initiative definition which focuses on free availability and permits users to perform a variety of activities. These rights are increasingly important for data and text mining because of the amount of information that is available electronically and the fact that scientists can no longer do this kind of data analysis manually. ARL talks about open science, open data and open educational resources. It is now broader than open access journals.
Open access journals are, however, still a major part of the initiative. There are more than 7,000 open access journals, and the number of papers published in them has increased as well. PLOS One is now the largest peer reviewed journal in the world. In five years or less, it is estimated that more than half of the peer-reviewed papers will be in open access journals. Similarly, there is a growth in open access funder policies, with over 200 participating. Open access is becoming the default from academic institutions, agencies and funding institutions. The Coalition of OA Policy Institutions (COAPI) has over 22 institutional members.
The critical infrastructure for digital repositories continues to grow with over 700 repositories internationally. Increasingly, digital repositories have become a part of institutions’ branding to their stakeholders. There is an emphasis on accountability metrics and expanding the infrastructure for data, moving beyond journals and traditional materials.
ARL has established an E-Science Institute to determine how to support the e-science agendas of its members. In November 2010, they put out a call asking for institutions that would be interested in a 6-month self-funded institute around e-research with the outcome of a strategic plan for the institution. They hoped to get 15-20 institutions. The program is in its final stages, and they have 70 institutions participating and more waiting in the wings.
The goals are focused on capacity building and transforming research library services. ARL helps each campus map its services and explore how to move into e-research. The 2½ day capstone session uses a Strategic Agenda Template where the teams come together to identify options that they can provide to their institutions. It is important that research libraries not go off on their own; there is a need for a shared infrastructure and a community driven e-research environment into which they fit. ARL will conduct an assessment of the 70 members to determine the value of the institute. The assessment will focus on local success and on how ARL can help at a strategic level to transform institutions. The E-Science (e-Research) Working Group of ARL will survey the 70 participants to see where they are in 6 months, including an assessment of their investments in digital repositories.
Workforce issues are the next building block in transforming the role of the research library. ARL has an initiative to recruit a diverse workforce and to attract students from racial and ethnic minority groups to research library careers with a focus on science and technology. The Institute for Museum and Library Services (IMLS) is funding 13 to 20 scholars in a class.
Finally, access is a national policy issue, and ARL has been following many legislative and administrative actions. These include SOPA (Stop Online Piracy Act), PIPA (Protect Intellectual Property Act), the Research Works Act (RWA), and the RFIs on Federally Funded Peer-Reviewed Publications and Digital Data. Ms. Adler distributed a table from netCoalition.com showing the key aspects of the various acts and their status.
New legislation has polarized the debate and presents both opportunities and significant challenges to building more infrastructure and open policies. There are a number of opportunities if we build out the repositories, build out on data, participate in new collaborations and advance interoperability. Challenges such as researcher awareness and uncertainty over sustainability still exist. The incentive and reward structures in academic research need to be aligned with the goal of openness.
Ms. Adler believes there are potential collaborations between e-science/e-research in the academic research library community and the interests of the CENDI agencies. She suggested additional conversations about research, workforce development and data issues. This might fit with Chris Greer’s proposal related to repositories and standards (see the CENDI-NFAIS Repository Workshop discussion below).
Mr. Sheehan and Dr. Warnick introduced WebLib and Tamas Doszkocs, the CEO of WebLib. WebLib received SBIR funding from DOE to focus on semantic search. OSTI provided one of its databases, GreenEnergy, to be used as an example. The GreenEnergy database online now deploys the WebLib product. NLM initiated an “apps” competition and WebLib was one of the winners. The current status shows not only the use of challenges and prizes but the benefits received from synergetic funding.
Semantic Search is a search, question, or action that produces meaningful results, even when the retrieved items contain none of the query terms or the search involves no query text at all. NLM+ is a semantic search and knowledge discovery application designed to tap into the rich content of the NLM in the areas of biomedicine and health. The challenge was the legacy databases that used different hardware and software and other technology platforms. The challenge is not only to connect them but to bring more out of the combined content than you would individually, with better precision and more relevance.
NLM has a number of different systems that try to bring together information from different systems, but they have been fragmented. For the first time, WebLib brought together access to 60 of the most important databases across these legacy collections. The application makes it possible to discover what exists and in which databases. If you click on any one of the results, you have the full power of that particular system.
PubMed records are actually indexed with MeSH headings. The MeSH vocabulary is only 20,000 terms, but these terms are connected to more than 4 million concepts in the Unified Medical Language System (UMLS). PubMed maps behind the scenes to the UMLS but sometimes the results don’t make sense. To resolve these issues, WebLib used MeSH, the full UMLS and information from other sites like the Mayo and Cleveland Clinics to create a biomedical knowledge base. Normalized keywords from titles, including individual phrases, are mapped to the biomedical knowledge base. All this information is then used in the semantic search approach. This knowledge base of biomedical and non-medical concepts continues to be enhanced.
Two approaches were demonstrated in the project. The first is taking a database and indexing it semantically, then making it searchable semantically. This produced the best results as demonstrated by the indexing of 1.6 million PubMed Reviews. The second is to generate a semantically enhanced query that can be used by traditional search systems such as Science.gov and ScienceDirect. This approach is less successful and automatic weights cannot be assigned.
For DOE, WebLib took the ETDE/INIS Energy Thesaurus to augment the queries. The problem is that thesauri are always behind the actual language used in documents. One of the challenges is how to automatically enhance an existing thesaurus so that it reflects the current concepts in science.
Host Showcase – Library of Congress
Vocabulary and Authority Linked Data Service (http://id.loc.gov) makes LoC-owned or -maintained vocabularies and authorities available as linked data. These include the LC Subject Headings (LCSH), genre forms, the Thesaurus of Geographic Names, the MARC Code List for geographic areas, the ISO Language code lists, and the name authority that is coming this summer. The service followed the Linked Data principle of identifying “things” with URIs, using HTTP URIs to dereference, making the data available not just as a bulk download but as individual resources over the web, and linking the data to other data (external and internal links).
For every concept or name you find identifiers (http), types, variant concepts or names, and relationships to other concepts or names. Some concepts are linked to the AgThesaurus or to the French and Germany national library authority files. In addition, to links to the LoC files from Rameau, the French subject heading list, the National Library of Sweden has also integrated id.loc.
The data has been used by Ethan Gruber at the University of Virginia to build a cataloging application in which the cataloger gets a populated list of subject headings using Solr. John Ocklerbloom at the University of Pennsylvania has an online book application that integrates a manipulated version of the bulk download to extract relationships from the LCSH hierarchy and integrate with his search and discovery tool.
Previous versions of the id.loc authorities were provided in SKOS (Simple Knowledge Organization System) but, this past summer, they began using MADS RDF (Metadata Authority Description - Resource Description Framework) that can identify the actual components of pre-coordinated subject heading, which is important for the unique identification of concepts. Bulk downloads are available in RDF/XML or N-Triples. Visualization and “type-ahead” features are also available for all vocabularies.
The MARC Organization and LC Classification schemes are also scheduled to be converted to Linked Data. New functionality will include better searching, faceted searching and the ordering of search results. Another future step is associating concepts in id.loc with resources in the Library of Congress catalog.
The CRS was established in 1914 to serve Congress. They have about 600 staff who use primary sources to develop confidential reports for Congress. Their work is not subject to the Freedom of Information Act (FOIA). Increasingly, they are called on to use data in their work. They need to understand the data and what is authoritative for what issue. They must perform hundreds of analyses and work “in the moment”. Combining sources is a challenge.
One of the objectives of the LoC Strategic Plan was to focus on data. They are looking at hosting, capturing, use, and staffing. The cost of acquisition and the policies of use are important. They currently do not have an inventory of the data that has been collected and it is distributed across the organization.
Similarly, it is important for them to understand how the data were collected and the intent as they are looking to combine and repurpose them to address relevant issues. The Geographic Information System (GIS) perspective is especially important, but many of the data do not provide sufficient scale to allow CRS to answer questions at the Congressional district level. GIS is valuable because the map interfaces convey the analysis in a very clean manner, even for complex topics. Mr. Atkinson gave three “made up” examples, one extracting an area of interest, the other overlaying a map of states, and the third “geo-rectifying” down to the map of the districts. In these cases, it is important to know the coordinate systems that were used.
There are many opportunities here to facilitate reuse by collaborating with data providers and to present background information. Better metadata and national indexes to trusted repositories are important in order to avoid recreating data they have used in the past. Funding for this kind of infrastructure and the identification of requirements for how agencies collect need to be reconciled against how Congress reuses the data.
Dr. Roberta Shaffer’s vision when she was the Law Librarian of Congress was to build a repository and portal to make it easier to locate legal information. The scope of this project is broad and deep from water districts to state, county, national, multinational and international levels. One group they are looking at now is tribal nations and indigenous groups. The portal OWL-law.gov will integrate LoC’s information and others.
Law is inherently trans-disciplinary so the information will be organized by topic or type. Cross-jurisdictional levels are allowed. Three proofs of concept have been identified, global (using the Global Legal Information Network, GLIN) or its successor, Thomas, which represents the national level; and sovereign within a sovereign which is represented by the tribal laws. Global and tribal are multi-lingual, and there are tribes that pass laws only through oral tradition.
Accessible, Authentic, Authoritative, and Archival are goals. The governance structures will also be a challenge. OWLL is looking at Science.gov and Worldwidescience.org as models for the collaboration needed to establish and sustain the World Law Library. LoC plans to build it but have it usable by everyone. They are testing a variety of software platforms including MetaSearch based on Solr/Lucene, the Smithsonian’s EDEN project and Semantic Web based on Class K. The project timeline goes to 2020; they are open to recommendations from CENDI.
2011 Meritorious Award Presentation
The 2011 award was given to Sharon Jordan for her long-term support of interagency cooperation, her tireless work on Science.gov, and her service as the CENDI co-chair. The group thanked Sharon and congratulated her on her retirement.