May 3, 2012
NATIONAL AGRICULTURAL LIBRARY
10301 Baltimore Avenue, Beltsville, MD 20705-2351, Main Reading Room
9:00 am - Opening and Introductions
Lisa Weber, Director, Information Technology Policy and Administration, NARA, and CENDI Chair
Simon Liu, Director, NAL
9:15 am - 11:00 am
“Big Data is a Big Deal” [presentation]
“Department of Homeland Security Science and Technology Directorate: Mission, Operations, and Future Plans” [presentation]
Rolf Dietrich, Chief Knowledge Officer, Science and Technology Directorate, Department of Homeland Security
11:30 am - Host Showcase – National Agricultural Library
» Christopher Cole
- DigiTop Navigator: Current Awareness Service on Literature. (Tanya Tanner and Wayne Thompson) [presentation]
- USDA Implementation of VIVO (Vern Chapman) [presentation]
12:30 am - Group Lunch
Ms. Lisa Weber, CENDI Chair, opened the meeting at approximately 9:10 am. She thanked the National Agricultural Library, and, in particular, Chris Cole and his staff for hosting the meeting. She also welcomed Dr. Simon Liu, Director of NAL, who welcomed the CENDI members.
Simon Liu Welcome: CENDI is the home for not only federal STI, but for information management professionals. Dr. Liu first became aware of CENDI in 2000 when he worked for NLM. He enjoyed the programs sponsored by CENDI and thanked the members, the Chairs, and the Secretariat for making CENDI such a great organization.
NAL is in a period of transformation including organizational structure, programs, and culture. New programs and projects are being developed, two of which will be showcased during the meeting.
Big Data is also important to NAL. MarkLogic is used to handling big data -- structured, semi-structured, and unstructured. DigiTop Navigator has 43 million citation records in agriculture and related fields. This project was initiated 18 months ago and implemented less than a year ago. NAL plans to continue moving forward with a goal of 120 million records. It currently contains 43,000 journal titles. The second project is VIVO, which is based on work developed under an NSF/NIH grant. The platform is used to facilitate collaboration among scientists. To date, they have incorporated people, projects, and publications.
Libraries are places where history comes alive. NAL has collections dating back to the 15th century. He invited CENDI members to return at some point to tour the special collections.
Libraries are also home for imagination. They are the delivery rooms for the birth of new ideas. What we discuss and present today will help to stimulate change in the CENDI organizations tomorrow.
Data, Data and - Metrics Data
“Big Data” is a term applied to a data set whose size or complexity is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. This definition, which they struggled with for quite some time, incorporates three important concepts that equate to the major issues -- size, complexity, and time. Eighty percent of the data is unstructured. Time can be defined as time to process, time to analyze, time to visualize, and time to support the development of meaning by researchers and, ultimately, the public. The hardware can handle massive amounts of data but the software and visualization techniques can’t keep up with it. We have an “analysis” or “filtering” problem, rather than a data problem. This is also a big opportunity to have technology make a giant step forward based on inexpensive storage, sensors, etc. We have all this information, but we need to harness it.
The Big Data Group was asked by the Office of Science and Technology Policy (OSTP) to focus on where we want to go. What are the agencies doing? In what areas can they coordinate and where are the gaps? What does the federal government need to focus on to get us where we want to go with limited funding?
The goal is to promote new science by harnessing big data, to exploit big data to address national and agency needs, and to support the stewardship of federal data and develop the necessary workforce and infrastructure to advance data science. After the first report was delivered to OSTP, the Group was asked to focus on four main areas: Core Technologies, Domain Research Projects, Challenges/Competitions and Workforce Development.
A solicitation around Core Technologies was released on March 29, 2012, as the cornerstone for the White House Event on Big Data. The goal was to draw multiple agencies together to find tools that cut across multiple domains or to generate tools that could be generalized and used in varying environments. Nine NSF directorates and seven NIH institutes pledged funding. DOE was interested but wasn’t able to process the paperwork and commit in time for the release. Ms. Wigen is hopeful that DOE and NASA will sign up for future rounds. The solicitation included a section where any agency can ask for customized software for its specific need. This drew a lot of attention from agencies.
The White House Event on Big Data included speakers from NSF, HHS/NIH, DoD, DARPA (Defense Advanced Research Projects Agency), DOE and USGS. The aim was to not only highlight the solicitation of the Big Data Initiative but to draw in other agencies which have initiatives of their own.
NITRD prepared a fact sheet that went out to various agencies asking about their current programs involved with Big Data. This was not an exhaustive list but they were able to identify 87 projects from 16 agencies. There is much going on crossing all domains. Ms. Wigen distributed some copies of the fact sheet. Principal Investigators from some of these projects have been invited to come and talk to the group about their projects.
Challenges and awards are tools that are being used inside and outside the government. The Big Data Group is working with the NASA Center of Excellence for Collaborative Innovation to establish a process by which the Center will help agencies set up and run these challenges. This will start with an Ideation Challenge to be released soon. The Steering Committee has approved the plan and the first challenge funding will be released soon. Currently, the Group is going back and making sure that the data sets are the most relevant before releasing the challenge.
The Domain Research Projects aim at identifying current projects that could benefit from cross-agency collaboration. In the area of Workforce Development, 20 projects were identified across seven agencies that may be adaptable to address the new field of data science and the development of data scientists. These include funding methods such as grants, fellowships, and internships. They are looking at all levels of college and beyond. They are also looking at building a Data Science Community through annual conferences, professional associations, etc. An update for the Workforce Development group is set for a May 17th meeting. Dr. Spengler will be speaking about developing a community.
It was mentioned that other groups are working on some similar aspects of Big Data. Mr. Sheehan mentioned that the Interagency Working Group on Digital Data (IWGDD) might intersect with the Big Data Group in the area of the data life cycle; specifically, how to make the data available for use. He did note that the IWGDD is very policy focused.
The issue of shared terminologies and ontologies across domains was raised. These areas are critical to the metadata issues. This may be one of the gaps where a research plan is needed.
The Science and Engineering Indicators are done every two years for the National Science Board. Data is analyzed in a wide range of science and technology (S&T) topics. The goal is to support policy input. The assessments span K-12 to higher education, research and development (R&D) and innovation, and employment. Other products are also available on the web site from their surveys.
Some highlights include:
- The US position in specific S&T areas has eroded gradually over the last decade.
- Many developing countries are building up their S&T infrastructure to generate new knowledge and translate it into economic and social benefits.
- While the Asian ascent in S&T has been focused on China, it applies to other Asian economies, both developing and developed. Other countries with heightened focus on S&T include South Africa and Brazil.
- The countries, especially China, are increasing the number of highly trained people in both science and engineering, and doing it very rapidly.
- While there has been a large growth in R&D in the US and the European Union (EU), the largest growth is among the Asia 8. If you combine all the Asian R&D, it nearly equals the US and goes above the EU in 2009.
- Even though it is difficult sometimes to compare data across countries, the increases can be seen in a variety of indicators, including the number of researchers, scientific publishing, and high technology manufacturing.
- The percent of intramural and Federal Funded Research And Development Centers (FFRDC) R&D fell because of the increases in other sectors.
- US industry has the largest share of R&D of all sectors. Industry is also the largest employer of science and engineering degree holders across all education levels.
- The federal intramural science is focused on biological/agricultural and environmental sciences.
- Scientific publishing is dominated by the academic sector. NSF uses the ISI Science and Social Science Citation Indexes and look at authorship by institution. The share for government and industry declined.
- About a decade ago, NSF did some interesting research about the flattening of the US publication output. They didn’t find any single explanation. It could be attributed to a number of factors, such as the increased output by other countries, greater collaboration between US organization internationally, and the increased R&D spending in US universities being invested in infrastructure in advance of the performance of the research. Based on interviews, Mr. Hill also noted that the very top academic institutions in the US were not as interested in the quantity of publications for tenure but in the quality of the publications in which their faculty published.
- There has been a sizeable increase in collaboration between US and foreign institutions and, therefore, in co-authorship. This is true of federal authors as well.
- There is a rapid growth in both foreign and US inventors. Patents doubled in the US, but more than doubled in foreign countries. In the late 2000s, the number of foreign patents surpassed those of the US. Patents by federal government intramural researchers are at a very low level. Private and non-profit has the lion’s share of patent assignments.
As the final step in applying for CENDI membership, Mr. Dietrich gave a special presentation about the mission and activities of DHS Science & Technology Directorate. He also introduced Christopher Lee, Chief Privacy Officer, who would serve as the CENDI Alternate.
The mission of the DHS S&T Directorate is to strengthen America’s security and resiliency by providing knowledge products and innovative technology solutions. Technology solutions can be developed internally or procured from others. The Homeland Security Enterprise includes not only the seven components of DHS, but first responders and others, including the public. They have projects to penetrate to those levels, including applications for your phones.
The components of the Enterprise were brought together 10 years ago. S&T is headed by an Under Secretary under DHS HQ. S&T is focused on adding value to the Enterprise with strategic and focused technology options and operational process enhancements. They seek innovative, system-based solutions to complex homeland security problems. S&T has the technical depth and reach to discover, adapt, and leverage technology solutions developed by agencies, laboratories, and Subject Matter Experts. For example, S&T has an FFRDC specifically for system engineering.
Mr. Dietrich described the S&T organization and how each group is aligned with the mission. The responsibilities range from support for First Responders and Advanced Research Projects in areas such as borders and maritime security, human factors and behavioral sciences, and chemical and biological defense, to support for acquisitions and partnerships with universities, the private sector, and international organizations. There are five laboratories within the S&T organization. The Knowledge Management and Process Improvement Office is new as of a year ago and provides support in these areas.
Mr. Dietrich is currently working on how to capture and manage the publications and other outcome of the research. At this time, he is focused on capturing the information where it resides or having it submitted to a central source. Once they have some of the materials, he can take the next steps to catalog and organize. The first objective is to share within S&T, then within DHS, and, finally, beyond. Most of their research is unclassified. In some cases, the information in the studies is classified but the methodologies would be of more general interest. This means talking to the Principal Investigators (PIs). He is focused on promoting a culture of innovation and learning. As the Chief Knowledge Officer for the S&T Directorate, most of his job is culture and relationship building.
Showcase – National Agricultural Library
DigiTop Navigator provides access to AGRICOLA and eight additional databases that NAL has licensed for USDA use. The external databases are Medline, BIOSIS, CAB Abstracts, Food Science and Technology Abstracts, Zoological Record, Fish & Fisheries, AGRIS, and Wildlife & Ecology. Previously, these nine databases had to be searched separately and through different interfaces. The goal was to improve customer service while moving away from mediated services to increased patron autonomy and reduced costs. The system provides access to approximately 43 million unique records.
The development was done by NAL staff and Mark Logic with a lot of input from their patrons. Focus groups from throughout USDA were used to develop the requirements and interface designs.
This approach is based on a normalized record structure that is stored at NAL, rather than a metasearch engine. Data is added from the databases weekly. The data is scrubbed one day and then loaded overnight. The team priced other options and determined that this was the least expensive approach. The various styles of author names were particularly difficult. It took a year to reconcile all the data. OpenURL is used for the full text. The full text can be easily absorbed, but the question is what you do with it.
Features include browsing, managed search results, saved searches and results, sending e-mails, and alerts. It is available through the USDA network or through a VPN for employees working remotely.
In the future, they plan to parse out the licensed databases that are accessible to USDA only and make the USDA databases publicly available through the same system. They also plan to leverage citation records, full text and datasets.
It was suggested that this system might have applicability to grow across agencies. Since the system was unable to be demonstrated, Mr. Cole will arrange with the Secretariat for another time when a webinar can be scheduled.
Action Item: Mr. Cole and the Secretariat will set up an event, perhaps a webinar, to demonstrate the DigiTop Navigator.
VIVO is an open source semantic web application for science collaboration that was originally created at Cornell. It is based on describing relationships between entities. Conclusions can be drawn from explicitly or implicitly stated information. The interface provides browsing, searching, and facets.
The VIVO ontology is based on the academic model. It took time to actually understand how to extend the ontology to accommodate the entities and relationships within USDA and not break VIVO. It required the creation of a local ontology across five different agencies that represented disparate organization structures and positions. The granularity of the data (first name, last name, etc.) was also an issue.
The NAL VIVO development was more difficult because there is no central HR system to supply current and consistent data. In addition, there was no consistent tracking of research, except for ARS. The data was cobbled together from a number of external data files in RDF, csv or excel. Tabular data had to be mapped into the ontology format.
VIVO is available only internally to USDA. Based on a search by author or organization entity, the user can access the full text of the publication from within this interface. They are working on visualizing relationships between publications. For example, a user would perform a search and then be able to see the co-author network.
The raw data for a researcher’s publications can be exported and then manipulated with Excel. There was a discussion about the possibility of extracting data from VIVO and importing it to STARMetrics.
NAL is currently working with 60 land grant universities to facilitate collaboration among the researchers using VIVO. NAL sees this as a valuable tool to be able to judge the outcome and output of research.
The morning technical program adjourned at Noon.