CENDI PRINCIPALS AND ALTERNATES MEETING

US Geological Survey
Reston, VA
October 3, 1997

Minutes

Digital Library Update
Developments in the Management of Federal Spatial Data
Automating Cataloging Using Government Technologies and Standards

WELCOME

Tom Pedtke began the meeting at 9:05 am. Introductions were made and the meeting proceeded into the keynote address.

Digital Library Update
Dr. Ron Larsen, Program Manager, DARPA

Since Dr. Larsen last addressed CENDI in August 1996, the first round of digital library projects funded under the Digital Library Initiative (DLI) was nearing a close. The Defense Advanced Research Projects Agency (DARPA) is preparing to call for new projects. Dr. Larsen provided a status report and indicated the lessons learned from each of the first round of digital library projects (DLI1). He then discussed the status of the planning efforts for the second phase of funding (DLI2).

DARPA, NSF, and NASA co-funded the DLI1 projects. All funded projects had a common set of requirements: 1) develop a collection of electronic materials, 2) build a network of national testbed sites, and 3) do some research. The sponsors were interested in consortia, partnerships between academia, commercial and government sectors. The awards were in the range of $1-1.5 million per year for three to four years. Cost sharing of 50 percent was required.

A DLI workshop was held in Santa Fe, NM, in March. It included the research community and the DLI1 program managers. It was intended not only to describe the programmatic content of the DLI1 but to raise questions and get people thinking "out of the box". The report of the workshop, authored by Dan Atkins, is available from the Digital Library Web site (http://dli.grainger.uiuc.edu/) as are descriptions of the digital library projects.

Dr. Larsen reviewed the six initial projects from DLI1. He reminded the audience that the call for projects went out in a pre-Web environment.

Carnegie Mellon University (CMU)

This is the most technologically advanced of the projects. CMU's project, called Informedia News on Demand, digitized CNN news feeds and made them retrievable via spoken language. Dr. Howard Wactlar is the principal investigator.

One week ago, a prototype system was placed on Dr. Larsen's desk. It contained two years of CNN newsfeeds. He recently asked it (verbally) about information related to recent military aircraft accidents. He got a window with thumbnails of CNN stories. Each one could be opened and the video of the story run. It took several months to install the machine at DARPA because it requires a dedicated 10 megabit fiber optic line to an OC3 line dedicated direct to CMU. Scaleability issues remain, but this project has come a long way.

The innovative technology is "video paragraphing"; i.e., determining when a story ends. "Video skimming" was also designed to aid with long stories. There is a 10-fold decrease in the size of the stories based on video "gisting". A relevance thermometer was also developed. It provides an indicator at the bottom of the thumbnails to indicate where in the story the highest frequency of search terms occur. This allows the user to quickly browse through the video. The indexes are created automatically based on speech recognition or the text of the closed captioning. There is no human intervention.

The current project deals with real-time video, speech, video paragraphing, how to build indexes, and a rich multi-year archive useable by lay-people. CMU is also experimenting with an algorithm to analyze the image of the person's face, allowing for a secondary query of the video archive by image.

Discussion

Dr. Aung asked how the CNN material was obtained. Dr. Larsen said that he did not know the details of the agreement, but that Turner gave the rights to a limited amount of information for research purposes. It is assumed that Turner is interested in the commercial aspects of the CMU project, perhaps for Distance Learning.

Mr. Rumble asked whether these technologies have been tried on higher level intellectual content than network news. This raises the question of usefulness for professionals, which Dr. Larsen said is not clear yet. However, CMU is working on Experience on Demand. This assumes a less professional set of tapes without commercial quality video (e.g., in a military setting). The Global Positioning System (GPS) and video feeds from hand-held cameras are fused with unmanned aerial vehicles (UAVs) or satellite data. Military personnel in the field would get a three dimensional view of the area and what might be "around the corner".

University of Michigan

This project's focus was on the future of information access and delivery as redefined by millions of software agents "scurrying around the Net". It addresses the problem that if people go to the Net by agents, how does one deal with the agent versus agent interaction. If this picture of the future is correct in a global network environment, how do you control or imagine this type of environment to function? They are looking at economic-based issues. The domain of the project is science education for grades K-12. So, despite the theoretical basis of the project, they are actually exploring a more traditional question of how to teach good science in K-12th grades. The "Key Question" concept allows the student to develop his own statement of the problem. It provides an infrastructure for independent research.

Discussion

Mr. Smith asked what happens if the teacher doesn't totally agree with the learning approach of the software. Dr. Larsen did not know of any analysis of issues related to transfer and acceptance of this approach by educators. Those teachers currently involved are committed to the project and, therefore, are unlikely to react negatively to the approach. He suggested that interested people could contact Dr. Elliot Soloway or Dr. Daniel Atkins, who is the principal investigator.

Stanford University

Stanford is raising questions of how to develop long-term interoperability among vastly heterogeneous and distributed environments. There are models for information that raise questions about the desktop metaphor with which we are now computing. They are investigating how to manipulate and manage objects within the screen space without the desktop metaphor. This is a "no-glitz" project approach that is tackling the issue of what goes on behind the screen as icons take on more meaning or functionality, (e.g., user operabilities). The prototype focuses on how researchers gain improved access to published literature. The domain consists of full-text journals from Knight Ridder. The principal investigator is Prof. Hector Garcia-Molina.

University of Illinois at Urbana-Champaign (UIUC)

UIUC (Dr. Bruce Schatz is the principal investigator) is dealing with the most traditional digital library initiative, a vast corpora of digital objects. It is developing indexes for retrieval using a variety of statistical techniques. The project involves agreements with ten professional associations that publish in science and technology. These organizations provide SGML-tagged data to UIUC who runs statistical analyses to create the indices. However, this is easier said than done. Early in the project, which learned that there are various interpretations of SGML. The "devil was in the details", when trying to routinely collect information from the publishers, they ended up with an "ugly mess." The solution was to map each SGML variation to a common canonical representation.

At a workshop last year, Dr. Larsen spoke with some of the publisher representatives. They were very laudatory of the project, indicating that there needed to be someone in the middle to negotiate the various formats. From a technical point of view, an N2 problem was reduced to a linear one.

Within DARPA, there are two camps with regard to automated support to retrieval. The most rapid progress has been through AI (artificial intelligence). However, there are also statistical approaches. Dr. Larsen believes that the best is a combination of the two. Statistical approaches are good for a broad brush query. AI is most helpful when there is a specific domain involved. Dr. Larsen has been trying to push the envelope forward on the statistical approach because it has historically received less attention than the more flamboyant AI approaches.

UIUC does not believe that the future is in supercomputers, even though the current statistical algorithms are run on the High Performance Computing Center at the National Center for Super Computing Applications (NCSA). In fact, this is the largest single effort at statistical processing of text. Even on a supercomputer it requires 10 hours to process the whole corpora. However, at some point, the power will begin to be scaled to people's desktops where the algorithms will actually be used. This follows Moore's law that computational power will be increased by 2x every few months; so, in 10 years, the power will be in the desktop.

UIUC has received another award to carry this effort forward.

UC Santa Barbara (UCSB)

This project involves digitizing and making the UCSB's world class map collection accessible and available for queries. Under principal investigator, Dr. Terrence Smith, the UCSB team has built a Web browser and developed gazetteer access. There is significant interest in this project from both the National Image Management Agency (NIMA) and the DoD.

UC Berkeley

Realizing that there will continue to be extensive paper legacy collections, this project tries to deal with that issue. Prof. Robert Wilensky and his team took a large paper collection from the California Department of Water Resources (DWR), scanned it, and then came up with the concept of multi-valent documents. There are a variety of tricks that are used to take a bit-mapped image and give it the functionality of a true electronic document. For example, a statistical table that is simply bit mapped can be given the functionality to allow for arithmetic functions on the contents. This also allows objects or images in objects that are stored as bit- mapped images to be retrieved by text queries.

When the recent flooding occurred in California, the California DWR asked Berkeley for access to the system for daily crisis management. The DWR realized that the result of the Berkeley work was far superior than dealing with the paper. The images were used as daily feeds to the press for news releases during this period. Berkeley is doing world class work in image processing. The DoD is interested in the crisis management techniques that can evolve from this type of project.

Discussion

Dr. Wood asked if this system is being geared up for use with the coming El Nino phenomenon. Dr. Larsen indicated that there were no known plans to do this. However, Mr. Molholm indicated that the Federal Emergency Management Agency (FEMA) is doing work in this area and that the Applications Council will have a crisis management workshop next Spring.

What Have We Learned from the DLI1 Projects?

Certainly there have been technology advances, some beyond the hopes of the originators of the projects. The importance of the infrastructure has been identified. They also found that the $1-1.5 million award was an awkward size for a university. It is not big enough to get the advantage of the infrastructure development but is big enough to require a lot of reporting.

DLI2 will award projects from $50,000 to multi-million dollars. The larger awards will still require a 50 percent cost sharing. The smaller ones will not. There are four core sponsors for DLI2, DARPA, NLM, LoC (through "in kind" funding) and NSF. The National Cancer Institute will be a peripheral participant. There are no guarantees that the DLI1 projects will be funded through DLI2. This is a new competition.

Developments in the Management of Federal Spatial Data
Barbara Poore, Federal Geographic Data Committee [FGDC]

The FGDC (http://www.fgdc.gov/index.html) was created through Circular A-16 (1990) that identified the need to develop a National Digital Geospatial Information Resource. Almost every executive agency produces geographic data of some sort. There are high quality data from other sectors that would make federal data much more valuable if the outside data were integrated with it. In 1993, the National Performance Review (NPR) identified the National Spatial Data Infrastructure (NSDI) (http://fgdc.er.usgs.gov/NSDI/Nsdi.html) as a key initiative. The NSDI Executive Order of April 11, 1994, assigned federal leadership for the NSDI to the FGDC.

Ms. Poore discussed the FGDC's outreach activities. They are seeking to incorporate partners from all sectors. The real impact of FGDC is to effect social change and to get different groups that speak different languages to come together.

FGDC deals with technology, policies and standards, and human resources necessary to acquire, process, store, and distribute geospatial data. Geospatial data is like putting a puzzle together. There are maps, field measurements, remotely sensed images, and other types of data from various disciplines. Almost anything can be spatially referenced.

The metadata standard (http://www.fgdc.gov/Metadata/Metadata.html) is being revised for presentation as an International Standards Organization (ISO) standard for approval this Fall. The standard is a subset of SGML which is an electronic format, technically readable by both human and machine. However, SGML is not particularly readable by humans, an issue with which they are struggling. Important to the standard are indicators that identify the originator and describe the quality, organization, and spatial reference of the data. This data about the data (metadata) is key to the successful use of the data by others.

The FGDC hopes that all federal agencies with geospatial information will use the standard. It is extremely helpful as a means of conducting an inventory. It also serves an advertising purpose by making others aware of the data. Metadata may also ward off liability issues by informing the potential user about data quality and appropriate uses of the data.

The FGDC works with various implementers to refine metadata tools. They also provide free training both in the tools and in the principles of metadata.

The FGDC is involved in a number of different standards initiatives. The FGDC is involved in harmonizing projects and standards internationally. There is harmonization with the geospatial efforts in Australia and Canada. South Africa has accepted the FGDC standard. The current ISO TC211 is fundamentally the FGDC standard. It is already an ASTM (American Society for Testing and Materials) standard. The FGDC supports Z39.50 distributed search protocol research. They are also investigating common classifications and collection criteria.

The FGDC Clearinghouse is a client/server architecture that provides access to distributed data and documents. It is based on the Z39.50 distributed search protocol. There are currently about 40 servers throughout the U.S. that can be searched from the FGDC Web page. There is no central collection but each site must do its own collection development. This is a critical issue. Several gateways are being operated in Alaska; Washington, DC; and North Carolina. There are servers under them that have registered the 40 sites as well as some global databases of interest.

Work is underway to create digital geographic data around seven themes. These themes can be commonly used and distributed at high resolution to the whole community. The user can then pull down one of these framework data sets and add local data and information to it for local analysis. Pointers to all the framework information will be available from the Clearinghouse. The Framework Handbook is available digitally.

Key to the success of the FGDC is its ability to develop partnerships with relevant sectors C commercial, not for profit, academic, and government. The FGDC has found that the most cogent approach to partnerships is money. FGDC funds metadata creation, teaches about metadata, provides the metadata clearinghouse, and funds educational projects to promote the use of the standard. There is a project to do a mini clearinghouse for high school science classes. There are also several framework-related projects.

Ms. Poore, as the benefits program manager, goes to the various communities and asks about the benefits of data sharing. The community must share the cost. The average is about $40,000-$50,000. The FGDC wants to give seed money only, not to continually pay people to do this. In the last four years of the partnership program, 1,700 organizations have been involved in the program.

The NSDI has developed a strategic plan. The new emphasis this year is on the 35 members of the UCGIS (University Consortium for GIS).

It is also significant that the National Science Foundation has become a member of the FGDC. Even though the research scientists are creating much of the data, they are the least likely to do metadata. Perhaps the creation of metadata will become part of the NSF grant cycle. FGDC can also leverage NSF's expertise in social science research, computer science and geography.

Discussion

Several members asked how the budget for GIS is looking. Ms. Poore indicated that in the early 1990's, agencies set up centralized GIS shops but, in the past few years, the disciplines have started to take on the GIS in their distributed systems. The growth of GIS on the Web along with analysis tools have promoted this. It is no longer treated as a separate line item but seems to be included in the funding for the discipline area.

Automating Cataloging Using Government Technologies and Standards
Dr. Mark Fornwall, Kate Kase, and Anne Frondorf, USGS/Biological Resources Division

Dr. Fornwall, Director of the Center for Biological Informatic in Denver, CO, (http://biology.usgs.gov/cbi/index.html) introduced the topic of automatic cataloging and introduced the speakers for this segment.

Kate Kase began with a quote, "Information is the currency of democracy." This emphasizes the importance of exchanging information to make it worthwhile. What, then, do we need to do to make sure we can appropriately exchange information? Ms. Kase identified two components, consistent descriptions of content and consistent descriptions of structure. Rules are needed so that we are all describing things the same way. Compliant tools are needed to ease the input and verification processes.

AACR2 and MARC are the most prevalent tools at this time. MARC is general enough to handle multiple disciplines. The rules and tools are written by and for the information community, not for the subject disciplines. However, the sheer volume of information within a discipline and the benefit of having the practitioners describe the work, since they are closest to it, have caused BRD to rethink the previous paradigm. It is necessary for information professionals to work with the discipline professionals to distribute the workload and to build tools in the language of the discipline, not in the language of information science.

Customized tools are important. Within the USGS/BRD, staff must deal with the geospatial community (NSDI) and the biological community (National Biological Information Infrastructure (NBII)).

At this point, Anne Frondorf briefly described the NBII (http://www.nbii.gov), begun three years ago. In order to make this distributed resource work, an agreement was needed on a metadata standard for the community. As part of the USGS, compliance with the FGDC standard was a given. They enhanced the FGDC standard to handle information particularly relevant to biologists (above and beyond the geospatial aspects of their work). This includes a field for organisms including the taxonomic and the nomenclatural references. These are not dealt with in the FGDC standard, so an extension was required. The NBII Metadata Standard (http://www.nbii.gov/infrastructure/meta.html) has been drafted and is now going back through the FGDC standards process to be approved as an extension.

It was also noted that pure laboratory research is also important but is not generally connected to geospatial information (other than the location of the laboratory, which generally does not impact the data included in the report). Non-data publications such as technical reports, needed to be included. There was the need to bridge the gap between the geospatial standard, the biological extension, and the extensions needed for non-inherently spatial data such as reports. The FGDC standard was extended to accommodate both the biological information and the non-spatial bibliographic data

Ms. Cotter indicated a number of international efforts that are ongoing. These include the Inter-Americas Biological Information Network (IABIN) (http://www.nbs.gov/nbii/iabin) and the North American Biodiversity Information Network (NABIN) (http://www.nbs.gov/nbii/). The Biological Resources Division (BRD) is also involved, as are some other CENDI members, in the International Council for Scientific and Technical Information (ICSTI) project on Biological Classification.

Ms. Kase continued with a description of the tools used for creation of the information. There are two tools, Metamaker (http://www.nbii.gov/metamaker/metamaker.html) and PUBS (http://biology.usgs.gov/pubs/). Both tools are Web-based. Metamaker is used by resource owners/creators to create and store records. PUBS, the Publishing Utilities for the Biological Sciences, is a suite of tools for authors and editors that accommodates the entire publication life cycle. It will include a decision support system to help the author determine the audience for his paper and the best format for publication.

BRD has recently begun to populate the publications metadata database. The search engine will be overlaid within the next month.

The PUBS system creates metadata that is sent directly to the NBII Clearinghouse (http://www.emtc.usgs.gov/http_data/meta_isite/nbiigateway.html), or metadata which is stored locally for later upload. Metamaker can accommodate the 200+ elements required by the NBII metadata standard. PUBS, on the other hand, reduces the element set to those that are needed for bibliographic data and related information.

The system can handle both full metadata information and access to the full text. This is handled by buttons at the bottom of the Biblio Query Screen. Not all material will be available in full text because of copyright.

XML is being researched as a way to do one tagging pass for the metadata, publications, and the Web publishing of the same document. It is also possible to embed color photos in place of the black and white ones that are published in the BRD publications. In addition, you can embed sound bits. Ms. Kase demonstrated a woodpecker, showing both text and the sound of the woodpecker's call.

Ms. Kase indicated that the BRD is working on an extraction tool to pull out the elements from the metadata record for an SF298 (report documentation page) and to provide facilities for printing it when needed for the back of a hardcopy report. BRD is working with NTIS to determine the best way to send the SF298 electronically for NTIS's files. BRD is also working on controlled vocabularies that are linked at the subdiscipline level to other vocabularies.

Discussion

Ms. Franco asked about the issue of quality control of author-created data. Ms. Kase indicated that the material is being filtered through a stop point where verification routines try to filter bad records and where records are reviewed between submission and addition to the actual clearinghouse.