CENDI ANNUAL PLANNING MEETING

Bavarian Inn, Shepherdstown, WV

August 27-28, 1996

MINUTES

Changing STI Management in a Networked Environment

Keynote: DARPA Digital Library Initiative - Ron Larsen

CENDI Planning for 1997/98

CENDI Operations

Announcements

Keynote Session: Information Technology, Public Policy, and STI Management

DARPA Digital Library Initiative - Ron Larsen

Ron Larsen, Program Manager of the Defense Advanced Research Projects Agency (DARPA) Digital Library Initiative (previously, Associate Director of Libraries at the University of Maryland), provided an overview of the current and future technologies as viewed through the Digital Library Initiative (DLI). (The IITA Workshop definition of digital library includes digital pointers to physical collections such as hardcopy books. This is a broader definition than used by most services. This is important to the government since much of the information is not and probably never will be in electronic form.) Dr. Larsen described the evolution of network technologies and the Internet, the specific technologies being addressed by the various DLI projects, and the plans for moving the technologies into real-world environments.

Dr. Larsen described the evolution of network technologies and the Internet in three stages,access, organization, and correlation and analysis. In the 1980's, the focus was on access. The technologies have matured from simple file server technologies, through Gopher and WAIS, to more sophisticated WWW search engines such as Lycos, Alta Vista and Yahoo.

Throughout the 1990's, the Internet will take on the characteristics of an Information Repository. SGML will provide structure within documents and allow multimedia integration. In DLI projects, particularly at the University of Illinois at Urbana-Champagne (UIUC), SGML and document structure are being evaluated. While UIUC discovered that SGML was more of a guideline than a standard and, therefore, required individual mapping for each source into a common format, the publishers were optimistic because UIUC could map SGML from various sources. This was proof that SGML could be used and that the path to use was a linear problem that could be solved.

In the 21st Century, the move will be toward an analysis paradigm, that deals with content and categories. The future focus on analysis will involve semantic and broadband interoperability. DLI is investigating systems and tools that raise the users' capabilities from the lower levels to concept-based. The automatic indexing concepts of 20-30 years ago, such as computer-assisted (semantic) indexing and vocabulary switched retrieval, are now having an impact because of supercomputers. It is the view of the UIUC project that the supercomputers of today will be the common desktop devices in 5-10 years. Therefore, techniques, that can be created using the brute force computing power of supercomputers, such as statistical clustering to get semantic spaces and switching between the vocabulary of related disciplines, are pragmatic technologies for the DLI. (See Bruce Schatz' article in the April, 1996 issue of "Science".) Creating concept spaces from 400K of Inspec records took one day on a supercomputer. Merging 600 spaces for Compendex (4M records) took 3.5 days.

Semantic interoperability is the grand challenge of the Digital Library concept. This includes semantic interoperability across subject domains and vocabulary switching to suggest terms across domains. UIUC has developed a prototype where users can search the vocabulary in one domain and be guided to related terms in another domain.

DLI has also raised the notion of non-professionals building their own indexes. It is believed that amateur indexers will build an amateur index but, if given a tool that allows them to become more professional, this may improve.

DARPA's information management direction is to develop interoperable and scalable middleware. They are trying to apply library concepts (the absence of a priori knowledge of relevant information) to the new networked environment. Image understanding, high performance knowledge bases, etc. attempt to vertically integrate for decision making. However, the need to horizontally integrate is missing. There's a need to broaden the base but in a way that would be useful.

The information management focus is critical to DARPA's major defense focus which is Situation Awareness. Situation Awareness requires real-time tactical and reference data from a variety of sources. Acquiring the information real-time is only part of the challenge. The other challenge is to correlate and manipulate the information through context and value filtering and provide information space visualization. To address this need, Carnegie Mellon is working on real-time ingest and categorization technologies. The video feeds are converted to text through speech recognition, and the text is then indexed. This occurs almost in real-time.

Translingual Interaction is another information management concept of importance to the DoD community. This includes machine translation of queries, so that foreign databases can be searched and responses returned in the user's native language.

DARPA is concerned about taking the lessons learned and the technologies from the open environment of the DLI to the closed environment of DoD. They are working with the Corporation of National Research Initiatives (CNRI) to develop a secure repository; funding from DTIC is being used to test this in a real environment.

There are questions about how to move from where we are now (relatively simple network services) to where we want to be (a global digital library). Traditional library concepts, such as authority, holdings, repository and circulation, must be translated to the new environment. The traditional library concept of the authority first included names and identifiers, such as ISBN's and ISSN's. However, the concept of authority now has new meaning in the world of URL's, PURL's, and URI's. The library model suggests that holdings are more than just URL's. A unique name or "handle" is necessary for each digital object, as well as the identification of properties, some of which relate to cost, archiving, and readability.

In the new environment, the repository is made up of both content and metadata. The repository is accessed via a "handle" and includes reference information, security information, and information relevant to intellectual property protection. In Dr. Larsen's view, intellectual property and electronic commerce are two sides of the same coin. Some token must be presented that gets you something. While the initial impact of the Internet may have been to promote the perception of information as free, Dr. Larsen suggested that the implementation of electronic commerce may bring about a change in this perception. The library model continues with the concept of circulation. Digital objects have different manifestations (as they are created, stored and disseminated). A transaction log of what has been happening to a document is the equivalent of circulation in the DLI environment.

There are six digital library research projects under the DLI focusing on a wide range of collection, storage, organization and retrieval issues. They are:

 The University of California/Berkeley is working on automatic indexing of environmental information. They have suggested the concept of "multivalent documents". A legacy document in printed form is scanned to get an archival copy as the basis for a bit-mapped document. There are virtual layers on top of the archival copy that you can control. For example, a user might annotate a version, creating a new valence for that document. There is a problem of not knowing what to call the fragments made from these multi-valent documents.

The University of Michigan is establishing a multimedia testbed in the K-12 Earth and Space Sciences curricula. They are focusing on user interfaces, mediation, and collection development.

ƒ The University of California/Santa Barbara is focusing on cartographic information including maps and other spatially-indexed materials.

Stanford University modified its original proposal to focus on interoperability architectures.

The University of Illinois at Urbana-Champagne is focusing on access and display of complete contents, including text, figures, graphics, etc. They are also concerned with semantic retrieval and are involved in vocabulary control issues, including thesaurus development.

Carnegie Mellon's project is a digital video library using speech recognition, machine vision, and natural language-understanding technologies.

DARPA is concerned about the interoperability of the DLI projects, not just each individual project. Therefore, UC Berkeley and UC Santa Barbara are doing interoperability experiments. Carnegie Mellon and MIT are looking at the interoperability of video objects. Michigan and Stanford are working on widely usable middleware.

The DLI is moving toward larger scale federated repositories, and from custom application software and architecture to generic approaches. Document sizes should increase over the next five years from an average of 1 megabyte per document to 100 megabytes. Response time should decrease from 10 seconds to 100 milliseconds. In addition, the DLI will move from a Multilingual (multiple language) to a Translingual (automatic translations for documents and queries) environment.

Context filtering will replace or supplement bibliographic filtering. Bibliographic filtering works well in a library or database where relevant documents are pre-selected for inclusion in the collection. However, it works poorly on the WWW because there is no pre-selection. Contextual filtering involves capturing some of the value information related to the material. The user would develop a profile of his/her interests and background. Related information such as reading level or point of view would be developed as value-based characteristics of the document, allowing the filter to respond to the users profile.

The DLI is also monitoring other digital library projects. The most advanced digital library network concept is NCSTRL, a network of 40 computer science departments. NCSTRL is organizing a set of technical reports in a single discipline within a fully distributed environment. Dr. Larsen recommended the online D-lib Magazine as a good forum for researchers and developers of advanced digital libraries.

In the future, interactive services could be cataloged as well as documents. Other things are digital artifacts and active objects. Intelligent collaboration and visualization projects are developing techniques on human collaboration and creation of electronic collaborative spaces. This might include ways that a civil engineer in CAD(X) can communicate with a physicist in environment (Y). They are looking at metaphors that make metadata smarter instead of smarter tools.

Another interesting related program is the Intelligent Analyst Associate Program at Rome AFB. Here, they are developing verb instead of noun queries and are developing related concept domains.

Discussion:

NTIS asked what is being done to improve search engine results. The current environment uses inverted indexes and allows "gaming" to slant the search engines' results. Dr. Larsen indicated that the DLI will be putting money into dealing more rigorously with semantics and context in the documents (including some of the UIUC work with statistical techniques). Also, research at UC Berkeley indicates that value-based filtering tools may be valuable in addition to the classic approach of document or query similarity. Dr. Larsen has asked the question in the past, "If you had unlimited bandwidth, how could you improve querying?" An image or piece of an image could be sent. Another possibility is to map the information space and have the user indicate what is missing from his/her understanding. The system would then go looking for the "missing pieces".

DARPA is still "shopping for ideas" for improving search engines. Dr. Larsen also suggested that the CENDI agencies become involved in the TREC and MUC work.

The future also includes making the document "smarter" through the incorporation of metadata.

DIA and NAIC are using up to three levels of metadata to make the document "smarter". A working model is available on Intellink. IATA metadata workshops (OCLC/NCSA Metadata Workshop) and the Intelligence Analyst Association from the Rome Air Development Center were suggested as important sources on metadata.

It was noted that the marketing profession has experience in developing and utilizing customer profiling. NTIS customers are suggesting interest profiles for "Fedworld". The issue of privacy related to profiling was discussed. Librarians have always included privacy as a professional ethic, even though they have developed "profiles" on customers as they have interacted.

NAIC described its Systran machine translation system and suggested that DARPA and NAIC might benefit by working together on future Systran developments.

Dr. Larsen asked the CENDI members what they thought about the concept of authors providing indexing. Many attendees responded that this approach would depend on the type of vocabulary being applied (controlled versus uncontrolled) and the domain being indexed. It was also mentioned that keywords are often provided by authors as part of original document, but these are often uncontrolled phrases and may not be used in the database.

DTIC mentioned that it has done work in the area of collaboration technologies.

Dr. Larsen indicated that there are many areas in which the research interests of the DLI and those of the CENDI organizations intersect. It was suggested that an update on the DLI as part of a AAAS Communications Section session at the 1998 meeting would be valuable. Dr. Larsen agreed. Planning must begin by February 1997.


Back to top

CENDI PLANNING SESSION

Elizabeth Buffum, CENDI Chair, began the second day of the meeting by calling for more emphasis on technical collaboration and education. The agencies have common problems and we need to capitalize on the investments of others. CENDI must focus on key issues, and develop partnerships in the broadest sense.

The Secretariat reviewed some of the key discussion points from the prior days' sessions. A brain storming session added additional interest areas to the key points outlined by the Secretariat.

A vote was then taken on those areas of interest to most agencies. The topics of most interest were:

The Secretariat will consider these discussions when drafting the CENDI Objectives and Activities and the Annual Plan for 1997.

The Proposals from the Secretariat and the Working Groups were also reviewed. The proposals of interest were:

Workshop on Managing Federal Agency Intellectual Capital in a Distributed, Networked Environment

Many groups are struggling with what and who should save electronic, networked information, particularly related to WWW sites.

American Association for the Advancement of Science (AAAS) Initiative on STI

AAAS is the peer group for many high-level policy makers in STI. The proposal from the previous day's session regarding an AAAS session on the Digital Library Initiative would come under this proposal.

Metrics and Promoting the Understanding and Value of STI Management

This was considered to be very important. The discussions might take the form of a symposium, a CENDI focus group, or a regular meeting topic. It should begin with the collection of the metrics currently used by the agencies. These metrics might include production metrics as well as metrics concerning the value of the program to its various customer groups.

Support for the Applications Council of the NSTC Committee on Information Computing, and Communications (CICC)

The CENDI presentation to the NSTC CICC originally scheduled for July has been rescheduled for September. It was also suggested that Mel Ciment be invited to address CENDI.

Executive Order 13011 and the Chief Information Officers' (CIO) Council

This was an area of interest. The group felt that the first step was better briefing on the role and make-up of the Council. How will they relate and communicate with the agencies? What topics will they be dealing with? What is their process? How will affinity groups be formed? CENDI involvement might result in an STI affinity group.

Assessing and Reevaluating Cataloging for Bibliographic Databases in an Networked Information Environment

The CENDI members approved this in principle, but asked the Cataloging Working Group to provide a more complete action plan.

Impact of the Internet on Product Development and Customer Service

This proposal was approved by the Principals. They recommended that the agenda focus on what has been made easier and what are the new challenges. The User Education Working Group was asked to provide a more complete action plan.


Back to top

CENDI OPERATIONS

The proposal for procedures by which the CENDI members direct the efforts of the Working Groups through specific defined tasks supported by proposals was approved. The Secretariat will finalize the proposal process taking into account the discussion. The procedures should be included as an addendum to the CENDI Handbook.

Due to time constraints, the specific accomplishments and current efforts of the Working Groups and focus groups were not discussed. However, it was noted that the Information Exchange Working Group did not submit any proposals in the Planning Book, because the Principals have already approved three projects that the WG will be addressing between now and June 1997.

OTHER ITEMS

Meetings

The group determined that the frequency of meetings is appropriate. The next meeting will be at DTIC between the end of September and the end of October.

Communications

The Secretariat indicated that electronic means of distribution are being used consistently and effectively. Early problems involving the listservs have been overcome.

Other Products

The CENDI Brochure is undergoing final revisions. The CENDI Database is up-to-date.

Back to top

Announcements

National Biological Service

The merger activities with the USGS are underway. Staff are meeting with user groups at USGS that are interested in biology and the environment. The Denver Group has been reestablished as the Center for Biological Informatics which is set up as a center without walls. The USGS currently operates with three strong independent groups,water, mapping, and geologic. Biological will be the fourth.

National Technical Information Service

NTIS reported on its experiences with the Performance-based Organization (PBO) process. OMB Administration is a champion for the PBO initiative, but the PBO requests are reviewed through the conventional OMB process.

Progress has been made on Title 44. OMB responded to the Justice Department ruling that the requirement of printing going through GPO is not in line with the separation of powers with the request that the agencies not act in accordance with the Justice Department decision until next April.

A new technology of concern is the e-mail encryption built into Netscape 3.0. Netscape 3.0 is expected to become widely used throughout the government over the next few months. If the digital ID isn't registered, there is no way for an agency to access the e-mail of an employee who leaves an organization. There is a real question as to who is going to manage the digital ID's for the government. NTIS will be a trusted authority with full recovery built into the system. NTIS expects that there may be several certifying authorities for handling passwords and ID's in the future.

DTIC

A working group has been established within DTIC to report directly to the Director concerning the reengineering and replacement of the DROLS system.

The emerging issue of Information Warfare Technology is being looked at by DISSA and others. It has become an important area for analysis and DTIC has set up an Information Analysis Center (IAC) to help coordinate information.

DTIC's Guidelines for WWW, which are available from the DTIC homepage [link to http://www.dtic.dla.mil/), might be of value to others. The National Library for Education recently established guidelines that reference DTIC's WWW Guidelines.

Department of Energy/Office of Scientific and Technical Information (DOE/OSTI)

Regarding the alignment of the Office of Scientific and Technical Information within the Department of Energy, the Secretary did not approve a recommendation to place OSTI within the Office of the Assistant Secretary for Human Resources and Administration. Rather, OSTI's temporary assignment to the Office of Energy Research (ER) is expected to be made permanent. To that end, ER is currently studying where to place OSTI within its organization.

There is support for DOE STI within Congress and there is excellent international support. The Inspector General and the Government Accounting Office (GAO) are studying DOE's STI management, because there is concern that more needs to be done in terms of R&D STI management and dissemination.

NASA

Four civil servants have been transferred from NASA HQ to Langley Research Center as part of the transition of operations to the lead center. The STI program is working with the NASA Centers on the Technical Report Servers (TRS's) (including full text, bibliographic records, and images). The plan is to transfer the data to CASI as the primary holder. Electronic copyright issues are surfacing, along with questions of the proper reviews and signatures.

The issue of references within homepage documents to documents that are not publicly available was raised. The CIO's at the Centers are concerned and some are establishing their own policies in advance of HQ. There is a need to make authors and system administrators aware of the liability issues if the data are wrong or misused.

NLM

FY97 is NLM's International Year. An international focus will be emphasized in its content as well as dissemination. An International Planning Council is being headed by Don Frederickson and will include Vint Cerf, Floyd Bloom, and Gene Wong. It is expected that domestic as well as international changes will result from this effort.

NLM recently completed a survey on Internet access. It was a well-developed mail questionnaire. The survey was administered to 2,500 randomly selected MEDLINE users. The response rate with follow-up was 82-83%. The purpose was to assess the customers' readiness to move to Internet access. They found a considerable degree of readiness. Seventy-five percent have access, but only 25% were using it to access MEDLINE. There is still a substantial amount of dial-up usage. With the upgrade projections, about 90% of the users will have access to the Internet within the next 12 months. Three-quarters of the respondents have fairly substantial modems and platforms. Only 20% of the user base are information professionals, but the usage of MEDLINE among this 20% is very heavy. Rural usage, however, is substantially lower, especially in hospitals. There was a 90-95% satisfaction rating. A technical report is being prepared for distribution.

NAIC

The Open Source Information System (OSIS) has over 20 major nodes. Embassies, Defense R&D, and the management councils of the Services will be included. The backbone T1 service is available, including access to the WWW and Internet.

Of the 10 million CIRC records, 1.5 million have been moved from the IBM mainframe to the client/server environment. The DCARS visualization tool has been integrated with RetrievalWare and will be available on the WWW. NAIC is offering 11 online machine translation systems (9 for the WWW). The user provides the text to be translated by entering a URL or by pasting or keying the text into an editor provided with the system. Windows and UNIX versions are available free of charge to all government organizations.

The Systran machine translation (MT) system and the Cuneiform OCR engine are being deployed by the U.S. Army in Bosnia. Systran is currently working on the Serbo-Croatian dictionary. The OCR software for Chinese from ECI is being deployed to FBIS and to Army and embassy groups in the Pacific Rim.

NAIC is using RetrievalWare (formerly ConQuest) from Excalibur as the text retrieval engine. RetrievalWare is forming a federal users group to better address the needs of this community. Anyone interested in attending the meetings should contact Major Tom Bazzoli at NAIC.

Back to top