| CENDI ANNUAL PLANNING MEETING |
Bavarian Inn, Shepherdstown, WV
August 27-28, 1996
MINUTES
Changing STI Management in a Networked Environment
Keynote: DARPA Digital Library Initiative - Ron Larsen
Keynote Session: Information Technology, Public Policy, and
STI Management
DARPA Digital Library Initiative - Ron Larsen
Ron Larsen, Program Manager of the Defense Advanced Research Projects
Agency (DARPA) Digital Library Initiative (previously, Associate
Director of Libraries at the University of Maryland), provided
an overview of the current and future technologies as viewed
through the Digital Library Initiative (DLI). (The IITA Workshop definition of digital library
includes digital pointers to physical collections such as hardcopy
books. This is a broader definition than used by most services.
This is important to the government since much of the information
is not and probably never will be in electronic form.) Dr.
Larsen described the evolution of network technologies and the
Internet, the specific technologies being addressed by the various
DLI projects, and the plans for moving the technologies into real-world
environments.
Dr. Larsen described the evolution of network technologies and
the Internet in three stages,access,
organization, and correlation and analysis. In the 1980's, the
focus was on access. The technologies have matured from simple
file server technologies, through Gopher and WAIS, to more sophisticated
WWW search engines such as Lycos, Alta Vista and Yahoo.
Throughout the 1990's, the Internet will take on the characteristics
of an Information Repository. SGML will provide structure within
documents and allow multimedia integration. In DLI projects,
particularly at the University of Illinois at Urbana-Champagne
(UIUC), SGML and document structure are being evaluated. While
UIUC discovered that SGML was more of a guideline than a standard
and, therefore, required individual mapping for each source into
a common format, the publishers were optimistic because UIUC could
map SGML from various sources. This was proof that SGML could
be used and that the path to use was a linear problem that could
be solved.
In the 21st Century, the move will be toward an analysis paradigm,
that deals with content and categories. The future focus on analysis
will involve semantic and broadband interoperability. DLI is
investigating systems and tools that raise the users' capabilities
from the lower levels to concept-based. The automatic indexing
concepts of 20-30 years ago, such as computer-assisted (semantic)
indexing and vocabulary switched retrieval, are now having an
impact because of supercomputers. It is the view of the UIUC
project that the supercomputers of today will be the common desktop
devices in 5-10 years. Therefore, techniques, that can be created
using the brute force computing power of supercomputers, such
as statistical clustering to get semantic spaces and switching
between the vocabulary of related disciplines, are pragmatic technologies
for the DLI. (See Bruce Schatz' article in the April, 1996 issue
of "Science".) Creating concept spaces from 400K of
Inspec records took one day on a supercomputer. Merging 600 spaces
for Compendex (4M records) took 3.5 days.
Semantic interoperability is the grand challenge of the Digital
Library concept. This includes semantic interoperability across
subject domains and vocabulary switching to suggest terms across
domains. UIUC has developed a prototype where users can search
the vocabulary in one domain and be guided to related terms in
another domain.
DLI has also raised the notion of non-professionals building their
own indexes. It is believed that amateur indexers will build
an amateur index but, if given a tool that allows them to become
more professional, this may improve.
DARPA's information management direction is to develop interoperable
and scalable middleware. They are trying to apply library concepts
(the absence of a priori knowledge of relevant information) to
the new networked environment. Image understanding, high performance
knowledge bases, etc. attempt to vertically integrate for decision
making. However, the need to horizontally integrate is missing.
There's a need to broaden the base but in a way that would be
useful.
The information management focus is critical to DARPA's major
defense focus which is Situation Awareness. Situation Awareness
requires real-time tactical and reference data from a variety
of sources. Acquiring the information real-time is only part
of the challenge. The other challenge is to correlate and manipulate
the information through context and value filtering and provide
information space visualization. To address this need, Carnegie
Mellon is working on real-time ingest and categorization technologies.
The video feeds are converted to text through speech recognition,
and the text is then indexed. This occurs almost in real-time.
Translingual Interaction is another information management concept
of importance to the DoD community. This includes machine translation
of queries, so that foreign databases can be searched and responses
returned in the user's native language.
DARPA is concerned about taking the lessons learned and the technologies
from the open environment of the DLI to the closed environment
of DoD. They are working with the Corporation of National Research
Initiatives (CNRI) to develop a secure repository; funding from
DTIC is being used to test this in a real environment.
There are questions about how to move from where we are now (relatively
simple network services) to where we want to be (a global digital
library). Traditional library concepts, such as authority, holdings,
repository and circulation, must be translated to the new environment.
The traditional library concept of the authority first included
names and identifiers, such as ISBN's and ISSN's. However, the
concept of authority now has new meaning in the world of URL's,
PURL's, and URI's. The library model suggests that holdings are
more than just URL's. A unique name or "handle" is
necessary for each digital object, as well as the identification
of properties, some of which relate to cost, archiving, and readability.
In the new environment, the repository is made up of both content
and metadata. The repository is accessed via a "handle"
and includes reference information, security information, and
information relevant to intellectual property protection. In
Dr. Larsen's view, intellectual property and electronic commerce
are two sides of the same coin. Some token must be presented
that gets you something. While the initial impact of the Internet
may have been to promote the perception of information as free,
Dr. Larsen suggested that the implementation of electronic commerce
may bring about a change in this perception. The library model
continues with the concept of circulation. Digital objects have
different manifestations (as they are created, stored and disseminated).
A transaction log of what has been happening to a document is
the equivalent of circulation in the DLI environment.
There are six digital library research projects under the DLI
focusing on a wide range of collection, storage, organization
and retrieval issues. They are:
The University of California/Berkeley
is working on automatic indexing of environmental information.
They have suggested the concept of "multivalent documents".
A legacy document in printed form is scanned to get an archival
copy as the basis for a bit-mapped document. There are virtual
layers on top of the archival copy that you can control. For
example, a user might annotate a version, creating a new valence
for that document. There is a problem of not knowing what to
call the fragments made from these multi-valent documents.
The University of Michigan
is establishing a multimedia testbed in the K-12 Earth and Space
Sciences curricula. They are focusing on user interfaces, mediation,
and collection development.
The University of California/Santa
Barbara is focusing on cartographic information including maps
and other spatially-indexed materials.
Stanford University modified
its original proposal to focus on interoperability architectures.
The University of Illinois
at Urbana-Champagne is focusing on access and display of complete
contents, including text, figures, graphics, etc. They are also
concerned with semantic retrieval and are involved in vocabulary
control issues, including thesaurus development.
Carnegie Mellon's project
is a digital video library using speech recognition, machine vision,
and natural language-understanding technologies.
DARPA is concerned about the interoperability of the DLI projects,
not just each individual project. Therefore, UC Berkeley and
UC Santa Barbara are doing interoperability experiments. Carnegie
Mellon and MIT are looking at the interoperability of video objects.
Michigan and Stanford are working on widely usable middleware.
The DLI is moving toward larger scale federated repositories,
and from custom application software and architecture to generic
approaches. Document sizes should increase over the next five
years from an average of 1 megabyte per document to 100 megabytes.
Response time should decrease from 10 seconds to 100 milliseconds.
In addition, the DLI will move from a Multilingual (multiple
language) to a Translingual (automatic translations for documents
and queries) environment.
Context filtering will replace or supplement bibliographic filtering.
Bibliographic filtering works well in a library or database where
relevant documents are pre-selected for inclusion in the collection.
However, it works poorly on the WWW because there is no pre-selection.
Contextual filtering involves capturing some of the value information
related to the material. The user would develop a profile of his/her
interests and background. Related information such as reading
level or point of view would be developed as value-based characteristics
of the document, allowing the filter to respond to the users profile.
The DLI is also monitoring other digital library projects. The
most advanced digital library network concept is NCSTRL, a network of 40 computer science departments.
NCSTRL is organizing a set of technical reports in a single discipline
within a fully distributed environment. Dr. Larsen recommended
the online D-lib Magazine as a good forum for researchers and developers of advanced digital
libraries.
In the future, interactive services could be cataloged as well
as documents. Other things are digital artifacts and active objects.
Intelligent collaboration and visualization projects are developing
techniques on human collaboration and creation of electronic
collaborative spaces. This might include ways that a civil engineer
in CAD(X) can communicate with a physicist in environment (Y).
They are looking at metaphors that make metadata smarter instead
of smarter tools.
Another interesting related program is the Intelligent Analyst
Associate Program at Rome AFB. Here, they are developing verb
instead of noun queries and are developing related concept domains.
Discussion:
NTIS asked what is being done to improve search engine results.
The current environment uses inverted indexes and allows "gaming"
to slant the search engines' results. Dr. Larsen indicated that
the DLI will be putting money into dealing more rigorously with
semantics and context in the documents (including some of the
UIUC work with statistical techniques). Also, research at UC
Berkeley indicates that value-based filtering tools may be valuable
in addition to the classic approach of document or query similarity.
Dr. Larsen has asked the question in the past, "If you had
unlimited bandwidth, how could you improve querying?" An
image or piece of an image could be sent. Another possibility
is to map the information space and have the user indicate what
is missing from his/her understanding. The system would then go
looking for the "missing pieces".
DARPA is still "shopping for ideas" for improving search
engines. Dr. Larsen also suggested that the CENDI agencies become
involved in the TREC and MUC work.
The future also includes making the document "smarter" through the incorporation of metadata.
DIA and NAIC are using up to three levels of metadata to make
the document "smarter". A working model is available
on Intellink. IATA metadata workshops (OCLC/NCSA Metadata Workshop)
and the Intelligence Analyst Association from the Rome Air Development
Center were suggested as important sources on metadata.
It was noted that the marketing profession has experience in developing
and utilizing customer profiling. NTIS customers are suggesting
interest profiles for "Fedworld". The issue of privacy
related to profiling was discussed. Librarians have always included
privacy as a professional ethic, even though they have developed
"profiles" on customers as they have interacted.
NAIC described its Systran machine translation system and suggested that DARPA and NAIC might benefit by working together on future Systran developments.
Dr. Larsen asked the CENDI members what they thought about the
concept of authors providing indexing. Many attendees responded
that this approach would depend on the type of vocabulary being
applied (controlled versus uncontrolled) and the domain being
indexed. It was also mentioned that keywords are often provided
by authors as part of original document, but these are often uncontrolled
phrases and may not be used in the database.
DTIC mentioned that it has done work in the area of collaboration
technologies.
Dr. Larsen indicated that there are many areas in which the research
interests of the DLI and those of the CENDI organizations intersect.
It was suggested that an update on the DLI as part of a AAAS
Communications Section session at the 1998 meeting would be valuable.
Dr. Larsen agreed. Planning must begin by February 1997.
Back to top
Elizabeth Buffum, CENDI Chair, began the second day of the meeting
by calling for more emphasis on technical collaboration and education.
The agencies have common problems and we need to capitalize on
the investments of others. CENDI must focus on key issues, and
develop partnerships in the broadest sense.
The Secretariat reviewed some of the key discussion points from
the prior days' sessions. A brain storming session added additional
interest areas to the key points outlined by the Secretariat.
A vote was then taken on those areas of interest to most agencies.
The topics of most interest were:
The Secretariat will consider these discussions when drafting
the CENDI Objectives and Activities and the Annual Plan for 1997.
The Proposals from the Secretariat and the Working Groups were
also reviewed. The proposals of interest were:
Workshop on Managing Federal Agency Intellectual Capital in
a Distributed, Networked Environment
Many groups are struggling with what and who should save electronic,
networked information, particularly related to WWW sites.
American Association for the Advancement of Science (AAAS)
Initiative on STI
AAAS is the peer group for many high-level policy makers in STI.
The proposal from the previous day's session regarding an AAAS
session on the Digital Library Initiative would come under this
proposal.
Metrics and Promoting the Understanding and Value of STI Management
This was considered to be very important. The discussions might
take the form of a symposium, a CENDI focus group, or a regular
meeting topic. It should begin with the collection of the metrics
currently used by the agencies. These metrics might include production
metrics as well as metrics concerning the value of the program
to its various customer groups.
Support for the Applications Council of the NSTC Committee
on Information Computing, and Communications (CICC)
The CENDI presentation to the NSTC CICC originally scheduled for
July has been rescheduled for September. It was also suggested
that Mel Ciment be invited to address CENDI.
Executive Order 13011 and the Chief Information Officers' (CIO)
Council
This was an area of interest. The group felt that the first step
was better briefing on the role and make-up of the Council. How
will they relate and communicate with the agencies? What topics
will they be dealing with? What is their process? How will affinity
groups be formed? CENDI involvement might result in an STI affinity
group.
Assessing and Reevaluating Cataloging for Bibliographic Databases
in an Networked Information Environment
The CENDI members approved this in principle, but asked the Cataloging Working Group to provide a more complete action plan.
Impact of the Internet on Product Development and Customer
Service
This proposal was approved by the Principals. They recommended
that the agenda focus on what has been made easier and what are
the new challenges. The User Education Working Group was asked
to provide a more complete action plan.
Back to top
The proposal for procedures by which the CENDI members direct
the efforts of the Working Groups through specific defined tasks
supported by proposals was approved. The Secretariat will finalize
the proposal process taking into account the discussion. The
procedures should be included as an addendum to the CENDI Handbook.
Due to time constraints, the specific accomplishments and current
efforts of the Working Groups and focus groups were not discussed.
However, it was noted that the Information Exchange Working Group
did not submit any proposals in the Planning Book, because the
Principals have already approved three projects that the WG will
be addressing between now and June 1997.
OTHER ITEMS
Meetings
The group determined that the frequency of meetings is appropriate.
The next meeting will be at DTIC between the end of September
and the end of October.
Communications
The Secretariat indicated that electronic means of distribution
are being used consistently and effectively. Early problems involving
the listservs have been overcome.
Other Products
The CENDI Brochure is undergoing final revisions. The CENDI Database
is up-to-date.
Back to top
National Biological Service
The merger activities with the USGS are underway. Staff are meeting
with user groups at USGS that are interested in biology and the
environment. The Denver Group has been reestablished as the Center
for Biological Informatics which is set up as a center without
walls. The USGS currently operates with three strong independent
groups,water, mapping, and
geologic. Biological will be the fourth.
National Technical Information Service
NTIS reported on its experiences with the Performance-based Organization
(PBO) process. OMB Administration is a champion for the PBO initiative,
but the PBO requests are reviewed through the conventional OMB
process.
Progress has been made on Title 44. OMB responded to the Justice
Department ruling that the requirement of printing going through
GPO is not in line with the separation of powers with the request
that the agencies not act in accordance with the Justice Department
decision until next April.
A new technology of concern is the e-mail encryption built into
Netscape 3.0. Netscape 3.0 is expected to become widely used
throughout the government over the next few months. If the digital
ID isn't registered, there is no way for an agency to access the
e-mail of an employee who leaves an organization. There is a
real question as to who is going to manage the digital ID's for
the government. NTIS will be a trusted authority with full recovery
built into the system. NTIS expects that there may be several
certifying authorities for handling passwords and ID's in the
future.
DTIC
A working group has been established within DTIC to report directly
to the Director concerning the reengineering and replacement of
the DROLS system.
The emerging issue of Information Warfare Technology is being
looked at by DISSA and others. It has become an important area
for analysis and DTIC has set up an Information Analysis Center
(IAC) to help coordinate information.
DTIC's Guidelines for WWW, which are available from the DTIC homepage
[link to http://www.dtic.dla.mil/), might be of value to others.
The National Library for Education recently established guidelines
that reference DTIC's WWW Guidelines.
Department of Energy/Office of Scientific and Technical Information
(DOE/OSTI)
Regarding the alignment of the Office of Scientific and Technical
Information within the Department of Energy, the Secretary did
not approve a recommendation to place OSTI within the Office of
the Assistant Secretary for Human Resources and Administration.
Rather, OSTI's temporary assignment to the Office of Energy Research
(ER) is expected to be made permanent. To that end, ER is currently
studying where to place OSTI within its organization.
There is support for DOE STI within Congress and there is excellent
international support. The Inspector General and the Government
Accounting Office (GAO) are studying DOE's STI management, because
there is concern that more needs to be done in terms of R&D
STI management and dissemination.
NASA
Four civil servants have been transferred from NASA HQ to Langley
Research Center as part of the transition of operations to the
lead center. The STI program is working with the NASA Centers
on the Technical Report Servers (TRS's) (including full text,
bibliographic records, and images). The plan is to transfer the
data to CASI as the primary holder. Electronic copyright issues
are surfacing, along with questions of the proper reviews and
signatures.
The issue of references within homepage documents to documents
that are not publicly available was raised. The CIO's at the
Centers are concerned and some are establishing their own policies
in advance of HQ. There is a need to make authors and system
administrators aware of the liability issues if the data are wrong
or misused.
NLM
FY97 is NLM's International Year. An international focus will
be emphasized in its content as well as dissemination. An International
Planning Council is being headed by Don Frederickson and will
include Vint Cerf, Floyd Bloom, and Gene Wong. It is expected
that domestic as well as international changes will result from
this effort.
NLM recently completed a survey on Internet access. It was a
well-developed mail questionnaire. The survey was administered
to 2,500 randomly selected MEDLINE users. The response rate with
follow-up was 82-83%. The purpose was to assess the customers'
readiness to move to Internet access. They found a considerable
degree of readiness. Seventy-five percent have access, but only
25% were using it to access MEDLINE. There is still a substantial
amount of dial-up usage. With the upgrade projections, about
90% of the users will have access to the Internet within the next
12 months. Three-quarters of the respondents have fairly substantial
modems and platforms. Only 20% of the user base are information
professionals, but the usage of MEDLINE among this 20% is very
heavy. Rural usage, however, is substantially lower, especially
in hospitals. There was a 90-95% satisfaction rating. A technical
report is being prepared for distribution.
NAIC
The Open Source Information System (OSIS) has over 20 major nodes.
Embassies, Defense R&D, and the management councils of the
Services will be included. The backbone T1 service is available,
including access to the WWW and Internet.
Of the 10 million CIRC records, 1.5 million have been moved from
the IBM mainframe to the client/server environment. The DCARS
visualization tool has been integrated with RetrievalWare and
will be available on the WWW. NAIC is offering 11 online machine
translation systems (9 for the WWW). The user provides the text
to be translated by entering a URL or by pasting or keying the
text into an editor provided with the system. Windows and UNIX
versions are available free of charge to all government organizations.
The Systran machine translation (MT) system and the Cuneiform
OCR engine are being deployed by the U.S. Army in Bosnia. Systran
is currently working on the Serbo-Croatian dictionary. The OCR
software for Chinese from ECI is being deployed to FBIS and to
Army and embassy groups in the Pacific Rim.
NAIC is using RetrievalWare (formerly ConQuest) from Excalibur
as the text retrieval engine. RetrievalWare is forming a federal
users group to better address the needs of this community. Anyone
interested in attending the meetings should contact Major Tom
Bazzoli at NAIC.
Back to top