| CENDI PRINCIPALS AND ALTERNATES MEETING |
NASA Center for Aerospace Information
Linthicum Heights, MD
February 9, 1998
WELCOME
Kurt Molholm, Deputy Chair, began the meeting at 9:10 a.m. He thanked NASA CASI for hosting the meeting. Introductions were made.
Ms. Carroll introduced the program. One of the major topics identified for the 1998 CENDI year is Digital Libraries. CENDI has had presentations from both DARPA (http://www.darpa.mil/)(on the DARPA/NSF Digital Library 1&2 initiatives) and from the National Agricultural Library (http://www.nalusda.gov/) on its AgNIC project. The presentations for this meeting set both the broader context (issues of digital information in general from CNI) and specific architectures and approaches (from the Library of Congress).
Coalition for Networked Information (CNI) and Issues in Digital Information
Clifford Lynch, President, CNI
Dr. Lynch introduced CNI (http://www.cni.org) and its program. The goal is to advance scholarship and intellectual creativity through networking and technology. CNI was formed in 1990 by three main sponsors, the Association of Research Libraries, CAUSE, and EDUCOM. There are now over 200 institutional members, predominantly institutions of higher education. However, there are others including information providers, state libraries, publishers, and library associations. All are concerned about the impact of networking on education. Through its broad membership, CNI has been an effective incubator for projects, particularly with publishers and information providers.
No one worried about policies and content when networking was first developing. Now there are four major areas in which CNI is involved: 1) general advocacy about networked information, 2) content and organization on the Internet, 3) organization and professional issues, including strategies and best practices, and 4) standards and infrastructure.
In the area of content, CNI is working with other groups on arts and humanities and cultural heritage content on the Internet. These groups include the National Humanities Alliance and the Coalition for the Interchange of Museum Information (CIMI). CNI has been a forum for organizations working on licensing issues. It has also been involved in the organization of content, including the development of metadata schemes.
The metadata schema is an area of interest for CNI. There has been an explosion of interest and activity in this area. Work is relatively far along on the Dublin Core. Currently, there are questions of the infrastructure needed to support such metadata. The Resources Description Framework (RDF) being developed by the World Wide Web Consortium (W3C) may solve many of the issues. Many of the other metadata efforts are domain specific; there is much overlap and interoperability is very awkward at this time. There is the need to look at the broad problems, rather than just at specific schema problems. There will be a focus on this for the next year or two.
To date, the focus on metadata has been on descriptive metadata, but this is a very small part of what must be known about an electronic resource. Authenticity and provenance are important as well. With relation to data sets, you need to know something about the collection techniques. With regard to images, you need to know about compression ratios and color corrections. There are issues of rights management. Administrative elements might have implications for digital preservation. All this kind of data is needed to support the digital object.
CNI is also involved in discussions related to the preservation of digital material. It is surprisingly difficult to get traction here. In 1995, there was a very important report from the Research Library Group (RLG) on this issue. However, since that time little has been done. The Council on Library and Information Resources and Association of Research Libraries (ARL) are trying to get this issue moving with several funding proposals pending. Collective conversation is needed to establish priorities related to the digitization and preservation of content. What is critical and what are the requirements for such archiving? Some organizations are taking snapshots of the Web, but this is not necessarily an approach that will be very responsive in the future. More detailed tracking is needed. The example within the newspaper industry is a good one. Newspapers are very important to the establishment of trends in many fields so archiving them for later reference is extremely important. Newspaper publishers are going online in complicated ways. Some newspapers are duplicative of their print versions, and others are only partially the same. This has impacts for what is archived and what isn't.
CNI is involved in several organizational projects:
IWIS (Institution Wide Information Strategy) that will determine best practices at policy and operating levels for information management.
Assessing the Networked Information Environment involves approximately 20 years worth of information technology assessments. Entities such as university regents, etc. are asking for data on the degree to which it has improved education. They are looking for valuable metrics. The group is working with Chuck McClure and his assessment methodology. About twelve institutions are involved in this project to date.
In the areas of Infrastructure and Standard Practices there are several key issues. The first one is authentication. There has been work done in the framework of electronic commerce. However, the issue CNI is interested in is the licensing of an information product for a user wherever he is physically located at any given time whether at home, campus or office. We don't have a viable infrastructure to support this, particularly from a Web browser. The policy side to this issue involves the fact that libraries gave anonymous access in the print environment. If the framework is implemented in the wrong way, all the information about usage will go to the publisher at the individual user level, and this raises issues of privacy. Access may be limited by license but management procedures should have an impact on authentication that will allow balance between indirection and the privacy issues. One technology under discussion is the issuance of certificates that can be used regardless of where the user physically resides. This highlights the users' responsibilities regarding the copyrighted information.
The Internet Engineering Task Force (IETF) has almost completed its work on Uniform Resource Names (URN). The structure for identifiers and their schemes remain to be decided. Questions include: What kinds of identifiers are needed (it's likely that it will be more than one)? Who should manage and operate the identifier? What will they identify in the long term (issues of long term reference and preservation)?
There have been many discussions between the URN and the Digital Object Identifier (DOI) groups. The URN is a framework that doesn't identify a specific identifier to be used within it, but puts some constraints on the identifier. The DOI is described within the URN framework. Some problems remain on the DOI to URN mapping, but they are basically compatible. The DOI is very much in a commerce context and identifies the content needed for use on the order screen. CNI has been involved in a dialogue on these issues. The issues come up when you try to use the identifier in a context for which it wasn't designed.
Interesting high level architecture issues related to Z39.50, the standard protocol for searching distributed heterogeneous databases, are starting to be addressed.
The Internet II is another area of interest. This project is driven by higher education and its need for a high speed, guaranteed quality of service system. The multimedia content is needed for distance learning. EDUCOM's Networking and Telecommunications Access project addresses this well.
Instructional Distance technologies are going to be necessary. The National Learning Infrastructure Initiative (NLII) under EDUCOM is a partnership among providers and higher education institutions. They have developed the Instructional Management System (IMS) as the basis for a digital library for instructional content. Content may include demos, simulations, prototypes, etc. Outstanding questions include how to link to online catalogs, how is metadata used, and how is instructional material integrated with a broader information universe? This program theme is expected to continue for the next few years.
The quality of service issue is of real concern. Right now the telecommunications system underlying the Internet is unreliable, because it is the system's best effort. The Internet is turned into a reliable service by TCP at the hosts, which accomplish this through timeouts, retransmissions, acknowledgment of data, etc., all of which slow the system down. This type of architecture doesn't work well for multimedia since you need to massively over commit the service in order to provide the speed, bandwidth and duration needed for multimedia. Right now pilots of Internet telephony and streaming video don't work well.
The new approach is called RSVP. This is the protocol which requests a certain bandwidth for a certain time with reliability. How do you design negotiating applications in this environment? How do you put up with router and trunk problems? If you overcommit, by how much? New business arrangements will be needed in a commercial environment built on RSVP. A constrained Internet II testbed in this area would be valuable.
Authentication and reservation protocols must be developed. This is particularly important to the interactive control of remote equipment. Who is allowed to ask for what bandwidth and when. Unfortunately these concepts may resurrect the time-share computer system-like structures which preceded the network environment.
Digital signature technologies are developing and are being implemented at the state government level. Unfortunately, there have been many different schemes developed. The federal agencies are just starting to implement digital signatures and it is a different technology than that used by the states. There have been patent positions on the algorithms used for creating these digital signatures. There are also different algorithms that work at different levels of the problem. Some of them may actually help the cause of interoperability.
The UCC2B (Uniform Commercial Code) for information products is also being reviewed. This is being reviewed first at the national level and then it may be adopted at the state level. There are pieces of the Internet content arena (such as photographs and likenesses) that have been historically regulated by state laws and regulations, which are relatively non-uniform. There is also the issue of which court has the jurisdiction.
Discussion
Mary Levering of the Copyright Office indicated that the Electronic Copyright Deposit System run by the Copyright Office is now using the RSA algorithm for digital signatures. They are also using handle management technology. Extensibility and interoperability are important.
Several CENDI members indicated that they are particularly interested in the rights of the federal government to use its own information. Both Ms. Levering and Dr. Lynch indicated that they have not heard specific discussions in any of these areas. The laws, codes, and procedures leave the final decision to be developed through a case-by-case analysis. However, pending database legislation could change a lot of the things and needs to be closely watched.
Architecting Digital Libraries at the Library of Congress
Jim Stevens, Special Projects, Information Technology Systems, Library of Congress
Mr. Stevens described the general philosophy and the components of the CapNet digital library being developed to support the legislative process in Congress. The plan for implementation of the Congressional System is to begin with bills that are easier than the Congressional Record. Other products can then be extracted from this system. It will be deployed via the Senate/House Intranet first. Public access will eventually come but it is not their primary thrust right now. He indicated that the pieces of the digital libraries are an ongoing development process at LoC.
The basis for the design of CapNet is an open systems, object-oriented architecture. It is standards-based to support interoperability and eliminate a single vendor situation. Text/image/graphics/audio and video will be stored as data fragments. Some of the text will be of value in relational database structures, while others could be held as long texts.
In response to a question about the degree to which relational versus hierarchical or sequential data structures are used, Mr. Stevens emphasized the fact that data analysis needs to be done in a different way. The Senate, House, and Library of Congress are coming to grips with this. The important thing is to analyze the different types of text and data. It isn't important to normalize but to identify the characteristics and relationships between the various objects. This is a cultural change for the programming and development staff that is necessary.
Also critical to the success of the digital library initiatives is the integration of commercial system components. It is important to look at the applications programmer interfaces (APIs) available from the vendors. In the object oriented world, CORBA is a new standard for allowing program A and program B to talk in a standard way through a protocol called the Object Request Broker.
There are five components to the digital library, workflow, the network, object management information, the object content repository, and text indexing.
The workflow applications integration allows functions to be put together in a non-intrusive way. The plan is to capture information about the object as it is created. This was begun back when scanning was first started. The identification of where, which, the circumstances under which it was created, rights management, and how to associate it with the digital library are important.
There is a need to track and control the item (particularly version control). For the legislation, the Congressional Research Service needs to identify the committee, the sponsors, the status of the legislation, who introduced it, etc. All this is administrative information. There is also the need to connect the bills to their amendments. Authority control of digital signatures is key.
Routing to logical users is important. The need for push technology requires that there be information in the metadata that will support this push. The administrative structure then also requires authentication for the push to be received. In the CapNet system (an Intranet) security is an issue. It is important to be as secure as Congress wants it to be and, thus, the degree of security can vary.
The concept of a "data blade" was described. This involves expanding the data types by developing things that explain how to do functions with a particular object. ODBC and JDBC are important standards for switching software between database management systems.
There are four different kinds of metadata involved in the Congressional system. The MARC record supports the regular library cataloging and archiving of the physical item. EAD (Encoded Archive Description) documents the Finding Aid, and Government Information Locator Service (GILS) documents the resource. The fourth level is the status metadata described earlier. At all levels, there is not only descriptive metadata, but structural and administrative metadata to support the identify of a document and the workflow sequences, access control, and required security.
Structural metadata tells you about the object you have stored. The identification of whole versus part and the "granularity" with which something is identified can affect copyright. The Washington Post, for example, is using a commercially available document management system with structural metadata elements to move newswire feeds and documents around.
A persistent identifier such as the Uniform Resource Identifier (URI) or URL is necessary. There is no particular group of people who will be able to control this. The need is for interoperability because we will have different standards.
The configuration of compound "virtual documents" is a necessary part of the digital library. It is important for future authors/scientists/researchers to be taught to write this new kind of document in which the paragraphs can stand on their own. It is then the responsibility of the system to dynamically recombine the document fragments into virtual documents as required by the user.
The object content repository stores the various types of content in the storage technologies most suited. They have identified as many as 175 different file types and data types.
The legislative system uses conventional types of text indexing. Structured Query Language (SQL) and Boolean are used to search for something that is known usually by field. Pattern recognition is also used along with relevancy ranking. There is also an attempt to do something with the query itself. Products such as Verity, Fulcrum, OpenText, Inquery and SQL relational technologies are integrated.
Mr. Stevens believes that concept- or knowledge-based or natural language searches are the ways in which searching will be done in the future. "Disambiguation" is key. They are using TextWise's Dr. Link, Sovereign Hill/KnowledgeMine, IBM/MediaMiner, and Oracle's Context to develop the future environment for parsing, tagging, phrase recognition, and building a "thesaurus" to recognize phrases. This can be done in small domains.
The document management system is currently being procured. He wasn't able to discuss the systems specifically because of this procurement process. However, Mr. Stevens indicated that there are about 10 systems that can be used in a Web environment. He believes that large companies like aerospace, pharmaceuticals, publishers, and insurance firms will make the most progress in commercial document management systems. Commercial firms will take the requirements from these groups. Some products are farther ahead than others, but they are very similar.
Mr. Stevens believes that it is more important to worry about the ability to copy from media than to worry about preservation. He doesn't know how you avoid some level of redundancy. It is important to look at the deficiencies in the current processes.
Discussion
When asked about archiving, Mr. Stevens indicated that he believes the agency creating the record should maintain the record forever. It isn't necessary to have a single indexing site or engine.
Unicode is the way to go for character sets, once we get applications to handle it. Web browsers are available that can use Unicode, but the problem is the creation via keyboard is difficult. MARC/MARBI is getting closer to deciding on Unicode as opposed to the current MARC character set. XML is also discussing Unicode.
The LoC is in the process of procuring a new integrated library system. He had no particular advice on this other than to pick a good company that is moving in the same direction as one's future requirements.
NASA Image eXchange (NIX) System
Roland Ridgeway, Lynn Heimerle, Bill von Ofenheim , NASA STI Program
Roland Ridgeway introduced the NIX Project [link to http://www.nix.nasa.gov]. It involves a distributed digital image library from several of the NASA centers. Ms. Heimerle and Mr. von Ofenheim described the project and demonstrated it. Several models for the system were examined before one was selected. NASA decided on a totally distributed environment because of cost and minimal change to existing systems. The central server holds only CGI and Perl scripts that indicate how a user's query, entered via a Web interface, is to be translated into the search system for each particular center-level system. Each center's image server holds the metadata and the image files.
The searching is done in parallel through a variety of database structures and search engines (MACSQL, FoxPro), though most of them are searched against WAIS databases. The system supports both WAIS and HTTP protocols, though WAIS is preferred because it is simple and efficient. To overcome the delay problems that could occur if a center's image server is unavailable or extremely slow, there is a child timer spawned with each search that includes an automatic time-out function. This allows the system to reply even though one of the centers does not answer. The search goes on without that center and the results indicate that one of the centers did not respond. The information retrieved includes the title, author, a brief description, and thumbnail of the image.
The results from the multiple searches are then merged and sorted. The arrangement/order of the items in the result set created an unexpected problem. Relevance ranking often emphasized one center over another. To overcome this, they came up with a "Fairness Doctrine" way to rotate the centers alphabetically and when presenting the results. The search engine's scoring mechanism has no impact on the final display. A multitude of image capabilities are available including thumbnails and zooming. The text metadata record can also be viewed.
Many aspects of the searching can be customized by the user. The system can be tailored to deselect a center. The user can also specify the maximum number of hits, the number of images per page and the child time-out time. A text-only search is also an option in order to avoid the problem of downloading image files on slow systems. The display is narrow enough to fit within a Web TV environment.
The plan is to continue to add more content at the centers that are already contributing. In addition, they hope to add information from Kennedy, Marshall, the Jet Propulsion Laboratory (JPL) and the Hubble Space Telescope project. VRML (virtual reality) and audio will also be added.
This digital image library project has been very successful. Many positive comments have been received from users.