CENDI PRINCIPALS AND ALTERNATES MEETING

National Library of Medicine
Bethesda, Maryland
February 5, 2002

Minutes

Policy and Technology for the New Year!
OSTP Priorities and Scientific and Technical Information
Searching the Deep Web
Public Domain Information: Status of Activities and Initiatives
NLM Computer Security Architecture
Profiles in Science: Overview and Demonstration

WELCOME

Kent Smith opened the meeting at 9:15 am. He welcomed all in attendance to the National Library of Medicine.

POLICY AND TECHNOLOGY FOR THE NEW YEAR!

"OSTP Priorities and Scientific and Technical Information"
W. Russell Neuman, Senior Policy Analyst, Technology Division, White House Office of Science and Technology Policy

Dr. Neuman recently arrived at the Office of Science and Technology Policy (OSTP) on an IPA from the University of Michigan, where he is the Evans Professor of Media Technology. He expressed greetings from CENDI's friends and colleagues at OSTP. He described the new administration at OSTP, the modest restructuring that is taking place, what has been happening and the work they see in the future.

Dr. Neuman described the members of the Executive Office. Jack Marburger is the new director of OSTP. He was formerly Director of Brookhaven National Laboratory, President of the Stony Brook University, and Dean and Professor of Physics at the University of Southern California. Richard Russell has been nominated and is awaiting confirmation as the Associate Director of Technology. He was involved previously with the Energy and Environment subcommittee. The other Associate Director will be Kathie Olson, who was Chief Scientist at NASA and is a brain researcher. Shana Dale is the Chief of Staff and General Counsel. Approximately two-thirds of the OSTP staff positions have been filled.

OSTP is undergoing a modest restructuring in order to integrate functions for better efficiency. Previously there were separate associate directors for International and Security Affairs and for the Environment. The elimination of these positions does not mean that these areas are not important, but reflects Marburger's sense that integrated rather than smoke-stack organization is preferred.

OSTP IT has two advisory committees: the President's Information Technology Advisory Committee (PITAC) and the President's Council of Advisors on Science and Technology (PCAST). PCAST is an industry advisory committee. PCAST met recently and created four working groups: Counterterrorism, S&T Budget, Energy Efficiency and Economy, and 21st Century Infrastructure. The latter, chaired by Marye Anne Fox, will focus on broadband to the home and ubiquitous computing. Dr. Neuman will staff that group. While previous PCAST reports often involved lengthy white papers, the new PCAST plans to spend the majority of its time deliberating and making concise recommendations in bullet form with well-developed rationales.

PITAC was renewed last October. They are in the process of selecting a new membership; it could be four months before the first meeting occurs due to the Federal Advisory Committee Act (FACA).

Current issues of concern to OSTP focus on homeland security. Dr. Neuman discussed the article by William Broad in the New York Times related to the dissemination of germ warfare information. He also noted that Dr. Dangerfield pointed out the risks in some declassified reports long before 9/11. The recent events just increased urgency and awareness. He indicated that a memo would be forthcoming from the Executive Office of the President (EOP) that is intended to be an interim policy and will suggest a common sense approach to routine public release post 9/11. It will specifically mention weapons of mass description and other exemptions to routine public release. A more "nuanced" memo will be forthcoming later that will provide more specific guidance on the treatment of other sensitive issues. OSTP understands that there must be a reasonable balance between the need for the dissemination of legitimate scientific information and national security concerns.

There have already been discussions between OSTP and agencies such as DoD and NTIS regarding document classification/declassification. Four main categories have been identified that need to be addressed: those items that were created as classified and are still classified, those that were created as classified and have been downgraded, those that are classified and are scheduled for declassification, and those that originated as unclassified. The specifics of the classification system need to be reviewed. DTIC, for example, has redesigned its registration process so that it uses LDAP.

It was also noted that content and control of dissemination differs by agency. There are agencies, such as the Department of Interior, that do not have the ability to classify materials. However, these agencies do have some material that is sensitive post 9/11.

The movement of information across borders needs to be addressed. Richard Clark, who is in charge of Cyberterrorism, has suggested free standing Virtual Private Networks to support government and other organizations that need to know. This has been addressed in the Gov-Net RFP. Responses (approximately 140) to the RFP are being reviewed by Mark Forman's office in OMB.

Obviously the agency stovepipe structure has been challenged by 9/11. It points up the need for more cross agency work.

The National Science & Technology Council (NSTC) structure will continue under Gary Ellis with weekly e-mails and meetings being held.

Broadband is a significant area of effort for Dr. Neuman. There are currently 12 pieces of legislation on his desk that include broadband in some way, including one version of the economic stimulus package. Economic incentives for broadband were included in this. The Judiciary Committee is still drafting database protection legislation. The issue of broadband deployment may also be included here.

The Homeland Security Presidential Directives (HSPD) 1 and 2 explicitly direct OSTP and others to address data sharing and new techniques for data mining. Data sharing from disparate databases is already supporting border security while facilitating legitimate commerce and tourism. An interagency group has identified barriers to data sharing; these barriers include technical, turf, and data integration issues. Dr. Neuman indicated that a meeting on new techniques for data mining would be held under the auspices of the White House. However, most of the data mining technology on the market appears to be related to commercial marketing applications. He is interested in data mining technologies that might be more applicable to the needs of the government.

On the border issue, they are also looking at integrating biometric information with other database information.

In the area of Information Technology, Dr. Neuman mentioned the Information and Technology Budget CrossCut. There has been a steady state in IT R&D. Chapter 22 of the President's 2002 budget has five major themes: simplify and integrate across the government and reduce duplication, improve management, increase the security of government information systems, eliminate redundant and unneeded IT, and establish successful e-business practices, including best practices from the private sector. Chapter 22 analytics specifically discuss Cybersecurity and E-government. A major issue is the lack of performance measures and goals listed for FY02. There is a need to develop and use appropriate metrics. This is a challenge for STI R&D, but it is not impossible and is worth the attempt.

Discussion

Members of CENDI expressed an interest in the memo and Dr. Neuman suggested that further discussions between interested CENDI members and OSTP would help to inform the "nuanced" memo. The CENDI Secretariat will work with Dr. Neuman's office to set up a meeting after the initial OMB memo has been released. The CENDI IT Security and Privacy and the STI Policy Working Groups may also be involved.

There was also some discussion on how the whole structure of homeland security operates, especially in conjunction with science and OSTP. Dr. Neuman mentioned there is a complex structure, including five executives, 11 policy coordinating committees, etc. The Deputy Assistant Secretary level is supported by an interagency working group (IAWG). Dr. Jim Griffin at OSTP and Dr. Norman Bradburn of the National Science Foundation support the IAWG.

"Searching the Deep Web"
Abe Lederman, Innovative Web Applications

The "deep web" (sometimes called the "invisible web") is information that is made available via the web, but that cannot be retrieved by web crawlers because it is in databases, behind firewalls or available only for a fee or with some access restrictions. Until recently this also included document formats such as PDF, but Google has recently announced searchable access to PDF documents. In March 2000, BrightPlanet estimated the deep web at 200,000 sites, many of high quality. It is 400-550 times larger than the surface Web, and growing faster than the surface Web.

The deep web search technology developed by Innovative Web Applications (IWAPPS) is called Distributed Explorit. IWAPPS was founded in 1996 and the first application was the Environmental Science Network for the Department of Energy's Environmental Management Science Program. It has also been implemented on the Energy Portal at DTIC and at a number of other web sites. Explorit is being used with science.gov to link 18 bibliographic and full-text databases from the agencies and the index of selected Web resources.

Explorit is driven by configuration files. It can access databases or other deep web contents that are web accessible and that produce a hit list with an access ID like a "URL". This technology is independent of the target search engine; the target database does not need to make any changes or conform to any standards. However, because of the heterogeneity of the target search engines, database structures and interfaces, some functions of the target system may not be implemented in Explorit. The configuration files can take advantage of as much of the native search engine technology as the customer chooses but, on occasion, it requires specific software development.

Distributed Explorit also standardizes the results that are presented back from the target databases. The results are parsed to extract key information such as author and title. The results are currently presented in the order in which the target systems respond. If more databases are selected, the speed of searching is generally slower. Error messages are presented if the target database is unavailable or if a "timeout" occurs.

"Deep web" searching does not require space for indexes or caching on the integrator's server because all of the searching is done in real time. The contents being searched are always up-to-date with that of the target database. Minimal CPU resources are needed. However, the network resources are greater than a server-based system because search requests have to be sent out to multiple databases and results have to come back from these databases, integrated, and sent back to users over the network.

Explorit also provides some enhancements that generally are not found in Web search engines or other applications that search the deep web. These include navigation capabilities, the ability to mark and download results, field searching, including date-range searching, access to log-in restricted Web sites, and access to sites that use cookies or other session identifiers.

Explorit has been enhanced to develop personal libraries. Alerts, based on user profiles, can be run on a schedule, and the results sent to the user's e-mail.

Several enhancements are currently in development; these are based on the needs of users and sponsors. Of particular importance is the development of a tool to actively monitor the availability of databases, so that Explorit can alert the integrator to changes in the availability of or the access to a database. This involves a series of "canned" searches. If the system gets no response or an unexpected response, the integrator can be alerted, allowing the integrator to make changes before the user encounters access problems.

Future enhancement may include the clustering of results, indexing/analysis of results, helping the user identify the most relevant sources for their searches, and connecting the product with collaborative work tools.

Discussion

It was suggested that CENDI coordinate information about the availability of and changes to the databases and target search engines that are included in science.gov in order to ensure continuity of service.

"Public Domain Information: Status of Activities and Initiatives"
Paul Uhlir, U.S. CODATA/National Research Council

Mr. Uhlir described the activities related to public domain information since his last presentation to CENDI in June 2001. A symposium is planned for September 5-6, 2002, at the National Academies of Science to discuss the U.S. environment. The program is currently being organized; suggestions for issues and speakers are being gathered.

After this symposium, he would like to address the international aspects. Some related sessions are planned for CODATA 2002 in Montreal. Depending on funding, another symposium might be planned for 2003 in Europe with international participation and sponsorship.

Discussion

Dr. Siegel mentioned that INSERM has expressed concern about the status of public domain information in France. He suggested that the follow on session in Europe could be done in conjunction with INSERM and related to the ICSTI Winter Meeting in January 2003.

Dr. Uhlir also raised the issue of the OMB A-130 review. The formal review will occur later this year. Preliminary discussions will be held at the U.S. National CODATA meeting on May 9, 2002. This will be a two-hour session with Brooke Dickson and Dan Chenok to discuss the proposed review and to express STI perspectives. Dr. Uhlir and Ms. Bortnick Griffith/Ms. Carroll will discuss the details of this meeting. The discussions will be captured on the CODATA Web site. The Policy Working Group will alert the members to the details of this meeting.

ACTION ITEM: Dr. Uhlir and Ms. Griffith will discuss details of CODATA meeting with OMB. The Policy Working Group will alert the members to the details of the meeting with OMB to discuss the review of A-130.

"NLM Computer Security Architecture"
Dr. Simon Liu, Director, Office of Computer and Communications

The model for computer security architecture at NLM is that of law enforcement - looking at motives, means and opportunity. Cyberattack is going to grow and is inevitable. Any connection to the Internet is a potential for an attack. According to recent Carnegie Mellon statistics, the number of incidents and the reported vulnerabilities doubled between 2000 and 2001. For NLM, the biggest threats are from information gathering and viruses.

NLM's architecture must address a variety of threats including viruses/worms; information gathering (some of which is legitimate); "script kiddie" attempts in which hackers use readily available tools; active, systematic scanning often for free software or to sneak into the network; and denial of service attempts. The challenge is to maintain performance and reliability in the face of these threats.

NLM's current security architecture includes three zones that address both Internet I and II connections. Zone 1 is the external IDS (Intrusion Detection System). Zone 2 is the internal scanning for vulnerabilities and viruses, and Zone 3 includes the internal IDS looking at who is trying to access for both public and private servers, and incident monitoring and reporting.

In 2002, they will be moving to a five-zone architecture. This will include looking at gigabit IDS. The new architecture separates the internal IDS previously in Zone 3 from the incident monitoring. It further separates the internal IDS for public servers from the private servers. Web/URL scanning is added to Zone 2. Zone 5, incident monitoring and response, has increased auditing and reporting. The approach is multi-layer, addressing prevention, detection and response.

Dr. Liu used the analogy of a castle to describe the multidimensional nature of the new architecture. The secure servers, authentication, and virus scanning, for example, are analogous to the front gate of a castle. The fortified walls and moat are the firewalls and the zoning/DMZs. The guards are the internal IDS and system scanning. The incident monitoring and response are analogous to a castle's alarm system. The watchtower includes security policies, management, and awareness training.

NLM is currently in the process of implementing this new architecture. They expect the new architecture to be in place by August and then will be dealing with maintenance of the system.

Redundancy is part of the security as well. They have an internal mirror server but do not yet have a mirror server off-site. The problem with mirror sites is keeping them synchronized. The lap top network in the reading room has a firewall around it.

In the process of developing the new architecture, NLM has learned several important lessons. Security is an investment, and not an expense. IT security is a journey, because the attacks never end and are increasingly sophisticated. Good security is better than perfect security that can never be implemented. Security and usability unfortunately are often at odds. Finally, it is better to concentrate on the known and most probable threats.

"Profiles in Science: Overview and Demonstration"
Alexa McCray, Director, Lister Hill Center for Biomedical Communications

Profiles in Science (http://profiles.nlm.nih.gov/) captures the papers, photos, letters and other memorabilia of Nobel Laureates in order to preserve the history of science. The goal is to make science real and interesting to students and other members of the public. Profiles looks behind the scenes at how science is done and by whom.

There are seven collections. The collection of geneticist and microbiologist Joshua Lederberg is unusual in that Dr. Lederberg is still alive. He has been very supportive of this effort and intends to annotate the electronic documents rather than write his memoirs. The collection of geneticist Barbara McClintock is owned by the American Philosophical Society. The materials were brought to NLM, scanned, and then returned. The APS references the NLM collection from its web site. In some cases, there are links between the collections.

This project requires the Library to capture and manage a variety of object types and to determine how to link them in meaningful ways. The model for Profiles is the "electronic exhibit". For example, the Joshua Lederberg collection alone would require approximately 300 linear feet of physical space.

Much of the effort has involved converting paper to digital form. The work began in the early 1990s when the technology for conversion was not as mature. In addition, some of the materials, such as photocopies or mimeographs, are extremely difficult to capture because of the poor quality of the original.

For paper documents, TIFF images were originally created. Then the Center went to "gifs" because of the smaller file size when compared to TIFF. Now "pdfs" are used but they have continued to retain the TIFF images. If a better format comes along, they will return to the TIFF images to create the new format, since this would result in the least amount of "loss".

In some cases, the digitized versions of material are better quality than the originals. This is the case with digitizing to HDV (high definition video). However, for archiving purposes, the original material is always kept.

Metadata plays a critical role in the delivery, management, resource description and discovery. The project developed its own metadata. The Center has clear semantics for their data elements that allow other metadata schemes, such as Dublin Core, to be used as needed for importing or exporting information. The Art and Architecture Thesaurus was used for the document types.
In addition to metadata, templates and index terms have compensated for the poor OCR quality.

A single system is used to manage the entire life cycle of Profiles material. This system is documented in an article for the Communications of the ACM. Tools are used for metadata entry, but the remainder requires human effort.

Return to Minutes Archive