CENDI PRINCIPALS AND ALTERNATES MEETING

Environmental Protection Agency
Washington, D.C.
December 2, 2002

Minutes

The Future Cyberinfrastructure: Government and Industry Perspectives
Revolutionizing Science and Engineering Through Cyberinfrastructure
Industry Perspectives on the Future of the Cyberinfrastructure
 
Agency Update: EPA's Scientific and Technical Information Systems

WELCOME

Kent Smith, Chair of CENDI, opened the meeting at 9:15 am. Odelia Funke and Tom Tracy welcomed the CENDI members to EPA.

The Future Cyberinfrastructure: Government and Industry Perspectives

"Revolutionizing Science and Engineering Through Cyberinfrastructure"
Dr. Peter Freeman, Director
Computing, Information Science & Engineering (CISE) Directorate, National Science Foundation

The focus of CISE's analysis of the cyberinfrastructure is on the needs for the research activity and revolutionizing what the scientist does. The National Science Foundation (NSF) formed a panel chaired by Daniel Atkins of the University of Michigan to consider the cyberinfrastructure for scientific research and engineering. The report is in final draft.

Dr. Freeman described the cyberinfrastructure using the metaphor of a table or a platform on which applications can be built. These applications are what sit on the table. The applications are tied together by high bandwidth networks. The infrastructure is made up of several components, including mass storage, networks, digital libraries, databases and content, sensors/effectors from telescopes to standalone field sensors, software, and services. Services will ultimately be as important as the hardware. Services include training and deeper, broader S&T education for scientists.

Dr. Freeman put the cyberinfrastructure development in the context of NSF's long history of involvement in computing beginning with the funding in the 1960s that spurred campus computing. These original awards for computing in the 1960s were very spotty. This led to supercomputing initiatives and the research to support more advanced computing environments, such as the PACI in 2000, and the Middleware and Terascale initiatives that followed. The more recent efforts have been coordinated through networking. Now, the three supercomputer centers -- San Diego, Pittsburgh and Illinois -- are working on the TeraGrid, a grid for high-end computation that would make the resources of all the centers available to scientists accessing the grid.

There are several trends that are likely to converge over the next few years to dramatically change the infrastructure for science between 2000 and 2010. The first is the power and capacity of technology. This has been termed the "information tsunami". Estimates of the output of various devices and projects, such as the new particle physics board being built at CERN, are now in the "Yottabyte" range. (The board is expected to produce a petabyte of data per second.) Not only do we need mass storage devices capable of handling this amount of data, but we want to be able to make this data available to research physicists anywhere in the world. The issue of moving petabytes around the world puts new requirements on networks. While processing speeds may be an issue, communications capabilities are also improving. Research into end-to-end optical switching is expected to solve most of these problems.

New modes of computing have been created by the reduced cost, speed, and capacity of networks. The environment is more data intensive, high speed/real time, and collaborative. New modes of interacting with resources have emerged. For example, in certain fields such as astronomy, scientists work from home rather than in the field, using digital images and data collected by remote instruments. With digital resources you can put images together from the Northern hemisphere and the Southern hemisphere at the same time. Remote particle accelerators are available in a virtual environment. The National Ecological Observatory Network (NEON) is proposed to operate on batteries and communicate via wireless. These NEON devices could be put throughout a region or a country to model and trace pollutants and other aspects of ecosystems.

The second major trend is the transformative power of computational resources for science and engineering. This includes space weather modeling, storm tracking, and the Encyclopedia of Life (comparing genomes of multiple organisms). This has resulted in collaboratories and the development of national and international efforts to build research infrastructure "grids" for sharing computing resources.

The third trend is an increased understanding of computation among policy makers within agencies, the Administration, and Congress. The key difference between the HPCC (High Performance Computing Committee) of the 1980s and the current initiatives is the level of networking. In the HPCC initiatives, you had to fight to get communications included in the architecture; now, networking is assumed.

The objective of the cyberinfrastructure is to provide an integrated environment for all scientists to work on advanced problems, not just those in certain countries or from selected research universities. The key challenge is to how to build the various components. How do we shape the technical architecture, and how do we go about using it? In the past several years, answers to these questions have begun to emerge, based on leading researchers who have experimented with advanced resources in multiple fields. We now have a way to move what we've learned in one field of study to another. Different areas of NSF are conducting workshops on the cyberinfrastructure for their own areas. This is starting to break down the historic NSF silos. The ubiquitous nature of computing and the breadth of technologies permit us to do something significant.

Dr. Freeman closed with some observations on what the cyberinfrastructure might mean for the government. The development of such a cyberinfrastructure should speed application development, enabling new business practices to be adopted. However, innovation must come first and best practices must be spread. Requirements analysis is still part of the problem and the delay in application development. The infrastructure can support the requirements analysis by being more bullet proof and freeing time from the routine of programming.

Discussion

The issue of older content was raised. Dr. Freeman noted that Tree of Life, Virtual Observatory, and several other NSF-funded projects include older material as well as current resources. However, this is a major challenge. Connections are made less often between the published literature and these projects than between the projects and data collections.

When asked about the status of the National Coordination Office (NCO), he said that it still exists. He chairs the interagency working group, but, to-date, it has not focused on cyberinfrastructure. Dr. Freeman said he intends to get the NCO engaged in these discussions. There will be a new director of the NCO. OSTP has asked for updates since the development of the Japanese Earth Simulator System, and the Atkins report has been briefed to Dr. Marburger.

The difficulties in trying to "talk across the agencies" were discussed. Dr. Freeman indicated that Dr. Rita Colwell, head of NSF, is very interested in promoting interoperability across the agencies. The National Science Digital Library (NSDL) is under CISE and it will ultimately take content from a variety of sources, both government and non-government, but standards are needed for government content just as in industry. Dr. Freeman is coordinating NSF's homeland security effort, and he noted that there are problems with modeling the spread of infectious diseases, because every agency involved has a different definition of a "disease related event".

"Industry Perspectives on the Future of the Cyberinfrastructure"
Dr. Michael Nelson, Director
Internet Technology and Strategy, IBM

Bumper Sticker #1: Always have a good bumper sticker.

Dr. Nelson noted that what is needed to make a point in Washington is a good bumper sticker, a buzz word, two good factoids, a good diagram, and two personal anecdotes.

Bumper Sticker #2: "Nowadays, it's too hard to predict the future. So, I settle for predicting the present." - noted futurist John Perry Barlow

IBM is building the Next Generation Internet (NGi). Dr. Nelson has been at IBM for over four years, where his job is to talk about the future, to strategize, and to work on relevant standards.

While he acknowledged that it is difficult and often precarious to predict the future, when trying to speculate, it is necessary to look at technologies, applications, and impacts. He presented trends in computing and computer usage. In four years, we will have 10 times as much computing power as today, per dollar spent, and, in 12 years, that will increase to 1000 times more through both hardware and software. In three years, a system will be built for DoD that has the same raw processing power as the human brain. In a few months, a four-gigabyte microdrive will be available. Communications will also improve by 100 times per dollar. In 12-15 years, we may run into a "dead end" with Moore's law because of the size of devices; this speaks to the need to develop different architectures.

Bumper Sticker #3: The Internet revolution is less than five percent complete.

Dr. Nelson estimates that the Internet Revolution is less than five percent complete based on the number of users, the number of devices, the speed/bandwidth and the amount of content and number of applications. Between now and 2010, the number of Internet users will increase by one million, with more people spending more time using more data-rich applications. The vast majority of the usage of the Internet will be for business rather than for science.

There was an estimate one petabyte of data on the Internet at the end of 2001. By 2006, this will increase to an exabyte, and by 2010 the number will be a zettabyte. The increase is due to the addition of material and to the replication of data across the net. Also adding to this increase is information that would be on paper today will be digitized.

The infrastructure of the future will have several characteristics: it will be fast, everywhere, always on, natural, easy, intelligent, and trusted. Speed and pervasiveness have been the focus so far. IBM is working in the other areas. There is a good white paper at www.ibm.com/NGi. He indicated that CENDI is in the middle of the effort to make the content intelligent. In terms of trust, we have the technology but there are problems with the lack of standards and sociological issues.

Computing will soon be pervasive. There will be more than 2 million cars on the Internet. Smartcards will be used for many purposes, and homes will be networked. There are many exciting new applications being developed. IBM's e-meeting application, "SameTime" meeting technology via the Web, is now being used. This is particularly important for IBM where 30 percent of the workforce does not work in an office. There are solutions to the security issues.

This pervasiveness and incorporation into business activities means that computing systems must be easier to manage because there will not be sufficient human resources to "tend" them the way they require today. IBM is working on autonomic computing, meaning that the system will be self-configuring, self-optimizing, self-healing, and self-protecting. IBM is looking to improve the ease of management by 98-99 percent. This means that hacking attacks must be caught, upgrades must be done, and other maintenance activities undertaken by the system as needed with little human intervention. There is no silver bullet but it is little enhancements at all levels, including middleware, that will make this possible.

Open standards are key. Open source software promotes open standards and allows everyone to see the code and to modify it. Leading in this area are products such as Apache and Linux. Many engineers are building Linux simultaneously, and IBM has donated lots of code. There are more than 4600 sites using Linux; Linux on mainframes has been a real driver. Many of these are government sites in both the US and Europe, since the governments are pushing for open source solutions. Security concerns of open source have been addressed through the National Security Administration-funded development of Secure Linux.

The development of the infrastructure has occurred in three phases. Phase 1 was one-to-one connectivity. It began with a remote user connecting to a computer. E-mail extended this to a one-to-one computer connection. Phase 2 was one-to-many via the Web. Phase 3 is many-to-many (peer to peer) as defined by NAPSTER.

The grid is an extension of this many-to-many architecture. The user connects to the grid, which is actually a series of small systems. The grid is a virtual supercomputer with distributed storage, applications and services, providing a more efficient use of IT resources because the network can take care of surges in need. The grid provides an industrial strength system with back up. Security is easier since the grid is more centrally managed than the current Internet. Data can be pulled as you need it rather than rehosted. For example, the University of Pennsylvania is running breast cancer research on the grid. The digital images are available across locations and various applications can be run across all the data, resolving some patient privacy issues, were the data to be rehosted.

E-business will benefit from the grid through different ways of working and communicating. Data and communications can be shared. An example of this is the online gaming environment like Butterfly.net, which supports multiplayer simulations. This network will soon support up to 1 million online gamers at a time.

NSF's teragrid project will provide a speed of 13.6 teraflops per second, resulting in the second fastest supercomputer, linking four sites. This project will be completed next year. It provides cheaper use of the supercomputers, and will serve as a model for understanding what the grid is all about.

The first phase of grid development will be intragrids within companies and inside firewalls. Phase 2 is research networks where you have a common purpose across organizations. The third phase is third-party grids. IBM describes this as "on demand computing" in which users will have the computing power they need when they need it, making the computing environment a utility that is global, integrated, and virtualized (www.ibm.com/ondemand.ex). The grid as a utility is likely to occur within the next five years, once accounting and security issues are addressed. A major challenge is to get CEOs to accept utilities running their IT systems, but more companies are outsourcing their IT.

Bumper Sticker # 4: The Internet is entering adolescence.

Dr. Nelson referenced a National Research Council report entitled "Internet's Coming of Age" (www.cstb.org) which describes the Internet as entering adolescence. This stage is characterized by rapid growth and change, a time of critical choices, and an unruly nature. There are many problems with the current Internet, such as incomplete wireless coverage, privacy issues, spam and cyberfraud, the information overload, and the digital divide.

The question is, can we survive the current Internet? Dr. Nelson believes that we can if standards can be developed and if the private sector and the market are allowed to develop. IBM also is involved in a Computer Systems Policy Project (www.cspp.org). The CEOs of the top 10 U.S. computer companies discussed the impact of the Next Generation Internet and addressed issues such as what the U.S. government can do to promote the digital economy.

Technology areas that need further development include: wireless, filtering technologies, digital rights management, instant messaging, web services, and authentication and directories. Cyberpolicy drivers such as privacy, payments and taxation, protectionism (Europe's database protection legislation), pornography and spam, and psychology also need to be addressed. In some cases, the technologies are the answers to the cyberpolicy drivers.

It is important to remember that the Net is being shaped outside the regular standards bodies in groups such as OASIS, the World Wide Web Consortium (W3C) and the Internet Engineering Task Force (IETF). The challenge in all these innovations is to get competitors to work together on standards.

Dr. Nelson concluded with three additional bumper stickers: The future will be here before we know it. We can't use old models in the new medium. The NGi will be as disruptive as the printing press and it will be faster and more global in its impact. It is estimated that a 99 percent reduction in cost of the information dissemination will result from the NGi just as the printing press and the Internet itself have dramatically reduced the cost.

IBM has estimated that the cyberinfrastructure may need a 1 billion dollar infusion of money. This will be an issue, but he believes that the Atkins Report may serve as a focus for the importance of such a cyberinfrastructure and the need for government to spur the development similar to the High Performance Computing Report in the 1980s.

Discussion

Dr. Nelson was asked about the organization of the web. He said that there is still a debate within IBM as to whether the Semantic Web concept is the answer because of the requirement to retrofit the content that is already on the web. They see this as unlikely to occur.

IBM is also working with the Internet II, which is testing authentication between colleges and other distributed trust issues.

Dr. Nelson sees the line between computer science and information science as increasingly blurry, especially at the middleware level where so much development is occurring.

"Agency Update: EPA's Scientific and Technical Information Systems"
Odelia Funke , EPA/Office of Environmental Information
Tom Tracy, EPA/Office of Research and Development


Odelia Funke (EPA/Office of Environmental Information) and Tom Tracy (EPA/Office of Research and Development) introduced a series of presentations to update the CENDI members on research and products underway at EPA.

Environmental Information Management System

John Sykes of the Office of Research and Development gave a status update on the Environmental Information Management System. This system contains metadata on projects and products. It will become a "one stop shop" for ORD. EIMS has grown from 3000 records to over 31,000 records. The Science Inventory, which is all research in the agency, should soon be available via EIMS. The EPA Enterprise Architecture will be based on EIMS. EIMS also supports catalogs for several partners, including the Enviro-Science E-print Service developed by DOE's Environmental Management Science Program and EPA.

EPA is integrating financial information and products with EIMS in order to provide better accountability of what a project costs. EPA will be able to tag back from a project to the financial system. This will allow the determination of the FTEs, answers to GPRA requirements and requests for "how much did this report cost". Peer review tracking is also in place, which will be used by an increasing number of projects.

Window to My Environment

Bill Grabsch and David Wolf of the EPA Office of Environmental Information described "Window to My Environment", which is a public access portal that has addressed practical problems of GIS data integration and deployment. The project uses XML to access data on remote servers, most of which are located at EPA's partners such as states, local governments and non-government organizations. Ninety-five percent of the data comes from elsewhere. Even regulatory and site monitoring information comes from the states and other organizations. It is necessary to integrate EPA information with information from these partners in order to tell if the regulations are making a difference to the environment. XML is being used, since there is no agreement on the Open GIS Standard.

EnviroMapper is how the map applications are deployed. There are about a dozen such applications both internal and external to EPA. ESRI and Oracle Spatial are used. The ESRI goespatial layers have been moved to an Oracle database for improved access. It is an object-oriented relational database, and the database services are embedded in the database. For example, a bounding box is passed to the database. The map is passed through the GeoServices Registry and the URLs that answer the query are identified. This provides the "Your Environment" portion of the portal. This information can be served from the state, county or local entity.

One of the most important issues is how to gear the content so that people in various communities, especially the public, can understand it. OEI hopes to finish this nationwide implementation by January. Implementations can differ from EPA region to EPA region and between the EPA national implementation and the regional implementations.

"Your Environment" was done by consensus and was based on comments from an advisory group and from questions submitted by program managers. The questions will be revisited after the system has been available. They are engaged in some internal applications for homeland security where the audience and the questions differ.

Environmental Data Registry

Larry Fitzwater of OEI described the Environmental Data Registry. The EDR supports EPA's efforts in data standardization by cumulating definitions of data elements from a variety of EPA databases, products, and projects. It provides a single source for their definition using the ISO 11179 standard for data element description, providing metadata about each data element and serving as a resource for reuse and mapping. While the data registry does not change the quality of the data, it does provide a key aspect of quality, which is "what does it mean". Mr. Fitzwater described several scenarios where definitions within EPA or between EPA and its partners were dissimilar and caused problems in using and sharing the data.

There is currently no commercial tool or case tool that supports the ISO 11179 registry elements. Therefore, EPA created the Environmental Data Registry, which has grown into a system of registries. There are 7000-8000 data elements defined in the EDR. Mr. Fitzwater said that EPA welcomes comments from the group on the interface to the EDR.

Another area in which EPA needed standardization was the area of chemical names. This prompted development of a chemical data registry, with approximately 80,000 chemicals. Standard names have been provided for about half of them and the remainder will be completed by March 2003.

While XML schema have proliferated, they do not help with the meaning. XML registries are still needed. They should be tied to the Universal Description, Discovery and Integration (UDDI) so that they can be found via the web.

Mr. Fitzwater announced the Open Forum on Metadata Registries (www.metadata-stds.org/openforum2003) , which will deal extensively with the interoperability of registries. The conference will be held January 20-24, 2003, in Santa Fe, NM. The conference is being sponsored by EPA, the European Environment Agency, ISO and several other groups, with session planning support from the USGS/BRD.

Return to Minutes Archive