CENDI PRINCIPALS AND ALTERNATES MEETING

National Technical Information Service
Springfield , VA
April 25, 2005


ABBREVIATED MINUTES

GRIDS
Grids: Concepts, Technologies and Applications
ETF Overview: Promise and Challenges
Using the Grid for Precision Searching of STI
NASA Grids and Earth Observation: Goddard Applications

NTIS Showcase: New Activities and Plans
Clearinghouse Products
Wage Determination Online
ALTUM Grants Reporting Software

IRS Tax Product CD
Atlas Pro and ACSEL
Social Security Death Master File

Welcome

Dr. Walter Warnick, CENDI Chair, opened the meeting at 9:10 am. He thanked NTIS for hosting the meeting. Ms. Janice Coe introduced Jo Gann, Acting Director of NTIS. Ms. Gann welcomed the CENDI members to NTIS.

GRIDS

“Grids: Concepts, Technologies and Applications” Dr. Geoffrey Fox, The Community Grids Lab, University of Indiana

Grids have numerous definitions. They range from the Grid as a series of four or more large computers (a definition from 1960) to virtual organizations; large-scale, resource-intensive distributed systems; to worldwide networks of interconnected computers that behave as a single entity. Grids are Internet scale distributed services. They use the Internet technology and are distinguished by managing or organizing sets of network connected resources. Unlike the classic Web, which allows independent one-to-one access to individual resources, the Grid integrates and manages multiple people, sensors, computers, and data systems. The organization can be explicit (e.g., the TeraGRID), or implicit (e.g., Internet resources that “harmonize a community”).

There are different visions of the Grid. Some use the term to refer to technologies (i.e., the full systems and applications). The computer industry envisions the Grid as utility computing or data or computer on demand. The e-Science or cyberinfrastructure are virtual organization Grids. Skype Voice-Over-Internet can be viewed as a peer-to-peer Grid. Commercial third-generation cell phones are forming mobile Grids. At this time, anything that is “e-” is defined as a Grid.

There are several broad classes of Grid applications. These include the Enterprise Grid which supports an information system for an entire organization. Outsourcing Grids link different parts of an enterprise together, such as manufacturing plants with designers. A Customer Grid links businesses and their customers. Other types are e-Multimedia and Distance Education Grids. An important business application of a Grid is utility computing. A pool of computers is assigned to applications when resources are needed.

There are several styles of Grids. Computational Grids link computers across the globe. Knowledge and Information Grids link sensors and information repositories, such as the Virtual Observatories. Education Grids link teachers, learners and others with learning tools, distance lectures, etc. e-Science Grids link multidisciplinary researchers across laboratories and universities. Community Grids focus on Grids involving large numbers of peers rather than focusing on linking resources. Semantic Grids link Grid and artificial intelligence communities with the semantic web and agent concepts.

Grid is the software infrastructure that sits on top of the physical network and exploits the network. Grids support distributed collaboratories or virtual organizations integrating concepts from the Web, agents, distributed objects, peer-to-peer networks, and a variety of open source software environments. The basic paradigm from the Web view is metadata-rich services communicating via messages. Metadata is key.

Typical Grid architecture involves the raw resources that are made accessible via middleware. At the highest level, the middleware applications are system services such as security and search. These applications are accessed by users via Web-based portal services.

Grid technologies include the Web, agents, distributed objects, and Web services. Web services are computer programs with an interface in XML. They can be used to build loosely coupled applications based on the principles of SOAP (Service Oriented Architecture Protocol). Web services interact by exchanging messages in SOAP format. WSDL (Web Services Description Language) interfaces define the contracts for the message exchanges that implement the interactions between otherwise independent Web services. The service is completely distributed from the interface. There is no explicit or implicit common implementation model, thereby creating a much more loosely coupled structure.

Dr. Fox presented several diverse Grid applications. E-Defense and E-Crisis are natural places where Grids can be used. These are essentially the next generation of Command and Control Systems. The NASA Aerospace Engineering Grid distributes the design and manufacture of complex aircraft. The Virtual Observatory Astronomy Grid integrates experiments. The e-Chemistry Laboratory performs experiments on demand. The Dame project between Rolls Royce and the United Kingdom (UK) e-Science Program captures a gigabit of data when a plane’s engine crosses the Atlantic. Dr. Fox is working on a Grid connecting Indiana coaches to aspiring players in China.

The work on Grids is coming from a variety of sources with many system services under development, particularly in industry. A variety of standards and frameworks are being developed by standards bodies and companies, resulting in standards battles. For example, there are about 60 related to web services alone which is resulting in an interesting battle. Many fields are setting domain specific standards and building domain specific services. Microsoft is a key player and stands to benefit as Web services open up.

Ultimately, the result is likely to be a Grid of Grids. Individual disciplines, geographic areas, and organizations will create Grids, which can then interact. An example would be a research Grid that interacts with an educational Grid.

The US appears to be lagging behind Europe and Asia in Grid development because of the lack of funds (roughly zero). The largest service application, which gives resources when needed, is at CERN. Analysis of data from CERN’s Particle Accelerator will involve 30,000 computers being shared by 1000-2000 physicists at multiple organizations world-wide. Development of grids for this application is particularly focused in the European Union where there has been significant funding, although many key technologies come from the US.

The Grid has vast implications for education and society in general. It is characteristically democratic. It will change the landscape in mature areas like enterprise software, encourage trends like outsourcing and globalization, and support new distributed applications in government, education, business, and science. The organization and presentation of data to the world will change the public view of science. While the Web Service/Grid standards and infrastructure are still in their infancy, some broad principles are reasonably clear. Many large-scale software development activities are inconsistent with this modern architecture. This leads to a question about what organizations should do now to prepare for the future. Dr. Fox suggested that development of application specific (XML-based) standards is an important “safe” area for current development.

“ETF Overview: Promise and Challenges” Dr. Guy Almes, National Science Foundation (NSF)

The Extensible Terascale Facility (ETF, also called the TeraGRID) is a particular kind of Grid for the NSF environment. This is part of NSF’s concept of cyberinfrastructure, an enduring element in the NSF of the future. Due to the very nature of the research, university and research colleagues are scattered across the nation and around the world. Enabling this community and its collaborative work is key to NSF. Traditionally, there were two approaches to doing science: the theoretical or analytical and the experimental or observational. Now, the use of computational resources has led to the third approach: simulation or modeling.

The development of the ETF has its roots in the Supercomputer Center program that began in the early to mid 1980s. At that time, the National Center for Supercomputing Applications (NCSA), San Diego Supercomputing Center (SDSC), and Pittsburgh Supercomputing Center (PSC) were among the leading centers. Their success led to a steady increase in funding and created a need to access these centers from the 200 research universities. This need led to the NSFnet Program from 1985-1995. The goal was to connect users to the supercomputer centers and to enable collaborations through these centers. This involved network mediated collaboration and there was rapid growth in the network speed. This had a very narrow mission, but the impact was broad, including the ARPAnet that became the commercial Internet. It moved the Internet from ARPAnet to the commercial Internet.

Sensors were not typically connected to the Internet at that time. This was primarily because the locations of these sensors were historically away from telecommunication hubs and, therefore, isolated from adequate connectivity.

Since the 1990s, middleware has been growing in importance. Explicit elements in the ETF include advanced computing, advanced instruments, advanced networks to connect researchers, instruments and computers in real time, and advanced middleware to enable potential sharing and collaboration. There are strong synergies among these elements. For example, the University of Oklahoma’s CRAFT Project predicts storms with a three-hour lead time. It is based on real-time Doppler radar and moving bits and crunching the data results in increased reliability. It involves NCSA and Pittsburgh, along with the Internet2 network, the Unidata Project from UCAR (University Corporation for Atmospheric Research ), and the National Weather Service.

The TeraGRID is one component of the ETF. It began with distributed systems of unprecedented scale, and a unified user environment across the resources. New partners were added to introduce new capabilities in computing, visualization, instruments, and data collections. This has resulted in a strong, extensible team. The initial community numbered over 500 users and 80 principal investigators.

It evolved into the PACI Program ( Partnerships for Advanced Computational Infrastructure) in 1997, and then into the Terascale Computer Systems program. This program began with the LeMieux low-latency cluster at Pittsburgh in 2000. In 2001, NCSA and SDSC with Argonne and Caltech developed the Distributed Terascale System of clusters and network. In 2002, Pittsburgh was added. The system continued to grow, with four new sites added in 2003.

In 2005, the ETF Operations was created with responsibility for coordinating activities. The Grid Infrastructure Group is an integrative and coordinating activity. Coordination is needed so users see that the TeraGRID is homogeneous enough to get support and some sense of community. The multiple resource providers meet in a Resource Providers Forum. There is a five-year plan for developing this coordination.

In reality, the computation that the ETF provides varies from very tightly coupled clusters, like the LeMieux and Red Storm systems at Pittsburgh, to tightly coupled clusters, like Itanium2 and Xeon at several sites. There are also data intensive systems that are important for crunching massive amounts of data. Memory-intensive systems are available if there is a need to access a large central store of RAM. Other requirements may include online and archival storage (more than a petabyte is available online from San Diego), numerous data collections and instruments such as the Spallation Neutron Source at Oak Ridge National Laboratory, and the Purdue Terrestrial Observatory.

The TeraGRID still serves as the “backplane” network, a role it has had since its inception. Grid software is provided by the TeraGRID, including the Globus toolkit, and Inca, which monitors computers on the network to make sure they are all running the same version of software.

There is now a need for a variety of different clusters in data centers. If the problem requires frequent interactions, then the large clusters break down. As the size of the datasets increase, the ability to write programs that touch these large datasets grows more problematic. For example, the memory-intensive cluster is ideal for remote visualization. A diversity of architectures is very important.

The promise of the TeraGRID is that capability systems will be available for scientists and engineers, and that these resources will be uniformly accessible. The resources need to be more widely available to more scientists. Balancing user and application ‘pull’ with the technology ‘push’ is important. Science gateways with backend supercomputers using web services could bring a 10-fold increase in the user base. There are eight initial gateways under development.

There are several technology drivers. Every three years, the ratio of gigabytes per mips (million instructions per second) is doubling. This allows increases to the data intensiveness of the activities. Integrating instruments is an undeveloped area. There is a distinct difference between capability at the national level and capacity computing. The care and curation of data collections is now the biggest impediment. Curation, management, and preservation become increasingly important in this environment. A global file system is being piloted.

Constructive relations are being developed with other Grids. The Open Science GRID is being reflected at particular university levels in preference to having supercomputers or individual department clusters. They are learning how to integrate these resources. The TeraGRID provides capability resources while the campus provides capacity resources.

Security is a major challenge. The TeraGRID has powerful shared resources that are implemented via Unix timesharing systems. The technology of secure shared systems has been neglected since the late 1970s. Security of shared systems now needs to catch up.

Performance is also an issue. Moore’s law is the enemy. It is difficult to maintain “supercomputerness” when ordinary computers double in performance every 18 months.

The role of the network is another challenge. Network performance among TeraGRID resources and between the TeraGRID and users is an issue. Bandwidth is improving but the speed of light latency is not.

The ETF is reaching out to new users. There is a slow but steady growth in the number of power users, but there are many other communities that could take advantage of the resources but they do not have the skills. There is a definite generation gap in the use of computational c+omputing. The Science Gateway is seen as a way to relate to larger and more diverse communities of users.

“Using the Grid for Precision Searching of STI”, Abe Lederman , DeepWeb Technologies

DeepWeb Technologies was founded in 2002. Customers include OSTI, DTIC, and Science.gov. While the current version of Science.gov provides a single interface across the 30 different government databases, it is often hard to determine the most relevant results from the multiple collections. There are different native search engines and ranking algorithms.

The DOE Small Business Innovation Research (SBIR) body to address this challenge began in August 2003. The Phase 1 SBIR, a proof of feasibility, was completed in April 2004. Phase 2 will continue through 2006. Funding is also provided by some Science.gov Alliance members. Dr. Fox is acting as a consultant on the Phase  2 development. Science.gov will be the first implementation of this precision searching using Grid technologies.

The project goals include downloading and indexing full text documents in order to improve the ranking of results. However, the standard approach involving massive downloads is resource intense for content providers. It is important to customize search algorithms to balance precision and recall. The goal is to support the mining and analysis of the search results through clustering, summarization, geographic location information, etc.

To deal with the need for precision ranking, a three-pronged approach is used. QuickRank works well for broad queries but not so well if the query is very narrow. MetaRank involved the customizing of algorithms based on metadata. This involves downloading the metadata including the abstracts prior to doing the ranking. QuickRank will be used first and then, if the results are not good enough, MetaRank will be invoked. DeepRank will be used only as needed since it utilizes more resources.

The architecture requires a new internal structure for Science.gov. The submit engine distributes the queries across the nodes and there are web services across the nodes. All the pieces are distributed geographically. As soon as the results come back from the first several searches, they will be displayed immediately. Refreshing the results will show more results if needed. The search status will show how many searches have been completed. The application will be loaded near the database and it will be up to the application to find servers and communicate quickly with the databases.

The distributed nature of Science.gov 3.0 will result in Grid nodes being located close to the content at some agencies. While this isn’t a requirement, it will result in improved performance. It will also be important to use approaches that minimize the bandwidth needed.

DeepRank will use a Grid-based solution using Web services, WSDL, SOAP, and XML. The application, which is platform-independent using Java, runs on distributed nodes. Different web services can be invoked for different kinds of analysis. DeepWeb Technologies may not write all the Web services, but the framework will allow others to write them and still ensure integration.

Twelve different functional web services have been developed. Some Web services that DeepWeb plans to develop include the “ranking conductor”, which manages the users’ search; and the “search selection optimizer”, which uses thesauri and information about the resource so lower-relevance databases aren’t searched. Some parts of the current Science.gov code will be replaced with Web services and Z39.50. Filtering services will include MetaRank, DeepRank, and others. An aggregation service will put results into a database. Results will be parsed to make them uniform; presentation services will develop consistency in the presentation.

The more standards that are adopted, the better the solution will be. DeepWeb Technologies is currently reviewing SRW, a Z39.50-like interface to content that will make the configuration easier. This architecture will be implemented as part of Science.gov 3.0, if only on one machine. A four-node GRID has been implemented as version alpha 1.1. The nodes include Los Alamos, an eight-GRID node in Indiana, and one in Oak Ridge.

Looking into the future, DeepWeb Technologies would like to pursue the idea of developing a Grid-based, government-wide portal. There may be a Grid of Grids, with some organization-based while others are thematic.

“NASA Grids and Earth Observation: Goddard Applications” Jeffrey Lubelczyk

Several NASA Earth science projects are sponsoring or managing prototypes that use Grid technologies, including AIST (Advanced Information Systems Technology), ESDIS ( Earth Science Data and Information System), and the LDCM (Landsat Data Continuity Mission) Grid Prototype. In addition, NASA is leading a Grid Task Team of the Committee on Earth Observing Systems (CEOS) Working Group on Information Systems and Services.

AIST involves the integration of OGC (Open GIS Consortium) and Grid technologies for earth science modeling and applications. It is being led by George Mason University. The hope is to make Grid-managed data accessible through the OGC servers and allow users to focus on science, rather than on issues with data receipt, format and manipulation.

The Remote Data Storage Pilot System under ESDIS seeks to demonstrate high volume remote data backup and grid recovery over a wide area network for portions of the Goddard Earth Sciences Distributed Active Archive Center (DAAC). This DAAC has petabytes of data.

The LDCM GRID Prototype is an infrastructure that allows scientists at resource-poor sites to have access to remote resource-rich sites. This will enable greater scientific research, maximize the use of exiting resources, and limit the expense of building new facilities. The objective of the LDCM Grid is to assess the applicability and effectiveness of a data grid to serve as the infrastructure for research scientists to generate virtual Landsat-like data products. The team consists of government and contractor employees from NASA, the USGS, and the University of Maryland.

They are also working to produce these products from non-Landsat satellites such as MODIS ( Moderate Resolution Imaging Spectroradiometer). This is especially important in cases where the Landsat satellites have left gaps in the data that can be filled in. Algorithms work on MODIS and Landsat surface reflectance scenes, and then data is distributed at remote facilities. This prototype solves a realistic scientific scenario using grid-enabled resources.

The LDCM uses the GLOBIS Toolkit 2.4.3, which provides security infrastructure, storage resources management in the form of the GridFTP, and job scheduling and resource allocation.

Grid flexibility is expected to maximize science resources. Four scenarios have been developed to illustrate the flexibility of the Grid approach. The first involves moving the application to the data or transferring the components to remote hosts, processing the data remotely and then sending the results back to the science facility. Batch execution involves parallel computing in a batch environment. The third approach is local processing where the selected data sets are transferred to the science user’s site for processing. The last scenario involves no use of local resources, but transfer of the data to a third party for remote processing.

The Grid Prototype provides a generic software system architecture based on Globus services. It also utilizes the Java Commodity Grid Kits which simplify the programming interfaces. The first workflow allows the submission of jobs only to a specified resource. The next steps will provide the ability to submit a job to the “Grid”. Condor is used to manage Grid workflow functions such as managing sets of subtasks, getting the tasks done reliably and efficiently, managing computational resources, and error recovery.

To date, NASA has learned several lessons. First, the open source environment of many of the Grid tools is very beneficial. They are enhanced by collaboration and the software is very robust. The reuse of Grid tools such as Condor is efficient but some limitations have been identified. A surprising amount of time has been spent on basic network administration and security, including network performance and firewall issues. It is difficult to maintain configuration management across independent agencies and centers. The technical side of the Grid is moving along but there are significant governance and organization resource issues. The Globus Toolkit keeps changing. CEOS has created a working group on systems and services to share lessons learned. It includes international, private and interagency collaborators.

“NTIS Showcase: New Activities and Plans” Janice Coe and Staff, National Technical Information Service

Janice Coe, Director of Business Development, introduced members of the NTIS staff who, in turn, presented overviews of various NTIS products.

Clearinghouse Products

Wayne Strickland, Product Manager for Clearinghouse Products, gave a general background on NTIS. It began with the initiation of “The Publication Board” by President Truman in 1945. In 1950, the responsibility for the Board was turned over to the Department of Commerce. The Clearinghouse was established in 1964. NTIS came into being in 1970. The American Technology Preeminence Act of 1991 requires federal agencies that produce scientific, technical and engineering information to provide it to NTIS as a permanent archive of STI.

However, NTIS must recover its costs. Therefore, adding value by providing products in various formats and archiving STI is of primary importance.

NTIS houses the nation’s largest collection of US Government scientific, technical engineering, and business-related information, with approximately 3 million records. As they find legacy reports, they incorporate them into the system. They also do some international STI. Acquisitions occur in electronic and paper formats, and agency web sites are harvested for relevant documents. There is still a strong demand for microfiche from libraries and the international community, and some call for paper. NTIS’ customers include US business and industry, libraries, universities and international entities.

Wage Determination Online

Bill Clark described the Wage Determination Online (WDOL) product. NTIS has been involved with the paper product and the database for many years. They moved the print product online. Originally, it was a subscription product which was moved from their bulletin board system to the Web.

As part of an E-government working group, they determined what stakeholders wanted. NTIS was unique in that they were able through their joint venture authority to bring government and private sector labor interests together. The WDOL is increasingly important in meeting provisions of the David-Bacon Act and also for the current A-76 environment.

The product development went through several evolutions and prototypes. The complexity of the Service Contract Act required a wizard (like TurboTax) to guide users to the right wage determination. A collective bargaining agreement module is being added. The product was launched on time and within budget.

The development of the WDOL allowed for elimination of redundant sites at various agencies. However, this also required migration of previous subscription sites held at the agencies into this single service. The service is funded by GSA at no charge to the users. Having the site available will allow elimination of the SF98 in paper.

ALTUM Grants Reporting Software

Betty Lagundo presented the ALTUM Grants Reporting Software (GRS). This is a joint partnership with Altum, Inc. to provide AltumGRS, to federal agencies via interagency agreements. The software is provided via subscription and can be integrated with current grants management systems for reporting to Congress,for the media and general public. There are two levels of products. The Central Grants Reports System (CGRS) is a Commercial Off-the-Shelf (COTS) product that is available to the government under license. The Grants Reporting Software (GRS) is a GOTS (Government off-the-shelf) product, developed for the National Institutes of Health (NIH) and available to other federal agencies through NTIS.

IRS Tax Product CD

Patricia Gresham described the Internal Revenue Service (IRS) Tax Product CD that has been produced in partnership with the IRS since 1998. The IRS Tax Product CD is used heavily by tax accountants and CPAs, and continues to be one of NTIS’ best-selling products. NTIS provides pre-mastering, mastering and replication, marketing/sales, and helpdesk support for the IRS.

The Tax Product CD is issued twice per year. The first version is issued in late December, to provide practitioners an early alert about changes in the tax laws. The second issue, which contains the Congress-approved forms, is issued in February of the following year. NTIS produces other IRS products such as the Small Business, Individual Retirement Account, and the Tax Exempt/Government Entity CDs. This year, the IRS has asked NTIS to accompany them to their Forums to assist in distributing and marketing the products to the public

In addition, NTIS supports the IRS TaxFax Program for people who want to receive fax versions of forms. NTIS’ successful partnership with the IRS continues. They recently initiated another long-term agreement for program support.

Atlas Pro and ACSEL

Bill Jackson described Atlas Pro and ACSEL ( Agency Consortium for Secure E-Learning). NTIS is an Office of Personnel Management-approved e-Learning Services Provider. However, NTIS differs from the other two providers in that they offer GOTS software, secure federal hosting, and flexible migration planning. NTIS works with ACSEL, a member-funded consortium that develops course modules.

Atlas Pro trains, tests, and certifies. The system is government-owned, open source, and SCORM-compliant ( Sharable Content Object Reference Model). It can be easily integrated with other course systems. The need for training both in the military and civilian areas is expected to increase. They have already trained more than 150,000 students per year for the past eight years. NTIS also runs the help desk for the system. The consortium is preparing to release a Request for Information (RFI) looking for additional partners.

Social Security Death Master File

David Thomas described the Social Security Death Master File. The full file has more than 70 million records extracted from the official file of the Social Security Administration. It includes all those reported to SSA as having been deceased and it is estimated to be about 95 percent complete. NTIS distributes the information through a secure web site and on CD-ROM. New deaths and corrections are made weekly and monthly.

Previously, users could receive the data in raw format only. Subscriptions were provided for weekly or monthly updates. In 2001, Congress wanted the information to be released more quickly and more often because the information is used by the government and the private sector to prevent and detect fraud and comply with the USA Patriot Act. The information is also used by pension funds, medical researchers, and others. Now the SSA sends the file electronically to NTIS every Saturday.

It became apparent that it was difficult for small organizations to manage the raw data and that many organizations wanted to be able to search as few as one record at a time for suspects or a watch list. The file is now available as search applications on a Web site hosted by NTIS and a private sector joint venture partner. The system has seen a 700 percent increase in usage since it was made available via the web.

Previous Page