CENDI PRINCIPALS AND ALTERNATES MEETING
National Library of Education
Washington DC
February 9, 2006
Minutes
FEDERAL PORTALS, ARCHITECTURES AND CENTERS:
CENDI CONNECTIONS
FedStats: The Gateway Portal to the Federal Statistical System
ERA Architecture: Status and Plans
The DNI Open Source Center: Information to Intelligence
The New ERIC Architecture and Content: National Library of Education Showcase
Welcome
Dr. Walter Warnick, CENDI Chair, opened the meeting at 9:10 am. He thanked NLE for hosting the meeting.
Dr. Warnick introduced Chuck Romine from the Office of Science and Technology Policy (OSTP), who is working with the new interagency working group on scientific data under the National Science and Technology Council’s Committee on Science. OSTP involvement was proposed in the National Science Board’s report on long-lived data. The Co-chairs will be from NIST and NSF. Mr. Romine will be the staff liaison.
Dr. Romine is in the process of finalizing the charter for the group. While he wants to ensure that scientific data sets are the motivating factor behind the group, the charter also ensures that textual digital data will be part of the working group’s deliberations. Close ties with CENDI are expected. CENDI might be designated as a Technical Advisory Group (TAG).
Dr. Warnick responded that CENDI is eager to participate once the group is formed. Ms. Carroll noted that there are several AAAS meeting sessions that relate to the interests of this group. The symposium co-sponsored by CENDI is a follow-on to the National Science Board Report. The Secretariat will ask the symposium speakers to make their presentations available through the CENDI web site.
“FedStats: The Gateway Portal to the Federal Statistical System”
Marshall DeBerry, FedStats Program Manager, Department of Justice
The Federal Statistical System is comprised of agencies that collect, process, or tabulate statistical data; plan statistical surveys or studies; conduct methodological research; or manage or coordinate statistical operations within the federal government. In fiscal year 2006, an estimated $5.3 billion were budgeted for these activities. Approximately 70 agencies have direct or estimated funding for statistical activities valued at $500,000 or more per year. Approximately 40 percent of the total budget for statistical activities goes to ten agencies, including the Bureau of Labor Statistics, the Bureau of Justice Statistics and the Census Bureau, which have these activities as their principal missions.
FedStats is the portal for publicly available statistical information from the decentralized Federal Statistical System. (The US is unique in its decentralization; statistical systems of many other countries including Canada and the UK are centralized.)
FedStats is developed and maintained by a task force of the Interagency Council on Statistical Policy (ICSP). The ICSP is comprised of the heads of ten statistical agencies and four statistical units. Charlotte Cottrill, CENDI Acting Alternate representing EPA, is the EPA representative for FedStats. The ICSP, which meets monthly, is chaired by the Chief Statistician of the US, whose office is located within the Office of Management and Budget (OMB). The FedStats task force also meets monthly.
The mission of FedStats is to provide effective, efficient, and timely access to and use of the federal statistical information needed for informed decision-making. The vision is that informed decision-making starts with the information and knowledge available through the FedStats portal.
In the early 1990s, the ICSP realized that the Internet could give a one-stop-shop for federal statistics. They went online in May 1997 as one of the earliest e-government initiatives. Since then, they have won several awards for the portal.
The ICSP group charter pays for the web site. There is a five-year, interagency agreement among the 14 agencies and units that make up the Council. The staff includes Mr. DeBerry and two staff members from the Census Bureau. The budget is under $500,000 per year. The membership is two-tiered. The top five agencies pay higher dues than the others. The dues for the Census Bureau are waived because of the infrastructure support that is provided. Some in-kind support is also provided.
The statistics that are made available from FedStats come from a number of federal statistical agencies and federal partners, such as the Department of Housing and Urban Development. The data is not really interchanged because of concerns regarding privacy and confidentiality. Mr. DeBerry noted that as part of the E-government Act of 2002 , in the section regarding the Confidential Information Protection and Statistical Efficiency Act (CIPSEA), some very limited sharing of economic information is allowed among BLS, the Census Bureau, and the Bureau of Economic Analysis.
FedStats does a crawl with a search engine to ensure that new items from the various statistical agencies are available to the public.
There is a key partnership with the IMF’s Dissemination Standards Bulletin Board (DSBB) group. As the US is an IMF member nation, FedStats is the US repository for the display of United States statistics as part of the DSBB’s member nation standard.. Approximately 100 statistical measures are updated and published twice a day from five major agencies. FedStats is also examining how to better standardize the way the information shown on this page is assembled and displayed through the use of an XML standard named the Statistical Data and Metadata Exchange (SDMX); this standard would replace the current use of PERL scripts that are used to produce the page.
Mr. DeBerry then went on to show the agency overlap and difference between CENDI and FedStat membership. This was based on charts that Carroll had presented at the CENDI briefing to FedStats in January. In addition to overlap in membership, there are areas of mutual interest including metadata, web site usability, and XML technology to facilitate updating of web sites. Mr. DeBerry discussed the taxonomy issues. The topics A-Z browse will be collapsed into four large buckets. It may be possible for the science and technology part of the taxonomy to be aligned with the Science.gov taxonomy.
Information seeking and usability are areas of concern. The web site seeks to strike a balance between ease of use and the amount of content. There are always concerns about quality and consistency. The American Customer Satisfaction Index (ACSI) has not been used on FedStats, because FedStats has access to the Bureau of Labor Statistics Usability Lab. In the beginning, there were some usability tests conducted. They now do it as needed when changes are made to the site. Agencies may, of course, do their own testing on the sites to which FedStats links.
Over the years, the FedStats web site has changed. The web site is now simpler and more “Google-like”. A MapStats capability has been added. This allows data to be viewed at various levels of geography. In cooperation with HUD, a city capability (pop. 25,000 or more) is available in addition to access by states and counties. However, there is no real use of GIS at this point. FedStats wants to create a geographic profiling functionality by which users can create their own information sets.
The current system is built on open source software. They achieve very fast retrieval and display with MySQL. Section 508 compliance was challenging to achieve but, ultimately, it was beneficial because it will provided the basis for providing the information on Personal Digital Assistants (PDAs) and other mobile devices more easily.
Mr. DeBerry believes that the way statistics are provided will change over the next few years, moving away from data in tables. The NSF’s Digital Government research program has supported some of the research into the future environment and technologies. The partnership with the NSF Digital Government Program has saved FedStats significant money. Through NSF, they have been able to work with academics from University of Maryland, University of North Carolina, and Carnegie Mellon on issues such as usability and statistical search technologies.
FedStats is considering a new search engine and has been working with Jaimie Callan at Carnegie Mellon on research regarding searching of statistical databases. Vivisimo, which is used by FirstGov, has also been investigated. The tools are evolving and it would be beneficial to search across statistics, text and geospatial information.
“ERA Architecture: Status and Plans”
L. Reynolds Cahoon, Senior Advisor on Electronic Records, NARA/ERA Team
The goal of the ERA is to provide ready access to central records in a way that provides an educational experience for the user. The challenges NARA faces is to preserve any type of electronic record created using any type of application, on any platform, from any entity of the Federal Government, and any donors. Discovery and delivery must be available to anyone with an interest and legal right to access now and for many future generations – “to the end of the Republic.”
In addition, NARA faces the same challenges as others who are trying to preserve electronic records. These challenges include retaining authenticity, addressing and overcoming obsolescence of technology, dealing with increasingly complex formats and demanding behaviors, and an increasing variety of format types (in excess of 16,000 mime types), and a scope and timeframe that create enormous numbers of records. Complicating the process are more and more complex multi-valent documents, user expectations that continue to evolve, and difficulties in creating independence of content from technology.
The sheer size of the effort is a major challenge. For example, the 2000 census includes 600-800 million images. The military will be transferring close to a billion images over the next 10-15 years. Scalability is a major driver as the project aims to manage the records life cycle.
NARA’s approach is to attack the critical preservation problem, defining the requirements in terms of the lifecycle management of records and aligning with the overall direction of Information Technology in the government. The latter means that the ERA is focused on finding solutions that are commercially viable, mainstream technologies that are being developed in other sectors. This approach will allow the ERA to be available to other agencies and sectors through commercial channels. NSF has been a partner in the ERA from the beginning. They have helped to identify commercially viable technologies developed for e-commerce, e-government, and the grid.
ERA will support workflow within NARA and between NARA and the agencies. Electronic records will be ingested, preserved, managed, and accessed through the National Archives, Federal Records Centers, and the Presidential Libraries.
Four system design drivers were identified. Technology is both a benefit and a sword. Obsolescence is a challenge, but improvements and cheaper storage are benefits of technology change. The ERA requirements were the first that required evolvability at the core. The system must scale up and down, from PCs to the Grid in order to address the whole life cycle. The system must be extensible to include new data types of increasing complexity. Persistent preservation must ensure authenticity and accessibility. The system must deal with sophisticated classified materials. Trust is extremely important.
The OAIS (Open Archival Information System) Reference Model is at the core of the ERA architecture. A Service Oriented Architecture will be used to support the ability to move components in and out. Interfaces will be open and standard to the degree that they can be while balancing performance issues.
The system will not be a single instance in one place. There will be separate classified and unclassified systems and also many instances of both. Very little human intervention can be involved because of the volume. A lot of upfront work with the agencies will be needed to help agencies design records systems that help with management and disposition.
Metadata is important. The initial preservation formats will be the native object format with a big metadata wrapper around the accessions. This approach makes the data independent of the ERA itself, providing a version of the content that is hardware and software independent and allowing the system itself to improve and evolve.
The system must also be able to support redaction and deal with privacy issues. Access must be managed and content must be rendered through extant technology as required for viewing and use by consumers.
Preservation Planning deals with the tradeoffs by format and content type. Preservation Adapters are being developed for various format types and behaviors. Of course, in reality this is more complex because a record can have multiple formats linked together that require adapters for each format. The contractor has prototyped the adapter approach but not on a large scale. Adapters can be replaced as the technologies improve.
In addition, it is important to keep the archival and records context. How the pieces fit together and are ordered to tell a story is at the heart of records management. The maintenance of context over time is a key challenge. A web of connections is needed to maintain this context; the data structure, and the catalog are key to a persistent archive.
A preliminary design review is scheduled in the near future. At that point, the ERA/NARA Team will see how well the engineers can implement a design consistent with the architecture and determine where the trade offs are.
The ERA Development Plan extends through 2011. Increment 1 is scheduled to be done before the end of FY07. This will include shared services and storage. Increment 2 will add additional instances and the rest of the system, including redaction. There will be a whole architecture around the rights management and access functions, which will be added in Increment 2. When and how this will be developed will be better known after the preliminary review.
Some agencies will have particular responsibility for records through its affiliated archives program. Affiliated archives will need some version of the ERA. Since the ERA is built on COTS products that are stitched together, some licensing of software/hardware will be needed to create these instances. However, the design is open and readily available. Federal Records Centers will have instances of the ERA along with the Presidential Libraries.
Agencies can support the ERA activities now by moving toward more standard or preferred formats. The more technology independent the content can be made, the more likely it is that the materials can be ingested by the ERA.
“The DNI Open Source Center: Information to Intelligence”
Judith Bylicki, Greg Klipfel and Bill Hannas; Office of the Director of National Intelligence, Open Source Center
The Creation of the Center. In response to language in the Intelligence Reform and Prevention of Terrorism Act of 2004 and recommendations in the Silberman-Robb Commission calling for more effective use of open sources to support intelligence, the newly established Director of National Intelligence created the DNI Open Source Center at CIA on 1 November 2005. The DNI assigned the Director/CIA as executive agent for this center and directed that the center be built on the capabilities and expertise of CIA’ FBIS and report directly to the Director/CIA in implementing DNI strategy and guidance.
The Mission. The Open Source Center (OSC) executes DNI strategy and guidance and supports DNI nurturing of distributed open source architecture across the Community. The Center manages that distributed enterprise through a new, inclusive approach to networking expertise and capabilities that exist not only within the Intelligence Community, but also across the government and throughout the private sector and academia.
The OSC maintains a worldwide network of multilingual regional experts who answer intelligence requirements using open sources including radio, television, newspapers, news agencies, databases and World Wide Web. The OSC provides customers with comprehensive and accessible reporting on foreign political, military, economic, and technical developments, as well as other topics responsive to their requirements.
Products and Services. The center will provide centralized services to enable individual USG components to integrate openly available information effectively and efficiently in carrying out their unique missions. Such centralized services include training, tools, and acquisition. In this way, the Center will enable individual USG components to use openly available information more effectively and efficiently in carrying out their unique missions.
The monitoring of open source in over 80 languages and more than 160 countries yields a wide range of products including text, multimedia, photographic and geospatial elements. Services include media analysis, in-depth research, source assessment and language support.
The CENDI members provided the Center with several ideas for collaboration and partnerships. In particular, the Center was pointed to the 11 Information Analysis Centers operated under contract to DTIC. The IAC on ChemBio should be an important resource for the work the Open Source Center is doing on biological issues. They are already connected to the Federal Library & Information Center Committee (FLICC) and FEDLINK.
There was some discussion on how to continue and build on this connection with the open source community, including possible CENDI membership for the Center.
“The New ERIC Architecture and Content: National Library of Education Showcase”
Luna Levinson, Institute for Educational Sciences, and Larry Henry, Computer Sciences Corporation
ERIC has long been the flagship product of the Department of Education. New legislation envisioned the ERIC topics as part of the totality of the Institute for Education Sciences (IES) effort. The requirements were for efficient, Internet-based, full text access to education materials.
The statement of work was developed in 2002 and the contract was awarded to the Computer Sciences Corporation (CSC) in 2004. In developing the ERIC business case, the IES determined that there is a 30 percent decrease in the cost over the old clearinghouse model for creating ERIC.
The entire business has been outsourced to CSC. CSC engages the publishers, establishes agreements, and ingests the digital content. Major issues have included reducing the production time down to 30 days and web site design considering the wide variety of users. A lot of time was spent in the early part of the project determining the metadata. Arrangements were made to meet with stakeholders at the American Library Association (ALA) to collaborate on ERIC’s evolution.
CSC is conducting eight rounds of usability testing. The American Customer Satisfaction Index (ACSI) was implemented beginning in September. The satisfaction rating of 70 was achieved, with the highest scores in site navigation and content.
Mr. Henry then described the ERIC architecture, which is based on web services. The architecture operates on COTS products, including BEA Weblogic as the applications server, NStein for the ERIC Thesaurus management, and computer-aided characterization. Documentum acts as the workflow management system that talks to Oracle. Portal architecture is used to personalize ERIC and Handles were implemented for persistent identification.
Content is ingested through a variety of means, including direct submissions, harvesting of web sites, and electronic input from publishers. The electronic input from publishers is transformed through a series of scripts. Authors can submit one or more documents. It requires that ERIC be granted the right to distribute the content. Some subset of the metadata records can be completed by the author, along with instructions to the processing team. The submitter can track the processing status.
NStein is used to manage both small and large taxonomies. The small taxonomy categorizes the material for routing to subject experts. The large taxonomy (the ERIC Thesaurus) provides candidate-controlled descriptors. The small taxonomy is approximately 90 percent accurate. More tuning is needed on the large taxonomy, which is approximately 60 percent accurate.
The source management system manages sources, content providers, and agreements against the sources. The journal list presented on the web site is automatically generated from the source management system.
A single search box and more advanced search interfaces are available. A search can be performed within results and a search can be refined. Some of the thesaurus descriptors are exposed in order to support the refinement of searches. The advanced search provides filtering by publication type. ERIC provides a link to MyLibrary, in which the user can select up to three institutions.
The Google site map has approximately 1.1 million pages that Google and others can index. Clicking on the result under Google takes the user to the ERIC detail page. Monthly updates are provided to several online vendors.
Future plans include publishing the thesaurus in XML format and moving from Dialog Format B to an XML format. The internal metadata model will be mapped to the OAI to allow ERIC to participate in open metadata activities. There is a need to bring their stakeholders along while meeting the demands of the Internet.