| CENDI PRINCIPALS AND ALTERNATES MEETING |
Government Printing Office
Washington, D.C.
June 3, 2003
WELCOME
Kent Smith, Chair of CENDI, opened the meeting at 9:10 am. He thanked GPO for hosting the meeting, and welcomed them as the newest CENDI member. Bill Turri, newly appointed Deputy Public Printer, welcomed CENDI and acknowledged that as GPO tries to reinvent itself, alliances with organizations such as CENDI are extremely important. GPO is excited about becoming a member of CENDI and believes that best practice initiatives are of common interest to GPO and the other CENDI members.
GPO in the Spotlight: Current Activities, DirectionsAt the CENDI meeting in December, Bruce James, the Public Printer, outlined GPO’s plans for the next 18-24 months, as it changes from being a printer to being an information management and dissemination organization. In recognition of the complex environment (thousands of employees, 23 unions, and the current status of Title 44), the goal is to look critically at current operations and to plan for the future. They are also looking to see what other organizations are doing and what their customers need. They believe that involvement with CENDI and the individual agencies is critical to this evaluation.
GPO is in a fact-gathering phase in which they are using demonstrable pilot projects to produce verifiable facts. One example is a pilot project with DOE/OSTI. Based on library profiles, specific DOE materials are being sent to certain Federal Depository Libraries. GPO is interested in any ideas from CENDI in this regard.
Ms. Russell then spoke specifically about changes to the Federal Depository Library Program (FDLP). The goal is to “get out of the box”, to re-look at the mission, and to achieve a new vision of what the FDLP should be.
A major question is the role of the FDLP in the electronic environment. What is the benefit to being a depository library if all the information is available on the Internet? Currently, 60 percent of GPO’s titles are available electronically on agency sites, online at libraries, or at the GPO. In five years, 95 percent of the material could be available electronically. The remaining 5 percent may have been forced to print for one reason or another even though it originated in electronic form.
The changes to the FDLP are being worked on “in the field” through meetings with FDLP members. The agenda for the recent meeting of the FDLP Council was primarily about plans for the future. GPO is also talking to the Association of Research Libraries and conducting conference calls with libraries of various types to identify their specific concerns and ideas. The environment of GPO and its depository libraries is not out of step with what is happening widely to libraries. The tone has been very open and there is excitement about what the future might hold.
Many of the depository libraries have critical space issues. Sharing facilities is being considered to reduce the physical needs at the regional levels. They are also looking at issues of digitization and retrospective cataloging of historical content.
"Changes in IT Infrastructure"
Judy Russell, Superintendent for Documents, for Reynold Schweickhardt, IRM
Policy Manager, who was unable to attend
GPO is creating a centralized IT infrastructure using an enterprise architecture model. They are also working with Congress on storage and backup for disaster recovery. Congress is planning a remote facility, which GPO may use as well. Live mirror sites are planned for GPO Access. Benchmarking best practices of other organizations is being done.
GPO’s Office of Innovation and New Technology has been looking at best practices in three areas -- authentication, version control and preservation. Authentication involves the question of how downstream users know that the item is from GPO and is official. Version control manages the problem of which version should be maintained. Preservation is both immediate and in the future. In the near-term, GPO will migrate everything since 1994 that is in GPO Access to a new system. This first major initiative will include digital signatures and the use of XML. GPO believes that the preservation aspect is best handled in a centralized fashion, migrating only once but providing distributed access.
There is a significant, impending preservation issue with CD-ROMS. Many of the older CDs in GPO’s collection are rapidly aging, with obsolescent operating systems and search engines. A systematic audit of the problem is needed. GPO must work to extract and preserve the content. Ms. Russell suggested that this is an area that CENDI could investigate jointly.
Metadata will be critical in order to let people know the work exists. Many libraries put the GPO metadata into their own integrated library system catalogs. GPO is purchasing an ILS, which will allow it to run a portal for searching across its collections. They haven’t presented plans or a budget to Congress, but they must present a consensus across the various stakeholder groups in order to succeed. Congress wants to see a unified constituency. Key elements of the plan will be reflected in the FY05 budget. However, it will be important to make an investment in FY04 for the GPO Access migration.
GPO also must work with the publishing agencies as well as others in the government
to define services now and into the future. There can be positive impact on
the agencies depending on what happens with GPO and Congress. The NCLIS Study
has been referenced often, particularly the need to set R&D money aside
for information dissemination. It is probably easier to sell this concept with
STI. STI could serve as a model for dissemination and preservation initiatives.
"Authentication of Electronic Federal Government
Information"
Richard Davis, Director, Office of Electronic Information Dissemination
Services
GPO’s work on public key infrastructure (PKI) and digital signatures centers around several key questions that arise in the electronic environment. Is what I have authentic? Has it been modified? What type of non-repudiation exists to prove that you are truly the originator?
PKI is based on asymmetric key pairs – public and private keys. A certificate agency serves the key pairs. Digital signatures use the PKI for various security functions. The result is a verifiable record and the user is assured of the authenticity of both the document and the sender.
A free plug-in reader from Entrust, Inc., allows users to verify PDF files. All users can view the digital signatures, but they cannot verify them without the plug-in. GPO is currently working with the vendor on compliance with Section 508 of the Rehabilitation Act.
GPO anticipates initial implementation in three to six months. One of the first uses of the technology will be with the bills of the 108th Congress. In the future, GPO plans to become a certificate authority under the US Federal Bridge Authority. Customers will include government-to-government and government-to-citizen. Certain management controls are needed in this process, and GPO is currently working on certificate policy and practices documents. GPO is also planning pilot projects and exploring business models that would include providing authentication services for others.
A key question is whether the digitally signed version will be considered “official”. Ms. Russell sees the Federal Register going that way. The final decision about this issue will depend on the publishing agency. The technology provides for a chain of provenance through multiple PKIs that could be reflected in Westlaw or in Thomas. Authentication can be transferred back to the originating agency and the software will still work. GPO is interested in discussing these workflow issues with CENDI members and other stakeholders.
Federal Information Policy Interaction: Discussion
on E-Government and Related Policies
Brooke Dickson, Policy Analyst, Information Policy and Technology, Office
of Management and Budget
On April 17, 2003, the Office of Management and Budget (OMB) officially opened
the Office of E-Government mandated by the E-Government Act signed into law
December 2002. Mark Forman heads the Office of E-Government and Information
Technology. Directly under Mr. Forman is the Chief Technology Officer, Norman
Lorentz, the portfolio management office for the 24 e-government initiatives,
and the Federal Enterprise Architecture Program Management Office with Robert
Haycock as Director. The Information Policy and Technology Group headed by
Dan Chenok with a staff of 12 to 14 now also reports directly to Mr. Forman.
This office is responsible for the Paperwork Reduction Act (PRA) policies,
IT Security, the Privacy Act, PKI/Authentication, and E-Government policies.
The Office of Information and Regulatory Affairs (OIRA) under John Graham has
a staff of approximately 52. It is at the same level as the E-Government Office
since both are statutory organizations. OIRA is now emphasizing its regulatory
responsibilities. The Information Quality Guidelines, which went into effect
last year, have been taken over by the Statistical Policy Office under OIRA.
The Guidelines are still considered part of PRA but the monitoring responsibility
resides with OIRA."
General agency implementation guidance for the E-Government Act was released on May 30, 2003, for agency comment. Comments were due on June 2, 2003. Signed guidance is expected to be released by June 6, 2003. The Guidance gives the Chief Information Officers (CIOs) responsibility for overseeing the implementation of the E-Government Act, but they need to work with librarians, content managers, records managers, webmasters, etc. The Guidance is at a very high level and uses the statutory framework. The Office of Management and Budget (OMB) realizes that specific guidance is needed for certain sections; for example, with regard to the IT workforce from the Office of Personnel Management (OPM). Similarly, further guidance on Section 207 is expected to come from the Interagency Committee.
Section 207 of the E-Government Act calls for formation of an Interagency Committee on Government Information. Ms. Dickson has been approaching various groups, including CENDI and FLICC, for nominations of officials and staffers. Invitations are expected in a few weeks in order to get the committee formed by the end of June.
The goal is to keep the Interagency Committee small with no more than 50 people. The task group model may be used for specific, detailed activities or for cross-cutting issues such as metadata or taxonomies. OMB does not want to impose a structure but to let the committee decide what structure is best to accomplish its work.
Some activities of the Interagency Committee are specified in the Act while others will come out of the discussions. The Interagency Committee has five main areas of work: categorization and indexing, recommendations on long-term preservation, standards for web site presentation, development of a public domain directory, and a federal R&D tracking system. It is expected that the activities will get some high-level attention; there are work products due in the first year. These will dictate initial priorities.
OMB is still working on the Content Management model for agency web sites. This includes presentation standards for agency web sites to give a common look and feel, required links, security and privacy statements, the required link to FirstGov, and search capabilities. They are still looking for the specific connections between this model and the Federal Enterprise Architecture’s (FEA) Data Reference Model. Metatagging will probably be primary and will develop from the work of the XML Working Group as well as on what the agencies are already doing. The Data Model is trying to get to semantics rather than just syntax.
Intellectual property concerns need to be addressed. A-130, particularly section
8A, will likely be revisited to incorporate the Federal Enterprise Architecture
and the E-Government Act. There is no working timeline for this review. Several
CENDI members noted that Section 8A seems to work well. Ms. Dickson noted that
an initial survey of A-130 showed that this section did not require major changes
but only a few updates, including appendices and mention of technologies. A-130
should take into consideration the increasing importance of the electronic
environment, metadata, and life cycle management in the digital environment.
Google: History, Plans, and Opportunities
for Partnerships
Celeste Chung, Senior Manager, Google Business Development
Dr. Walter Warnick, DOE OSTI, set the stage for the Google presentation. He identified the surface and deep web, with the intersection of these two areas at the front pages of the underlying databases. It is estimated that there are 3 billion pages indexed by Google, and perhaps 100 times that many pages in the deep web that are inaccessible to Web search engines such as Google. OSTI is working to make DOE’s information more visible. Google will soon be indexing OSTI’s R&D literature from the full text of the documents.
Celeste Chung then gave a brief history of Google and discussed the kinds of partnerships in which they are involved.
Google began as a research project at Stanford in 1995. It was incorporated in mid-1998. There are now more than 800 people employed with Google world-wide. About half the workforce are engineers; several are senior research scientists who work with the algorithms on which Google is based.
Google has grown from 3 million queries per day on 2100 computers in September of 1999 to more than 200 million queries on more than 10,000 computers today. Google, a private company, has a market share of about 40 percent.
The mission is to organize the world’s information making it universally accessible
and useful.
Google considers itself to be a technology company. It refers more search traffic
than any other provider. It does not consider itself a portal, and there is
very little advertising.
Ms. Chung outlined the search differentiators for Google – Relevance, Integrity, and Performance. Google is well known for its relevance ranking. Most people are familiar with Google’s PageRank technology and link analysis based on popularity. However, there are over 100 factors (link text, font size, proximity, etc.) that Google uses in its relevance ranking.
Secondly, Google is known for its Integrity. A Google search is an objective search; there is no way to pay to be in the basic search results. The automated crawl makes it objective, and there is no human editing.
Google’s Performance is good even on slow modems. The key is that Google crawls the content and then performs the heavy processing on the Google computers.
Currently, Google has indexed over 4 billion web documents, including 3 billion web pages and 425 million images. This includes more than 35 million non-HTML pages in formats such as PDF and Word. It has also purchased the Usenet messages, which are available as Google Groups.
Google is beginning to grow into vertical markets and it is changing its algorithms to meet these market needs. Eventually, Google wants to combine these approaches to perform a single search.
Google has several business models. This includes advertising and licensing its search services in a product called Sponsored Links. AdWords is a product geared toward small business. More than 130 different partners license the search engine, including AOL and Yahoo. The Google Search Appliance (Google Box) is hardware and software that can be installed behind firewalls. It will crawl an organization’s domain and integrate with external searching.
In terms of federal information, a Google crawl can be restricted to government sites. This would significantly increase outreach and dissemination. A search can be directed just against an agency’s web site or part of a web site. Google estimates that there are 98 million government documents in the current index. It averages about 50,000 queries per month on the Air Force sites and about 90,000 on NASA sites.
Crawling of deep web content is being done with several partners. The ArXiv Archive (headed by Paul Ginsbarg) saw a 50 percent increase in usage after allowing a crawl without any additional promotion. Approximately 60 PubScience publishers have expressed an interest in being crawled and discussions are underway with others. A project with Stanford’s HighWire Press will include more than 300 journals. Google has projects underway to crawl DOE OSTI, GPO and PubMed records.
Selective crawling requires creation of a site map, which allows homepage JavaScripts that control access to be bypassed. Some of this information is fully available to the public while in other cases it is restricted. Access restrictions can still be imposed at the partner’s web site. The crawling simply feeds the index and then, depending on the access restrictions and the authentication of the user, the full text or just an abstract could be displayed. In some cases, adjustments have been made to the crawler or to the algorithms. Google will consider new and different ideas as part of joint R&D projects.
Information is available about how to set up a site for the Google crawler. Google cannot crawl the site if it is produced from dynamic content, if there are boxes, forms, or JavaScript, or if the site has a robots.txt (no crawl) tag in the metatags.
There is currently no specific access to science information on Google, and this might be an interesting product from a partnership of Google and the CENDI agencies. Such a product could eventually accommodate features such as citation linking and peer review.
Governments are developing metatag sets, but these are either ignored by Google or are treated just as any other word. Ms. Chung indicated that Google is focused on the crawling of full text and believes that this, along with its algorithms, provides results more economically than the creation of expensive metadata. Date searching is not part of the .com product but it may be added, as it has been for the news search.