| CENDI PRINCIPALS AND ALTERNATES MEETING |
National Technical Information Service
Department of Commerce, Washington D.C.
April 6, 2001
WELCOME
Kurt Molholm, CENDI Chair, and Deputy Administrator of the Defense Technical Information Center (DTIC), opened the meeting at 9:00am. He thanked NTIS and the Department of Commerce for hosting the meeting. Introductions were made. John Rumble of NIST introduced Charles Sturrock, who will be heading the Standard Reference Data Program at NIST while John is on detail.
TECHNOLOGIES FOR INFORMATION EXCHANGE: FROM INTERCHANGE TO INTEROPERABILITY
"XMLXMLXML: An Executive Overview of What It’s All About"
Mark Gross, President, Data Conversion Laboratory (DCL)
DCL is primarily involved in converting legacy information for access by new technologies. Their clients include libraries, publishers, technical societies, Defense agencies, e-commerce organizations, and industry requiring technical documentation. Recent work has involved the conversion of technical documentation, including military manuals.
The current industry environment is a virtual "alphabet soup" of alternative technologies and data standards. These include HTML, Native formats, TIFF, PDF, and SGML. XML and SGML are virtually identical; XML is often referred to as a "starter kit" for SGML.
When addressing conversion and storage format issues it is important to ask the question, "What are the Data Use Issues?" Is there a need to distribute page image representations. Sometimes this is critical for page integrity and authentication purposes. Repurposing of information makes it available for derivative uses in new versions. (Even taking the content from the Web to the browser display is a type of repurposing.) This is especially important as information is moved to small hand-held devices. Searching is the ability to find information through text and other advanced searches that depend on the context. Library science understanding of search techniques is far ahead of the the Web’s use of it. TheWeb is just now catching up. For example, back of the book indexing provides context to the use of individual terms. Component reuse is the creation of boilerplates and taking portions of documents to create another product from the pieces. This approach provides efficiency when creating a variety of specific technical manuals that have some pieces in common and others that diverge. Another data use issue is the enforcement of data standards. This is the ability to ensure that the information produced is produced consistently and meets corporate standards. Interchange with vendors, customers, and the world is the ability for others to use your information for communication and to incorporate into their own products. In a commercial world, this allows people to license information for other purposes. Some content providers are now making more money from licensing their information than they are from directly selling their own products.
Mr. Gross then mapped these use issues against the various formats–native, TIFF, PDF, HTML, SGML and XML. Over the past several years, efforts have moved from TIFF images (as the electronic equivalent of microfiche), to PDF. HTML was a half-way solution. SGML and XML are also not completely satisfactory if the goal is to reproduce an image of the document.
HTML and XML both include tagged text, but they are extremely different. HTML simply describes how the information should appear to an HTML browser. XML tells you what the information means. XML is destined to become the mainstream technology in Web applications where there is a high degree of reuse, interchange and automation required. The tag is separate from the formatting, providing more flexibility. XML also allows for more sophisticated information and for contextual information that bridges the world of content and databases. XML and SGML provide more flexibility, but they are not the best for distributing page images.
How does the computer know what to do with the tags and how do other people know what they mean? This is the purpose of the XML Style Sheet (XSL). The style sheet language is used to interpret the content. Therefore, it is possible to remove the tags, reorder them or move them around for presentation from the same raw file. Context sensitivity can be handled by XSL, so the style can vary based on all the ancestors, descendants and siblings of an element. This provides formatting flexibility based on the context or position of an element within a document. XSL supports both printing and online display from the same tagged format.
The Document Type Definition is a formal set of rules for describing the structure of a document. It includes element definitions, attributes for the elements, entity definitions that allow you to associate a name with some fragment of a document, and notation definitions that may link to external data. One of the great strengths of XML is that it allows you to create your own tag names. A DTD is similar to a database schema in that it tells you the names for the elements and key information about what content will be found within them. It helps to determine what the document can look like, what is its order, what names have been defined, etc.
For XML (unlike SGML), the DTD is optional because XML will simply use the default. However, it is important to remember that if you do not develop a specific DTD, there are limitations as to what you can do in XML. A DTD is particularly important if you are developing an authoring template or want to ensure good validation.
Use of a DTD allows you to use an XML parser to validate a document based on the rules that are provided. There are some standard DTDs for different document types and different subject communities that can be used. Industry groups are putting together standard DTDs. Unfortunately, organizations often think that they have their own unique needs and therefore do not take advantage of DTDs that already exist. Standard DTDs can be used as a base for further development, allowing for unique elements to be added as well.
Real world applications are particularly prevalent in electronic learning and testing procedures. For example, DCL has produced 3 million pages in each of the last five years. DCL produced a large number of pages for the Library of Congress product to make 18th and 19th Century rare documents available. It was impossible to have a very specific DTD. Some hand coding was done, including the Congressional Committee and the dates. The image has been preserved and there is SGML behind it, which can be searched as full text with tagged SGML content.
In another example, on an on-going basis DCL converts over 400 medical journals to allow their publisher to make them available online. They have developed a core medical journal DTD, which they use for putting the journals on the Web. He believes that this type of consensus around disciplines and document types is extremely worthwhile.
It is important in any of these projects to determine if the benefits are worth the cost. It is an order of magnitude greater cost to create fully tagged XML or SGML when compared to the cost of creating PDF images. If the organization is interested only in dissemination without the other added benefits then it is probably not worth the added cost. Also, XML and SGML are not the silver bullets for everything. They are not print formats and they are not suitable for unstructured information.
Any XML project requires planning. It is important to remember that not all XML is created equal. There will always be new uses so there will be the need to modify the schema over time. Authors need to be retrained to support the use of XML. Leveraging the initial investment in DTDs, infrastructure and standards should be considered. A properly executed SGML conversion will be XML compatible.
Mr. Gross provided an extensive glossary of XML-related terms and several web sites. These are available in the handouts. He can be contacted at Data Conversion Laboratory, Fresh Meadows, NY 11365, phone: 718/357-8700 x 211, fax: 718/357-8776, or e-omail: markgross@dclab.com.
"XML and LabBook: A Case of Interoperability"
Keith Montgomery, President, LabBook, Inc.
LabBook is a specific case of the use of XML as an enabling technology. Mr. Montgomery noted that the product was driven by the fact that scientists want things now and they have limited amount of time for training. If your users must deal directly with XML you will fail. LabBook turns XML into a visual presentation. Using XML as the bridge, LabBook is able to combine diverse applications, heterogeneous databases, Web sites and other information.
Mr. Montgomery sees XML as a new language for science. They have developed the BioSequence Mark-up Language (BSML) that sits on top of XML and allows the integration of the various genomic related resources. This allows information that has been encoded using BSML to be integrated and analyzed by the scientist for new associations. The visualization is dynamic and based on the marked up content. They have provided BSML to the community as a non-proprietary format; 40 percent of the XML users in the pharmaceutical industry are now using BSML.
Because standard web browsers are not domain aware, LabBook has developed a "biology smart" browser. This provides a communication device that allows scientists to document and edit for collaborative purposes. This is important since the lab book to the scientist is actually a legal document. Only about 20 percent of genomics have any documentation, such as journal literature, associated with it.
Mr. Montgomery then showed a live demonstration of an XML document. He showed how he could convert from a nucleotide view to a protein view of the same information, based on the XML mark-up. Later, this will be enhanced to move from a protein view to a chemical view. There is also a connection to NCBI and to MEDLINE.
Applications of the LabBook go beyond the bench scientists. They have linked this product to patents for genomes. Other legal and technical applications are being developed. BSML–type mark-up can be developed for other applications and disciplines.
XML is a major paradigm shift. You no longer have information–you have data. This allows for repurposing of data and for integration of various data formats that would otherwise be incompatible.
"The CIO’s XML Working Group"
Owen Ambur, Co-Chair, CIO XML Task Group/ U.S. Fish and Wildlife Service
XML will happen with or without the involvement of the CIOs because vendors are racing to implement it in their products in the marketplace. Thus, the issue for the CIOs is how they can make it happen faster, better and cheaper than might otherwise occur in government agencies. Two specific potentials include: 1) the use of XML metatags to classify and manage records Government-wide, and 2) the use of XML to render all government forms on the Internet as well as to gather the data from them for record-keeping and capture into databases.
Based upon the identification of those two potentials, the CIO Council’s Enterprise Interoperability and Emerging IT Committee (EIEITC) authorized an ad hoc study group to for 60 days. Subsequently, as recommended by the ad hoc study group, the EIEITC chartered the XML Working Group to perform four activities:
For the latter, they have established the xml.gov web site (xml.gov). They are seeking comments on the web site and what could be done to enhance it.
The XML.gov site has a section called "efforts" at which agencies can describe their XML projects and implementations. The group is trying to determine the need for a registry of XML elements, DTDs, and schemas that are "inherently governmental" in nature. Under OMB Circulars A-76 and A-119, agencies are encouraged to work with commercial organizations to develop and implement "voluntary consensus standards" and, whenever possible, to use such standards that have already been developed. However, there are government-unique information needs that must be addressed. One such area is the need to identify the metadata related to records management and retention under the Federal Records Act.
The Working Group meets on the Wednesday before the third Thursday of each month. Teleconferencing is available.
NTIS Futures: From Strategic Initiatives
to Infrastructure Modernization
Alan Neuschatz, Deputy Director and NTIS Associate Director
Alan Neuschatz, Deputy Director of NTIS, spoke about NTIS’s strategic plan. NTIS is an old and experience-rich organization, which, for many years, has been caught in a time warp. Increasingly, they are evolving because the technology is changing the marketplace and the marketplace changes are eroding their current financial position. For a while, they were in a very tenuous position. However, they have now turned the corner. They have substantially changed their business model, posted profits for the last two years, and recognize that a strategic plan is necessary.
The core of NTIS’s difficulty is that, because they are a full-cost recovery agency, they are forced to charge prices that are out-of-step with the demands from users. The senior staff has developed a plan that calls for several activities to address this challenge.
The first is to change the business model to one of partial appropriations and partial cost recovery. Congress will be asked to fund the core government functions with a modest appropriation. There would be free and widespread public access to the NTIS database. Linkages would be provided to free text on the host/originating agency’s site. NTIS would continue to provide products in multiple formats (CD-ROM, fax, e-mail, print, etc.) at a cost. Full text access would be provided to the 3 million documents in their archive when they are not available at the originating agency site. This would be done at an incremental cost for electronic delivery. There would also be a cost for print and other media. The 2003 budget will be impacted by these recommendations. They are optimistic that these recommendations will be received favorably by the Administration.
These changes will also have a major impact on their main business units. Mr. Neuschatz then asked each of the business unit managers to give their perspective on the impact of this strategic plan on their business.
Wally Finch–Business Development
Outreach will be an increased focus for NTIS. They are already becoming more active through memberships in NFAIS and ICSTI. They are also revisiting the American Technology Preeminence Act (ATPA), which gives the legislative collection mandate for NTIS and requires agencies to send their documents. The wording is currently very paper-based. NTIS is planning to suggest wording that would work more effectively in the digital environment. GAO recently visited NTIS. They are doing another study to look at ATPA and agency compliance.
Mr. Finch also announced that NTIS is planning to become a naming authority for digital identifiers (Handles). They are already working with DTIC to determine how the two organizations might interact and where they can support each other in the development of such an infrastructure. The goal is to use Handles if they have already been assigned by an organization like DTIC; otherwise, NTIS will assign a Handle. Both will be stored in the Handle Data Store. However, multiple resolution will be available, so that the user can get the information from the originating organization or from other sites where it is stored (e.g., NTIS). Alternatively, NTIS will provide the electronic version for those agencies who do not want to manage their own archives and a link to document delivery from NTIS in a variety of formats. PURLs will also be supported.
Janice Coe–Office of Media Services
Ms. Coe reported on the Office of Media Services at NTIS. She spoke about the media outreach campaigns for the Social Security Administration and continuing Department of Treasure/IRS online projects. She also explained that OMS has a joint venture partnership with Navy Media to perform satellite broadcasting for agencies. This partnership uses satellite to disseminate information – such as women’s health issues for HCFA and history classes/reenactments for the Department of Education.
Ms. Coe also elaborated on a pending joint venture agreement to provide media asset management for agencies. This partnership will allow NTIS to develop a pilot program using the assets of the National Audiovisual Center at NTIS. The MAM services include digitization, cataloging, text and scene indexing and searching for both the print and video assets of the Center. The pilot program will be developed with selected clips on the web. The pilot is being developed to allow new services to be offered to federal agencies.
Doug Campion–Production Services
This group deals with all the systems for acquisition, cataloging, inventory control, and output production. The staff was recently reorganized into a single group with three divisions – data capture and conversion, indexing, and the storage. The size of the staff has decreased by approximately 50 percent since 1995. However, they continued to disseminate approximately 2M information products for the agencies in FY2000.
In order to increase efficiency and to keep up with agencies that have put their information on the Web, but do not send it to NTIS, NTIS has increased the investment in electronic harvesting. A harvester is used to gather the reports from the Web, to download them in whatever format is available and provide them to the production process. Approximately 40-50 documents a day are handled in this way.
A major issue with downloading from the web is that it provides more color content than in the print. Imaging and scanning in color will be available by mid-April.
The Storage Division is responsible for electronic and physical storage. The physical storage requirements are down from 25,000 square feet to 2000 square feet.
There are approximately 250,000 titles in the ADSTAR system, which is the manufacturing hub of NTIS. Old documents are digitized into the ADSTAR system on demand. When undigitized material (pre 1997) is requested, the item is scanned then printed from the scanned image. Others that are never ordered remain only in paper. The NCLIS study recommended a further study to identify how much paper is available at NTIS and what the value would be if it were converted to electronic. Mr. Campion estimates that ½ billion pages are available in paper only.
Jon Birdsall–Acting Director of Customer Service
NTIS has noted that as the technology advances, so do the customer expectations and requirements. CD-ROM may be preferable for multiple documents. E-mail and fax can be performed directly from the ADSTAR system.
The drop call rate is monitored using an automatic call distribution system. Each representative can monitor what is happening from his/her desktop. A Customer Advocate provides the viewpoint of the customer in all decisions.
CISPUB is an off-the-shelf system purchased in 1995. They are working to replace it. COIN is the online ordering system, which may be replaced. The Bookstore in Springfield is available for walk-in purchases.
There are several popular specialty services provided by NTIS. The rush 24-hour service is extremely popular. NTIS also provides document certification for court cases. Tailored CD-ROMS can be provided, along with e-mail and fax.
All sales representatives are cross-trained in general searching and location of items via the STAR system. All have access to the Internet for reference questions. Everyone also has ADSTAR access so that changes can be made after the customer has entered the order. Individual training programs are being encouraged for staff.
Keith Sinner–CIO
From the IT standpoint, Mr. Sinner’s group is trying to use a business approach to IT to avoid "buying toys". They will use advanced but proven tools and technologies. There are several major areas of investigation at the present time: 1) fixes to the ADSTAR system to support up-front conversion from TIFF to PDF; 2) online web ordering and searching using the Cuadra STAR database directly, rather than a intermediate file; 3) a new order processing system that is based on a COTS product (this may require changes to their internal business practices, since the current system was tailored to NTIS’s business practices); 4) implementing Handles with multiple resolution; and 5) streaming video-audio, etc. All systems purchased and developed with be XML-compliant.
Discussion
The group asked about the current budget outlook for NTIS. Mr. Neuschatz reiterated that they are not looking for subsidies, but, instead, a change in the pricing structure. If the appropriations for inherent governmental activities are not forthcoming, it will be more difficult to change the pricing structure.
There has been no nomination for the Commerce Undersecretary for Technology to which both NTIS and NIST report. However, Secretary Evans mentioned NTIS at a recent senior staff meeting. Mr. Neuschatz reported that NTIS has a strong constituency and is statutorily based, which are both positives in their favor.
In terms of staffing, NTIS reduced its staff from 335 to approximately 195 without a Reduction in Force (RIF) process. They are currently under a department-wide hiring freeze until the high level official is appointed. The major challenge now is the mix of skills.