CENDI PRINCIPALS PLANNING MEETING

University of Maryland Conference Center
Adelphi, MD
September 7-8, 2005


MINUTES

Welcome
CENDI Roundtable on Agency Priorities and Opportunities for Cooperation

Government Printing Office
Defense Technical Information Center
Environmental Protection Agency
National Technical Information Service
NASA STI Program
National Agricultural Library
Department of Energy/ Office of Scientific and Technical Information
National Archives and Records Administration
National Library of Education
National Library of Medicine
US Geological Survey
National Science Foundation
Public/Private Relationships: New Challenges and New Ways of Doing Business
PubChem
Public Access Policy
Portable PMC
KEYNOTE: “Federal Priorities for Scientific Collections Management and What Data Are Next?”
Dr. Teresa Fryberger, Assistant Director for Environment, Office of Science and Technology Policy
User Expectations: CENDI Member Panel on Understanding and Reaching the User
Information Literacy: Google is not THE Answer
Show and Tell: What We Learn from Exhibiting
Expert Panels: Balancing Vision with Experience
GIS: The Visualization We Know and Love
User Expectations: Increasing the Value of Sci-Tech Intelligence through Enhanced Visualization


KNOWLEDGE DISCOVERY & MEETING NEW EXPECTATIONS

Chairman’s Welcome:  Knowledge Discovery and Scientific Advancement
Dr. Walter Warnick

Dr. Warnick welcomed everyone at 9:10 am. He gave a special welcome to Principals who have joined the group since the last planning meeting -- George Strawn (NSF), John Sykes (EPA), Ellen Herbst (NTIS,) and Paul Ryan (DTIC).

Dr. Warnick went on to set the stage for the planning meeting discussions. The world is entering a new era of knowledge diffusion and global discovery. Tremendous strides have been made since the early days of the Internet. At OSTI alone, the number of transactions is many hundreds of times greater than in the pre-Internet days.  It has been only 11 years since DOE posted its first homepage, but, today, those pages look very primitive.

Dr. Warnick compared the Internet to another transformational technology – the automobile. There are similarities in form and function between the Model A of 1908 and the Model T, which was introduced 11 years later. The future is unpredictable but we can make some predictions and assumptions. There is likely to be an Internet in 2015, but we don’t know its features. Similar to the way that the Model T revolutionized transportation and changed human behavior, the behavior of research scientists and the searching public will be transformed as the current implementations of search are improved.  

Our job at CENDI and within our agencies is to make use of the evolving Internet to diffuse knowledge from our agencies. There are several near-term opportunities to speed knowledge diffusion, including the launch of Science.gov 3.0, the AAAS Symposium and the NFAIS Annual conference. OSTI now has additional authority to aid in this knowledge diffusion. OSTI was created in 1947 but, for the first time, in 2005, it will become part of law through the Energy Policy Act of 2005. This is a major milestone with implications for broadening OSTI’s collection and dissemination policies. OSTI has been adding to existing databases and new databases are being added, including conference proceedings and patents.

Efforts are underway to transform precision search by deploying new technologies in Science.gov 3.0. Science.gov 4.0 will use Grid technology. OSTI and other agencies are making significant inroads with non-text materials such as numeric data, audio, and video that go beyond the traditional text environment. Dr. Warnick called for CENDI to leave the planning meeting with a roadmap and tangible activities for improving knowledge discovery and diffusion over the coming year.

 

CENDI Roundtable on Agency Priorities and Opportunities for Cooperation

The agencies were polled in advance of the meeting for one primary topic area representing challenges or opportunities for that agency.  An additional one or two topics of interest were also requested as optional. Each agency participated in a “round robin”. The topics are presented by agency below.  Themes and action items that emerged are incorporated under the Discussion of Proposed FY06 Activities and the table summarizing those discussions.

Defense Technical Information Center – Paul Ryan

DTIC is investigating visualization tools under the leadership of Jim Erwin. DTIC’s interest in visualization grew out of the three to four presentations on these technologies at the NFAIS conference last February.  DTIC is interested in identifying patterns and making connections, rather than in deep data mining.

DTIC became an independent field activity in June 2004. Within the last month, the charter outlining its authorities was signed by the Deputy Secretary of Defense.

DTIC’s major activities include the Research & Engineering Portal. The portal will bring content and tools together in one place.  DTIC started with a database of interest to the Chief Technology Officer, bringing together the S&T Warfighting Plan, Congressional budget data, and employee information in a single structure, and added some analytical tools. The portal will also include a workspace. They are completing the third phase of a four-phase design. The third phase is to fully integrate the databases.  Phase 4 is a single sign on. DTIC is working to provide integrated access to unclassified and unclassified but limited. It is difficult to have the systems “talk” to one another. Marketing plans to spend $200,000 for dissemination and promotion of the portal. They are about three to four months into this already.

Last week, they went live with the Iraq Virtual S&T Library that provides access to journal literature to a select group of scientists and engineers from ten Iraqi universities. A briefing will be given to Karen Hughes at the Department of State to make connections between Arab and Western scientists. 

Discussion

Other agencies are also working on portal projects. EPA is developing a Scientist’s Workspace (see EPA briefing below). NASA is working on both intranet and Internet portals (see NASA briefing below).

NAL will be applying Explorit to its internal web sites.  Ms. Frierson asked if Explorit is used for any of the DTIC development, since it might be worth considering how the systems could be linked to provide a “private” or more focused Science.gov based on a common software platform. Presently, Convera and Verity are the search engines for the main systems at DTIC.

Government Printing Office – T.C. Evans

GPO is implementing its Strategic Vision. The agency is reorganizing into six business units called for in the vision and is nearing the procurement phase on the new system to support the vision, which includes non-text resources. The procurement will be coordinated by the CTO and the CIO. Mr. Evans will be moving to the Chief of Staff’s Office with responsibility for implementing the vision.  

In the interim, before the future digital system is in place, there will be a transition system based on Akamai.  Originally, they chose the Akamai system for better disaster recovery for GPO Access, but GPO found that Akamai provides better search results and improved speed. They will also take advantage of Akamai’s content delivery mechanism. Akamai puts the data much closer to the user through its large world-wide network of service. Using the Akamai service helps to smooth the bandwidth spikes. There will be firm, fixed addresses for some of the GPO information; this persistence will be important in the new system.

GPO’s Authentication Initiative is moving forward. A bulk signing tool is the next step. As material goes through GPO, it will automatically be signed and the material will be made available through GPO’s collection. As a outcome, these marked materials can be given back to the agency. If the material is handed off to a third party, it can easily be checked to make sure it hasn’t been altered. There is a mark on the document and this is part of the national bibliography.  It can be viewed by free reader software that will be part of the next generation of Acrobat. The chain of custody (provenance) can also be checked through a special system. This authentication will be an integral part of the Congressional workflow and other agencies, such as NARA, are interested.  From the CIO side, they authenticate people. The workflow must marry these two authentications.

Environmental Protection Agency – John Sykes

EPA is developing a Scientist Workspace with administrative, research and science, and environmental health areas. There will be access to 700 scientific libraries and links to Science.gov and Alerts.  A special security zone for researchers will facilitate external collaboration and the creation of groups. EPA is trying to open up ports so they can get out to the Internet. EPA has a science ftp which is used by over 100 researchers to work with 300 external collaborators. This is a major achievement after EPA’s internet was shut down several years ago because of security concerns. Collaborations can be set up in about 24 hours. In the future, there will be a web proxy server for the scientist workstation, providing secure, transparent access to all web sites without prior permission. Sixty EPA scientists will get high end workstations to work on the Grid.  Dr. Sykes will provide specifications and costs for the workstation.

EPA is also interested in Voice-Over/IP. EPA is about to implement it at their field station in Corvallis, Oregon.

Discussion

The Department of Commerce uses CISCO for Voice Over IP. The DOE CIO’s office is also interested. It was suggested that this could be a topic for a meeting in the near future. NTIS could report on the lessons learned at the Department of Commerce.

National Technical Information Service – Ellen Herbst

NTIS has embarked on a strategic update which will result in a two- to three-year plan. NTIS spent the last few years achieving financial stability and focusing on business and dissemination activities. However, the acquisition, indexing, searching and other core mission activities have suffered. In October, a team will be formed to review the acquisition and associated workflows, looking at the IT infrastructure and the skills of its workforce. NTIS is also grappling with the challenge of making the entire collection perpetually available in electronic and physical media. As they are doing this activity, NTIS continues to seek collaboration opportunities with the federal STI community. NTIS will be looking for help from other agencies with their core mission, while they can help other agencies establish public/private partnerships.

For the most part, the “support of others” work that NTIS has taken on in the last few years is appropriate to their mission of information dissemination. NTIS will continue to provide support to others, but they will refocus on investing in the collection activities.

NTIS is in the process of reactivating its Advisory Board. The Board was announced in the Federal Register, and recommendations should be provided to Wally Finch by October 26, 2005.

The Department of Commerce is supportive of NTIS’s strategic changes and has stated this to OMB. However, they are still in a self-sustaining situation without appropriations. NTIS has managed to break even or show small retained earnings in the last five to six years. 

NASA STI Program – George Roncaglia

NASA is working on its Enterprise Architecture Review (the NASA version of a Strategic Plan). Key areas include digital preservation, disaster recovery and technology readiness. The goal is to focus resources on the most imminent readiness levels. NASA is also working with DTIC on deploying NASA’s version of Handles. A longer term aspect of the plan is a shared services center for disaster recovery purposes which will consolidate business and data warehousing.

NASA is also implementing a portal. While the NASA portal was originally a very diffuse effort, it is proving to be directly beneficial to the STI Program since it promotes its database. The STI Program is investigating how to make the NASA Database friendlier to harvesting or external searching such as Google. Database usage is increasing in direct proportion to the degree to which the database is exposed.  NASA is interested in making its information more public. The boilerplate contract language has been changed to better protect NASA’s government rights.  
Organizationally, the STI Program is part of the CIO’s Office and metrics are important.  Through the budget defense process, Mr. Roncaglia has found ways to determine return-on-investment from STI. 

Discussion

The group briefly discussed the Program Assessment Rating Tool (PART) Process.  Several agencies are in the middle of the process. NAL is about to start their process. NARA has completed the analysis in some areas. USGS has just completed the process.  It was suggested that an interest group might be formed or the PART process might be a topic for a future CENDI meeting.

National Agricultural Library – Peter Young

NAL is developing an electronic repository of Agricultural Literature, following NLM’s lead in the type of literature that would be deposited by USDA to improve access by researchers and the public. The Agricultural Research Service (ARL) publishes in journals. NAL is hoping to include intramural research.  Policy, legal and technology issues must still be addressed.  The services would be integrated into a seamless type of portal.

The Digital Desktop, which provides the full text of published literature to USDA researchers at the desktop, has been operational for more than two years. The use has increased 80 percent between 2003 and 2004 and the same increase is expected between 2004 and 2005.

The objective next year is to expand the holdings as well as the contributions from the USDA researchers. Discussions are underway with the USDA General Counsel on intellectual property issues, including joint authorship. Extramural research funds will also be addressed. USDA provides $1-2 billion in grants and they estimate that they are getting about 7,000 of the 15-20,000 articles per year that appear in the commercial press as a result of this investment.

The ARS has a FACA advisory committee that has a task force looking specifically at NAL. NAL views this task force as a channel into the budge process.

Dr. Young asked about a coordinated federal STI response to the disaster in the Gulf Coast. USDA has a number of federal centers as well as funded projects in this area, as do many other agencies.

Department of Energy/ Office of Scientific and Technical Information – Sharon Jordan

DOE’s Data Management Initiative is seeking to clarify the relationship between the DOE STI Policy, which is generally seen as addressing published literature, and data. The managers of data centers, most of which manage information in a particular program or science area, met for the first time last year. A lot of data is being generated, but no one is really responsible for its preservation or access.

The data center managers group recommended that the STI policy be broadened to include numeric data.  The report from the meeting was issued to the Scientific and Technical Information Advisory Board (STIAB) which met in March. Dr. Ray Orbach is the chair and Dr. Warnick is the co-chair. The Advisory Board is vetting the policy revisions. Once the policy is approved OSTI will create further guidance on aspects to consider in the management of data.  OSTI might house the metadata which would describe the data, but not all the data. OSTI is starting to work with the designated data centers that are topically focused. Standards will be needed for describing the data, but the intention is not to be too prescriptive but to emphasize the need to consider the life of the numeric data as part of the planning activities when designing a project. The access decisions will be made by the individual project managers.

Another meeting is scheduled with the data center managers in October to get an update from Dr. Chris Greer on the National Science Board’s long-lived data report. Dr. Orbach will also discuss the importance of data, particularly with regard to simulation activities. 

DOE wants to disseminate its information more broadly, so OSTI has an initiative with OCLC to make its content accessible via Open WorldCat. OCLC recently harvested OSTI’s metadata for full-text R&D reports from 1994 to the present. The content will increase as the full text collection is expanded. OCLC has designed the system to a new IT platform and it can accept many types of data formats beyond the traditional MARC format. DOE’s made its database OAI harvestable and this made it easy.  As long as the metadata is well structured, there should be no problem. No money exchanged hands.

Discussion

Ms. Frierson, who is on the OCLC Advisory Board, indicated that OCLC members are pushing OCLC to turn it into a one-stop shop beyond just an online catalog. WorldCat is on the open web through the Open WorldCat initiative.  It can guide the user to the local system so the organization can charge for document delivery, if desired.  This has impacted governance and pricing structures. If you are a member of OCLC, there is a small fee for Open WorldCat. Federal institutions generally have not kept up their representation on OCLC because it is done by contribution.

Other CENDI members are involved with WorldCat. DTIC’s unlimited, unclassified citations are available through Open WorldCat because CISTI puts them into Open WorldCat. However, the records point to the full text at CISTI rather than at DTIC. ERIC has encountered some issues when trying to open ERIC to Open WorldCat. There must be harvestable metadata and a hyperlink. 

CENDI members expressed an interest in finding out more about Open WorldCat.  It was suggested that Lynn McDonald at Fedlink be invited to speak at a future CENDI meeting.

National Archives and Records Administration – Nancy Allard

NARA is embarking on strategic planning to submit with next year’s budget. The new Archivist of the U.S. sees this as a way to engage stakeholders in determining where NARA should go in the next ten years. NARA is in the early stages of data collection. A draft plan will be available for comment next year.

NARA selected the final contractor for its Electronic Records Archive (ERA) System.  Lockheed Martin was awarded the contract after a prototype development phase which pitted it against Harris.

NARA has been working on a Record Management Profile of the Federal Enterprise Architecture (FEA) since January. The Interagency Committee on Government Information’s report to the Archivist recommended that this profile be developed in order to integrate records management policies as agencies are designing their FEA implementations. The draft was reviewed by the Federal Records Council and some IT community groups. The final recommendation will be sent to the CIO Council and OMB by the end of September. NARA is moving the current version of the profile forward as a partial solution since it may be helpful even in its current form.  

NARA is embarking on an online museum store for the National Archives downtown and for the Presidential Libraries. Ms. Allard is interested in experience related to online selling via credit cards. GPO, NTIS and DOE’s Science Technology Software Center sell via credit card.

National Library of Education – Luna Levinson

NLE is a year and a half into the new vision and mission of ERIC.  ERIC went through a massive transformation prompted by Congressional legislation and mandate to be more focused on evidence-based research.  The new ERIC is focused on the database; ERIC will not produce publications as it did in the past. Nearly 7,000 new records were released in August, including journal and grey literature. The latter will require a more concerted effort in the future.

Quality control will be a larger focus over the next few years, since machine aided indexing will be used by the contractor. A submission system for scholars to deposit the full text documents and abstracts has been developed and this is another area where quality will be monitored.

NLE is interested in structured abstracts. The National Center for Education Statistics already requires structured abstracts from its contractors. In other areas, it isn’t as easy to impose a structured abstract. The ERIC Advisory Group has created subgroup to determine the structure for certain types of educational materials. NLE is interested in what other agencies are doing with structured abstracts.

National Library of Medicine – Dr. Elliot Siegel

Dr. Siegel described NLM’s long-range planning process. This is part of an ongoing process that NLM has been doing for about 20 years. The current plan was primarily staff generated five years ago. NLM is now going back to the stakeholders to develop the next plan. 

NLM takes the stance that they do not necessarily have the answers and so must rely on the outside. (For example, Newt Gingrich is on the Advisory Board and is also co-chair of a small working group.) A recent meeting brought leaders from business, health care, health policy, major electronic medical records system vendors, the library community and the information community together. Bruce James, the Government Printer, was also involved.

The vision for the future that resulted from the meeting is quite different from a federal library.  NLM would be the center of a series of national databases that go beyond journal and published literature, collecting information about genotypes and their impacts on health care. The future holds a way to harness this information and, when combined with the patient’s medical records, health care providers will be able to work in a more targeted way to provide better care.

This vision is an extension of NLM’s current mission. The current uneasiness surrounding this vision, including constraints about privacy and NLM’s status as a federal agency, will be part of a natural evolution.  NLM tends to move forward from a strategic plan in “chunks” rather than incrementally, while continuing to focus on the larger goal.

US Geological Survey

Biodiversity informatics is an emerging field.  Information on the natural world comes in a wide variety of formats, representing diverse disciplines and from a variety of sources. For example, migratory bird information is quite different from aquatic or fisheries information. Some of the challenges relate to the kind of data. Sometimes data are missing. Spatial data isn’t necessarily easy to merge; the recording location, dimension, or scale may differ. There are temporal data challenges -- the data needed for long-term studies may be unavailable. USGS/BIO is pushing these issues and how best to use current and future technologies to address them. They are participating in standards development with international bodies such as CODATA, the Global Biodiversity Information Facility and the Inter-American Biodiversity Information Network (IABIN).  Many of the issues are the same between the information and the data communities.  Data and metadata policies are in place, but implementing them is still difficult. 

Discussion

The group discussed the apparent lack of standards for citing data sets. This might be an interesting area to investigate.

National Science Foundation – George Strawn

NSF is using scientific information to more efficiently process the approximately 40,000-50,000 proposals received each year.  Almost all proposals are now received electronically. With the help of one of the division directors who is a specialist in textual information processing, they are piloting a system that has proven useful especially for large proposals, for dividing proposals into panels and for identifying panelists for the panels. The system creates a term vector for each proposal and creates connections among the proposals based on the vector analysis.  The term vector helps to do the initial binning and to identify orphan proposals pointing in directions that don’t match the average. These outliers can be read by the program officers to decide what “bin” they should go in. The system also helps to identify possible panel members. They apply the analysis to previous proposals to identify people who work in the same “vector space” who could serve as possible panel members. It also helps with policy questions as they seek diversity, including women and minority panel members.

This text analysis approach has been in operation for over a year and the program officers say that they could not survive without it any more. NSF has been able to double the proposal workload without any increase in staff. They would like to use the binning services to automatically create the panels, but right now the automatic binning techniques do not make panels of the same size.  It can only be used for the initial quick binning by the program officers. It is possible that the identification of orphan proposals may actually make for more efficient review of these proposals, though it isn’t likely to completely solve problems related to their review.  Over the next year, the goal is to add a user friendly interface and to get the system institutionalized as a cross-agency system. 

The system has also been used to perform post selection analysis of submissions in order to determine how much the actual activities of the project matched what was outlined in the proposal.  The 15 years’ worth of nanoscience and technology reports have been processed. The text vectors really help because the vocabulary has changed and the keywords are not the same. The system also helps to identify sub-areas and to chart how the science has changed. They produced a time visualization for NSF’s nanoscience support.

 

Public/Private Relationships: New Challenges and New Ways of Doing Business

Ms. Humphreys introduced the session with a general overview. NLM and NIH have been under the spotlight for their public access policy and for the PubChem database.

PubChem (Betsy Humphreys, NLM)

The primary purpose for PubChem was to provide access to the data from the screening centers at universities and research institutes that NLM announced for the grants. The list of sources in PubChem continues to grow with three new sources added in the last two weeks. The issue with PubChem is the desire that NLM not duplicate what is going on in the private sector but collaborate with private chemical information producers.

Through the Federal Register, NLM has asked people in the chemical information space to step forward and nominate themselves or others to the PubChem Chemical Information Working Group of the Board of Scientific Counselors at NCBI.  ACS is the largest organization in the chemical information space, but there are many others. She also distributed a list of the current sources depositing in PubChem and what they are linking back to from NLM. There are a wide range of organizations represented, including several federal agencies.

Ms. Humphreys distributed the language in the House Appropriations Committee Report about both PubChem and public access. The language in the Senate version is similar. The bill was reported out but not actually passed and it may never be. While the public access language in the House version is preferable, the PubChem language in the Senate report is preferable. In all cases, it is “livable” compared to some of the earlier draft language. 

Public Access Policy (Betsy Humphreys, NLM)

Both the Senate and House languages call for NLM to produce reports on what is happening in regard to public access, though there are slight differences in what the reports should contain and when they should be produced. The House language for public access is very favorable.

Ms. Humphreys distributed a list of the members of the Public Access Working Group of the NLM Board of Regents. They include strong advocates as well as publishers who are opposed to the policy. The first meeting was July 11, 2005, and the group will meet again in November. 

The implementation of the policy has gone extremely well. The system has been extended to include a third-party submission process with review and signoff by the principal investigator. The costs are in-line with NLM’s estimates. There has been considerable effort expended on outreach to grantees in a variety of ways. The rate of submission to the repository has been low, but NLM will monitor the rate of participation over time and seek specific information as to why authors aren’t participating.

The publishers have expressed an interest in bulk submission, perhaps because they can have more control. This isn’t reasonable because NIH’s agreement is with the principal investigator, not with the publisher. NLM has separate agreements with publishers for PubMed Central.    

Portable PMC (Dr. Jim Ostell, NLM/NCBI)

The NCBI, which is part of NLM, was created to work in factual databases in biotechnology, including databases of DNA sequences and taxonomy.  However, the goal of integrating published material with MedLine has led NCBI to become involved in bibliographic activities in order to solve particular problems.  The bibliographic resources developed or under development by NCBI include PubMed, PubMed Central (PMC), pPMC (a portable mirror of the PMC content), NIHMS (NIH Manuscript Submission System for Public Access policy), the NLM DTD for bibliographic material, the pNIHMS (portable NIHMS), an XML Authoring System, Bookshelf (books and monographs in XML), and the Entrez version of NLM’s database. After a brief introduction, Dr. Ostell described these resources in more detail.

As of September 2005, PMC has about 600,000 articles including those waiting to be released. There are over 1 million unique users. If NLM could provide access to more of the full text, usage would increase even more. PMC requires that electronic journals have “issue organization”, which can be problematic for some. The system supports linking to supplementary information and allows browsing by the Table of Contents.

In the current version of PMC, data is received from the publisher in whatever form of XML or SGML is available. NLM turns it into a single DTD format which is then returned to them. Publishers typically do the print version first and then pay a contractor to make the SGML. If they find errors, they correct the SGML and not the XML. The XML is only a by-product. NLM must perform quality assurance on an ongoing basis because the publishers often are not checking the content. It could be valid to their DTD but it really isn’t correct. Additional quality assurance is performed by the million people who look at the site. Error reports are sent back to the publisher. There have been a number of arguments about a dark archive versus a live archive. They believe that if an archive isn’t used all the time (a dark archive), it may as well be dead. 

PMC is stored in a standard XML DTD. XML preserves the structure of the article which is important for indexing or for data mining. XML lends itself to intelligent processing, but it is human readable and not dependent on technology. XML is based on SGML which is a publishing industry standard. It is possible to link to resources and products from within the article. XML is portable and able to be migrated. An issue for the publishers is that they want the PMC online versions to look like their journals. This creates problems for NLM in terms of standardization. PMC web pages are actually rendered to HTML on the fly.

NLM participated in the Harvard E-journal Archiving Project. The investigators found that a single DTD for journal archiving was feasible and that a modified PMC DTD would accommodate journals from all disciplines. Based on the recommendations from this group, NLM developed the Journal Article DTD as a set of XML modules in a library. The Archiving and Interchange DTD is all these modules and the Journal Article DTD is a subset. It is written for authoring article content (new journals), initial tagging of non-XML content and creating consistent structures. The DTD is being adopted by many including Highwire Press, JSTOR, PLOS and other conversion vendors and journal service providers.  The key is that this was not a standard done in the absence of implementation. The suite contains documentation, validators, tools and a working group. NLM is currently working with the working group to include more mark-up for factual data.

NCBI believes that you do not have an electronic archive until it is distributed, so there is redundancy. pPMC is a portable mirror of the PMC which is updated daily from NCBI. This is the first step toward true collaborative archiving. The local implementer can link to local web pages and logo. There is no search engine for pPMC, but Entrez can be used and then the results sent to the mirror site. Entre Utilities uses Microsoft SQLServer that will have the native full text searching for the database included in the next version.  Ultimately, the goal is to have pPMC become a completely separate archive in another country, organization or agency. The repositories would no longer need to be mirrors of one another.

The technology of the pPMC has been paired with the NIHMS public access policy which takes manuscripts, connects them to the grants, creates XML and imports the manuscript to PubMed Central. As part of this effort, NCBI created the NLM DTD, a modular DTD for bibliographic material expressed in XML (see above). At this time, there is no good automatic conversion to XML, so they send the materials to two vendors for mark-up. The cost is $30 per article. The vendor performs some quality checks.  The author then reviews the proof, which is XML rendered into a web page. After the review, it either goes for update or into a queue. The submission of an article takes the author about five minutes. The proofing takes 5-10 minutes. Once the embargo date expires, it can be released. If it exists, the manuscript points to the publisher’s version.

Math, tables, foreign alphabets, and other types of characters are difficult despite the use of Unicode (UTF-8). There are many different ways to refer to outside figures and images. In some cases, differences in the content rules impact the presentation of the pPMC version versus the published version.

pPMC requires two servers at about $15K each. An SQL Server license is several thousand dollars. The labor required is approximately one FTE. Putting in the content is the costly part. It also takes a while even with an experienced vendor to get good quality mark-up. If the organization is coupling the repository to what has been published, the organization needs an infrastructure like PubMed to act as a trigger.  This could be adapted for other triggers and it can be customized to a particular process.  In the NLM situation, there is a check against the grantees and grants database that checks to make sure it is an authenticated person using the NIHMS. Several people at NLM also have responsibility for working out problems with authors when they occur. While the NIH policy includes contractors, there isn’t a system at this time to authenticate contractors.

A number of those organizations putting material into pPMC would like to have the equivalent of the NIHMS as well. The pNIHMS is being planned for the future. Other enhancements include changes to allow pPMC to include content that is not in PMC. Some additional quality control tools are also needed. Willing partnerships are needed as well as additional vendors who have a track record doing the DTD tagging.

In addition to journals, NLM is interested in books and monographs in XML. They currently have a Bookshelf as a quid pro quo. Public access is granted to the XML version if NLM created the XML. There are several dozen books in this environment. Bookshelf started with previously published books, but they have actually added new ones. Publishers find that providing the books via Bookshelf boosts sales through free advertising. This works particularly well with books since most readers would rather have the book to read cover to cover. Authors are now requesting that their books be made available on Bookshelf and they are pressuring publishers. With the 2nd edition of Microbiology, the authors will produce the book first in XML, so that they can publish the chapters one at a time, do collaborative writing, and then publish it in a paper later. Bookshelf is seeing the return of the monograph. A chapter in a book could be viewed as a monograph and the web is a popular and cost effective publishing medium. Domain experts are writing individual chapters and collaborating with NLM to point to the appropriate factual databases. They can author under their names and even publish after the fact.

Gene Review is an interesting experiment in that the monograph about genes for clinicians is being written as part of doing the database. The authoring is done in MS Word. Simple mark-up is added based on Word styles. A few rules are needed for following a template and plug-ins to MS Word are required. WordML (via Word 2003) is converted to XML. This XML is automatically converted to the NLM’s Book DTD. NLM is now in the process of merging the Book DTD and the Journal Article DTD. A few high level metadata elements, such as part of a series, are being added.

Dr. Ostell described the XML Authoring System as a collaboration with Microsoft to create the XML DTD right out of the authoring environment. NLM is also working with Microsoft to be able to roundtrip from the DTD to Word. Right now, it is unidirectional, from Word to the DTD.

Discussion

NAL, NLE and GPO expressed an interest in additional follow-up with NLM on the pPMC system. In a follow-up e-mail the following day (September 8), Dr. Ostell and Jeff Beck recommended several follow-on activities with those who are interested. They do not see the need for a formal task group. Interested agencies could send examples of the documents that they would put into an electronic archive, along with any specific comments about their needs to NLM. This could be done informally, but should happen as soon as possible. A cursory look would be made to determine if there are any issues. Mr. Beck and Dr. Ostell are willing to give a more detailed presentation on the NLM DTD targeted to policy and technical people. After the first two activities are completed, they would offer follow-on work sessions for any agency that would like to go through its materials element by element, producing mappings and possibly a prototype to assess the scope of the task to convert the material.

The group discussed preservation formats. NLM does not consider PDF to be an archival format because it is proprietary. In addition, PDF doesn’t necessarily allow for linkages between the text and other databases such as GenBank, and the text cannot easily be re-purposed on PDAs. In the near future, ISO PDF-A will be announced, but it is primarily a specification for software vendors to write software to assure that a PDF file will meet certain criteria in the future. This means that certain bells and whistles will not be included, which may make it difficult to convert more complex documents without loss. In Dr. Ostell’s opinion, PDF-A is a specification that can lead to different implementations.

Many CENDI agencies are making decisions about the use of PDF. The members requested a short summary about the issues and implications surrounding PDF versus other formats for archiving. Susan Sullivan from NARA could brief the group on PDF-A, because she has been involved in the ISO standards group. Jeff Beck or Jim Ostell could present NLM’s assessment. 

 

KEYNOTE: “Federal Priorities for Scientific Collections Management and What Data Are Next?” Dr. Teresa Fryberger, Assistant Director for Environment, Office of Science and Technology Policy

The role of the Office of Science and Technology Policy (OSTP) was established to advise the Office of the President. It leads interagency efforts to develop S&T policies and budgets for all areas of science. Its goal is to build strong partnerships among other government entities. There are only 40 people and it is a very flat organization. OSTP is in transition right now with several vacancies, including the Director.

The work at OSTP is done in two ways. One is through the President’s Committee of Advisors on Science and Technology (PCAST), which involves the private sector, universities and industry. The National Science & Technology Council (NSTC) is the federal component of OSTP. NSTC is a cabinet level council of advisors to the President on S&T. They coordinate science and technology through a series of committees. There are four high-level committees chaired by heads of agencies or department secretaries. Dr. Fryberger works on the Ecosystems and Earth Observations Committee. The Committee on Science includes the Interagency Working Group on Scientific Collections.

OSTP works with the Office of Management (OMB) to issue guidance memos that direct the agencies to develop science priorities and budgets. The President then proposes the budgets to Congress. Bottom up and top down approaches are used in the priority process. For example, the scientific collection issue was really a grass roots effort which resulted in the creation of an interagency meeting and then creation of a group.

For FY07, there is a new emphasis on collections and an understanding of their R&D investment and impact. There is also an emphasis on Networking and Information Technology. Each agency is required to request a budget that sustains the research important for its mission but to also reflect the focal areas in the guidance memo that is issued jointly each year by the heads of OMB and OSTP. OMB also expects this when reviewing the budget requests.

Dr. Fryberger went on to describe the new Interagency Working Group on Scientific Collections. It will be chaired by David Evans (Smithsonian) and Phyllis Johnson (USDA/ARS). The first meeting was held on September 27, 2005.  The draft charter calls for the evaluation of scientific collections, coordination regarding their development and management, and activities to increase awareness of collections. 

The group is talking about collections that are actively used for scientific purposes. The Marburger/Bolten memo says the priority and stewardship of scientific collections should be assessed and agencies should coordinate strategic plans to identify, maintain, and use these collections. These activities require the help of expert information management groups like CENDI.

The Interagency Working Group will likely focus on specimen-based collection activities; the agencies will decide this at the first meeting. Specimen collections are generally housed in museums, at universities, and private institutions in the U.S. and around the world. However, collections of all kinds need attention, and their importance is not widely recognized. There are questions about data across disciplines but there are also many common themes. The issues include how to prioritize, digitize, preserve, store and fund these collections in a sustainable fashion.

The US Group on Earth Observations (US GEO) Subcommittee of the NSTC (http://iwgeo.ssc.nasa.gov) is another area where additional information management expertise is needed.  USGEO, is chaired by NOAA, NASA and OSTP. The US group has a plan on its web site. USGEO is the US contribution to GEOSS, a distributed system of systems for earth observation across 60 countries (http://earthobservation.org). A smaller international plan has also been developed that looks toward a global framework. However, most of these systems cannot interoperate. Improved science for decision-making and regulation would be possible if there was better integration. The focus is on standardization for sharing and identifying the gaps where more is needed to be done.

Data management is central to achieving these integrated observations. The data management needs are huge spanning both the program and international scales. New systems will mean a 100-fold increase in data, but the current systems are already challenged. There will be gaps in data.

The USGEO is working on near-term opportunities for integration frameworks in various observation areas -- observations for disaster warnings, global land observing systems, sea level observing systems, air quality assessment and forecasting.  The USGEO decided to single out data management and try to pull out the related issues from each of the areas listed above.  They are looking at what can be pieced together to make a bigger impact.  They hope to pull together a data management package that will provide for a broader data management plan. It may also carry over to the international arena.

USGEO has an excellent group of people but they are  geoscientists and IT people, not information managers.  She is looking for involvement from CENDI, which has this kind of expertise.

Discussion

Chuck Romine at OSTP is interested in having a briefing from the Science.gov Alliance and CENDI. The NITRD is in charge of networking and information research based on the High Performance Computing Act of 1991. The NITRD has a number of subgroups and two interagency working groups. Dr. Fryberger mentioned that Mr. Romine is involved in the telecommunications area of OSTP.  A follow up meeting with others at OSTP, including Mr. Romine was discussed.

Dr. Marburger has a new initiative on the societal value of science and how to measure it. Publications and data play a role in those measurements.

The group discussed STI support for recovery from Hurricane Katrina. How do we channel our good intentions and resources?  Dr. Fryberger will be involved through Homeland Security.  The Disaster Reduction Subcommittee will probably be calling an interagency meeting, which will include a discussion about the research activities going on in that geographic area. Dr. Fryberger is going to suggest that a series of meetings begin with a roundtable.

 

 User Expectations: CENDI Member Panel on Understanding and Reaching the User

Ms. Carroll pointed out the table in Tab 4 of the Planning Book that categorizes the information collected from the recent survey of CENDI activities in the area of user expectations. This shows the wide variety of initiatives underway.

Information Literacy: Google is not THE Answer - Paul Ryan (DTIC)

Google is an enabler. It is good for some things, but is not THE answer for the R&D community. However, DTIC has been challenged by the Department to be X+1 better than Google; otherwise, why does the department need DTIC?  DTIC realizes that Google is always going to be the first choice and that public documents from DTIC are going to be copied and made available through Google. To address this situation, DTIC is using OAI to make their public documents harvestable. This ensures that DTIC results will be displayed first and with the DTIC logo rather than that of a third-party.

Secondly, DTIC has undertaken an educational initiative to highlight the differences between the DTIC system and Google.  DTIC selected a series of topics relevant to the Advanced Concept Technology Demonstrations program, which is involved in new technology weapons development. They searched these topics in a variety of DTIC systems and on Google (Google, Google Scholar and Google Print).  They looked at the first 10 hits and analyzed the information type, the database, etc. The relevancy was ranked by DTIC professional searchers (note:  not users).

One of the biggest advantages of DTIC is the non-public material. There were several areas where Google’s numbers were higher, such as with journals and academic research.  DTIC needs to encourage the Services to send more journal literature to DTIC.

While this analysis is a snap shot, DTIC believes that this provides some hard data that can be used to sell the scientists and engineers on the value of DTIC. The intent is to present the findings to the senior leadership and then to include the users in the relevancy ranking.  However, the registration process and the use of a user name and password continues to be a barrier to the use of DTIC content.

DTIC wants to ensure that its information is available for other agencies and other search engines. Through its OAI/Google-ization initiative, it is setting up the capability to make DTIC unclassified and limited information available so that authenticated DoD activities and other agencies can pick up the information more easily. All the metadata is available in the repository, but the system can respond to the request for metadata in any of a variety of formats including MARC, Dublin Core, and COSATI. DTIC will begin to build out an XML and HTML infrastructure.

The OAI Data Provider is an XML-based repository. It was modified to fit DTIC’s infrastructure  from a system at Old Dominion University.  A Java servlet is used to respond to OAI and HTTP requests. The OAI harvester thinks it is crawling a web site, but it is really generating OAI metadata on-the-fly.  Handles are used so that people can get to the document on a persistent basis.  Links are provided to DTIC’s homepage, to DTIC’s search page and also to NTIS’ shopping cart to purchase the material.

Discussion

The group discussed the opening of databases to Google and others. At the May OCLC Conference, the statistics showed that Google’s share of the market is falling. Therefore, OCLC is also working with Yahoo and MSN. The use of Open WorldCat makes it possible to open up content without having to deal individually with Yahoo, Google, etc. NAL has cut out Google crawling of Agricola because it was degrading the system’s performance.

Dr. Warnick will report to the group at a later meeting on the impact of the relationship with OCLC on OSTI’s usage.  Yahoo artificially bumped up DOE’s results ranking to compensate for the fact that the ranking algorithms didn’t work because they are based on the number of external links to the material.  GPO tested a positioning service. This was successful in a targeted way and it wasn’t very expensive.

Show and Tell: What We Learn from Exhibiting – Tom Lahr (USGS)

The NBII is a partnership that involves a number of nodes around the country, which are regionally or thematically based. The NBII participates in about six regional exhibits, two national exhibits and three international exhibits each year. The exhibit may be a booth or a table top display. The criteria used to determine exhibit attendance are conference size (the projected number of attendees), the specialization of the audience, exhibit hours without conflicts from other meeting events, and the symbiosis of the meeting’s purpose with the immediate needs of the NBII (highlighting existing project tools to existing users or promoting new project tools). 

The node managers suggest and select the exhibits, tailor each exhibit to the needs of the specific audience, and staff the exhibit.  The exhibit staff complete a feedback form when they return. Questions include a rating for the exhibit (highlights and low lights).

Exhibits provide an opportunity to find new users and to get feedback from existing users. They also provide opportunities for the node managers to find additional partners and identify new content resources. Success is hard to determine; Mr. Lahr estimates a 50/50 success rate.

Discussion

Many other agencies participate in exhibits. The group discussed the possibility of using shared space, particularly for the promotion of Science.gov. Ms. Jordan mentioned that opportunities for sharing are routinely discussed at the Promotions Task Group meetings.

Ms. Herbst indicated that work has been done on the Return on Marketing Investment. It is important, however, to begin by setting business goals to which the marketing activities are tied. Exhibits need to be considered along with the entire program. There are fairly inexpensive tools available to aid in this analysis. The members suggested that metrics and evaluation of marketing programs should be a topic for discussion at a future CENDI meeting.

 

Expert Panels: Balancing Vision with Experience – Elliot Siegel (NLM)

Dr. Siegel described the NLM use of expert panels as a way to engage a community and get feedback during the planning process. In NLM’s case, there will be about 100 people involved. While the panels cost about $500,000 without the staff labor costs, it is a way to systematically get input in a short period of time. The goal is to look out 20 years and then back it up to a 10-year perspective. 

Each panel meets twice. The report of outcomes from the first meeting sets the stage for the next meeting. The second meeting works to finalize the report. Each panel has approximately 20 people. To get things going, they might bring in outside experts, a list of questions, or staff presentations. They will agonize over the questions that will be discussed between now and the panels. However, it is important to balance structure against serendipity.

There will be four panels linked back to the original panel structure developed about 20 years ago, that matches the basic areas in which the programs must report. For example, Panel 1 addresses the databases and libraries, and issues of a new building.  Current programs and accomplishments are tied to future considerations, which allows NLM to identify the key activities that they are doing now that feed into something for the future.

Getting buy-in from the staff is very important.  Therefore, they have the staff involved early through retreats with outside experts. Staff members are also involved in writing background papers. At the conclusion of the planning process, management comes back to the staff in the program areas and asks them to think about how to implement the findings. Some staff will have recording/writing responsibilities during the course of the expert panels. There are lots of politics and the Board of Regents is a connection between the staff and the outside experts, even though the staff helps to develop the plan. 

The initial 20-year plan did not have a specific outreach component, which they have since added.  The international scene is a big question mark for NLM because they don’t really have the funding for this. They have held back in this area, but that may change in the future.

The promotion of pubic health systems of the 21st century is a big area. This includes mining patient data, and perhaps having patient data centralized at NLM.  NLM has supported investigators using data mining in areas such as homeland defense and first responders. Dr. Siegel predicts that there will be more emphasis on these projects after Hurricane Katrina. Support for genomic science such as directing a person’s care with genetic information, will flow from the current work of the NCBI. 

GIS: The Visualization We Know and Love – Dr. William Thomas (Alternative Farming Systems Information Center, USDA)

USDA is one of the largest government users of geographic information tools. They also create many digital maps and other products. There are over 3.5 million items in the NAL collection, many of which could be linked directly or indirectly to geography. Most of the 7,000 articles correspond to a geographic location.  His information center is digitizing older materials that pertain to organic agriculture prior to 1942 when synthetic chemicals began to be used. They want to link this to geographic information. The grower knows where he is located, and geographic location is extremely important to sort out the most relevant information from the results of a search.

Metadata facilitates the search process. Work is underway to develop and integrate the appropriate metadata, into the various resources, including the library catalog. One characteristic can be geospatial. The current task is to develop a higher level of metadata to incorporate new geographic tools.  Dr. Thomas is working on an XML tool that will allow the geospatial information to be put into a form that could be harvested by FGDC.

Discussion

EPA has digitization efforts in Las Vegas, and Dr. Sykes would be interested in discussing digitization efforts with USDA.

 User Expectations: Increasing the Value of Sci-Tech Intelligence through Enhanced Visualization - Bonnie Carroll

There are two main types of visualization – data visualization and text visualization. We are accustomed to data visualization, including maps, and chemical visualization tools that represent chemical elements, molecules, and compounds graphically.  The Oak Ridge National Laboratory’s Visualization Facility can visualize mathematical equations and models, galactic supernova, proteins, nanostructures, and fusion energy devices.

Visualization helps solve numerous research problems. According to Dr. George Fann at ORNL, “we’re no longer dealing with reams of papers as we understand science. Visualization is helping scientists test hypotheses, better understand data, and make new discoveries.”

The next challenge appears to be text visualization. Text visualization tools allow the results of a search query to be displayed visually. The Intelligence community has considerable investments in this area, and it might be worth reengaging with them on these technologies.

There are several available tools including the Pacific Northwest National Laboratory’s IN-SPIRE, NIST’s NIRVE project, and STN’s AnaVist which is in beta test. (It was a presentation on AnaVist which provided the title for Ms. Carroll’s presentation.)

Ms. Carroll demonstrated or showed screen shots from several tools.  Grokker and Kartoo are two meta-search tools that are available on the public web now. They use size, color and graphics to show relevancy and connectedness. A trial version of RefViz from ISI/Thomson, which is available for a trial download, searches multiple data sources including MedLine. RefViz is a more complex enterprise tool that can visualize results from both internal and external resources.

Ms. Carroll suggested that CENDI may want to share experiences or plan a joint effort in this area. Possibilities include a workshop or a jointly funded application for use with Science.gov.

Discussion

NSF has been experimenting with some of these ideas. They have developed a system for looking at projects by taking a taxonomy created by Vivisimo on the fly and presenting it as a picture that shows patterns and trends.

Dr. Wood recently attended a conference in Boston where nearly 100 companies were working in this area, primarily in support of the Intelligence community.