CENDI PRINCIPALS AND ALTERNATES MEETING
Department of Education
Washington, DC
July 8, 2008

Final Minutes

Flying High and Diving Deep


The Grid, the Cloud, and the Internet of Things – Potential and Policy Implications
Deep Indexing: Harnessing the Power of Data Discovery
National Library of Education Showcase

 Welcome

Mr. Ryan, CENDI Chair, opened the meeting at 9:10 am.  He thanked the National Library of Education and the Department of Education for hosting this meeting.

Flying High and Diving Deep

“The Grid, the Cloud, and the Internet of Things – Potential and Policy Implications” (Dr. Michael Nelson, Visiting Professor, Georgetown University)  (presentation, .pdf format)

We are entering a fundamentally different phase of the Internet. The changes will be almost as profound as the founding of the Web. Standards and business practices are shaping the Net as much or more than laws and regulations. The Internet is less regulated than other telecommunication infrastructures. However, we are less than 15 percent through the transformation based on the number of potential users, the total bandwidth, the total amount of content, the number of devises, and the number of applications.

However, Dr. Nelson cautioned that it is important to always look beyond the headlines. There is so much hype that it is important to dig deeper to identify the real trends.

In terms of Cloud Computing, there are a lot of articles about Google installing large data centers and offering Google applications. The common ones, to date include Web 2.0-type user applications including Flickr, YouTube, and SalesForce.com. These are early examples of applications. Small- and medium-sized business applications will be coming online where the entire team can use the application over the Net.

The real news is that we are really entering the third phase of computing where applications and data will be moving off our machines and into the “cloud”. This move will impact the way information is organized and managed, and we need to inform our management and employees of this change.

In the diffuse atmosphere of cyberspace, the network will truly be the computer. It is estimated that in 10 years, more than 80 percent of what we do will be done on the Web. Both the data and applications will be moved. Different pieces of applications will be loosely bound together. Therefore, we don’t need a lot of computing in our hands; smaller processors and sensors will be used instead of desktop or laptops. 

In his book, The Big Switch, Nicholas Carr looks at the history of the utility industry. It looks very much like what we are seeing now with computing. We are on the verge of “utility computing.”  Gartner, Inc. says that Cloud Computing will be as influential as e-business. However, we don’t know what standards will play out or how companies might stand in the way to block this kind of technology advance.

Internet video is moving forward rapidly as television and movies are going online. This is the commercial side of the technology adaptation. Individual videos are the most exciting area. More than 100,000 videos are uploaded each week. There is also a lot of surveillance video. In his book, the Transparent Society, David Brinn suggests that we are on video so much that we either make the video all public or it becomes controlled by the government, which jeopardizes privacy.

The GRID is a specific application of cloud computing. This involves tying together computers and instruments. Amazon and Akamai are building infrastructures for GRID computing. Amazon’s Elastic Computer Cloud sells extra computing power by the byte and by the cycle, primarily to start-up companies who don’t want to invest in their own infrastructures. Akamai currently delivers 15-20 percent of the Internet traffic through its servers. The global traffic was 1.5 million hits per second several months ago. Thousands of machines are used for hosting in order to save bandwidth.

PC-based grids are even more powerful. For example, radio signals can be downloaded from SETI@Home. The Berkeley Open Infrastructure for Network Computing lists different sites where you can donate your personal computing power for particular causes, such as fighting AIDS. There are, of course, malicious uses of this -- hacker attacks and mega spammers, for example.

Swanson and Gilder talk about an “exaflood,” an incredible surge of content and traffic because of the increase in content, especially video. It is estimated that the traffic and content will increase 10 times by 2011. The surge will come between 2011 and 2015, with an increase of another factor of 10. As we move from watching television to watching personalized video channels, we will need to figure out what to do with this increased content, similar to what happened with the Web in the mid 1990s.

The combination of developed and developing countries will also have an impact. It is difficult to tell how fast the network will be built out, but it will come from full motion, high-definition video and gaming, and the increase in the use of virtual worlds. 

High-end video conferencing will turn into a virtual world. The gaming revolution and virtual worlds are producing GDPs (Gross Domestic Products) equivalent to that of Belgium. The Internet isn’t just a medium, but a Place. Early Virtual World Business Applications involve commerce, collaboration and events, education and training, and emerging applications such as emergency response and surgical training.

A major question is what the government’s role should be. Government can be an early adopter of new technologies such as virtual worlds and the GRID. Todd Ramsey from IBM discusses how to move to a fully collaborative, distributed government in his book The Government Innovation Journey. Most agencies are in the first two phases where information is being pushed out to citizens and other agencies. Phase 3 involves working across agencies and focusing on applications for a particular citizen function. Extended government means multiple levels working together. Collaborative government requires a different model. There are a lot of cultural issues, but technologies can help to move this along. More access in a distributed way will allow different players to have the power they need.

Key policy issues include promoting competition and investment in the telecommunications infrastructure. We are generally behind other developed countries. It is important to consider foundations for securing the Internet and “future proofing” government and agency policies. For example, if the data isn’t in the building, but out on the Network, what does search and seizure mean? How do the laws apply?

New policies will be needed for the Cloud. There will be critical technology choices to be made. Directories will be important to keep track of resources. The Open Document Format, an XML-based specification for describing the content and format of a document, will be advanced for interoperability. This specification was developed by a multi-vendor committee at OASIS and is an International Standards Organization (ISO) standard. The Open Document Format is a way to enable the Cloud, so that no matter which word processing application you use, you can open any document. It will be important for government to push for open standards.

A consistent way to authenticate individuals will be needed. Authentication methods such as Passport and Cardspace will be needed. MIT’s Technology Review of Top 10 Technologies included federated identity management. Open source, open standards identity management will become key. Different ways of doing authentication are being used now, requiring individuals to share personal information multiple times, putting their identities at greater risk. Instead of establishing a relationship with multiple web sites, people would belong to two to three identity services that will vouch for them. This approach is more stable, provides better security and privacy, and is much easier for people to use. 

A critical question is where government agencies will get employees to deal with this new world. Archiving will become an incredible problem. The next generation of library/information scientists will need to deal with what to keep and what to throw away. What should the curriculum be? How do we as professionals help to train everyone in society about these issues?

The next phases of the Internet Revolution will be as disruptive as the printing press, but it will happen much faster, will be totally global, and be more unpredictable. However, when in doubt, empower the user. Science and technology is about a year behind the “crazy kids” and two to three years ahead of general business.

The new world will hold challenges for us as data centers and repositories. How do we capture the information that comes out of the 3-D environments? How do we capture the modeling being done for global change? There aren’t standards and we don’t know what needs to be kept. 

The US is in a mixed picture compared to other developed countries. It is 15-20th in broadband implementation. The cost is also very high. We get some extra credit by being more experimental. There are a lot of eager customers. We could be farther ahead if we had some of the standards issues solved.

Privacy, security and intellectual property (IP) remain show stopper issues. Dr. Nelson thinks we will see many new alternative models that go beyond the standard intellectual property (IP) models. There are implications here. The 99.9 percent decrease in the cost of sharing information will eventually catch up with those who seek to enforce IP, and the interface between Copyright and Creative Commons will continue to be an issue.

“Deep Indexing: Harnessing the Power of Data Discovery” (Mark Hyer, Vice President, Higher Education Publishing, ProQuest)  
(presentation, .pdf format)

ProQuest is facing an explosion of content including abstracting and indexing, full text, dissertations, government documents, special collections, social network information, and the open Web. Research is a highly non-linear, visual environment where information is aggregated within articles in tables, figures and charts. It is hard to find and understand these resources when someone is looking to validate research.

ProQuest/CSA has been doing article level indexing for many years. However, many users are interested in the data and the statistics and not in the pieces of the text that are addressed by basic indexing. Dr. Craig Emerson of CSA came up with the idea of “deep indexing” about 15 years ago. CSA has actually been performing deep indexing for three years, but the product was just released 18 months ago.

If you can extract and make the content of tables available in context, they can be very helpful.  Carol Tenopir’s research shows that direct access to high quality components saves users’ time and enhances relevance judgments.

The tables and figures are identified within a document. Critical data and information within and surrounding each table or figure (including the full caption) is provided with the indexing.

There have been several similar projects including TableSeer, BioText Search Engine, ORE Aggregation, and DeLiver. BioText, for example, indexes all open access journals including 60,000 tables and 100,000 figures from 40,000 articles. Google Images is coming out with a labeling system.

However, CSA Illustrata is the first commercial deep indexing product. The indexing makes it possible to search a variety of tabular content. This is done by taking the object within the article and building an object record for it. There are over 80 object categories, object descriptors (including geospatial), taxonomic terms, etc. Sometimes they capture the image and make a gray scale with publisher attributions. The object caption is captured and they assign Digital Object Identifiers (DOIs) to these objects.

Life Sciences collections are growing dramatically from under a million to 3.5 million. They are anticipating about 12 million in the next few years. This is a total scale of production that is different from what they have seen before.

CSA just released a technology module with deep indexing. They also have plans for doing deep indexing for archives of selected journals. A year’s worth of Elsevier Science Direct results in over 2 million objects. They are also envisioning deep indexing for dissertations, as well as looking at data sets and end user material.

CSA would be interested in having an open standard for the indexing mark-up. The process itself has a patent pending.

For more information, the ProQuest site has a webinar and a white paper. (http://info.csa.com/csaillustrata) that further describe deep indexing.

National Library of Education Showcase

ERIC (Education Resources Information Center) is the most visible public program of the NLE. As of May, the database has over 98,500 records from 2004-2008. The database also includes records from 1966-2004, the time when ERIC was produced by 16 clearinghouses in academia and educational associations.

Over 14, 400 records, or about 3000 records a month, were added during 2008. There are 221 journal providers under agreement, providing 703 journal titles. ERIC is looking at adding agreements for almost 450 new journals and 195 new non-journal sources.

The journals are added to the list online. The number of journals will be approximately the number that was covered under the legacy ERIC. The difference is that the new coverage will be comprehensive.

ERIC has identified several goals for the ERIC processing including accurate representations for every document, a maximum 30-day processing time, acquiring and processing all possible current content, and acquiring and processing appropriate 2002-2003 content from the gap that was created during the post-clearinghouse era.

At the January CENDI meeting, the ERIC contractor reported on the digitization project in partnership with the National Archive Publishing Company. The project is now almost complete. The most extensive part of this 2 ½ year project was the copyright clearance process where authors and institutions were contacted. The copyright clearance process may be 85-90 percent of the cost of digitizing an article. All the permissions for the 340,000 have been documented. Of these documents, approximately 338,000 permission requests were distributed. Approximately 1/3 of the permission requests have been returned. ERIC expects to eventually get about a 50 percent return.

ERIC has added several new features, including an enhanced web site section for the digitization project. This section provides a list of the journal records that now have full text; the list helps libraries weed their microfiche collections. WorldCat and OpenURL links are available through “Find in a Library.” An updated “related items” feature was made available after the new search engine was implemented.  ERIC recently added posting counts for thesaurus terms. Dedicated searches are also available to Regional Educational Laboratory and What Works Clearinghouse documents. The Advanced Search feature allows the user to check the education level for material contained in the document. This field is limited to those documents that identify an education level in the abstract because ERIC doesn’t have indexers who are making that judgment. The total records are updated dynamically on the journal list record.

The web site will be revised over the next few months and new features will be added. Enhanced help text and new training material sections will be added including voice-over tutorials. Search will get faster, since ERIC is skimming the interface down and eliminating some of the graphics. The Thesaurus will be updated. A quick link will be available to featured publications and journal collections. A “What’s New for Publishers” section will provide information on publisher agreement and how authors can upload their papers. A Librarian area will provide more documentation including an introduction to the thesaurus and to the metadata scheme. A particular field code is needed to specifically identify peer reviewed metadata. This field is available only from 2004 forward.

The meeting adjourned shortly before Noon.