Million Books to the Web An Example of Indo-us collaboration Lessons Learnt & The Road Ahead



Yüklə 535 b.
tarix27.10.2017
ölçüsü535 b.
#15638


Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead

  • Prof N. Balakrishnan


Lessons from the past

  • fires of Alexandria

    • irrevocably severed our access to any of the works of the ancients.
  • introduction of printing technology

    • several Indian and Chinese knowledge disseminated by word of mouth and on palm leaves virtually disappear or inaccessible
  • New cultural revolutions

    • edifices built by destroying the past irrevocably
    • later revolutions seek solace in attempting to preserve what was destroyed
    • we need to preserve our heritage independent of the political and social ups and downs


Lessons from Reality

  • In a thousand years:

  • only a few of the paper documents we have today will survive the ravages of deterioration, loss, and outright destruction. 

  • Existing archives of paper many other works still in existence today are rare

  • - only accessible to a small population of scholars and collectors at specific geographic locations 

  • Contrary to the popular beliefs, the libraries, museums, and publishers do not routinely maintain broadly comprehensive archives of the considered works of man

  • No one can afford to do this, unless the archive is digital



The Approach

  • Technology Driven Vision

  • Decide on the stake holders

    • Never make it exclusive
  • Pilot Projects to perfect technology

  • Bring in advanced management concepts

    • like People Maturity Models
    • Quality assurance
    • automate wherever possible


The Approach

  • Lessons from the past

    • Too many Digital Library Projects
    • with half-life of less than 2 years from the date of “Launch” or a long incubation time
    • Follow Nike – JUST DO IT
  • Digital Library must have two ingredients

    • A knowledge Amplifier
    • Free-access, giving avenues for every one to make economic benefit
      • still contribute to multiplication of knowledge by circulation
  • In India, it should be a test bed for our Language Technology Research

    • a show case for our heritage


Elements of Technology

  • Microprocessors

  • Memory

  • Connectivity

  • Software

  • All these technologies are growing exponentially



Communication Revolution





The World of Computers & Communication

  • Small fish eat the Big Fish



Processor of Tomorrow

  • Carbon Nano Tubes

    • 5 to 10 atoms wide
    • promise to replace silicon soon
  • Flexible Transistors

    • made from plastic, oraganic materials
  • Silicon will live for 15 years

  • Moore’s law will live longer

  • 1000 times growth in 10 years



Processor of Tomorrow

  • A billion Transistors at 10 to 20 GHz Clock rates by 2010

  • 128 G Bytes of Main Memory

  • Terra byte of Disk Storage- may be Holographic

  • Speech input/ output ASR

  • Multiligual

  • Terrabit connectivity at PC

  • The DL plans of today must be sensitive to this



The Road Ahead



The future trends:

  • Browser will be the only medium of communication.

  • It will be active- with voice and video, language independent.

  • Mobility will be the key.

  • Small form factor devices such as Palms, PDAs and Tablets would be the future.

  • We would soon see TVPCT at the cost of a TV

  • We will witness major convergence between ICT, Nano Technologies and Biological Sciences



Electronic Resources and the Library of the Future

  • E-mags; E-books; E-music;

  • E-Movies



Dedicated E-book Readers

  • Dedicated readers – about 20,000

  • Palm devices – 6,000,000

  • PC’s – hundreds of millions

  • “For people accustomed to reading text on a computer for hours at a time, e-book screen clarity is a non-issue.”

  • A low cost E-Book reader design on in India



http://www.eink.com/technology/index.htm

  • E Ink is made up of millions of microcapsules

    • each the diameter of a human hair
  • Each microcapsule contains

    • positively charged white particles &
    • negatively charged black particles
      • that float in a clear fluid
  • A film of transistors supplies the voltage to the capsules

  • A negative charge makes the white particles move to the top of the microcapsule

    • an opposite electric field pulls the black particles to the bottom of the microcapsules, mimicking the effect of print.
  • Electronic ink is a real power miser



E-ink/e-paper (Lucent)

  • The technology has been identified and development is well under way

  • By the year 2003, we envision electronic books

  • that can display volumes of information as easily as flipping a page,

  • permanent newspapers that update themselves daily via wireless broadcast

  • Just as today's books give people easy access to everyday information, tomorrow's books will provide the same easy access to the dynamic data of the information age



Indian Institute of Science’s Simputer

  • A hand held Linux Box at around US$ 200

  • Has the state of the art browser

  • Color screen

  • very good speech synthesizer

    • In English and many Indian Languages
  • A very powerful tool for access with wireless

  • Soon to be modified as an E-book

  • www.simputer.org

  • www.picopeta.com

  • www.ncoretech.com



The Challenges in Computing

  • Tomorrow’s computing needs are not in mflops and Gflops

  • The computer to process Information, recognition and DM like a Human

  • Small inexpensive

  • Robots, swarms will

  • be a reality



Ray Kurzweil: The Age of Spiritual Machines

  • “A $1,000 PC (in 1999-dollars)…

    • 2009 = trillion calculations/second
    • 2019 = 20 million billion calculations/second (the human brain)
    • 2029 = 2 * 1019 calculations/second (1,000 human brains)




Ray Kurzweil: The Age of Spiritual Machines

  • 2009: “Computer displays have all the display qualities of paper- high resolution, high contrast, large viewing angle, and no flicker. Books, magazines, and newspapers are now routinely read on displays that are the size of small books.”

  • 2009: “At least half of all (business) transactions are conducted online.”



2009: “There is effective convergence of all media, which exist as digital objects (that is, files) distributed by the ever-present high-bandwidth, wireless information web. Users can instantly download books, magazines, newspapers, television, radio, movies, and other forms of software to their highly portable personal communication devices.”

  • 2009: “There is effective convergence of all media, which exist as digital objects (that is, files) distributed by the ever-present high-bandwidth, wireless information web. Users can instantly download books, magazines, newspapers, television, radio, movies, and other forms of software to their highly portable personal communication devices.”



2009

  • A $1,000 PC delivers Terahertz speeds

  • PCs with high resolution visual displays come in a range of sizes

    • from those small enough to be embedded in clothing and jewelry
    • to the size of a thin book
  • Cables are disappearing

    • Communication between components uses wireless technology, as does access to the Web
  • The majority of text is created using continuous speech recognition

    • Also ubiquitous are language user interfaces.
  • Most routine business transactions (purchases, travel, etc.) take place between a human and a virtual personality

    • Often the virtual personality includes an animated visual presence that looks like a human face


2019: “Reading books, magazines, newspapers, and other Web documents; listening to music; watching three-dimensional moving images (for example, television, movies); engaging in three-dimensional visual phone calls; entering virtual environments (by yourself, or with others who may be geographically remote); and various combinations of these activities are all done through the ever-present communications Web and do not require any equipment, devices, or objects that are not worn or implanted.”

  • 2019: “Reading books, magazines, newspapers, and other Web documents; listening to music; watching three-dimensional moving images (for example, television, movies); engaging in three-dimensional visual phone calls; entering virtual environments (by yourself, or with others who may be geographically remote); and various combinations of these activities are all done through the ever-present communications Web and do not require any equipment, devices, or objects that are not worn or implanted.”



2029: “The ever learning Society”

  • 2029: “The ever learning Society”

  • Learning now constitutes the primary focus of the human species.

  • Human learning is accomplished using virtual teachers (and virtual libraries?).

  • Learning is enhanced by widely available neural implants, which improve memory and perception but cannot yet download knowledge directly.

  • Automated agents are learning, on their own without human assistance. Machines can now create significant new knowledge with little or no human intervention; unlike humans, machines easily share knowledge structures with one another.



And Then There Was Music

  • RealJukeBox

  • Win Amp

  • MP3

  • Napster



The Growth rates

  • The processor performance doubles every 18 Months

  • The Network bandwidth doubles every year

  • The storage capacity doubles every nine months

  • Soon you will have processor bottleneck

  • 1000 times growth in storage in 10 years – I already have 250 GB on a single disk-



Recognition verses Recall

  • Recognition is like seeing your friend’s face in a sea of faces

    • even if he has changed since you last saw him
    • storage intensive and fast
  • Recall is like figuring out how to repair your car’s carburetor using a manual and you have never done that before- applying knowledge to a new situation- processor intensive and less storage

  • Brian works on recognition

  • Present day computers prefer recall – remember the Y2K

  • Future computers would work like the brain- recognition



Recognition verses Recall- what it does to our DL

  • We will move away from quantitative search (key word match) to “aboutness” and content based retrieval

  • In Future the documents will be read more by computers than by humans – will it change the way we write ? Would we think in html or in xml ?

  • From mere Text data to 3d Objects, voice and video

  • Multiligual

  • Every conceivable form of knowledge expression



Technology Driven vision for The Digital Library

  • We can store everything

    • all the knowledge of the human race
    • in all forms
    • that is the Universal Digital Library
  • Cost of Selection is stationary but storage cost is plummeting



Education



Universal Library Vision

  • All recorded information online

  • instantly available

    • To Anyone
    • Anywhere in the world
    • In any language
    • searchable, browsable, navigable by humans and machines


Digital Library Contents

  • Books

  • Periodicals (journals, newspapers)

  • Art, photographs

  • Databases, software

  • Movies, video

  • Music, opera, dance

  • Suppose all of this were on the Web



Digital Library of the future

  • Digital library

  • Digital museum

  • Digital tour guide

  • Research assistant

  • Knowledge amplifier



Can we store all the human knowledge in a Digital form

  • There are about 100 Million books written by the human race

  • Multiply by 10 for all other form of knowledge

  • 1 book = 500 pp. = 1 MB uncompressed

    • 109 books = 1015 bytes = 1 petabyte
  • 140 million computers on the Internet

    • At 20 GB free space each  >2.8 Zetabytes now
  • 1 GB of disk costs ~$1

    • 1 petabyte < $1 million
    • Our Peta Byte server Initiative
    • Storage is not the limitation but creation and coordination are
    • Avoiding Duplication and connectivity are


Universal Digital Library

  • More than 120 million PCs on the net

  • Each having atleast 20 GB of free space

  • Peer to peer Communication

  • Can we store all the Human Knowledge in the computers



Technology Driven Vision for the Universal Digital Library



The Strategy for Scanning of books

  • A planetary Scanner like the Minolta PS 7000

  • Takes about two hours to scan a 500 page book, crop, OCR and convert it to TIFF, HTML and XML files

  • About 10, 000 pages to the web in a day

  • Storage per book is around ~ 60MB

  • 100 Tera byte is not an issue

  • Our Partner Internet Archives has 370 TB adding 30 TB a day

  • Distributed data bases



Process





Post scanning operations

  • Skew Correction

  • Document Registration

  • Dot Shading and Speck Removal

  • Image centering

  • Image Cropping

  • Smoothing and Completion



Image comparison

  • Original Image



Processed Image

  • SW 1



OCR CONVERSION



Performance evaluation for various fonts in Kannada language OCR





  • Average book size ~ 500 Pages

  • Size of Page as Image ~ 50-150 KB

  • Size of Page as text file

  • (rtf /htm) ~ 8 – 15 KB

  • Average size of Digitized book ~ 60MB



Brightness – Dark(1 in scale) and contrast – 9(in scale)

  • Brightness – Dark(1 in scale) and contrast – 9(in scale)



Million Books to the web- Stake holders as Partners

  • Academia- CS, IS and users

  • Researchers and Language Technologists

  • Cultural and Religious Organizations

  • Public Libraries

  • Government Agencies

  • None too exclusive



Background and Status

  • Collaborative Project between India and US

  • Lead roles by CMU and IISc

  • Initiated by CMU sending scanners free of cost to India. NSF supported

  • Initiated by the Office of the Principal Scientific Advisor to GOI by a Seed funding to IISc

  • Fuelled by MCIT’s whole hearted support

  • More than 16 centres in academic, religious and government institutions spread across the country

  • 69 scanners in place

  • China, Egypt (Alexandria Library), Srilanka, Australia joining in

  • There is light on the other side of the tunnel



Hubs of DL Activities in India

  • Anna University, Chennai, Tamil Nadu

  • Arulmigu Kalasligam College of Engineering, Srivilliputur, Madurai, Tamil Nadu

  • Goa University, Goa

  • Indian Institute of Information Technology, Allahabad, Uttar Pradesh

  • International Institute of Information Technology, Hyderabad, Andhra Pradesh

  • City and State Central Library, Andhra Pradesh

  • Shanmugha Art, Science, Technology & Research Academy, Thanjavore, Tamil Nadu

  • Sringeri Mutt, Sringeri, Karnataka

  • Tirumala Tirupathi Devasthanams, Tirupathi, Anadhra Pradesh

  • Mahastrastra Industrial Development Corporation, Maharastra

  • Universirty of Pune, Pune

  • Kanchi University, Kanchi, Tamil Nadu

  • Indian Institute of AstroPhysics, Karnataka



Scanner Operation at Hubs



Progress of Various Centre in Scanning



Number of Pages Scanned



Category of Books



Cumulative Status



More Centres and Initiatives- Already 61 scanners in operation + 39 in the pipe line

  • Rashtrapathi Bhavan

  • Punjab Technical University

  • IIIT Hyderabad and University of Hyderabad



MCIT’s Initiatives

  • Mobile Van with VSAT for the Book Mobile

  • ERNET providing connectivity to all centres

  • Many Centres supported with funds for computers and for scanning operations

  • Total spending from Government support and from Scanning Centre’s resources is ten times more than the Scanning equipment cost and effectively 100 times more

  • Support from all quarters of the government, religious leaders, academia and private agencies

  • Universal Digital Library of India to be launched



Some Observations and the Road ahead

  • More than 5 million pages have been scanned

  • The highest average rate of sustained scanning was about 4,000 pages per day at Hyderabad during February.

  • Our goal is to establish best practices to reach 6000 pages a day

  • 3 years – 1 M Books

  • By 2020 – 20 Million Books, 2 Million Songs, 200,000 Movies

  • The most enviable content creation



Road Ahead

  • Establishing the Digital Library of India on the same lines as the E-Governance Initiative

  • Under the MCIT

  • Head Quartered in AP

  • A think tank for content selection, delivery, technology and policy directions for the country

  • Creation of special funds for 4C



Criteria for Selecting Mega Centres- 5 of them planned

  • Geographical Distribution

  • Availability of contents of interest to larger user base

  • Local enthusiasm to support and sustain this activity

  • Budget of US$ 200,000 Initially and around 0.5 cent per page of output

  • One single scanner can produce 2 Million pages a year-

  • We will have 300 scanners – a Million books a year



Raod Ahead

  • Mega Content Creation Centres

  • New Delhi, Varanasi, Allahabad, Hyderabad, Far east (Tawang or Guahathi), Kolkotta and Chennai

  • Each Centre having around 40 scanners and 5 mobile scanners

  • Content Creation Centres with upto 5 scanners in Gujarat, Rajasthan so as to cover the entire country

  • Spearheading Language Technology Initiatives

  • Adding voice and video of our heritage



Universal Digital Library

  • Goal — To have all public knowledge online, available for free to all, everywhere

  • An achievable goal

    • There are only some 100,000,000 books in the world
    • A few billion dollars could bring these online
  • Limitations

    • Copyright and licensing issues
    • Different language books and character recognition technologies
      • We must ensure that English is not necessarily the de facto language
  • Universal Library



TECHNOLOGICAL CHALLENGES

  • Input (scanning, digitizing, OCR)

  • Data representation

    • text, notations, images, web pages
  • Navigation and Search

  • Multilingual Issues

  • Output (voice, pictures, virtual reality)

  • Synthetic Documents



SEARCH ENGINE of UDL

  • Very powerful light weight and scalable CMU search engine

  • Greenstone

  • Both are working and are being evaluated for the choice

  • Both have been modified for use as Indian Language search engines- language independent search

  • Future- Semantic web and content based retrieval – Speech input and speech output





Choice of Collection

  • Use books from libraries that are beyond copyright

  • Administrative metadata from OCLC, ISBN, and other sources

  • Dublin Core for Indian Books

  • A Copy Right Metadata – aggressive attempts to obtain copy right- Free Copyright from many agencies including GoI

  • Source Library Metadata

  • Converge towards focussed collection



Funding – Road Ahead

  • Funding effort must be an organized activity

  • Commercial funding unlikely for “public good” activity

    • Must go to governments, NGOs
  • World Bank

  • Qatar (if CMU deal succeeds)

  • Benefits of UDL:

    • Digital Opportunity
    • Use in distance education
    • International involvement – cultural diversity
    • Technology dissemination
    • Low cost v. conventional libraries
  • Funding is tied to Outreach (next slide)



Outreach

  • The UDL message must be disseminated

  • Present at World Summit (WSIS) in Geneva (12/03)

  • Pre-WSIS meeting at CERN (12/03)

  • Establish liaison with UN Decade of Literacy (2003-2013)

  • Points:

    • Terabyte servers
    • “Free to read” policy
    • Universal Dictionary (applicability to other domains)


Access by Public

  • All content free to read, print one page at a time

  • Restrictions imposed by donors will be respected

  • Categories of use will be recognized, e.g. cannot print entire document

  • Buttons, links to fulfillment houses and publishers are allowed- to take in “born Digital” copyrighted material



Partner Relations- Future

  • All material scanned or input as part of the UDL will be shared by all partners

  • Preference for national umbrella organizations to simplify international partner relations

  • Relationships between partners and their national DLs encouraged

  • Online communication and collaboration tools needed to facilitate partner questions and interchanges

  • Written partnership agreement will be made



Standards

  • Published standards within the UDL

  • Quality control and testing standard

  • Funding to be sought to support standards development

  • Logo to be developed (graphic device without words). Must appear on all sites, all pages

  • Logo should have a hot link to a gateway site that links all UDL sites

  • Local variability in look and feel of sites is permitted so long as the logo is displayed



Scanning/OCR Policy

  • We scan what gives greatest impetus to continued funding

  • Language: majority of content in English; otherwise no restriction

  • Scans will be previewed for minimum quality; OCR will not be corrected unless local site desires



Metadata

  • All entries MUST have metadata according to MARC or Dublin Core



Copyright

  • Public domain materials: no restrictions, tools for printing entire document provided

  • Works of uncertain copyright status:

    • Good faith effort to determine status, locate owner
    • Scan and index work
    • After a waiting period (at least one month), make work viewable
  • Archival material (old but unique)

    • Allow resolution restriction to avoid devaluation of original
  • Out-of-print in-copyright (OPIC)

    • Seek blanket permissions from publishers


Possible Intake Model



The Digital Library a Test Bed for language research

  • Rich data in many languages from the Million Books to the web Project - atleast 10,000 books in any language

  • Translations in many languages- Gita, NBT, NCERT etc- an excellent tool for language translation-

  • Training data for the OCR

  • The case insensitive ITRANS standard



The Digital Library a Test Bed for language research

  • Rich data makes the creation of OCRs in Indian languages easy- In Tamil, Kannada and Malayalam – A rapid prototyping

  • Speech synthesis and recognition

  • Indian Language Search Engines

  • Example Based Machine Translation

  • Universal Dictionary



The Universal Dictionary



Aboutness Hierarchy- Dr Shamos



Legal and Business Challenges

  • Use of copyrighted material

  • Economics (Who pays? Who gets?)

  • Privacy

  • Reliability of information

  • Change in the nature of teaching

  • Change in the nature of Information creation and use



Philosophy of Copy Right Laws

  • Protect the Inventor so that private investments in R & D would flow

  • Disseminate the information so that society grows

  • Protect the fairuse

  • Ensure you get what you paid for



What can be copyrighted ?

  • Must be tangible, e.g. a lecture can’t be copyrighted, a transcript of it can

  • Work must be original

  • Work must be creative - even minimal efforts usually count as creative



Fair use doctrine

  • Authorizes any person to make fair use of a published or unpublished copyrighted work (including the making of unauthorized copies) in these contexts:

  • In connection with criticism of or comment on the work

  • In the course of news reporting

  • For teaching purposes or

  • As part of scholarship or research activity



Four basic Factors:

  • The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes

  • The nature of the copyrighted work

  • The amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

  • The effect of the use upon the potential market for or value of the copyrighted work



www.library.org principles

  • Scholarly and government information and knowledge is a public good

    • that should be available, maintaining the balance of the rights of the individual creator vs. the needs of the public
  • The Library is the intellectual crossroads of the community.

  • Librarians will conceptualize and ensure

    • implementation of innovative new systems
      • for the creation and dissemination of information for succeeding generations.


“This rule provides that the first sale of a copy of a work to a member of the public ‘exhausts’ the rights holder’s ability to control further distribution of that copy. A library is thus free to lend, or even rent or sell, its copies of books to patrons”

  • “This rule provides that the first sale of a copy of a work to a member of the public ‘exhausts’ the rights holder’s ability to control further distribution of that copy. A library is thus free to lend, or even rent or sell, its copies of books to patrons”

  • How does this work in the Digital World ?



Music, Movie and Entertainment Industry

  • Much larger part of most of the economies

  • Large production costs

  • Need to protect business interest

  • Need to technology to protect

  • NAPSTER – peer to peer communication

  • DeCSS

  • NAPSTER for video ??

  • Consumer is different from the creator



New paradigms in the Digital Library

  • Should the laws used for protecting commercially attractive enterprise such as patents, music, entertainment be applied to DL

  • The dissemination of information creates multiplication unlike in music etc

  • Shorter life cycles for the information



Copyright Conflicting requirements

  • Need to protect the financial interests of creators in order to encourage private investments to the economy

  • Need to create a framework for every human being to create

  • The 2nd principle should dominate in DL

  • The 1st principle should dominate the others



The Concept of FourC

  • The scientific community is the only one that is creator and consumer of information

  • It pays for both

  • The SW Industry had shown the way for freeware

  • Can we do it in Scholarly communication, text books etc.



The Concept of FourC

  • In the 20th Century, in the interest of public good the Governments created BBC, PBS, AIR and also the Public Library System- provided compensation for artists and writers while providing free access to public

  • Total Global Expenditure in public broadcasting and public libraries exceed 100 B$

  • Look at our kings who supported all the poets and scholars

  • We need to find the 21st Century equivalent of BBC, AIR and PBS.



The Concept of FourC

  • Learn from NAPSTER- will we have a video equivalent of NAPSTER

  • It is impossible to police and protect IP Rights at gigabit rate connections

  • Some countries and WIPO under pressure from lobbying groups form the draconian Copy Right Laws

  • Remember the FAIR USE Doctrine- and what the creators want- recognition and compensation



The Solution -FourC

  • Consortium for Compensation of Creative Contents- FourC

  • Set aside 25% of the current national expenditure on public broadcasting and PLs

  • Authors are encouraged to put the work on the web after a few years of commercial exploitation- many models- in return get tax excempt etc.

  • India showing the way IASc and INSA

  • Books out of print

  • Titanic effect

  • Authors Can take back the Copy right



The Solution -FourC

  • Authors compensation based on the hits

  • Future versions of text books may be FAQs and XMLised-

  • Many eceonomic models-

  • Can work for Courseware as well



The Solution -FourC

  • The changing trend in publications- we want the documents to be readable by the machines as well humans

  • Born digital documents

  • Can we compensate those for creating contents for the web

  • Can we compensate those who create music and movies for the web- really small form factor – small screens



Conclusion

  • Knowledge multiplies whenever bits are circulated on the web

  • Technology has a habit of creating a problem (by knowledge explosion) and spending the rest of its time in trying to solve it- through Digital Library

  • The Universal Digital Library with 20 Million Books by 2020 – A year our President dreams India to become a developed nation

  • A FourC Policy and a Digital Library Act are in the anvil in India to meet this mission

  • If a billion people sneeze- together we can create a Hurricane

  • With the technology of the two nations we will convert this hurricane into useful energy and light up the world of knowledge



If you are creating a digital library, it should be for access by anyone, anytime and from any place

  • If you are creating a digital library, it should be for access by anyone, anytime and from any place

  • If Your Digital Library Is For Exclusive Use, Let Us Talk About Weather

  • There Is Nothing Called, Your DL, My DL

    • It Is Our DL
    • The Universal Digital Library




Yüklə 535 b.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin