Presentation of the clia project by Pushpak Bhattacharyya



Yüklə 470 b.
tarix22.01.2018
ölçüsü470 b.
#39992


Presentation of the CLIA Project

  • by

  • Pushpak Bhattacharyya,

  • IIT Bombay,

  • On behalf of

  • the CLIA Consortium

  • 12 Dec 2008


Motivation



CLIA is a real need

  • Great language diversity in India

  • Low comfort level with English

    • less than 5% of the total population of about 700 million can use English effectively
  • Need for critical information in large quantity and high quality, especially in agriculture, health, tourism, education and sectors

  • CLIA project started in 2006: domains- tourism and health



Geographically speaking



CLIA: basic information



Defining Diagram



CLIA Consortium Members

  • Name of Institute Assigned Language(s)

  • IIT Bombay (Consortium Leader) Marathi, Hindi

  • IIT-Kharagpur (consortium co-leader) Bengali

  • IIIT Hyderabad Telugu, Hindi

  • Anna University-KBC Tamil

  • Anna University-College of Engg Tamil

  • ISI Kol Bengali

  • Jadavpur University Kolkata Bengali

  • CDAC-Pune Marathi, Hindi, Tamil

  • CDAC-Noida Punjabi

  • Utkal University --



Principal Investigators

  • Name of Institute Names

  • IITB Prof. Pushpak Bhattacharyya

  • IIT-Kgp Prof. Sudeshna Sarkar

  • IIITH Prof. Vasudev Verma

  • AU-KBC Prof. Sobha L.

  • AU-CEG Prof. Ranjani Parthasarthy

  • ISI Kol Prof. Mandar Mitra

  • JU Kol Prof. Sivaji Bandyopadhya

  • CDAC-P Dr. Ajai Kumar

  • CDAC-N Dr. Karunesh Arora

  • Utkal University Prof. Sanghamitra Mohanty



Some prominent research members

  • Name of Institute Names

  • IITB Manoj, Vishal, Vishaal, Ashish

  • IIT-Kgp Nimesh, Dr. Rajendra

  • IIITH Bhupal, Praneet

  • AU-KBC Pattavi, Vijay, Vijay

  • AU-CEG Kaviha, Subha Lalitha

  • ISI Kol Prasenjt, Deepashri, Ayan

  • JU Kol Asif, Pinaki

  • CDAC-P Swati, Abhishek

  • CDAC-N Gaur Mohan, Ankur

  • Utkal University Balbant Rai



Prior expertise brought to the project (Horizontal, i.e., language independent)

  • Name of Institute Areas of prior expertise/experience

  • IITB NLP (LR, WSD, MT), Semantic Search

  • IIT-Kgp Search and Ranking, Shallow Parsing

  • IIITH Commercial level search engine building, query processing

  • AU-KBC NER, Information Extraction, Summarization, Anaphora

  • AU-CEG Morphology, Interlingua

  • ISI Kol IR Evaluation, large scale IR system building (SMART)

  • JU Kol Example based MT, Summarization, NER

  • CDAC-P Converters, File format processors, MT

  • CDAC-N Parallel corpora, Query processing

  • Utkal University Machine Translation, Lexical Resources



Prior expertise brought to the project (vertical, i.e., language specific)

  • Name of Institute Areas of prior expertise/experience

  • IITB Hindi Marathi wordnet building, Hindi Marathi shallow parsing

  • IIT-Kgp Bengali shallow parsing including MA

  • IIITH Telugu-Eng CLIR, Telugu query processing

  • AU-KBC Tamil NER, Tamil IE, Tamil Morph

  • AU-CEG Tamil Morph, Eng-Tamil MT

  • ISI Kol Bengali statistical stemming, large scale corpora for Bengali

  • JU Kol Bengali NER, EBMT involving Bengali

  • CDAC-P Various Indian language converters

  • CDAC-N Aligned parallel corpora for Indian languages

  • Utkal University --



Horizontal tasks of CLIA and the organizations responsible

  • Input Query processing

    • IIIT Hyderabad
  • Crawling, Indexing

    • IIT KGP, IIITH, IITB
  • Searching, Ranking

    • IIT KGP, IIITH, IITB
  • User Interface

    • CDAC Noida
  • File format processing

    • CDAC Pune


Horizontal tasks of CLIA and the organizations responsible (contd)

  • Document Processing (index time NER, IE)

    • AU KBC
  • Document Processing (Post Retrieval: Snippet, Summary)

    • Jadavpur University
  • Distributed Search

    • IIT KGP, Utkal, CDACP
  • Evaluation, Relevance Judgement

    • ISI Kolkata
  • UNL based semantic search (for Tamil)

    • AU CEG


Languages and the organizations responsible

  • Language Organization(s)

  • Bengali IIT KGP (c), JU, ISI

  • Hindi IIITH (c), IITB, CDAC Noida

  • Marathi IITB (c), CDAC Pune

  • Punjabi CDAC Noida

  • Tamil AUKBC (c), AUCEG

  • Telugu IIITH



CLIA Important Dates

  • Project Start Date: 29th Aug 06 (effectively Jan 2007)

  • First meeting of the Project Review and Steering Group (PRSG): 2nd March 2007

  • Second PRSG: 30th Aug 2007

  • Third PRSG: 08th March 2008

  • Fourth PRSG: 15th July 2008

  • Alpha version released: 15th July, 2008

  • Beta version to be released (along with the 5th PRSG): January, 2009



Related consortium: E-IL MT project

  • English to Indian Language MT

  • Indian Languages: Hindi, Marathi, Bengali, Urdu, Oriya, Telugu, Tamil

  • Approaches: Statistical MT, Example Based MT

  • Members: CDAC Pune (c), IIT Bombay, JU, UU, IIITH, IIITA



Related consortium:IL-IL MT project

  • Indian Language to Indian Language MT

  • Indian Languages: Hindi, Marathi, Bengali, Punjabi, Tamil, Telugu, Kannada

  • Approach: Transfer Based

  • Members: IIITH (c), CDAC Pune, IIT Bombay, JU, University of Hyderabad, AU KBC



All three projects are time bound and result oriented

  • 2 years time frame (extension granted for 1 year)

  • Strict deliverables

  • For each project the budget outlay is about Rs 80 million (USD 2 million)



CLIA: Top level technological information



Process Flow





CLIA: achievements in 2 years (Jan 2007 to Dec 2008)

  • Tools and resources

  • (Copyrightable code and data)



Steps towards overall evaluation

  • Yet to be completed

    • Precision, Recall, MAP, F-score etc.
  • Large Relevance judgment base under construction

    • 50 queries per language (6 languages)
    • About 5000 documents per language (6 languages)
  • Crawled and indexed document base of English: approx 600,000 pages



Copyright for CLIA (code)



Copyright for CLIA (code) contd.



Copyright for CLIA (code) Cont..



Copyright for CLIA (code) Cont..



Copyright for CLIA (data)



Copyright for CLIA (data) contd.



Copyright for CLIA Cont..



Copyright for CLIA Cont..



Conclusion

  • Large scale national level activity

  • Large number of tools and resources developed under the consortium

  • Alpha release done in July, 2008

  • Beta release to take place in Jan, 2009

  • Look forward to more detailed interactions and suggestions from the international audience



Introducing people…



Principal Investigators

  • Name of Institute Names

  • IITB Prof. Pushpak Bhattacharyya

  • IIT-Kgp Prof. Sudeshna Sarkar

  • IIITH Prof. Vasudev Verma

  • AU-KBC Prof. Sobha Nair

  • AU-CEG Prof. Ranjani Parthasarthy

  • ISI Kol Prof. Mandar Mitra

  • JU Kol Prof. Sivaji Bandyopadhya

  • CDAC-P Dr. Ajai Kumar

  • CDAC-N Dr. Karunesh Arora

  • Utkal University Prof. Sanghamitra Mohanty



Some prominent research members

  • Name of Institute Names

  • IITB Manoj, Vishal, Vishaal, Ashish

  • IIT-Kgp Nimesh, Dr. Rajendra

  • IIITH Bhupal, Praneet

  • AU-KBC Pattavi, Vijay, Vijay

  • AU-CEG Kaviha, Subha Lalitha

  • ISI Kol Prasenjt, Deepashri, Ayan

  • JU Kol Asif, Pinaki

  • CDAC-P Swati, Abhishek

  • CDAC-N Gaur Mohan, Ankur

  • Utkal University Balbant Rai



Overview

  • Technical Status of the Project

  • Technical Documentation

  • Shared resources

  • Testing methodology

  • Software Documentation

  • Alpha and Beta versions



Technical Summary



Work Flow



Project Status



Status - Input Processing

  • Stemmer

  • MWE

    • Guidelines are under discussion (IITB)
    • Marathi ~ 2000 MWE Bangla ~ 600 MWE Tamil ~ 600 MWE Punjabi ~ 4000 MWE


Status – Input Processing : NER



Status - Input Processing

  • WSD (IITB)

    • 2nd version WSD
    • Interface for Sense-marking of corpus developed by IITB
  • Dictionary

    • IITB working on E-Hin linkage
    • All LVs working on IL-IL linking and E-IL linking
    • ~10,000 synsets generated from Tourism corpora


Status: Dictionary

  • Eng-Hin Linkage

    • ~ 2500 synsets linked (IITB)


Sample Input screen

  • Input Screen



Sample Input screen



Project Status



Status – Search

  • Size of Indexed corpus



Status – Search

  • cML-Text Converter (IIT-Kgp)

    • First version of the engine is ready
    • Software extracts the fields and body, but does not identify paragraphs and blocks in this version
    • Has been tested for Bengali
    • Ready to be integrated with Nutch


Project Status



Status – Document Processing

  • Basic IE Engine and eleven IE Templates are ready (AUKBC)

  • Has been tested with sample documents (EILMT corpus)

  • First template “How to reach the place” is getting translated to Tamil, Telugu

  • For other languages, the inflectionary markers are being provided



Project Status



Sample Output Screen



Sample Output screen



Sample Output screen



Sample Output Screen



Sample Output Screen



Sample Output Screen



Status – Output Generation



Project Status



Status - Evaluation

  • Corpora

    • Tourism and Health Corpora being collected for all languages
    • News corpora also being collected.
    • Period of news corpora ranges from 2002 to 2007
    • For News corpora, ISI Kol having dialogues with TOI and Hindustan Times for permission for the use of their multilingual corpora


Details of Corpora (crawled)

  • Assumption in SRS:

    • Each language corpus has at least 50,000 documents from General / News + all available documents in Tourism and Health


Evaluation : Topics

  • Topics (ISI Kol)

    • A set of 95 topics are ready for evaluation
    • 30 topics for training and 50 topics for testing and 15 topics as stand-by
    • Each topic = Title + Narration + Description
    • Translation of these 95 topics have been completed by all the six language verticals
    • Sample Topic
      • Euro Inflation
      • Find documents about rises in prices after the introduction of the Euro
      • Any document is relevant that provides information on the rise of prices in any country that introduced the common European currency.


Evaluation Methodology

  • Benchmark data creation



Evaluation Methodology

  • Benchmark data creation

      • Sample documents (corpus)
      • Sample Queries / Topics (95)
      • Relevance judgement
        • No of relevance judged Bangla documents ~ 4,500
        • Independently judged against 23 topics by each of two judges
      • Pooling
        • Pooling strategies adopted by TREC
        • List of top ~100 documents are taken
        • Pool = union of these


Evaluation methodology

  • Evaluation engine



UNL

  • Monolingual retrieval is working for Tamil documents

  • 6500 words in UNL Dictionary

  • Words + MWE indexed

  • Documents indexed

      • No. of documents processed in Tourism - 564
      • No of Concept-Relation-Concept indexed - 11,754
      • No of Concept-Relation indexed - 11,754
      • No of Concepts indexed - 17,650


Testing Methodology

  • Testing methodology

    • Black box testing based on SRS and design documents
    • Unit testing by each sub-system
    • Test cases (format) and test reports
  • Integration testing

    • Top down / Bottom-up based on dependencies
    • Stubs and drivers
    • Sub-system wise testing (module-wise)
      • Input processing
      • Search and Retrieval
      • Document processing
      • Output Generation
      • Evaluation
      • UNL
  • System Testing

    • Performance testing


Integration

  • Use of controlled corpora for Integration

  • Use of EILMT English and Hindi parallel corpus

  • ISI generates the queries for corpus

  • Translation of queries by all LVs

  • English and Hindi synsets identified for building multilingual dictionary by each LV

  • Each language vertical will be tested for their respective cross-lingual retrieval

  • Information Extraction and output generation will be done on the same corpora

  • Integration of each LV into Nutch at IITKgp



Test and Integration (contd.)

  • Bug tracking system (Bugzilla) to be installed

  • Currently planned for installation at IITB on the same server as CVS

  • Bugzilla

    • Web-based general-purpose bug tracker tool
    • Detects not only software bugs but also all other user-submitted tracking tickets
    • Eases communication between team members
    • Can be integrated with CVS and WIKI


Bugzilla

  • Requirements

    • A compatible database management system – MySQL, Postgressql
    • A suitable release of Perl 5
    • A compatible web server
    • A suitable mail transfer agent, or any SMTP server
  • Bugzilla Demo

    • https://landfill.bugzilla.org/bugzilla-tip/index.cgi


Bugzilla - Design

  • Bugs can be submitted by anybody, and will be assigned to a particular developer



Deployment diagram



Hosting of Alpha and Beta versions

  • Alpha Version

    • ~10,000 documents in each language
    • Low complexity system
    • Hence simple hardware configuration sufficient
    • Does not include Summary generation and Output translation
    • Planned for Dec 2008
  • Beta Version

    • ~10,00,000 documents in each language
    • Hardware configuration being worked out - based on disk space requirements, throughput of system, response times, simultaneous users etc.
    • Following details are being worked out:
      • Connectivity
      • Where to host
      • Support for hosting
    • Planned for July 2008


Elitex08: Demo of Alpha Version

  • Plan to demonstrate the following:

    • Cross-lingual information retrieval for all languages
    • Information Extraction and translation of at least one template to Tamil / Telugu
    • Snippet Generation (monolingual)
    • Hardware integration – IITKgp
    • Publicity management / Poster design - JU
    • Funds: Participation fees to be shared
  • Demonstrate the same at IJCNLP08 exhibition (in Hyderabad - Jan 2008)



Gantt chart (as on Aug 30)



Gantt chart (as on Aug 30)



Software documentation

  • SRS (Based on IEEE)

  • Design document v2.0 (based on RUP)

  • User Requirements Document (Ver 5.0)

  • Java docs

  • Test cases template

  • File naming conventions

  • Testing and integration guidelines

  • Code review guidelines

  • Skip templates



Software documentation : SRS

  • SRS

    • Introduction
    • Overall description
    • External interface requirements
    • System features (module-wise)
    • Advanced Search system for Tamil using UNL


Software documentation: DD

  • Design document (v 2.0)

    • Has been simplified to suit project needs
      • Introduction
      • System Architecture
        • Solution Architecture (brief description of systems, subsystems)
        • Software Architecture ( block diagrams)
      • System Design
        • Logical Design (Class Diagrams )
        • Component Design (Component Diagrams )
      • Appendix - other details


Software documentation:URD

  • URD

    • Introduction
    • Objective
    • Scope of the project
    • Product perspective
    • Capabilities of the Product
    • User Characteristics
    • Assumptions and dependencies
    • Operational environment
    • Input / Output scenarios
    • Definitions, acronyms and abbreviations
    • References


Software documentation:Test

  • Test case template: for all tests



Software documentation:File naming

  • File naming convention captures the following:

    • Subject & domain of document
    • Content Type (ppt / doc / rpt / Tr / etc)
    • Name of Institute (IITB / ISI / IIITH etc.)
    • Date of creation of doc (dd-mon-yy)
    • Version no.
    • Format
      • ____.
      • E.g. PRSG_Pres_IITB_08dec07_v1.ppt


Shareable Resources and Tools

  • Shared Resources across projects

    • From ILILMT to CLIA:
      • Morph Analyzer
      • POS Tagger
      • Chunker
      • Dictionary Standardization
      • IL-IL Synsets
    • From EILMT to CLIA
      • Synsets E-IL
    • From CLIA to other projects:
      • NER engine
      • NE list
      • MWE


Collaborative tools used - CLIA



CLIA Wiki site

  • http://www.cfilt.iitb.ac.in/~consortia/dokuwiki

  • CLIA Wiki contents

    • Project Team Contact details
    • Project documentation (SRS, Design doc, URD..)
    • Meeting minutes and presentations
    • Project fund details
    • Progress reports and timelines
    • Project resources
    • Corpus
    • Collaborative platform for audio conferences


CLIA Wiki site



Wiki – Upload notification



Thank You



Yüklə 470 b.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin