Presentation of the clia project by Pushpak Bhattacharyya

Yüklə 470 b.

tarix	22.01.2018
ölçüsü	470 b.
	#39992

Presentation of the CLIA Project

by
Pushpak Bhattacharyya,
IIT Bombay,
On behalf of
the CLIA Consortium
12 Dec 2008

Motivation

CLIA is a real need

Great language diversity in India
Low comfort level with English

less than 5% of the total population of about 700 million can use English effectively

Need for critical information in large quantity and high quality, especially in agriculture, health, tourism, education and sectors
CLIA project started in 2006: domains- tourism and health

Geographically speaking

CLIA: basic information

Defining Diagram

CLIA Consortium Members

Name of Institute Assigned Language(s)
IIT Bombay (Consortium Leader) Marathi, Hindi
IIT-Kharagpur (consortium co-leader) Bengali
IIIT Hyderabad Telugu, Hindi
Anna University-KBC Tamil
Anna University-College of Engg Tamil
ISI Kol Bengali
Jadavpur University Kolkata Bengali
CDAC-Pune Marathi, Hindi, Tamil
CDAC-Noida Punjabi
Utkal University --

Principal Investigators

Name of Institute Names
IITB Prof. Pushpak Bhattacharyya
IIT-Kgp Prof. Sudeshna Sarkar
IIITH Prof. Vasudev Verma
AU-KBC Prof. Sobha L.
AU-CEG Prof. Ranjani Parthasarthy
ISI Kol Prof. Mandar Mitra
JU Kol Prof. Sivaji Bandyopadhya
CDAC-P Dr. Ajai Kumar
CDAC-N Dr. Karunesh Arora
Utkal University Prof. Sanghamitra Mohanty

Some prominent research members

Name of Institute Names
IITB Manoj, Vishal, Vishaal, Ashish
IIT-Kgp Nimesh, Dr. Rajendra
IIITH Bhupal, Praneet
AU-KBC Pattavi, Vijay, Vijay
AU-CEG Kaviha, Subha Lalitha
ISI Kol Prasenjt, Deepashri, Ayan
JU Kol Asif, Pinaki
CDAC-P Swati, Abhishek
CDAC-N Gaur Mohan, Ankur
Utkal University Balbant Rai

Prior expertise brought to the project (Horizontal, i.e., language independent)

Name of Institute Areas of prior expertise/experience
IITB NLP (LR, WSD, MT), Semantic Search
IIT-Kgp Search and Ranking, Shallow Parsing
IIITH Commercial level search engine building, query processing
AU-KBC NER, Information Extraction, Summarization, Anaphora
AU-CEG Morphology, Interlingua
ISI Kol IR Evaluation, large scale IR system building (SMART)
JU Kol Example based MT, Summarization, NER
CDAC-P Converters, File format processors, MT
CDAC-N Parallel corpora, Query processing
Utkal University Machine Translation, Lexical Resources

Prior expertise brought to the project (vertical, i.e., language specific)

Name of Institute Areas of prior expertise/experience
IITB Hindi Marathi wordnet building, Hindi Marathi shallow parsing
IIT-Kgp Bengali shallow parsing including MA
IIITH Telugu-Eng CLIR, Telugu query processing
AU-KBC Tamil NER, Tamil IE, Tamil Morph
AU-CEG Tamil Morph, Eng-Tamil MT
ISI Kol Bengali statistical stemming, large scale corpora for Bengali
JU Kol Bengali NER, EBMT involving Bengali
CDAC-P Various Indian language converters
CDAC-N Aligned parallel corpora for Indian languages
Utkal University --

Horizontal tasks of CLIA and the organizations responsible

Input Query processing

IIIT Hyderabad

Crawling, Indexing

IIT KGP, IIITH, IITB

Searching, Ranking

IIT KGP, IIITH, IITB

User Interface

CDAC Noida

File format processing

CDAC Pune

Horizontal tasks of CLIA and the organizations responsible (contd)

Document Processing (index time NER, IE)

AU KBC

Document Processing (Post Retrieval: Snippet, Summary)

Jadavpur University

Distributed Search

IIT KGP, Utkal, CDACP

Evaluation, Relevance Judgement

ISI Kolkata

UNL based semantic search (for Tamil)

AU CEG

Languages and the organizations responsible

Language Organization(s)
Bengali IIT KGP (c), JU, ISI
Hindi IIITH (c), IITB, CDAC Noida
Marathi IITB (c), CDAC Pune
Punjabi CDAC Noida
Tamil AUKBC (c), AUCEG
Telugu IIITH

CLIA Important Dates

Project Start Date: 29th Aug 06 (effectively Jan 2007)
First meeting of the Project Review and Steering Group (PRSG): 2nd March 2007
Second PRSG: 30th Aug 2007
Third PRSG: 08th March 2008
Fourth PRSG: 15th July 2008
Alpha version released: 15th July, 2008
Beta version to be released (along with the 5th PRSG): January, 2009

Related consortium: E-IL MT project

English to Indian Language MT
Indian Languages: Hindi, Marathi, Bengali, Urdu, Oriya, Telugu, Tamil
Approaches: Statistical MT, Example Based MT
Members: CDAC Pune (c), IIT Bombay, JU, UU, IIITH, IIITA

Related consortium:IL-IL MT project

Indian Language to Indian Language MT
Indian Languages: Hindi, Marathi, Bengali, Punjabi, Tamil, Telugu, Kannada
Approach: Transfer Based
Members: IIITH (c), CDAC Pune, IIT Bombay, JU, University of Hyderabad, AU KBC

All three projects are time bound and result oriented

2 years time frame (extension granted for 1 year)
Strict deliverables
For each project the budget outlay is about Rs 80 million (USD 2 million)

CLIA: Top level technological information

Process Flow

CLIA: achievements in 2 years (Jan 2007 to Dec 2008)

Tools and resources
(Copyrightable code and data)

Steps towards overall evaluation

Yet to be completed

Precision, Recall, MAP, F-score etc.

Large Relevance judgment base under construction

50 queries per language (6 languages)
About 5000 documents per language (6 languages)

Crawled and indexed document base of English: approx 600,000 pages

Copyright for CLIA (code)

Copyright for CLIA (code) contd.

Copyright for CLIA (code) Cont..

Copyright for CLIA (data)

Copyright for CLIA (data) contd.

Copyright for CLIA Cont..

Conclusion

Large scale national level activity
Large number of tools and resources developed under the consortium
Alpha release done in July, 2008
Beta release to take place in Jan, 2009
Look forward to more detailed interactions and suggestions from the international audience

Introducing people…

Principal Investigators

Name of Institute Names
IITB Prof. Pushpak Bhattacharyya
IIT-Kgp Prof. Sudeshna Sarkar
IIITH Prof. Vasudev Verma
AU-KBC Prof. Sobha Nair
AU-CEG Prof. Ranjani Parthasarthy
ISI Kol Prof. Mandar Mitra
JU Kol Prof. Sivaji Bandyopadhya
CDAC-P Dr. Ajai Kumar
CDAC-N Dr. Karunesh Arora
Utkal University Prof. Sanghamitra Mohanty

Some prominent research members

Name of Institute Names
IITB Manoj, Vishal, Vishaal, Ashish
IIT-Kgp Nimesh, Dr. Rajendra
IIITH Bhupal, Praneet
AU-KBC Pattavi, Vijay, Vijay
AU-CEG Kaviha, Subha Lalitha
ISI Kol Prasenjt, Deepashri, Ayan
JU Kol Asif, Pinaki
CDAC-P Swati, Abhishek
CDAC-N Gaur Mohan, Ankur
Utkal University Balbant Rai

Overview

Technical Status of the Project
Technical Documentation
Shared resources
Testing methodology
Software Documentation
Alpha and Beta versions

Technical Summary

Work Flow

Project Status

Status - Input Processing

Stemmer

All Language stemmers developed
Integrated with Nutch through plug-ins
Monolingual retrievals are working

Guidelines are under discussion (IITB)
Marathi ~ 2000 MWE Bangla ~ 600 MWE Tamil ~ 600 MWE Punjabi ~ 4000 MWE

Status – Input Processing : NER

Status - Input Processing

WSD (IITB)

2nd version WSD
Interface for Sense-marking of corpus developed by IITB

Dictionary

IITB working on E-Hin linkage
All LVs working on IL-IL linking and E-IL linking
~10,000 synsets generated from Tourism corpora

Status: Dictionary

Eng-Hin Linkage

~ 2500 synsets linked (IITB)

Sample Input screen

Input Screen

Sample Input screen

Project Status

Status – Search

Size of Indexed corpus

Status – Search

cML-Text Converter (IIT-Kgp)

First version of the engine is ready
Software extracts the fields and body, but does not identify paragraphs and blocks in this version
Has been tested for Bengali
Ready to be integrated with Nutch

Project Status

Status – Document Processing

Basic IE Engine and eleven IE Templates are ready (AUKBC)
Has been tested with sample documents (EILMT corpus)
First template “How to reach the place” is getting translated to Tamil, Telugu
For other languages, the inflectionary markers are being provided

Project Status

Sample Output Screen

Sample Output screen

Sample Output Screen

Status – Output Generation

Snippet Generation (JU)

Working for monolingual retrieval
Integrated with Nutch
Has been tested for Bengali

Project Status

Status - Evaluation

Corpora

Tourism and Health Corpora being collected for all languages
News corpora also being collected.
Period of news corpora ranges from 2002 to 2007
For News corpora, ISI Kol having dialogues with TOI and Hindustan Times for permission for the use of their multilingual corpora

Details of Corpora (crawled)

Assumption in SRS:

Each language corpus has at least 50,000 documents from General / News + all available documents in Tourism and Health

Evaluation : Topics

Topics (ISI Kol)

A set of 95 topics are ready for evaluation
30 topics for training and 50 topics for testing and 15 topics as stand-by
Each topic = Title + Narration + Description
Translation of these 95 topics have been completed by all the six language verticals
Sample Topic

Euro Inflation
Find documents about rises in prices after the introduction of the Euro
Any document is relevant that provides information on the rise of prices in any country that introduced the common European currency.

Evaluation Methodology

Benchmark data creation

Evaluation Methodology

Benchmark data creation

Sample documents (corpus)
Sample Queries / Topics (95)
Relevance judgement

No of relevance judged Bangla documents ~ 4,500
Independently judged against 23 topics by each of two judges

Pooling

Pooling strategies adopted by TREC
List of top ~100 documents are taken
Pool = union of these

Evaluation methodology

Evaluation engine

UNL

Monolingual retrieval is working for Tamil documents
6500 words in UNL Dictionary
Words + MWE indexed
Documents indexed

No. of documents processed in Tourism - 564
No of Concept-Relation-Concept indexed - 11,754
No of Concept-Relation indexed - 11,754
No of Concepts indexed - 17,650

Testing Methodology

Testing methodology

Black box testing based on SRS and design documents
Unit testing by each sub-system
Test cases (format) and test reports

Integration testing

Top down / Bottom-up based on dependencies
Stubs and drivers
Sub-system wise testing (module-wise)

Input processing
Search and Retrieval
Document processing
Output Generation
Evaluation
UNL

System Testing

Performance testing

Integration

Use of controlled corpora for Integration
Use of EILMT English and Hindi parallel corpus
ISI generates the queries for corpus
Translation of queries by all LVs
English and Hindi synsets identified for building multilingual dictionary by each LV
Each language vertical will be tested for their respective cross-lingual retrieval
Information Extraction and output generation will be done on the same corpora
Integration of each LV into Nutch at IITKgp

Test and Integration (contd.)

Bug tracking system (Bugzilla) to be installed
Currently planned for installation at IITB on the same server as CVS
Bugzilla

Web-based general-purpose bug tracker tool
Detects not only software bugs but also all other user-submitted tracking tickets
Eases communication between team members
Can be integrated with CVS and WIKI

Bugzilla

Requirements

A compatible database management system – MySQL, Postgressql
A suitable release of Perl 5
A compatible web server
A suitable mail transfer agent, or any SMTP server

Bugzilla Demo

https://landfill.bugzilla.org/bugzilla-tip/index.cgi

Bugzilla - Design

Bugs can be submitted by anybody, and will be assigned to a particular developer

Deployment diagram

Hosting of Alpha and Beta versions

Alpha Version

~10,000 documents in each language
Low complexity system
Hence simple hardware configuration sufficient
Does not include Summary generation and Output translation
Planned for Dec 2008

Beta Version

~10,00,000 documents in each language
Hardware configuration being worked out - based on disk space requirements, throughput of system, response times, simultaneous users etc.
Following details are being worked out:

Connectivity
Where to host
Support for hosting

Planned for July 2008

Elitex08: Demo of Alpha Version

Plan to demonstrate the following:

Cross-lingual information retrieval for all languages
Information Extraction and translation of at least one template to Tamil / Telugu
Snippet Generation (monolingual)
Hardware integration – IITKgp
Publicity management / Poster design - JU
Funds: Participation fees to be shared

Demonstrate the same at IJCNLP08 exhibition (in Hyderabad - Jan 2008)

Gantt chart (as on Aug 30)

Software documentation

SRS (Based on IEEE)
Design document v2.0 (based on RUP)
User Requirements Document (Ver 5.0)
Java docs
Test cases template
File naming conventions
Testing and integration guidelines
Code review guidelines
Skip templates

Software documentation : SRS

Introduction
Overall description
External interface requirements
System features (module-wise)
Advanced Search system for Tamil using UNL

Software documentation: DD

Design document (v 2.0)

Has been simplified to suit project needs

Introduction
System Architecture

Solution Architecture (brief description of systems, subsystems)
Software Architecture ( block diagrams)

System Design

Logical Design (Class Diagrams )
Component Design (Component Diagrams )

Appendix - other details

Software documentation:URD

Introduction
Objective
Scope of the project
Product perspective
Capabilities of the Product
User Characteristics
Assumptions and dependencies
Operational environment
Input / Output scenarios
Definitions, acronyms and abbreviations
References

Software documentation:Test

Test case template: for all tests

Software documentation:File naming

File naming convention captures the following:

Subject & domain of document
Content Type (ppt / doc / rpt / Tr / etc)
Name of Institute (IITB / ISI / IIITH etc.)
Date of creation of doc (dd-mon-yy)
Version no.
Format

____.
E.g. PRSG_Pres_IITB_08dec07_v1.ppt

Shareable Resources and Tools

Shared Resources across projects

From ILILMT to CLIA:

Morph Analyzer
POS Tagger
Chunker
Dictionary Standardization
IL-IL Synsets

From EILMT to CLIA

Synsets E-IL

From CLIA to other projects:

NER engine
NE list
MWE

Collaborative tools used - CLIA

CLIA Wiki site

http://www.cfilt.iitb.ac.in/~consortia/dokuwiki
CLIA Wiki contents

Project Team Contact details
Project documentation (SRS, Design doc, URD..)
Meeting minutes and presentations
Project fund details
Progress reports and timelines
Project resources
Corpus
Collaborative platform for audio conferences

CLIA Wiki site

Wiki – Upload notification

Thank You

Yüklə 470 b.

Dostları ilə paylaş:

Presentation of the clia project by Pushpak Bhattacharyya

Presentation of the CLIA Project

by

Pushpak Bhattacharyya,

IIT Bombay,

On behalf of

the CLIA Consortium

12 Dec 2008

Motivation

CLIA is a real need

Great language diversity in India

Low comfort level with English

Need for critical information in large quantity and high quality, especially in agriculture, health, tourism, education and sectors

CLIA project started in 2006: domains- tourism and health

Geographically speaking

CLIA: basic information

Defining Diagram

CLIA Consortium Members

Name of Institute Assigned Language(s)

IIT Bombay (Consortium Leader) Marathi, Hindi

IIT-Kharagpur (consortium co-leader) Bengali

IIIT Hyderabad Telugu, Hindi

Anna University-KBC Tamil

Anna University-College of Engg Tamil

ISI Kol Bengali

Jadavpur University Kolkata Bengali

CDAC-Pune Marathi, Hindi, Tamil

CDAC-Noida Punjabi

Utkal University --

Principal Investigators

Name of Institute Names

IITB Prof. Pushpak Bhattacharyya

IIT-Kgp Prof. Sudeshna Sarkar

IIITH Prof. Vasudev Verma

AU-KBC Prof. Sobha L.

AU-CEG Prof. Ranjani Parthasarthy

ISI Kol Prof. Mandar Mitra

JU Kol Prof. Sivaji Bandyopadhya

CDAC-P Dr. Ajai Kumar

CDAC-N Dr. Karunesh Arora

Utkal University Prof. Sanghamitra Mohanty

Some prominent research members

Name of Institute Names

IITB Manoj, Vishal, Vishaal, Ashish

IIT-Kgp Nimesh, Dr. Rajendra

IIITH Bhupal, Praneet

AU-KBC Pattavi, Vijay, Vijay

AU-CEG Kaviha, Subha Lalitha

ISI Kol Prasenjt, Deepashri, Ayan

JU Kol Asif, Pinaki

CDAC-P Swati, Abhishek

CDAC-N Gaur Mohan, Ankur

Utkal University Balbant Rai

Prior expertise brought to the project (Horizontal, i.e., language independent)

Name of Institute Areas of prior expertise/experience

IITB NLP (LR, WSD, MT), Semantic Search

IIT-Kgp Search and Ranking, Shallow Parsing

IIITH Commercial level search engine building, query processing

AU-KBC NER, Information Extraction, Summarization, Anaphora

AU-CEG Morphology, Interlingua

ISI Kol IR Evaluation, large scale IR system building (SMART)

JU Kol Example based MT, Summarization, NER

CDAC-P Converters, File format processors, MT

CDAC-N Parallel corpora, Query processing

Utkal University Machine Translation, Lexical Resources

Prior expertise brought to the project (vertical, i.e., language specific)

Name of Institute Areas of prior expertise/experience

IITB Hindi Marathi wordnet building, Hindi Marathi shallow parsing

IIT-Kgp Bengali shallow parsing including MA

IIITH Telugu-Eng CLIR, Telugu query processing

AU-KBC Tamil NER, Tamil IE, Tamil Morph

AU-CEG Tamil Morph, Eng-Tamil MT

ISI Kol Bengali statistical stemming, large scale corpora for Bengali

JU Kol Bengali NER, EBMT involving Bengali

CDAC-P Various Indian language converters

CDAC-N Aligned parallel corpora for Indian languages

Utkal University --

Horizontal tasks of CLIA and the organizations responsible

Input Query processing

Crawling, Indexing