|
Presentation of the clia project by Pushpak Bhattacharyya
|
tarix | 22.01.2018 | ölçüsü | 470 b. | | #39992 |
|
by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008
Motivation
CLIA is a real need Great language diversity in India Low comfort level with English - less than 5% of the total population of about 700 million can use English effectively
Need for critical information in large quantity and high quality, especially in agriculture, health, tourism, education and sectors CLIA project started in 2006: domains- tourism and health
Geographically speaking
CLIA: basic information
Defining Diagram
CLIA Consortium Members Name of Institute Assigned Language(s) IIT Bombay (Consortium Leader) Marathi, Hindi IIT-Kharagpur (consortium co-leader) Bengali IIIT Hyderabad Telugu, Hindi Anna University-KBC Tamil Anna University-College of Engg Tamil ISI Kol Bengali Jadavpur University Kolkata Bengali CDAC-Pune Marathi, Hindi, Tamil CDAC-Noida Punjabi Utkal University --
Principal Investigators Name of Institute Names IITB Prof. Pushpak Bhattacharyya IIT-Kgp Prof. Sudeshna Sarkar IIITH Prof. Vasudev Verma AU-KBC Prof. Sobha L. AU-CEG Prof. Ranjani Parthasarthy JU Kol Prof. Sivaji Bandyopadhya CDAC-P Dr. Ajai Kumar CDAC-N Dr. Karunesh Arora Utkal University Prof. Sanghamitra Mohanty
Some prominent research members Name of Institute Names IITB Manoj, Vishal, Vishaal, Ashish IIT-Kgp Nimesh, Dr. Rajendra IIITH Bhupal, Praneet AU-KBC Pattavi, Vijay, Vijay AU-CEG Kaviha, Subha Lalitha ISI Kol Prasenjt, Deepashri, Ayan JU Kol Asif, Pinaki CDAC-P Swati, Abhishek CDAC-N Gaur Mohan, Ankur Utkal University Balbant Rai
Prior expertise brought to the project (Horizontal, i.e., language independent) Name of Institute Areas of prior expertise/experience IITB NLP (LR, WSD, MT), Semantic Search IIT-Kgp Search and Ranking, Shallow Parsing IIITH Commercial level search engine building, query processing AU-KBC NER, Information Extraction, Summarization, Anaphora AU-CEG Morphology, Interlingua ISI Kol IR Evaluation, large scale IR system building (SMART) JU Kol Example based MT, Summarization, NER CDAC-N Parallel corpora, Query processing Utkal University Machine Translation, Lexical Resources
Prior expertise brought to the project (vertical, i.e., language specific) Name of Institute Areas of prior expertise/experience IITB Hindi Marathi wordnet building, Hindi Marathi shallow parsing IIT-Kgp Bengali shallow parsing including MA IIITH Telugu-Eng CLIR, Telugu query processing AU-KBC Tamil NER, Tamil IE, Tamil Morph AU-CEG Tamil Morph, Eng-Tamil MT ISI Kol Bengali statistical stemming, large scale corpora for Bengali JU Kol Bengali NER, EBMT involving Bengali CDAC-P Various Indian language converters CDAC-N Aligned parallel corpora for Indian languages Utkal University --
Horizontal tasks of CLIA and the organizations responsible Input Query processing Crawling, Indexing Searching, Ranking User Interface File format processing
Horizontal tasks of CLIA and the organizations responsible (contd) Document Processing (index time NER, IE) Document Processing (Post Retrieval: Snippet, Summary) Distributed Search Evaluation, Relevance Judgement UNL based semantic search (for Tamil)
Languages and the organizations responsible Language Organization(s) Bengali IIT KGP (c), JU, ISI Hindi IIITH (c), IITB, CDAC Noida Marathi IITB (c), CDAC Pune Punjabi CDAC Noida Tamil AUKBC (c), AUCEG Telugu IIITH
CLIA Important Dates Project Start Date: 29th Aug 06 (effectively Jan 2007) First meeting of the Project Review and Steering Group (PRSG): 2nd March 2007 Second PRSG: 30th Aug 2007 Third PRSG: 08th March 2008 Fourth PRSG: 15th July 2008 Alpha version released: 15th July, 2008 Beta version to be released (along with the 5th PRSG): January, 2009
Related consortium: E-IL MT project English to Indian Language MT Indian Languages: Hindi, Marathi, Bengali, Urdu, Oriya, Telugu, Tamil Approaches: Statistical MT, Example Based MT Members: CDAC Pune (c), IIT Bombay, JU, UU, IIITH, IIITA
Related consortium:IL-IL MT project Indian Language to Indian Language MT Indian Languages: Hindi, Marathi, Bengali, Punjabi, Tamil, Telugu, Kannada Approach: Transfer Based Members: IIITH (c), CDAC Pune, IIT Bombay, JU, University of Hyderabad, AU KBC
All three projects are time bound and result oriented 2 years time frame (extension granted for 1 year) Strict deliverables For each project the budget outlay is about Rs 80 million (USD 2 million)
CLIA: Top level technological information
Process Flow
CLIA: achievements in 2 years (Jan 2007 to Dec 2008) Tools and resources (Copyrightable code and data)
Steps towards overall evaluation Yet to be completed - Precision, Recall, MAP, F-score etc.
Large Relevance judgment base under construction - 50 queries per language (6 languages)
- About 5000 documents per language (6 languages)
Crawled and indexed document base of English: approx 600,000 pages
Copyright for CLIA (code)
Copyright for CLIA (code) contd.
Copyright for CLIA (code) Cont..
Copyright for CLIA (code) Cont..
Copyright for CLIA (data)
Copyright for CLIA (data) contd.
Copyright for CLIA Cont..
Copyright for CLIA Cont..
Conclusion Large scale national level activity Large number of tools and resources developed under the consortium Alpha release done in July, 2008 Beta release to take place in Jan, 2009 Look forward to more detailed interactions and suggestions from the international audience
Introducing people…
Principal Investigators Name of Institute Names IITB Prof. Pushpak Bhattacharyya IIT-Kgp Prof. Sudeshna Sarkar IIITH Prof. Vasudev Verma AU-KBC Prof. Sobha Nair AU-CEG Prof. Ranjani Parthasarthy ISI Kol Prof. Mandar Mitra JU Kol Prof. Sivaji Bandyopadhya CDAC-P Dr. Ajai Kumar CDAC-N Dr. Karunesh Arora Utkal University Prof. Sanghamitra Mohanty
Some prominent research members Name of Institute Names IITB Manoj, Vishal, Vishaal, Ashish IIT-Kgp Nimesh, Dr. Rajendra IIITH Bhupal, Praneet AU-KBC Pattavi, Vijay, Vijay AU-CEG Kaviha, Subha Lalitha ISI Kol Prasenjt, Deepashri, Ayan JU Kol Asif, Pinaki CDAC-P Swati, Abhishek CDAC-N Gaur Mohan, Ankur Utkal University Balbant Rai
Overview Technical Status of the Project Technical Documentation Shared resources Testing methodology Software Documentation Alpha and Beta versions
Technical Summary
Work Flow
Project Status
Status - Input Processing Stemmer MWE - Guidelines are under discussion (IITB)
- Marathi ~ 2000 MWE Bangla ~ 600 MWE Tamil ~ 600 MWE Punjabi ~ 4000 MWE
Status – Input Processing : NER
Status - Input Processing WSD (IITB) - 2nd version WSD
- Interface for Sense-marking of corpus developed by IITB
Dictionary - IITB working on E-Hin linkage
- All LVs working on IL-IL linking and E-IL linking
- ~10,000 synsets generated from Tourism corpora
Status: Dictionary Eng-Hin Linkage - ~ 2500 synsets linked (IITB)
Sample Input screen
Sample Input screen
Project Status
Status – Search
Status – Search cML-Text Converter (IIT-Kgp) - First version of the engine is ready
- Software extracts the fields and body, but does not identify paragraphs and blocks in this version
- Has been tested for Bengali
- Ready to be integrated with Nutch
Project Status
Status – Document Processing Basic IE Engine and eleven IE Templates are ready (AUKBC) Has been tested with sample documents (EILMT corpus) First template “How to reach the place” is getting translated to Tamil, Telugu For other languages, the inflectionary markers are being provided
Project Status
Sample Output Screen
Sample Output screen
Sample Output screen
Sample Output Screen
Sample Output Screen
Sample Output Screen
Status – Output Generation
Project Status
Status - Evaluation Corpora - Tourism and Health Corpora being collected for all languages
- News corpora also being collected.
- Period of news corpora ranges from 2002 to 2007
- For News corpora, ISI Kol having dialogues with TOI and Hindustan Times for permission for the use of their multilingual corpora
Details of Corpora (crawled) Assumption in SRS: - Each language corpus has at least 50,000 documents from General / News + all available documents in Tourism and Health
Evaluation : Topics Topics (ISI Kol) - A set of 95 topics are ready for evaluation
- 30 topics for training and 50 topics for testing and 15 topics as stand-by
- Each topic = Title + Narration + Description
- Translation of these 95 topics have been completed by all the six language verticals
- Sample Topic
- Euro Inflation
- Find documents about rises in prices after the introduction of the Euro
- Any document is relevant that provides information on the rise of prices in any country that introduced the common European currency.
Evaluation Methodology
Evaluation Methodology Benchmark data creation - Sample documents (corpus)
- Sample Queries / Topics (95)
- Relevance judgement
- No of relevance judged Bangla documents ~ 4,500
- Independently judged against 23 topics by each of two judges
- Pooling
- Pooling strategies adopted by TREC
- List of top ~100 documents are taken
- Pool = union of these
Evaluation methodology
UNL Monolingual retrieval is working for Tamil documents 6500 words in UNL Dictionary Words + MWE indexed Documents indexed - No. of documents processed in Tourism - 564
- No of Concept-Relation-Concept indexed - 11,754
- No of Concept-Relation indexed - 11,754
- No of Concepts indexed - 17,650
Testing Methodology Testing methodology - Black box testing based on SRS and design documents
- Unit testing by each sub-system
- Test cases (format) and test reports
Integration testing - Top down / Bottom-up based on dependencies
- Stubs and drivers
- Sub-system wise testing (module-wise)
- Input processing
- Search and Retrieval
- Document processing
- Output Generation
- Evaluation
- UNL
System Testing
Integration Use of controlled corpora for Integration Use of EILMT English and Hindi parallel corpus ISI generates the queries for corpus Translation of queries by all LVs English and Hindi synsets identified for building multilingual dictionary by each LV Each language vertical will be tested for their respective cross-lingual retrieval Information Extraction and output generation will be done on the same corpora Integration of each LV into Nutch at IITKgp
Test and Integration (contd.) Bug tracking system (Bugzilla) to be installed Currently planned for installation at IITB on the same server as CVS Bugzilla - Web-based general-purpose bug tracker tool
- Detects not only software bugs but also all other user-submitted tracking tickets
- Eases communication between team members
- Can be integrated with CVS and WIKI
Bugzilla Requirements - A compatible database management system – MySQL, Postgressql
- A suitable release of Perl 5
- A compatible web server
- A suitable mail transfer agent, or any SMTP server
Bugzilla Demo - https://landfill.bugzilla.org/bugzilla-tip/index.cgi
Bugzilla - Design Bugs can be submitted by anybody, and will be assigned to a particular developer
Deployment diagram
Alpha Version - ~10,000 documents in each language
- Low complexity system
- Hence simple hardware configuration sufficient
- Does not include Summary generation and Output translation
- Planned for Dec 2008
Beta Version - ~10,00,000 documents in each language
- Hardware configuration being worked out - based on disk space requirements, throughput of system, response times, simultaneous users etc.
- Following details are being worked out:
- Connectivity
- Where to host
- Support for hosting
- Planned for July 2008
Elitex08: Demo of Alpha Version Plan to demonstrate the following: - Cross-lingual information retrieval for all languages
- Information Extraction and translation of at least one template to Tamil / Telugu
- Snippet Generation (monolingual)
- Hardware integration – IITKgp
- Publicity management / Poster design - JU
- Funds: Participation fees to be shared
Demonstrate the same at IJCNLP08 exhibition (in Hyderabad - Jan 2008)
Gantt chart (as on Aug 30)
Gantt chart (as on Aug 30)
Software documentation SRS (Based on IEEE) Design document v2.0 (based on RUP) User Requirements Document (Ver 5.0) Java docs Test cases template File naming conventions Testing and integration guidelines Code review guidelines Skip templates
Software documentation : SRS SRS - Introduction
- Overall description
- External interface requirements
- System features (module-wise)
- Advanced Search system for Tamil using UNL
-
Software documentation: DD Design document (v 2.0) - Has been simplified to suit project needs
- Introduction
- System Architecture
- Solution Architecture (brief description of systems, subsystems)
- Software Architecture ( block diagrams)
- System Design
- Logical Design (Class Diagrams )
- Component Design (Component Diagrams )
- Appendix - other details
Software documentation:URD URD - Introduction
- Objective
- Scope of the project
- Product perspective
- Capabilities of the Product
- User Characteristics
- Assumptions and dependencies
- Operational environment
- Input / Output scenarios
- Definitions, acronyms and abbreviations
- References
Software documentation:Test Test case template: for all tests
Software documentation:File naming File naming convention captures the following: - Subject & domain of document
- Content Type (ppt / doc / rpt / Tr / etc)
- Name of Institute (IITB / ISI / IIITH etc.)
- Date of creation of doc (dd-mon-yy)
- Version no.
- Format
- ____.
- E.g. PRSG_Pres_IITB_08dec07_v1.ppt
Shareable Resources and Tools Shared Resources across projects - From ILILMT to CLIA:
- Morph Analyzer
- POS Tagger
- Chunker
- Dictionary Standardization
- IL-IL Synsets
- From EILMT to CLIA
- From CLIA to other projects:
Collaborative tools used - CLIA
CLIA Wiki site http://www.cfilt.iitb.ac.in/~consortia/dokuwiki CLIA Wiki contents - Project Team Contact details
- Project documentation (SRS, Design doc, URD..)
- Meeting minutes and presentations
- Project fund details
- Progress reports and timelines
- Project resources
- Corpus
- Collaborative platform for audio conferences
CLIA Wiki site
Wiki – Upload notification
Thank You
Dostları ilə paylaş: |
|
|