To ensure effective planning, implementation, co-ordination and achievement of the project activities, including timely production of deliverables and successful completion of the tasks.
To provide project structure and support to assist decision making, internal and external communications, encourage greater accountability and control, minimize risks, identify, address and exploit related opportunities
To help partnership to achieve their project objectives
Description of work (ii)
The project management encompasses contractual issues, technical, administrative, finance, communication and knowledge management inside the project, and external relationships between the project and the EC. The General Manager and Project Co-ordinator, XEROX SA will lead the workpackage, but it will be assisted by the members of the Project Management Committee (PMC) in the general management and by the ELQM (External Liaison Quality Manager) for communication and quality matters. Workpackage leaders will also play a substantial role.
The following tasks are foreseen:
T1.1. Project planning: Project detailed planning and scheduling, including tasks, responsibilities and timescales. Establishment of the PMC and ELQM units.
T1.2: General project management activities: Establishment of project communications protocols and of management internal and external communication strategies; adaptation and review of the work plan; monitoring feedbacks; establishment of liaisons with WP leaders and PMC; maintenance of the current financial project-wide records
T1.3: Reporting: Establish reporting structure, collect and consolidate partners reports, ensure that cost claims are duly justified and documented for each partner, provide view of progress and results of the project, produce interim and final reports
Technical coordination of the project will be assured by CELI, that by its very nature (research performing SMES) is at the cross-roads of the worlds of NLP research and industrial optimization.
See Appendix X for additional tasks, outputs and events.
Deliverables (iii)
D1.1 Consortium Agreement
D1.2. Periodic Activity Report including: activities carried out by the Consortium, progress towards the project objectives, milestones and deliverables, problems encountered and corrective action taken and updated plan for using and disseminating knowledge. (Every 6 months).
D1.3. Progress Report, including justification of resources, claimed costs and distribution of the Community financial contribution. (Every 6 months).
D1.4Annual Report
D1.5Final Report
Work package number :
2
Start date or starting event:
1
Work package title:
Language Resources
Participant number:
1
2
3
4
5
6
7
8
Participant short name
XRCE
CELI
UVA
UNITN
OD
GONET
BAL
UBER
Person-months per participant:
9
9
12
Objectives (i)
This WP deals with linguistic resources to be used in several WPs. The idea behind the project is that the software originally available under an open source license is developed under an open source model. However some language processing software and resources will necessarily remain proprietary. Therefore they need to be hidden behind a service layer, in such a way that any vendor user can replace consortium’s linguistic software and resources with its own or that of another party. The consortium will provide resources for Italian, French, English, German, Dutch, Arabic and Polish.
In the context of this WP GALATEAS will also tailor Named Entity Recognition algorithms on tasks of NE extraction from query, which deserve different configuration than algorithms which works on well formed, syntax rich texts.
Description of work (ii)
T2.1: NLP Service wrapping: Several processors available to different partners will be wrapped behind UIMA web services, using a shared type system. The application server will be axis2. The level of granularity should fit the purposes of language based log analysis, log data mining and machine translation, thus it will minimally include: lemmatisation, morphological analysis, disambiguation and chunking, word-by-word translation. For some languages (Italian, French and English) also dependency analysis will be made available.
T2.2:NE extraction from query: Our approach will include a combination of mainly unsupervised techniques for volume and supervised techniques for quality. We will mainly rely on the unsupervised, language-independent algorithms developed by Pouliquen and Steinberger at JRC for the identification of references to persons and geographical locations, and on the unsupervised link-based disambiguation methods developed by Etzioni's group (Etzioni et al, 2008), and by Popescu and Magnini from FBK (2007) (this latter method participated with good success to the Web people SEMEVAL-2007 task). This approach will be combined, in a voter paradigm, with heuristic based methods developed by CELI. This component will also be hidden behind a UIMA/Axis2 web service layer.
T2.3: Language identification component: it is used for guessing the language of a query. A plain HMM based open source software for character based language identification will be used. For languages with close graphematic structure, it will be powered with a dictionary based system.
Deliverables (iii)
D2.1 Wrapped services of NLP (m. 6)
D2.2 NE extraction service (m.18)
D2.3 Language identification component (m.6)
Work package number :
3
Start date or starting event:
3
Work package title:
Log Analysis
Participant number:
1
2
3
4
5
6
7
8
Participant short name
XRCE
CELI
UVA
UNITN
OD
GONET
BAL
UBER
Person-months per participant:
9
20
15
Objectives (i)
This WP will handle log files both from the point of view of feeding the system with primary query log and integrating language based log analysis with formal log analysis and data mining technologies. Different vendors of web server and web applications produce log files in different, sometimes proprietary formats. It is of crucial importance for GALATEAS being able to access the largest number of available format and being able to capture also “extended” information, i.e. the information which is regulated by the W3C draft “Extended Log File Format”. The first objective of this WP is therefore produce/adapt technologies in order to “understand” the largest number of log format. The same technology must be able to transfer the information contained in log files into a relational DB for faster and easier access.
In the context of this WP we will also adapt technique of business intelligence /data mining to log analysis. These techniques are already widespread and accepted by the market. Different vendors provide indeed log analysis consoles, with different degree of complexity. None of them is however able to capture and analyse the kind of semantic information which is produced by GALATEAS. The second goal of this WP is therefore to integrate some of those platform with the information produced by the GALATEAS log analysis system
Description of work (ii)
T3.1. Log management: This task has the goal of producing the log management component which is in charge of 1) harvesting log files from customer services; 2) normalizing log files to a common format (W3C Common Log File Format); 3) loading the log files into a relational DB. None of these three goals is particularly “challenging”. However, great attention must be paid to the fact that the information must be reusable in several contexts. Moreover, in view of the dimensions that log files sometimes assume, the component must be highly optimized.
T3.2 Log information disclosure: This task will parameterize two main-stream log analysis systems (AWSTATS and WEBTRENDS) in order for them to consider also information coming from GALATEAS. The consortium will study the modality of this integration, but it is likely that the production of enriched virtual log files will represent the easiest way to integrate with proprietary environments. In order to enable the production of customized services, log analysis will also be integrated with a commercial data mining environment such as Business Object and with an open source one such as Pentaho Mondrian. Details of this task are provided in section 3.2.a3.1.3
Deliverables (iii)
D3.1 Log Management (m.9);
D3.2 Log Disclosure (m.9);
Work package number :
4
Start date or starting event:
6
Work package title:
Algorithms tuning
Participant number:
1
2
3
4
5
6
7
8
Participant short name
XRCE
CELI
UVA
UNITN
OD
GONET
BAL
UBER
Person-months per participant:
5
8
9
21
12
Objectives (i)
This WP will produce most of the algorithms for query analysis to be integrated either in other technological WPs or in business activities. The main algorithms to be extended/developed are the following:
Extended Tlike, for identifying queries which are likely translation equivalents of each other.
Clustering: for grouping together all queries which represents the same information need.
Topic Computation: for identifying queries which match the category tree adopted by the content provider (e.g. the specific subject headings system or classification system of a certain library).
Description of work (ii)
Each task in this WP concerns the configuration of a specific algorithm, namely:
T4.1: Clustering: tow different algorithms will be customized, namely Topic Models and Spherical k-means. Details are provided in section 1.3.b.2.
T4.2:Topic Computation: The goal of this module is to compute the degree of matching between queries and the customer classification system. Two different algorithms will be used. A language based algorithm based on semantic matches between the classification system and the query and a click-through based algorithm. The latter relates queries with digital objects the user clicked after issuing the query to the search engine, and trains a machine learning algorithm accordingly. It will be mainly tested in the context of a Digital Library offering images. Details are provided in section 1.3.b.1
T4.3: Extended TLike: The basic idea beyond the TLike algorithm is to detect the probability for two queries to be one a translation of the other. It makes intensive use of bilingual dictionaries and Latent Semantic Analysis techniques. Details are provided in section 1.3.b.3
T4.4: Algorithms Optimization. During this task all produced and evaluated algorithms will be optimized in terms of precision and performances
We described the details of these algorithms in a separate session. Here we just mention the fact that all of them will work with the same input and output components. Moreover each one of them will be isolated behind a specific interface with dynamic loading of the implementing object: in this way different algorithms from external vendors and/or research institutions could be tested without any modification to the source code of the whole system.
Deliverables (iii)
D4.1: Clustering (m.12)
D4.2: Topic Computation (m12)
D4.3: Extended TLike (m. 15)
D4.4: Final version of tuned algorithms (m. 36)
Work package number :
5
Start date or starting event:
0
Work package title:
Machine Translation for Queries
Participant number:
1
2
3
4
5
6
7
8
Participant short name
XRCE
CELI
UVA
UNITN
OD
GONET
BAL
UBER
Person-months per participant:
15
15
29,5
0
Objectives (i)
The goal of this WP is to parameterize the open source Machine Translation engine Moses (http://www.statmt.org/moses/) in order to deal with search engine queries. The parameterization will start once the Tlike algorithm has been optimally configured and once a large number of bilingual queries will become available. Of course the objective is not to parameterize a single instance of Moses to obtain a general purpose query translation engine.
The goal is to set up the infrastructure according to which several instances of the MT system can be easily configured as soon as domain specific log files become available. In this way, by accessing different log resources, we obtain an MT system which is customized for handling queries and adaptive with respect to the domain. Minimally, during the project, the Moses based MT will be configured for dealing with generic library catalogues (e.g. Europeana) and with art images (Bridgeman)
Description of work (ii)
T5.1 MT Adaptation: This task starts before the availability of the final version of the TLike algorithm. However the consortium has already access to both a CACAO-derived version of TLike and huge log files derived from Excite and Yahoo! Search engines, as well as log files from CACAO managed libraries (e.g. TEL and NEEO) and Europeana. This will allow us to start configuring MOSES for the kind of text under analysis (queries).
T5.2 Integration with LangLog and domain adaptation: In this task a general approach will be developed to deal with domain adaptivity. More specifically, in-domain training data produced as a results of the LangLog service and aligned via the final version of TLike will be used to optimize translation performance, while requirements arising by the nature of query logs (such as short contexts and flexible words order) will lead to the identification of particular tuning parameters for the MT system (i.e. language and translation models). Particular attention is needed to allow ease of configuration.
T5.3 MT infrastructure setup: Some of the operations of the MT system are very resource intensive. Automated ways to deal with large quantities of data will be therefore defined and implemented (based on segmentation of data, parallelization of processes, or a combination of both) bearing in mind the need for flexible domain adaptation.
Deliverables (iii)
D5.1. MOSES parameterization approach (m.12)
D5.2. Integration with LangLog and domain adaptation. (m.18)
D5.3. MT infrastructure (m. 36)
Work package number :
6
Start date or starting event:
0
Work package title:
Optimisation
Participant number:
1
2
3
4
5
6
7
8
Participant short name
XRCE
CELI
UVA
UNITN
OD
GONET
BAL
UBER
Person-months per participant:
17
14
30
13,14
Objectives (i)
The main objective of this work package is to design a system architecture able to glue all GALATEAS ingredient into a coherent, low-maintenance workflow. Two versions of this architecture will be delivered. The first version will allow easy integration and testing of the available components at early stages of the project. This will not integrate MT and will satisfy the need of the first part of the BP, namely collecting logs and providing semantic analysis. The second version of the architecture will integrate machine translation and will be optimised for handling massive quantities of data.
The second version of the architecture will be able to provide services to several customers in parallel. Therefore this WP must also provide a graphical interface by which the system can be appropriately parameterized. Notice that this is not meant to be the interface for log analysis properly (which is a task described in WP 3) but rather the interface for configuration and administration. As a mere example, the GUI should allow to set up the periodicity of log harvesting, the granularity of certain algorithms (e.g. clustering), the languages to be dealt with, the training parameters of the MT systems, etc.
Description of work (ii)
T6.1 Design of the system: By adopting a classical UML methodology, the system will be designed according to user input, technological constraints and market requirements.
T6.2 First Version of the integrated architecture. It will be basically built as a composition of remote web services and local algorithms. Remote web services are mainly composed by NLP services described in WP 2, whereas integrated algorithms are the ones described in WP 4. This task should also deliver a definitive format for data disclosure in such a way that the system for log analysis and data-mining can become operational since the availability of this first version.
T6.3 Implementation of the LangLog service: The service will hook to the GALATEAS architecture to provide services of intelligent log analysis. Initially it will be deployed on an Tomcat application container (powered by Axis2).
T6.4 Final Version of the integrated architecture. This will be the optimised version of the GALATEAS architecture. The responsible will evaluate two ways towards optimisation: i) a further segmentation of the system in such a way that also base algorithms act as web services and can therefore distributed on different machines; ii) The adoption of a parallel computing approach such as Hadoop (map/reduce), which would allow the distribution of every task on different computing devices. Non functional consideration such as the size of data to be transferred, the fragmentability of the task, the time response tolerance will influence the final decision.
T6.5 Implementation of the QueryTrans service. The service will hook the GALATEAS architecture (and especially the machine translation subsystem) to provide services of query translation. Initially it will be deployed on an Tomcat application container (powered by Axis2).
T6.6 Configuration GUI. It will be realized by adopting standard web technologies, most likely JSP. The crucial point of the GUI is not visual appeal but robust integration with the underlying processing machinery and easy of use.
T6.7 Service maintenance and adaptation: This task will start after the deployment of the QueryTrans service. Besides maintaining both services into a perfect operational state, this task will take care of performing appropriate customization required by business opportunities which could arise in the last year of activity.
Deliverables (iii)
D6.1 System design (m. 3)
D6.2 First Version of the integrated architecture+ LangLog WS (m. 12)
D6.3 Final Version of the integrated architecture + QueryTrans WS. (m.24)
D6.4 Configuration GUI (m.24)
Work package number :
7
Start date or starting event:
12
Work package title:
Evaluation
Participant number:
1
2
3
4
5
6
7
8
Participant short name
XRCE
CELI
UVA
UNITN
OD
GONET
BAL
UBER
Person-months per participant:
4
10
17,25
12,5
15
Objectives (i)
The objective of this WP is to provide an evaluation infrastructure for most of the components of the GALATEAS system. Evaluation will focus on the following components:
Topic Computation algorithms
TLike algorithm
Query MT system
These three components are indeed the ones which are deemed “more crucial” for an organic development of GALATEAS, in the sense that:
The topic computation algorithm is crucial for attracting customers willing to access an accurate analysis of their logs based on their catalogues.
The TLike algorithm is crucial in determining the quality of the output of the MT system.
The Query based MT system is one of the principal outcome of the project and one of the pillar of the commercial exploitation.
For this reasons they will undergo both in-lab and real-data evaluation. The remaining algorithms will undergo only in-lab evaluation (to be performed in the relative WP), as long as golden standard are available.
Users of GALATEAS are mainly institutions: therefore we do not plan any end-user on-the-field evaluation. The key players for evaluation are UBER, BAL and OD. OD will mainly evaluate non functional constraints such as scalability and performance and will assist a third party library in the evaluation of Tlike and QueryTrans.
Description of work (ii)
T7.1.Evaluation of “Topic Computation”. The goal of the topic computation algorithm will be to associate queries to specific categories in which the contents of the information provider are organized. It is a standard classification problem and in-lab evaluation will be performed against community Golden Standard such as the Reuters corpus, for English. Real data evaluation will be performed by manually assigning categories to a set of 1000 queries and then verifying how the algorithm perform
T7.2 Evaluation of TLike. In-lab evaluation of the TLike algorithm will be performed by using CLEF-Topics. For real data evaluation the golden standard will be manually built, composed of 250 matching queries and 250 non matching queries for at least three language pairs.
T7.3 Evaluation of QueryTrans. The query translation algorithm is ultimately used for cross language information retrieval applications. Therefore in-lab evaluation will be performed against the CLEF infrastructure. Real data evaluation will be performed by building a golden standard formally identical to the CLEF format, where topics are replaced by real user queries. The TEL@CLEF corpus will be used as the evaluation base. The golden standard will be donated to CLEF organizers.
All details about evaluation are provided in 3.2.a.5 .
Deliverables (iii)
D7.1 First evaluation report of Topic Computation and TLIKE (m. 15)
D7.2 Final evaluation report of Topic Computation and TLIKE (m 36)
D7.3 First Evaluation of Query Translation (m. 24)
D7.4 Final Evaluation of Query Translation (m. 36)
Work package number :
8
Start date or starting event:
0
Work package title:
Business sustainability
Participant number:
1
2
3
4
5
6
7
8
Participant short name
XRCE
CELI
UVA
UNITN
OD
GONET
BAL
UBER
Person-months per participant:
6
15
5
6
6
6
3,1
Objectives (i)
The goal of this project is to assure sustainability of GALATEAS after project termination and start to produce revenues already during the project. The business sustainability of the project will be guaranteed by offering two services:
A service of language based log analysis
A service of Query translation
The quantitative goal of the business action is two secure at least 10 customers in each of these services before the end of the project. Quantitative objectives after project termination will be detailed in the business plan and are sketched in section Section B2.2 of this Annex.
Description of work (ii)
The success of GALATEAS builds on availability of logs. Ss a baseline the project can count on CACAO federations, TEL, EUROPEANA, Yahoo! and Excite logs, but these needs to be dynamically expanded. In order to trap customers in the loop, we propose the following phases:
Phase 1: log analysis services (based on Topic Computation, Clustering, etc.) are offered for free until the end of the project to all companies/institutions requiring them. In exchange we get the authorization to exploit queries for the purpose of training the query MT.
Phase 2: Query translations services based on the query MT are offered for free until the end of the project to all companies/institutions requiring them. In exchange they will allow us to analyse their logs in order gain information for further refinement of the system.
Phase 3: Acquisition of paying customers (annual fee) for the log analysis service;
Phase 4: Acquisition of paying customers (annual fee) for the query translation service;
These phases are detailed during T.8.1, e.g. the business plan writing phase. The business plan will also details revenues share among providers of the NLP services and GALATEAS providers of query MT and log analysis services. It will also detail how further revenues can be generated by parameterisation of the system under the open source perspective. In order to implement the above 4 phases we define also T.8.3 which is a customer acquisition task for both log analysis and query MT and with respect to both free services and fee based services.
This work package will also deliver a User Study (T8.2): the study, described in 3.2.a.6, is not conceived as a service or an artefact to sell, but as a piece of evidence to convince customer adhesion. In the context of this task it will be also evaluated whether the methodology adopted for the study is mature enough to be delivered as a service. The study will be mainly conducted by UNITN and UBER, with support by BAL.
Deliverables (iii)
D8.1: Business Plan (m. 6)
D8.2: User study on search episodes (m. 18)
D8.3: Intermediate report on customer acquisition activities (m.30)
D8.4: Final report on customer acquisition activities (m. 36)
Work package number :
9
Start date or starting event:
0
Work package title:
Dissemination
Participant number:
1
2
3
4
5
6
7
8
Participant short name
XRCE
CELI
UVA
UNITN
OD
GONET
BAL
UBER
Person-months per participant:
9
5,5
6
5,5
3
6
2
Objectives (i)
In GALATEAS dissemination will have two main objectives:
make the scientific community aware of the results obtained with the project
make potential customers aware of the scientific and technical excellence of the proposed services, thus supporting direct marketing actions developed under WP8.
We believe that for the kind of business that GALATEAS proposes, a strong emphasis must be put on innovation aspects and it is important that first customers become aware of the potentialities of the services (e.g. via press and other media) and then they can be contacted by GALATEAS representatives
Detailed objectives of dissemination are described in Section B3.6.
Description of work (ii)
T9.1 Involvement of the scientific community: It will be achieved by the following means:
Standard channels of relevant scientific communities (esp. NLP community and Digital Libraries community).
Project web site.
Participation to CLEF initiatives (e.g. proposing a specific track on query translation)
Dissemination of software according to open source dissemination channels
T9.2 Awareness construction among potential partners: It will be achieved by the following means:
Service Advertisement Web site;
Presence in major fair trades in the field of IT and specifically of (digital) libraries.
Production and diffusion of marketing material such as brochures and PowerPoint presentations (since the very beginning of the project) and commercial flyers (since the first six months of activity)
Search Engine Optimization for the service web site;
Publication in IT management oriented reviews;
Press Release (2 press releases, in correspondence to the launch of each fee-sustained service)
Standard commercial networking
Showcase as a promotional video showing results of the project.
Deliverables (iii)
D9.1. Dissemination and awareness plan (m.3)
D9.2 Project web site and power point presentation (m.3)
D9.3 GALATEAS services web site (m. 18)
D9.4 First report on dissemination activities (m. 18)
D9.5 Project showcase (m. 24)
D9.6 Second report on dissemination activities (m. 36)
B.3.2.b.4 WT4 List of Mile Stones
Milestone Number
Milestone name
WP numbers
Lead beneficiary Number
Delivery date from Annex I
Comments
1
LangLog WS
2,3,6
All
12
The LangLog service, as resulting from D6.2 (which presupposes all other deliverables of first year)
2
QueryTrans WS
2,5,6
All
24
The QueryTrans service as resulting from D6.3 and D8.2.