Europe, as a whole, has a stake in improving the usage of Open Source Software (OSS) in all branches of IT and public life, in general. By putting all developed software that is subject to an open source license back in the open source community, GALATEAS aims at contributing to such a joint European effort.
We expect that GALATEAS software that is made available under an open source license will be adopted and even improved by research organizations willing to experiment on the basis of an already consolidated platform. In order to benefit of this community effort, commercial partners of the consortium are even ready to accept possible rise of competitors with products/services based on the GALATEAS software. In any case, three commercial partners are already make use of OSS in their business (XEROX, CELI, OD) and they are aware of the fact that for certain kinds of very innovative projects this software release modality might become a business accelerator (for instance many European administration have published directives pushing towards adoption of OS based software).
It is a fact, however, that just using the open source licensing schema and even using some open source development platform such as SourceForge is not per se a guarantee of success. As pointed out by the EU funded project TOSSAD (towards Open Source Software adoption and dissemination: http://www.tossad.org ) the open source model need to be accompanied by a number of technical and promotional initiatives to facilitate its adoption. GALATEAS will promote its software made available under an open source license via the following means:
Good documentation at source code level and at system level;
Constant project activity;
Public records of changes/bugs;
Publicly available tutorials;
Compliance with other open source software (e.g. for data mining (Mondrian), structured log analysis (AWSTATS), communication protocols (UIMA), etc.)
Explicit acknowledgement by customers of the fact that they are using GALATEAS services.
Participation to programmer oriented conference and seminars.
Organization of training events, possibly under the form of webinars, in order to minimize costs.
All the activities for OSS dissemination will be carried on in the respective development packages and documents in the general “Reports on dissemination activities”.
B3.1 Consortium and key personnel
B.3.1.a The Consortium
The GALATEAS consortium tries to reach an optimal balance between academic research (UNITN and UVA), industrial research (XRCE and CELI), system integrators and web technology experts (OD and GONET) and users (UBER and BAL). Such a structure facilitates technology transfer, and allows relatively easy take up of leading edge technologies. Academic research is basically involved in algorithm tuning and machine translation. To a certain extent also Xerox and CELI are involved on these tasks, for different reasons: Xerox has a strong group on machine translation, which could actively cooperate with academic partners. CELI, as one of the main partners of exploitation, needs to acquire high level skills on MT customization, with special attention to MOSES. Expertise on MOSES will be provided by UVA and in particular by Christoph Monz. Moreover CELI, in the context of CACAO has conceived and implemented the TLike algorithm, which represents one of the building block of the project.
The innovative aspects injected in the project by academic and industrial research partners are glued together by the system integrators in order to achieve timely delivery of the proposed services. Besides performing integration, OD and GONET will also lead technical work packages concerning log analysis (from a strictly formal point of view) and rendering/delivering of semantic information extracted from logs. Their skills on this topic are proved by their commercial activity in the specific sectors (GONET for structured transaction log analysis and OD for data mining). The users (in GALATEAS the user is not the end user, but the potential customer) have been chosen with the intent of maximize the impact (both commercial and institutional) of the whole project. BAL is one of the most successful European multimedia digital libraries selling reproduction of art works. It will represent the needs of commercial organizations building their business on the marketing and sale of digital objects. UBER, represents a big university library, and, more than this, it is the coordinator of the EuropeanaConnect work package on multilinguality. Therefore it represents a strong bridge towards Europeana and other institutional federations of digital libraries.
In terms of co-operation, the involved partners have a track of successful co-operation in national or international projects. In particular:
XRCE, CELI, GONET are partners of CACAO;
XRCE, CELI and UBER are partners of EuropeanaConnect and Europeana 1.0;
CELI (France), GONET and OD are partners of LEILAS (Eurostars);
XEROX and CELI are partners of SCOOP (Eurostars)
OD, XEROX and CELI (France) are part of COCLICO (ANR-France: project on Open Source Technologies)
XEROX is subcontractor of CELI in ICT4LAW (Regione Piemonte)
XEROX, UNITN and CELI are part of a joint European Master Program (Mundus)
XEROX, UVA and CELI actively cooperated in the context of CLEF evaluation campaign.
B.3.1.b The Partners
Xerox Research Centre Europe’s (XRCE) specific mission is to become a Centre of Excellence for the understanding of document processes and for the invention of technologies which support them. While drawing on the strength of the Xerox Corporation around the world, XRCE's focus is Europe. The centre cooperates with the scientific community and with businesses and their customers to ensure that the developed technologies are not only innovative, but also match the requirements of the market. XRCE forms partnerships, collaborates with a wide range of European research organizations, and works with the business divisions of Xerox. The research carried out at XRCE covers the following areas: natural language processing, machine learning, XML, computer vision, computer science, mathematics, data mining and ethnography.
XRCE has significant experience in multilingualism, and has processing tools available for different languages. At a low level of linguistic processing (morphological analyzers and part-of-speech taggers) XRCE has tools for a wide range of languages. At a higher level of linguistic processing, the institution has tools for syntactic processing of German, English Italian, Spanish, Dutch and French. Advanced semantic processing, i.e. named entity recognition and typing, as well as word sense disambiguation, is available for French and English, and discourse analysis tools for English.
Core competencies relevant to the GALATEAS project are multi-lingual morphological analysis, syntactic and semantic processing and discourse analysis, which are activities carried out by the "Parsing and Semantics" (ParSem) area headed by Dr. Frederique Segond and will be brought into the project.Frederique has been managing different European and national projects such as Thetis, InfoMagic (for the textual part) and CACAO. ParSem has participated and participates in European and French national projects with its expertise in all of these domains. (CACAO, SAPIR, EERQI, Infom@gic, TANGUY, ALADIN).
Thanks to its expertise in machine learning and statistical machine translation, XRCE will also contribute to the configuration of MOSES MT system.
Key Personnel Dr Frédérique Segond is principal scientist and area manager of Parsing and Semantics (ParSem) at XRCE. Since she joined Xerox as a research scientist in 1993, she occupied different positions such as project leader on semantic disambiguation and new business development manager working on the transfer of linguistic technologies. She has been involved in many European and national projects and coordinates the CACAO project. In GALATEAS she will act as the coordinator.
Dr Guillaume Jacquet joined XRCE as Research Engineer in 2006 as part of the Parsing and Semantic group and his current research activities are related to Information Extraction and more precisely to use hybrid methods (combining statistical and symbolic approaches) to semantic concept extraction from textual data, he also works on clustering and textual entailment. He has been involved in the InfoMagic national funded project.
Dr Claude Roux is a project leader in the Parsing and Semantics area (ParSem) at XRCE.
Claude has developed the Xerox Incremental Parsing (XIP) rule engine which offers a full formalism to design, implement and run dependency grammars for any natural languages. Besides he has participated in evaluation campaigns (TREC98 and Amaryllis) and is involved in the CACAO project for the NLP web services.
CELI is a company founded in May 1999 by researchers and engineers, previously working in the research sector. The main goal of CELI is to bring the results of the most advanced research activities in the Human Language Technology, Artificial Intelligence, and Web Interaction fields into products and solutions for the enterprise.
CELI works in close co-operation with several European research centres (e.g., DFKI–Saarbruecken, Sheffield University, Scuola Normale Superiore, XRCE) and has participated in the TAL Italian national project (national infrastructure for linguistic resources in the automatic processing of spoken and written natural language).
The company has been technological partner in the MIETTA and MIETTA II consortia (RTD project, partially funded within the IV and V Framework, European Union, DGXIII), the Deep Thought consortium (funded within the V Framework, European Union, DGXIII), and it was technical coordinator of the LOIS project (eContent). It is currently participating to the EU funded projects CACAO (eContent+) and EuropeanaConnect.
The company offers consulting, business modelling, design, and implementation of complete “language sensitive” software solutions, together with a whole range of personalized services.
The current focus of the company is on information extraction technologies applied to several practical domains such as:
• Business Intelligence
• Cross Language Information Retrieval
• Customer opinion monitoring (also known as “sentiment analysis”)
• Sentiment Analysis
• Question Answering
In November 2006 CELI funded the French company CELI-France, which is one of the European leader in the field of Sentiment Analysis and partner of the CARE project.
In the last years of activity a strong importance was attached to research on cross language information retrieval, that culminated in the participation to projects such as CACAO and European, as well as in evaluation activities in the context of the CLEF evaluation challenge (TrembleCLEF). The main contribution of CELI in GALATEAS will focus on cross language and NLP technologies with a special focus on query logs alignement. Moreover it will integrate skills and resources acquired in the context of CACAO project and facilitate the integration of language resources and algorithms from several partners .
Dr Luca Dini is President of CELI s.r.l. and CELI France SAS. Formerly CEO of DIMA Logic. He has been active since 1986 in the field of Natural Language technologies and has participated to several European Projects, among them: Eurotra, EAGLES, MIETTA I & II (site manager), Deep Thought (Site Manager), LOIS (Project Manager), CACAO (Site Manager/Technical Coordinator), EuropeanaConnect (Site Manager). In GALATEAS he will act as leader of the technical coordination and site manager. Paolo Curtoni Senior engineer for CELI and Development Director for CELI France. On the EU project side he was active in the projects EUROTRA, MIETTAII, Deep Thought, and LOIS. In GALATEAS he will act as a senion engineer and manager of integration operations. Dr Giampaolo Mazzini. He was involved in Eurotra and MLAP projects (COLSIT and LS-GRAM) at Gruppo Dima and was a one year visiting researcher at DFKI(Saarbrucken) working on Information Extraction. He is one of the CELI s.r.l. co-founders. He has been involved - as a computational linguist and researcher in the field of Information Extraction - in European projects such as Mietta, Mietta-II
Objet Direct (OD)
Since 1998, Objet Direct helps companies and organizations in developing their information system, using iterative and incremental development methodologies, and distributed object approach, from design to implementation.
Services include Project Delivery, Consulting, Software Edition and Skills Transfer.
As a result of Objet Direct long experience in developing object-oriented projects, our process and tools succeed in:
Building applications that really match users’ requirements, by using prototyping techniques,
Improving development productivity and quality with the help of transformation methods,
Reusing core components thanks to systematic Business Processes and Business Object analysis.
Objet Direct daily technological environment covers all the Object, distributed architectures and Internet major techniques: UML, J2EE application servers, n-tier architecture, EJB, Java and C++, XML, .Net platform, EAI middleware, Web technology, Corba, UP, XP, …
Objet Direct has fundamental strong points to guarantee a perfect success in the realization of missions:
An Object and n-tier architecture specialization,
A mixed culture consulting / training / engineering ensuring expertness, pedagogy, pragmatism and experience feedback,
A strong knowledge on the main tools in the market,
A well-tried methodology,
A true partnership with each of our customer,
An implementation of technical and human means,
Concrete measures to ensure quality,
An internal tool for prototyping and code generation usable by everybody : D.OM
In GALATEAS Objet Direct will be mainly involved in Log analysis, Data Mining and Integration activities. It will also play a crucial role in business and exploitation activities.
Thibault Parmentier manages Grenoble branch of Objet Direct. He joined the Objet Direct as a computer scientist in 2001. Before he has worked at Xerox for 3 years where he was responsible for design and implementation of technical part of an e-learning system. He was also involved in the coordination of the Thetis EC project. During these three years, he has been active in writing patents and scientific articles.
Matthieu Chartier is a computer science engineer from INSA Lyon. He has strong skills in Java development on several large scale web applications projects (e.g. a two years and half e-business portal project for small and medium businesses developed at HP) and experience as project leader. senior java engineer and project leader.
Dr Olivier Baudin
After two years in research working as responsible of the Industrial project « Stéthoscope Électronique Analogique / Numérique » within IRIS, Olivier has joined the industry at PSO (Professional Service Organisation) within Borland France for 6 years. Since 2007, Olivier is working within Objet Direct as a contractor. Since one year, he's working for Yahoo Search Technology (YST) and work on several components of the YST search engine
Bridgeman Art Library (BAL)
The Bridgeman Art Library (Bridge) is the leading international resource of digital images of art, artefacts, culture and history representing 2000 stakeholders in the museums, galleries and archives community worldwide. The Bridgeman Art Library has been the commercial content licensing body for UK museums, galleries and archives for over 37 years. It funds museums to the tune of £2.5m annually and represents nearly all the UK designated museums. The database comprises 360,500 images all available online.
The Bridgeman Art Library has taken part in 4 EU projects for metadata, imaging and content delivery and is now coordinating eContentplus MILE. It is also an expert in IPR, licensing, content sourcing, accessing, tracking and use of encryption.
We currently operate two UK government funded projects, KTP (Knowledge Transfer Partnership) connected with education, and SILVER (TSB Technological Strategy Board funded), also concerned with education.
As a user partner in Project SILVER, The Bridgeman Art Library defines categories for metadata semantics and ontology’s, file and wrapper formats, workflows, and the management of copyright assets. The Bridgeman Art Library is responsible for licensing and contractual structures for the prototypes as the basis for future products in different markets, co-ordinating research on the educational product subject areas, creating semantic objects from digital assets and taking part in building and evaluating prototype tools.
The Bridgeman Art Library has 36,000 clients worldwide, with approximately 65% of its business in publishing, media and education, with sales evenly balanced between the UK, the EU and the rest of the world. BAL’s experience in at least 10 market sectors (including educational sectors addressed by Bridgeman Education) will be invaluable for
The Bridgeman Art Library has four international offices and 55 agents worldwide so are well versed in the issue of operating in different language environments. For GALATEAS, the Bridgeman Art Library would act as content provider for their online database or specialist image sets. We would act as advisor on metadata standards and share best practice on cataloguing and indexing images, the types of data fields required in a cataloguing structure, controlled vocabularies, delivery protocols such as IPTC and share some of the difficulties already experienced from this.
The Bridgeman Art Library 37 year track record and database of 36,000 clients in some 10 different market sectors within the creative industries is testament to its ability to help deliver sustainable business models.
In the context of GALATEAS Bridgeman will be active in workpackages concerning Evaluation, Optimisation and business sustainability. Moreover, being an international art portal and a multimedia digital library, Bridge will provide the data for the tuning of the Click-through classification algorithm. Key Personnel Harriet Bridgemanis founder and Chairman of the Bridgeman Art Library Ltd. She was a founder member of The British Association of Picture Libraries and Agencies and has chaired their executive committee with special responsibility for copyright. She has received many honors such as the European Women of Achievement Award in the Arts (1997), International Business Woman of the Year (2005). In 2006, she founded the Artists’ Collecting Society to collect Artists’ Resale Rights (Droit de Suite) on behalf of UK-based artists.
Jessica Tier is Project Manager for The MILE Project. Jessica wrote a proposal for a 3-year EC funded project under the eContentplus programme – The MILE Project. As well as continuing to run this project, Jessica also manages Harriet Bridgeman’s new company under The Bridgeman Art Library's umbrella, Artists' Collecting Society (ACS), whose purpose is to manage Artist's Resale Right or Droit de Suite on behalf of over 250 member artists.
GONETWORK s.r.l. is a company founded in 1999 (initially named TQMNETWORK) with the goal of reaching a good market position in the field of Web Marketing and Search Engines Optimization. Moreover the company works on software development, Internet services, business consultancy and training.
As for the web marketing in recent years has developed numerous projects for companies leaders in their field on the Web such as AutoScout24, Ebay, Expedia, Lastminute, Meetic, Booking. All these projects required the analysis of user queries and the Log analysis (website access statistics).
From 2002 onwards has independently developed a number of marketing projects related to tourism and other field, which has studied the problem of localization of content, the relationship with keywords, user behaviour and the behaviour of search engines concerning the geographically based queries and web contents with geographical related text. Customized sytems of Log Analysis has been developed to improve the relevance in search engine results.
From 2006 Gonetwork has developed some web portals as part of an unique project, named 'goturismo' with content related to Italian regions, cities, and landmarks which aggregates public contents, search engines queries results and content uploaded by registered users (companies or private). This project reaches more then 3.000.000 of visits yearly, with a continuos growth. This project is composed by several modules, and one of them is a Harvesting system, search engines based, dedicated to harvesting and processing public contents on the web. Another module deals with Log Analysis, connected to webmarketing purposes and Search engines positioning.
Since its foundation, Gonetwork has successfully participated to EC funded research projects in the field of Search Engines, Multilingual contents retrieval, Territorial marketing, such as MIETTAII, LOIS, CACAO gaining a good experience in project management: Project Management, Project Quality Control, participation in working groups, or Editorial Review of Technical, integration of software projects.
For the development of the CACAO Project, for example, Gonetwork has contributed heavily not only to the development of parts of the system but also to the design of the project website and web marketing plan that has already passed its review.
PRIVACY AND DATA PROTECTION
Gonetwork has a long experience in this field, both on the Italian and European market and will set up procedure for guaranteeing full respect of the legislation of at least Italy, France, Germany and United Kingdom.
GoNetwork has matured a wide experience on advertisement policies and netiquette. Its long experience of cooperation with GoogleAdSense, Tradedoubler, Ebay, Expedia, etc. will be profitably exploited in the phase of commercial exploitation and dissemination.
In the context of GALATEAS goNetwork will contribute with its experience in transaction log analyis, web service integration and development of web Graphical User Interfaces. Key Personnel Dr Massimo Balestrieri is one of the founders of GONETWORK s.r.l, of which he is currently the CEO. He has been active since 1991 in the field of Quality Assurance and Quality control, companies organization and customer care solutions, He was involved in different EC projects such as MIETTA II, LOIS and CACAO as project manager for
The Design and Development as well as in the Business Activities connected to the project.
Igor Barsanti he a software engineer working on different aspects such as network and server in unix and windows environment, hardware maintenance He masters many programming languages, network planning and electronics. In GONETWORK since 2002, I.Barsanti has been involved in many EC funded Projects, in particular in the design and developmente of the Harvesting System of CACAO project.
Università degli Studi di Trento, Italy (UNITN)
The UNITN unit consists of researchers belonging to the Department of Information Engineering and Computer Science (DISI, http://disi.unitn.it) and the inter-departmental Center for Mind/Brain Sciences. (CIMeC: http://portale.unitn.it/cimec/).
DISI has an outstanding scientific record and an outstanding performance at attracting R&D funding from both industry, local government, and the EU. In the 6th Framework, DISI participated in 22 projects. DISI has also developed highly effective graduate educational curricula which attract students from around the world. DISI currently has 74 members of academic staff, 24 members of research staff, 14 PostDocs, 10 members of technical staff, and 19 members of administrative staff.
CIMEC features, at the moment, 56 researchers and 33 doctoral students, including psychologists, neuroscientists, physicists and computer scientists, all sharing an interest in
Most of the GALATEAS activities will be conducted by scientists from the CIMeC CLIC laboratory (CIMeC Language, Interaction and Computation: http://clic.cimec.unitn.it/), which currently features 2 full professors, 3 tenured researchers, 4 contract researchers and 10 doctoral students.
The CLIC lab was inaugurated in 2007. Despite its recent inception (the lab was inaugurated in 2007), CLIC is one of the largest and most active centers for human language technology in Italy, and is particulary known for its activities in working with web-scale amounts of textual data (WaCKY corpora, CLEANEVAL), information extraction (in particular, entity disambiguation -e.g., ELERFED) and the application of machine learning to semantic processing.
In GALATEAS UNITN will be mainly active in the NLP and Algorithm WPs, with special attention to Named Entity Recognition, Clustering and classification. Key personnel:
Pr Massimo Poesio is the Director of the Language Interaction and Computation Lab in the Center for Mind / Brain Sciences. His main research interest is text mining. In 2007, he led the consortium that developed the BART toolkit at the Johns Hopkins workshop. He has been Principal Investigator in 8 grants, including the LiveMemories large research project in Trentino 3 EPSRC projects in the UK (GNOME, ARRAU, AnaWiki) and 2 EU projects (MATE and TRINDI).
Dr Marco Baroni is Assistant Professor at the University of Trento He has worked as a computational linguist for Conversay, as a researcher at the Austrian Research Institute for Artificial Intelligence and as a tenured researcher at the University of Bologna. His present research interests include the rapid automated creation of large language resources and knowledge-poor extraction of semantic knowledge from textual data.
Dr Alessandro Moschitti is an Assistant Professor at the Information Engineering and Computer Science Department of the University of Trento. He worked as an associate researcher in the University of Texas at Dallas for two years. His expertise concerns machine learning approaches to NLP, Information Retrieval and Data Mining. He has participated in several EC projects LIVINGKNOWLEDGE, PRESTOSPACE, NAMIC, TREVI.
University of Amsterdam (UVA)
The University of Amsterdam joins with researchers from the Informatics Institute (Faculty of Science). The Intelligent Systems Laboratory Amsterdam (ISLA) within the Informatics Institute focuses on processing information in pictorial, auditory and/or textual form and the consequences such information has for subsequent actions. ISLA members are interested in pictorial databases, learning from text and pictures, the theory of computer vision, multimedia and multi-modal information integration, link discovery, and agent technology. All these topics are covered from theory to practice, from basic principles to implementations of applications.
ISLA has gained very substantial experience in many areas relevant for the project: Web Search Engines, Access to Cultural Heritage, XML Retrieval, Web Mining, Language Technology, Machine Translation, Log analysis. ISLA is involved or has been involved in many national and EU funded projects, and collaborates with a large number of research groups, both nationally and internationally. Of special relevance for the present proposal are the MultimediaN project on the development of multimedia information technology for usage in high-end applications (in media, intelligence, and heritage), DuOMAn (Dutch Language Online Media Analysis), TNT (Tracking News Events and their Impact), the AID (Adaptive Information Disclosure) project on semantic access to scientific information within a virtual lab environment for e-science, and the MataHari project on machine translation from automatically harvested Internet resources.
The main tasks of UVA in GALATEAS will concern the adaptation of the MOSES MT system, and the tailoring of the various algorithm of log analysis. Key personnel:
Pr Maarten de Rijke leads the Information and Language Processing Systems group. He was awarded a Pionier grant (2001), on the basis of a proposal. His current research focus is on intelligent web information access, weakly or semi-structured documents, media analysis and multilingual information. He is currently the director of UvA’s Information Science bachelor program and of its newly founded Center for Content, Creation and Technology (CCCT).
Dr Christof Monz is Assistant Professor of Natural Language Processing within the Information and Language Processing Systems group. He worked at Queen Mary, University of London and at UMIACS, USA.. He has led several software implementation efforts and the group’s statistical machine translation system has been evaluated at numerous international benchmarking events. He is currently the Principal Investigator of a project on statistical machine translation directly funded by a UK governmental end user.
Berlin School of Library and Information Science, Humboldt-Universität zu Berlin (UBER)
The School of Library and Information Science of the Humboldt University (IBI) is the oldest school of library science in Germany, the only library school at a research university, and the only German institution with the right to give a doctorate in library and information science.
The IBI’s aims are to prepare students to take information management and information services positions within companies and other organizations. Specific topics of work are:
to bring scholarship about the latest information technology and technology trends into the classroom,
to provide students with a basis for work in both the digital and traditional aspects of contemporary library and information management work, and
to ensure that students have an understanding of qualitative and quantitative scientific methods that allows them to understand and contribute to research.
Important among the goals of IBI are the aims to engage internationally at both the teaching and research levels and to build a research and teaching program that creates a distinctive Humboldt perspective and a practical set of tools for addressing the changing needs of the world of information.
The IBI’s research focuses on the areas of digital library evaluation, long-term preservation, information management, electronic publishing, knowledge management, information organization and retrieval, and public library management.
On an European level, the IBI is currently involved in the eContentplus projects Europeana v1.0 and EuropeanaConnect, the FP7-SSH funded EERQI project, the eTEN project ebooks on Demand (EOD) and the FP7-ICT funded TrebleCLEF activity.
These points reflect the reality of academic work at the institute and at the same time illustrate IBI’s motivation for playing a role in a project such as GALATEAS. Key personnel:
Dr Vivien Petras is a Junior professor at the Berlin School of Library and Information Science and leads work package 2 (Multilingual Access to Content) in the eContentplus EuropeanaConnect project and is a member of the core expert group for the eContentplus project Europeana v1.0. She was also involved in EDLnet (Technical & semantic interoperability) and coordinated the domain-specific and ad-hoc tracks at the European Cross-Language Evaluation Forum (FP7-ICT TrebleCLEF).
B.3.1.c Preliminary Allocations
The allocations are provisional and changes might occur over the three years of the project.