Statistics for omics A proto-program for the Institut Pascal Organizers: Marie-Laure Martin-Magniette (INRA, IPS2), Sophie Schbath (INRA-Jouy/UPSay) and Stéphane Robin (INRA Paris/AgroParisTech/UPSay),
Not all of the participant names mentioned below have been consulted, and this proto-program should be considered as a proposition, without any commitment from the mentioned researchers. Keywords:omics data, high-dimension problem, heterogeneous data integration, statistics for networks
In genomics, data produced are typically high-dimensional, heterogeneous and multi-scale. High-dimensional data correspond to data sets with a very large number of variables measured for a small number of individuals (n << p). Heterogeneous data correspond to measurement of different nature, as concentrations, counts, physical measurements of weights, sizes,.., genomic and genetic data including phylogenetic or relatedness relationships, and binary data. Data may be temporal or static, measured at regular intervals or not. They also include functional data, where the observation is a function that depend itself on other variables. Finally, measurements encompass a range of different scales, from sub-cellular levels to the whole organism level, observed in several different environments.
Thanks to technological improvements, genomics studies and analyses are becoming more and more sophisticated and new issues are identified. For example, genome-wide studies require to analyse very long sequences of several millions of observations for each individual under study, metagenomics studies point out the necessity to develop methods to identify interactions between the organisms composing the system. The construction of networks of several hundreds of biological entities requires visualisation tools and statistics to be able to extract new knowledge and to summarize the network…..
The program proposed aims at encouraging communities already well established in statistics and computer science to work together on priority biological issues. Invitations will be sent to the community of the biologists of Paris-Saclay (INRA-Jouy-Versailles-Moulon, IPS2, CGM, IGM, EGCE…) to present their biological issues.
The objective of the program is to develop a modeling approach that makes the best use of those experimental data to answer biological questions or hypotheses. Some questions of interest are the following:
How to extract significant relationships between heterogeneous, high-dimensional data sets? How to take advantage of the data structure, known or approximatively known? How to develop methods allowing to deal with huge data sets and complex models in a reasonable time?
How to study, identify and model the interactions between different scale levels? Which variables, or functions of variables are relevant ?
How to combine data at large scale on one side, and at a high precision scale for small sub-parts of the model on the other side ?
How to visualise the results to go back to the biology issues ?
The UPSay labs that are expected to be interested by the program are: at least 7 for the methodological issues (about 30 researchers)
All over France, we have listed over 15 labs that can be interested (see attached list), representing more than 30 permanent researchers. A preliminary list of potentially interested researchers in the world has also been attached, even though the actual number of potentially interested people is obviously much larger.
List of potentially interested labs or permanent researchers:
MIA-Paris : S. Robin, J. Chiquet, J. Aubert, T. Mary-Huard, C. Lévy-Leduc……
MaIAGE : S. Schbath, M. Mariadassou, P. Nicolas, S. Huet; S. Plancade, C. Larédo, …...