Locality-Sensitive Hashing indexing schemes for metagenomics data
Jacques-Henri Sublemontier, DRF, Iramis
28 novembre 2018, 11h, Evry, Genoscope, salle Jacob
Nous proposons dans cette étude de synthétiser plusieurs approches LSH développées depuis l’approche SimHash et discuter de leur mise en pratique dans le contexte de l’indexation de séquences ADN (représentées par leur profil compositionnel) en métagénomique. L’objectif de l’indexation est de hasher des séquences ADN proches avec le même code (ou des codes proches) de sorte à indexer identiquement des séquences issues potentiellement d’une même espèce (bactérie, microbe, etc.). La finalité est de disposer d’une structure de données permettant d’accélérer les tâches suivantes d’analyse telles que la construction de graphes de proximité entre séquences ou le binning des séquences à partir de bonnes approximations. Quelques approches récentes exploitent ce mécanisme en tant qu’étape préalable au binning basé sur les plus proches voisins partagés. Le schéma de base que nous proposons de reparcourir (SimHash) est basé sur la binarisation d’une projection aléatoire des données. Des travaux plus récents visent à améliorer ce schéma en proposant de réduire la complexité de l’échantillonage du projecteur aléatoire ou du calcul de la projection en introduisant de la structure dans ce projecteur, ou bien changer la méthode de binarisation de la projection. Produisant un partitionnement des données, le hashing peut également être vu comme un premier binning et analysé comme tel. Les différents schémas de hashing étudiés sont évalués par leur aptitude à approcher un graphe de proximité tel que le graphe des k plus proches voisins entre reads, et leur aptitude à regrouper le mieux possible les séquences relativement à leur classe. Pour finir, nous étudierons comment l’apprentissage machine peut être introduit pour ajuster la structure de données aux données.
Graph Degeneracy for social nets and text mining
Michalis Vazirgiannis, DASCIM, Ecole Polytechnique
25 septembre 2018, 11h, CEA-Saclay, Orme des merisiers, bat 709, RDC, salle Cassini
Graph degeneracy is a popular method to approximate the densest subgraph in almost linear complexity time. In our research work we extended this method to weighted and directed graphs and capitalizing on them to investigate its potential in different graph and text mining cases. One of the cases is k-core based community evaluation – specifically metrics that integrate authority and collaboration – a properties not captured by the single node metrics or by the established community evaluation metrics. We further extend introduce novel metrics for evaluating the collaborative nature of directed graphs and define a novel D-core metric, extending the classic graph-theoretic notion of k-cores to directed graphs. We applied the D-core approach on large real-world graphs such as Wikipedia and Aminer.org citation data and report interesting results. The D-core metric has been adopted by Aminer as part of its reported metrics. We also investigate to issue of influence maximization in graphs using degeneracy as means to select the optimal spreaders. The results are promising and show that starting an epidemic from the densest k-truss. We also investigate thoroughly the issue of graph similarity via novel graph kernels and embedding schemes with applications to graph classification in chemo-informatics, social networks and text mining.
At the level of Text mining, we capitalize on the Graph-of Words (GoW) model, that capitalizes on a graph representation of documents and captures inherently the words’ order and distances in the document, apart from the frequency, to capture document similarity. We applied graph-of-word in various tasks such as ad-hoc Information Retrieval, Single-Document Keyword Extraction, Text Categorization and Sub-event Detection in Textual Streams (i.e. twitter) and document summarization. In all cases the graph of word approach, assisted by degeneracy at times, outperforms the state of the art base lines in all cases. We are currently investigating the potential of the GoW as input to deep learning architectures for text mining tasks.
2018_09_CEA_vazirgiannis_degeneracy_applications (1).pdf
Omics data processing and analysis for high-throughput phenotyping
Etienne Thévenot, LIST
04 juillet 2018, 11h, CNRGH
Molecular phenotyping approaches are complementary to genomic strategies for the discovery of robust biomarker signatures from disease or response to treatment. Metabolomics is the characterization of low molecular weight molecules involved in biochemical reactions of the metabolism. Mass spectrometry enables comprehensive characterization of metabolites in a biological or clinical sample. Signal processing, data analysis, and computer science are pivotal for the detection, quantification, selection, and annotation of biomarkers from these high-volume and high-complexity data. We propose to discuss some recent mathematical and computational developments and challenges for high-throughput phenotyping.
180704_computational-phenotyping_ET.pdf
Advances in Machine Learning in High Energy Physics
David Rousseau, LAL
05 mai 2018, 11h, CEA-Saclay, Bat 141, salle andré Bethelot
Machine Learning (known as Multi Variate Analysis) has been used somewhat in HEP in the nighties, then at the Tevatron and recently at the LHC. However with the birth of internet giants at the turn of the century, there has been an explosion of Machine Learning tools in the industry, HEP being left behind. A collective effort has been started for the last few years to bring state-of-the-art Machine Learning tools to high energy physics, and to promote collaborations between HEP physicists and Machine Learning specialists.
This seminar will give a tour d’horizon of Machine Learning in HEP : review of tools ; example of applications, some usable immediately (e.g. cross validation, novelty detection) , some in a (possibly distant) future (e.g. deep learning, image vision) ; recent and future HEP ML competitions ; setting up frameworks for Machine Learning collaborations.
tr180515_davidRousseau_CEA_HEPML.pptx(1).pdf
New Dynamical Systems Tools to study Atmospheric flows
Davide Faranda, LSCE
21 février 2018, 11h, CEA-Saclay, Bat 141, salle andré Bethelot
Atmospheric flows are characterized by chaotic dynamics and recurring large-scale patterns . These two characteristics point to the existence of an atmospheric attractor defined by Lorenz as: ``the collection of all states that the system can assume or approach again and again, as opposed to those that it will ultimately avoid". The average dimension DD of the attractor corresponds to the number of degrees of freedom sufficient to describe the atmospheric circulation. However, obtaining reliable estimates of DD has proved challenging . Moreover, DD does not provide information on transient atmospheric motions, which lead to weather extremes . Using recent developments in dynamical systems theory , we show that such motions can be classified through instantaneous rather than average properties of the attractor. The instantaneous properties are uniquely determined by instantaneous dimension and stability. Their extreme values correspond to specific atmospheric patterns, and match extreme weather occurrences. We further show the existence of a significant correlation between the time series of instantaneous stability and dimension and the mean spread of sea-level pressure fields in an operational ensemble weather forecast at steps of over two weeks. We believe this method provides an efficient and practical way of evaluating and informing operational weather forecasts.
Machine Learning Techiniques at the LHC experiments
Özgür Sahin, DPhP
16 janvier 2018, 11h, CNRGH