AISSAI - Heterogeneous Data and Large Representation Models in Science

Name: AISSAI - Heterogeneous Data and Large Representation Models in Science
Start: 2024-09-30T12:30:00+02:00
End: 2024-10-03T14:25:00+02:00
Location: Toulouse, France

30 September 2024 to 3 October 2024

Toulouse, France

Europe/Paris timezone

Contact

aissai_heterogeneous-data_LOC@l2it.in2p3.fr

Contribution List

36. Welcome from the L2IT director

Jan Stark (L2I Toulouse, CNRS/IN2P3, UT3)

30/09/2024, 14:00

Oral presentation

37. Welcome from the organizers; practicalities

Sylvain Caillou (L2I Toulouse, CNRS/IN2P3, UT3)

30/09/2024, 14:10

Oral presentation

31. Keynote Address: Foundation models for high energy physics

Anna Hallin (Universität Hamburg)

30/09/2024, 14:25

Oral presentation

Foundation models are machine learning models designed to handle a wide range of datasets and tasks. After being pre-trained on a specific task on a specific dataset, these models can be fine-tuned for various downstream applications, including different tasks and datasets. Developing such models for physics data could significantly enhance performance in the field and substantially cut down...

9. PolarBERT: a Foundation Model for Neutrino Telescope Data

Inar Timiryasov (Niels Bohr Institute, University of Copenhagen)

30/09/2024, 15:15

Oral presentation

Neutrinos are elusive particles that require massive detectors for observation. The IceCube neutrino observatory at the South Pole is a cubic kilometer of Antarctic ice, instrumented with 5,160 digital optical modules. Its results play an essential role in both particle physics and astrophysics.
Deep learning methods, such as graph neural networks, have been successfully applied to the steady...

10. Scientific Foundation Models for Computational Fluid Dynamics: threats and opportunities

Dr Fernando Gonzalez (CERFACS), Luciano Drozda (CERFACS)

30/09/2024, 16:20

Oral presentation

Scientific Foundation Models (SciFMs) hold the promise of accelerating numerical simulation of physical phenomena. In recent years, a myriad of SciFMs for weather forecasting have been proposed by major companies (e.g., Microsoft's ClimaX and Aurora) as well as research centers (e.g., ECMWF's AIFS). The development of SciFMs in other domains such as Computational Fluid Dynamics (CFD) has not...

14. Small thinks big: transfer learning in KM3NeT/ORCA for neutrino event reconstruction

Ivan Mozun

30/09/2024, 16:45

Oral presentation

This study explores using transformer models to analyze data from the KM3NeT/ORCA neutrino detector. Due to the current detector's size, reconstructing neutrino events is challenging. By training models on simulations for the full detector (115 detection units) and fine-tuning them on smaller configurations, significant performance improvements are achieved compared to models trained from...

16. Automatic estimation of the wind turbine noise with recurrent neural networks

Mr ABDELAZYZ RKHISS (Doctorant à Grenoble INP)

30/09/2024, 17:25

Oral presentation

There is growing interest in the development of renewable energies, particularly wind power. However, wind turbines generate noise that can affect the sound environment of nearby residents.
This study focuses on the isolation of wind turbine noise (WTN) level from the surrounding total noise. Our method is based on a Recurrent Neural Network (RNN) Architecture that captures temporal...

29. Keynote Address: Gravitational waves coming at you from all directions

Jonathan Gair (AEI Potsdam)

01/10/2024, 09:30

Oral presentation

In January of this year, the European Space Agency officially adopted the space-based gravitational wave detector, LISA, as a mission, to launch in 2035. LISA will open up a new band in the gravitational wave frequency spectrum, at millihertz frequencies. This band is expected to be very rich in sources, ranging from binaries of compact stars in our galaxy, to binaries involving supermassive...

38. Statistically principled learning for gravitational-wave inverse problems

Alvin Chua (National University of Singapore)

01/10/2024, 10:45

Oral presentation

An important aspect of gravitational-wave astronomy is solving inverse problems, i.e., determining the properties of astrophysical sources from their gravitational signals. This involves the construction of complex forward models for possible signals by solving the equations of general relativity, as well as the use of these forward models in data-analysis algorithms to extract and...

18. Neural density estimation for Galactic Binaries in LISA data analysis

Natalia Korsakova (APC)

01/10/2024, 11:10

Oral presentation

The future space based gravitational wave detector LISA will observe millions of galactic binaries (GBs) constantly present in the data. A small fraction of this population will be individually resolved. One of the challenging tasks will be to estimate the parameters of resolvable GBs while disentangling them from each other and from other gravitational wave sources present in the data. This...

28. Beyond Gauss? A more accurate model for LISA astrophysical noise sources

Dr Riccardo Buscicchio (Universitá di Milano-Bicocca, Milan, IT)

01/10/2024, 11:35

Oral presentation

In this talk, we explore two assumptions ubiquitous in LISA data analysis: Gaussianity and stationarity of astrophysical noise sources (i.e. arising from source confusion). I will provide an overview on characterizations and parameter estimations techniques of both properties. I will review the most recent findings on the Galactic population of double white dwarfs and the extragalactic one of...

33. Keynote Address: Deep learning and the global workspace theory

Rufin VanRullen (Centre de Recherche Cerveau et Cognition (CerCo), Artificial and Natural Intelligence Toulouse Institute (ANITI))

01/10/2024, 14:00

Oral presentation

34. Semi-supervised multimodal representation learning through a global workspace

Léopold Maytié (Université Toulouse III - Paul Sabatier, Toulouse, France & Artificial and Natural Intelligence Toulouse Institute (ANITI))

01/10/2024, 14:15

Oral presentation

Recent deep learning models can efficiently combine inputs from different modalities (e.g., images and text) and learn to align their latent representations, or to translate signals from one domain to another (as in image captioning, or text-to-image generation). However, current approaches mainly rely on brute-force supervised training over large multimodal datasets. In contrast, humans (and...

4. Learning how to design biomolecules using a neuro-symbolic architecture

Thomas Schiex (Université fédérale de Toulouse, ANITI, INRAE)

01/10/2024, 14:50

Oral presentation

Designing requires to mix physical knowledge, experience accumulated from past designs and constraints defining design objectives.

Proteins are large biomolecules that play crucial roles in all living organisms. They are linear polymers which can be described as a sequence in a 20 letter alphabet (one for each amino acid). They can therefore be represented as discrete objects. In water,...

17. A graph-structured distance for heterogeneous datasets with meta variables

Dr Paul SAVES (DTIS, ONERA and Fédération ENAC ISAE-SUPAERO ONERA, Université de Toulouse, France)

01/10/2024, 15:30

Oral presentation

This talk presents a novel distance function and modeling framework for mixed-variable domains, effectively handling heterogeneous data with continuous, integer, and categorical variables, including meta variables that shape problem structure. This approach is presented in a paper that enhanced generalization and optimization in large representation models in science without partitioning data....

30. Challenges of heterogeneous data for building Linguistic Theory

Anisia Popescu (LISN)

01/10/2024, 16:05

Oral presentation

Linguistics thrives on data, whether it stems from small highly controlled laboratory studies or from large heterogeneous datasets. Speech technology is increasingly providing new and varied tools to test linguistic theories (from sound change to second language learning) on large scale data. This, however, does not come without its challenges. In this presentation, we address one of the key...

43. Round Table

Anna Hallin (Universität Hamburg), Daniel Murnane (Niels Bohr Institute, University of Copenhagen), David Roussel (Airbus), François Lanusse (AIM, CNRS/CEA Paris-Saclay / Flatiron Institute), Jonathan Gair (Max Planck Institute for Gravitational Physics), Jordi INGLADA (CESBIO, Université de Toulouse, CNES/CNRS/INRAe/IRD/UT3), Sylvain Caillou (L2I Toulouse, CNRS/IN2P3, UT3)

01/10/2024, 16:55

Other oral contribution (round tables ...)

The round table is structured into three 30-minute discussion segments focused on the following topics:

• Foundation Models in Science
• Heterogeneous Data and Multimodal Representation Learning
• Inverse Problem - Likelihood-Free simulation based approach

12. Keynote Address: RELEO - Representation Learning for Earth Observation

Jordi INGLADA (CESBIO, Université de Toulouse, CNES/CNRS/INRAe/IRD/UT3)

02/10/2024, 09:00

Oral presentation

This talk will introduce the RELEO (REpresentation Learning for Earth Observation) project (2024-2028), a research chair of the Artificial and Natural Intelligence Toulouse Institute (ANITI). RELEO aims at developing new self-supervised representation learning methods to produce semantically meaningful probabilistic representations from high-dimensional multi-modal EO data. The originality of...

19. Identifying a piecewise affine signal from its nonlinear observation - application to DNA replication analysis

Clara Lage (ENS de Lyon)

02/10/2024, 09:50

Oral presentation

An important challenge in DNA replication analysis is to recover a so-called timing profile, that contains important information about the replication dynamics, from nonlinear observations. We show that this challenge can be expressed as a nonlinear sparse coding inverse problem where the unknown timing profile is assumed to be piecewise affine.

We propose a novel formalism and...

39. Keynote Address: Multimodal Pretraining for Astrophysical Foundation Models

François Lanusse (CNRS, UMR AIM / Flatiron Institute)

02/10/2024, 10:55

Oral presentation

Deep Learning has seen a recent shift in paradigm, from training specialized models on dedicated datasets, so-called Foundation Models, trained in a self-supervised manner on vast amounts of data and then adapted to solve specific tasks with state-of-the-art performance. This new paradigm has been exceptionally successful not only for large language models (LLMs) but in other domains such as...

24. Galaxy detection with deep learning in radio data

David Cornu (Observatoire de Paris | PSL)

02/10/2024, 11:45

Oral presentation

Astronomical facilities generate ever-increasing data volumes, rapidly approaching the exascale. In this talk, I will introduce YOLO-CIANNA, a deep-learning object detector for astronomical images, and present results over simulated 2D continuum images and HI emission cubes from the SKAO SDCs. I will then discuss how the method could be applied to data from the SKA precursor and how we could...

23. Searching for Dark Matter at the LHC with GNN

Rafal MASELEK (LPSC (Grenoble))

02/10/2024, 12:25

Poster

About 1/4 of the energy density of the visible Universe is comprised of Dark Matter (DM), an unfamiliar and elusive form of matter that is yet to be understood. DM particles can be detected by experiments at the Large Hadron Collider (LHC), however, such searches are very challenging. We propose a novel approach based on Graph Neural Networks, combining low- and high-level information to...

15. Salt: Multimodal, Multitask Models for the ATLAS Experiment

Jackson Barr (UCL)

02/10/2024, 14:00

Oral presentation

In High Energy Physics, experimental data can range from low level hardware information from different sub-detectors to high level reconstructed physics events. To address the need for flexible, multimodal machine learning models within the ATLAS experiment, the Salt framework based upon PyTorch and Lightning has been developed. Salt was initially developed for the identification of...

22. Explaining Jet Flavour Taggers with Integrated Gradients

Scott DeGraw (University College London)

02/10/2024, 14:40

Oral presentation

At the Large Hadron Collider (LHC), proton-proton collisions produce collimated streams of particles called jets created from particle decay chains. Identifying the particle that originated the jet (flavour tagging) is crucial. Modern taggers use deep learning models with features of the decay products as inputs. We show that integrated gradients reveal how these complex and opaque models use...

25. Graph Neural Networks for track reconstruction in the ATLAS ITk detector

Minh-Tuan Pham (University of Wisconsin-Madison)

02/10/2024, 15:05

Oral presentation

High-energy physics (HEP) experiments, e.g. ATLAS and CMS, collide opposing packs of particles and characterize the collision's final state. The innermost part of a detector consists of many sensors which detect the passage of a charged-particle by measuring its energy deposit. A tracking algorithm recreates from these measurements the trajectories of all particles, a computationally intensive...

11. Large-scale deep-learning for weather and climate prediction

Laure Raynaud (Météo-France)

02/10/2024, 16:10

Oral presentation

A new paradigm for weather and climate prediction has emerged recently : data-driven prediction models have achieved similar performances as standard physics-based models, thanks to an accurate (task-specific or task-agnostic) encoding of the data distribution. While these models are able to efficiently use relatively homogenous data, the next challenge to expand the capabilities of...

35. Keynote Address: Medium Range Weather Forecasting with Machine Learning

Andrew, on behalf of the GraphCast team and GenCast team from Google DeepMind El-Kadi (Google DeepMind)

02/10/2024, 16:50

Oral presentation

The recent emergence of quality data, large scale compute and deep learning advancements has enabled an acceleration in the field of Machine Learning for Weather Forecasting. Today's talk centers on two pieces of work: GraphCast and GenCast, both Medium Range Global Weather Forecasting models. The former produces deterministic forecasts up to 10 days into the future, while the latter makes...

32. Enhancing Ultrasound Localization Microscopy (ULM) with Spatio-Temporal Deep Learning

Vassili PUSTOVALOV (Institut de recherche en informatique de Toulouse, Université Toulouse III - Paul Sabatier)

03/10/2024, 09:30

Oral presentation

The integration of Ultrasound Localization Microscopy (ULM) into ultrasound imaging has significantly improved resolution, providing precise insights into blood flow direction and velocity. However, despite its potential, ULM remains a complex and time-consuming technique, even as deep learning (DL) continues to drive its optimization. Current DL methods for microbubble (MB) superlocalization...

20. Preprocessing arbitrarily structured data for AI with Awkward Array

Vangelis Kourlitis (Technical University of Munich)

03/10/2024, 09:55

Oral presentation

Processing heterogeneous multimodal data presents challenges. These datasets feature complex, irregular structures due to nested or variable-sized outputs from different sensors, or due to missing data values. The data are typically of mixed types, complicating the preprocessing steps required before they can be fed into algorithms like multimodal representation models. AI practitioners must...

13. Leveraging AI in computational physics with NVIDIA Modulus and TorchFort

Corentin Lapeyre (NVIDIA), Frédéric Parienté (NVIDIA)

03/10/2024, 10:35

Oral presentation

NVIDIA supports the scientific community in leveraging data-driven and AI approaches in computational physics workflows. This talk will showcase how researchers use NVIDIA's open-source libraries like Modulus to integrate learning methods with scientific solvers. Specifically, it will focus on recent results that facilitate AI in-the-loop approaches and enable on-the-fly training and inference...

40. Optimizing PyTorch: Accelerating Training and Inference with Compilation, Custom Kernels, and Beyond

Mr Alvaro Moran (Hugging Face)

03/10/2024, 11:40

Oral presentation

In this talk, we'll explore cutting-edge techniques to optimize both training and inference in PyTorch, enabling faster, more efficient model execution. We'll dive into the power of PyTorch's torch.compile to accelerate workflows by fusing operations and generating optimized code, reducing runtime overhead. Additionally, we'll cover the use of custom kernels with tools like Triton, Pallas...

41. Keynote Address: A causal perspective on reliable and interpretable representation learning

Michel Besserve (MPI for Intelligent Systems, Tuebingen, Germany)

03/10/2024, 12:10

Oral presentation

Artificial intelligence is increasingly relied on to assist with complex tasks by leveraging vast amounts of data. Building useful representations is a core ingredient to the performance of such systems, and arguably goes beyond the mere extraction of statistical information in observed data. One way to express desiderata for such representations using modifications to the data generation...

44. Closing words & Farewell

Jan Stark (L2I Toulouse, CNRS/IN2P3, UT3)

03/10/2024, 12:55

Choose timezone

AISSAI - Heterogeneous Data and Large Representation Models in Science

Contact