13–15 janv. 2026
LPNHE
Fuseau horaire Europe/Paris

CEEMS: A Resource Manager Agnostic Energy and Performance Monitoring Stack

14 janv. 2026, 14:20
20m
Amphi Charpak (LPNHE)

Amphi Charpak

LPNHE

Orateur

Dr Mahendra PAIPURI

Description

With the rapid acceleration of ML/AI research in the last couple of years, the energy consumption of the Information and Communication Technology (ICT) domain has rapidly increased. As a major part of this energy consumption is due to users’ workloads, it is evident that users need to be aware of the energy footprint of their applications. Compute Energy & Emissions Monitoring Stack (CEEMS) 1 has been designed to address this issue. CEEMS can report energy consumption and equivalent emissions of user workloads in real time for SLURM (HPC) and Openstack (Cloud) platforms alike. Besides CPU energy usage, it supports reporting energy usage and performance metrics of workloads on NVIDIA and AMD GPU accelerators. It supports variety of energy sources like BMC (IPMI/Redfish), RAPL, Cray PMC, etc. In addition to energy consumption of individual workloads, CEEMS offers cluster level metrics for Data Center (DC) operators to monitor the overall energy consumption of the cluster, usage of cluster by individual users and projects, etc.

Although CEEMS has been developed with energy estimation of individual workloads as primary objective, it has been extended to report important performance metrics. It leverages the Linux perf subsystem and eBPF 2 to monitor the performance metrics of the applications which can help the end users to identify the bottlenecks in their workflows rapidly and consequently optimize them to reduce the energy and carbon footprint.

CEEMS has been built around the prominent open-source tools in the observability eco-system like Prometheus and Grafana. It has been designed to be extensible and it allows the DC operators to easily customize the energy estimation rules of user workloads based on the underlying hardware. CEEMS also integrates with Grafana Pyroscope to be able to continuously profile the user workloads on SLURM and Kubernetes platforms which proved to be an effective solution in optimizing the workloads. Finally, the talk will conclude by showing a quick demonstration of CEEMS monitoring.

Type d'intervention use-case
Temps d'intervention ? 20mn

Auteur

Documents de présentation

Aucun document.