Description
Advances in Machine Learning, particularly Large Language Models (LLMs), enable more efficient interaction with complex datasets through tokenization and next-token prediction strategies, providing a novel framework for analyzing high-energy physics datasets. This talk presents and compares various approaches to structuring particle physics data as token sequences, allowing LLM-inspired models to learn event distributions and detect anomalies via next-token (or masked token) prediction in proton-proton collisions at the Large Hadron Collider (LHC). By training solely on background events, the model reconstructs expected physics processes, learning properties of the given Standard Model (SM) processes. During inference, both background and signal events are processed, with deviations in reconstruction scores flagging anomalous events—offering a data-driven approach to distinguishing processes or uncovering physics beyond the Standard Model (BSM). This technique is particularly relevant for exploring rare or unexpected signatures, such as four-top-quark production or supersymmetric (SUSY) processes. The method is tested using simulated LHC Run 2 (√s = 13 TeV) proton-proton collision data from the Dark Machines Collaboration, replicating ATLAS conditions, specifically targeting SM and BSM four-top-quark final states. The event tokenization strategies presented in this talk not only enable anomaly detection but also represent a potential new approach for training a foundation model at the LHC. By integrating state-of-the-art ML techniques with fundamental physics principles, this approach paves the way for more adaptive data-driven methods in particle physics, potentially enhancing future searches for new physics at the LHC and beyond.