Uncovering latent jet substructure

Barry M Dillon,Jernej F Kamenik,Darius A Faroughy

doi:10.1103/physrevd.100.056002

Barry M Dillon, Jernej F Kamenik + Show 1 more

Open Access

https://doi.org/10.1103/physrevd.100.056002

Copy DOI

Abstract

We apply techniques from Bayesian generative statistical modeling to uncover hidden features in jet substructure observables that discriminate between different a priori unknown underlying short distance physical processes in multi-jet events. In particular, we use a mixed membership model known as Latent Dirichlet Allocation to build a data-driven unsupervised top-quark tagger and $t\bar t$ event classifier. We compare our proposal to existing traditional and machine learning approaches to top jet tagging. Finally, employing a toy vector-scalar boson model as a benchmark, we demonstrate the potential for discovering New Physics signatures in multi-jet events in a model independent and unsupervised way.

Highlights

The use of jet substructure techniques in studying large area jets has played an important role in identifying hadronic decays of Higgs and electroweak gauge bosons in runs 1 and 2 of the LHC [1,2,3,4]
We have demonstrated a new unsupervised machine learning (ML) technique for disentangling signal and background events in mixed samples by identifying features in jet substructure observables that differentiate between the two
To do so we have mapped jet substructure distributions onto a LDA model, a generative probabilistic model widely used in Bayesian statistics approaches to unsupervised ML

Summary

INTRODUCTION

The use of jet substructure techniques in studying large area jets has played an important role in identifying hadronic decays of Higgs and electroweak gauge bosons in runs 1 and 2 of the LHC [1,2,3,4]. In the last few years, machine learning (ML) tools have extended the application of jet substructure in tagging jets at the LHC [16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32] through the use of neural networks (NNs) to process and “learn” from vast amounts of training data Since these approaches rely on theoretical predictions for pure signal and background training data sets [typically through Monte Carlo (MC) generators], they (a) are exposed to MC mismodeling of realistic events as reconstructed from real data and detectors; (b) require exact model knowledge of both expected signal and backgrounds. We compare them to existing conventional and ML approaches and outline possible further improvements and future directions

GENERATIVE BAYESIAN MODELS OF JET SUBSTRUCTURE

UNSUPERVISED TOP TAGGER

UNSUPERVISED NP SEARCH

CONCLUSIONS