The purpose of this repository is to explore using machine learning techniques to solve high energy physics problems as preparation for later analysis of long-living particles relating to hidden valley sector theories of dark matter. Specifically, it is often useful to discriminate between different kinds of 'jets', which are collections of decay products resulting from high energy collisions. In this case, we distinguish between gluon jets and quark jets, where gluon and quark refer to the type of particle that decayed.
This project is intended to be run on the TeV cluster at the UW Department of Physics using Python 3. If you intend to run it on a different cluster, you must modify the hostfile.txt file to include appropriate addresses for your cluster, and you may also need to modify the dask_ssh.sh script.
First, clone the repository from git to a convenient location (ssh is recommended). Once you have cloned the repository, source the package_setup.sh script in bash. This will install the required packages and setup your environment. Note that pipenv is used to maintain a virtual python environment, so you must either start the pipenv virtual environment by running pipenv shell
or run all scripts by prepending pipenv run
, e.g., pipenv run python package_test.py
. See the pipenv documentation for more information.
The main scripts are train.py and metrics.py; if you run pipenv run python train.py -h
, you will see an up-to-date list of available commands. train.py supports training several different models on all or part of the quark/gluon data either locally or using the full TeV cluster and with or without hyper-parameter optimization. The trained model resulting from train.py is stored in a new 'run directory' which is specified relative to RUNS_PATH defined in constants.py. You can specify a name for the run directory, or a default one will be created for you based on the parameters passed to train.py.
By default (i.e., without --local
specified), train.py uses the TeV cluster to speed up computations. To do so, you must first run source dask_ssh.sh
on the scheduler machine (tev01) (note that this script is currently broken). This will spawn a dask scheduler on tev01 and dask workers on the other tev machines. Once you're done running train.py, you can terminate these processes by pressing CTRL + C.
The fastest model to train is the Naive Bayes model, so let's use that as an example. Run pipenv run python train.py -m NB --local --no_hyper
. If you want to use the cluster (again, currently broken), try running pipenv run python train.py -m "GBRT" --max_events 1000000
, and a scikit-learn Gradient-Boosted Regression Tree (more commonly known as a boosted decision tree or BDT) will be fit to the data.
Once you've created and trained a model, you can see performance metrics by running metrics.py. Run pipenv run python metrics.py
and several plots will be created and saved in the run directory. There is an optional --run_dir
parameter to specify the name of the run directory your model is saved in, but by default it uses the most recently modified run directory.
The next step is to create and test a convolutional neural network (CNN) model in keras, based on this research, which was presented recently at Boost 2017. We can initially utilize a 2-dimensional CNN, using pseudorapidity (eta) and azimuthal angle (phi) as coordinates while meta variables such as transverse momentum (pt) can serve as image channels. However, in order to study long-lived particles, it will likely be crucial to extend to a 3-dimensional CNN by including depth information, as an important indicator of such particles is their displaced vertex in the third dimension.