Exploring Neural Networks in ROOT

For this PHYS291 project I hoped explore different artificial neural network architectures and configurations using the tools provided with ROOT's own Toolkit for Multivariate Analysis (TMVA) to classify Higgs to tau-tau events in the simulated dataset from the 2014 Higgs Boson Machine Learning Challenge.

Link to macros, models and data

Background

Neural Networks

The concept of artificial neural networks stems from the neurological structure of the brain, though it has evolved far from those roots. The neural network accepts a vector of values {x_0, ... x_n} each of which connects to (as a linear combination with a set of weights) a set of nodes. The sum of all the weighted values in a given node is passed through a differentiable limited activation function (usual choices include a sigmoid and hyperbolic tangent) which determines if the node activates and contributes to the activation function.

In a single layer feedforward network, there is one layer of these nodes, which then feed into a single output unit giving a probability for a given event to be signal or background. A multilayer neural network feeds the values from these nodes into another layer of nodes, with another set of weights for linear combination, and the values from the last layer as input values. The number of layers gives the depth of the neural network, whilst the number of nodes in each layer gives the width (which can vary between layers).

The determination of the weights is done through minimising a loss function:

where y(x_i) is predicted event type and y(C) is true event type for a given event. The minimisation of the loss function is done through back-propagation, where the weights are adjusted towards the direction of steepest gradient descent of the loss function:

Where eta represents the learning rate of the model, a positive number defined during the initialisation of the training.

TMVA

TMVA is an extension to the ROOT framework which provides an environment with a number of easily accessible machine learning models. From this, I chose to focus on the artificial neural networks, of which there are 3 different implementations. The most recent implementation available in ROOT ("kMLP") was used for this project.

Higgs Boson Machine Learning Challenge

The Higgs Boson Machine Learning Challenge dataset is available in its entirety (in a .csv format) from CERN's Open Data platform. Each event has 30 variables associated with it (17 primary; 13 derived), along with extra variables giving an event number, weights, a signal/background label and a subset label, as the original dataset contains the training set given for the challenge, public/private leaderboards and unused data.

Method

The Kaggle dataset was converted from a .csv file into a ROOT file, with the "Label" and "Set" being changed into numerical values for easier interpretation by ROOT. The training and testing sets are made into a separate .root files, both of which are used to load signal and background trees into a factory object, which handles the training, testing and evaluation of methods.

To explore the various neural network architectures, a number of models were trained using the macro TrainNetwork.C, changing the setup to investigate a few key hyperparameters and preprocessing techniques, namely:

  • Ratio of Training/Testing events
  • Normalisation of variables
  • Depth/width of network
  • Top ranked variables
  • The evaluation of the networks was based on the Recieving Operator Characteristic (ROC) curves produced by the macros in the TMVAGUI. These plot the background rejection (1-Signal Efficiency) versus the signal efficiency at various points when cutting on the classifier outputs. ROC curve integrals should represent how effective the classifier was at classifying correct signal events. The neuron activation function was held constant as tanh(x) across all training sessions. 30000 Signal and background events were used to train each of the models, so as to limit training time. The weight expression used was the normalised weight over each set (given as "KaggleWeight" for each event). The ROC-curves were generated using the macros provided in TMVAGUI. The top 10 variables were determined from a 2 hidden layer training run.

    Results

    ROC Normalised

    Figure 1: ROC-Curve, Normalised/Not Normalised data, 2 Hidden layers with N nodes

    ROC Depth

    Figure 2: ROC-Curve, 1/2 Hidden layers with N nodes

    ROC Depth

    Figure 3: ROC-Curve, 2 Hidden layers with N+10/N nodes in the first layer

    ROC Top 10

    Figure 4: ROC-Curve, 2 Hidden layers with Top 10 variables

    Discussion

    Using a 50% training data to 50% testing data allocation seemed to produce a slightly less effective classification than the more conventional 80% testing to 20% training. This might be due to the testing set covering a larger amount of edge cases for which the model was not appropriately trained for, hence returning a smaller ROC-integral.

    One of the clearest results from this exploration is the need for normalisation, at least in the case of this dataset. This could be due to the wide fluctuation between the range of the variables (some having a domain of [0-3], others [-999,999]). Normalising these allows the network to have equal response for all variables, allowing for a more uniform training process. Normalisation in the loading of the tree also seems necessary to maximise the ROC curve.

    One surprising result is how unaffected the networks seem by any change in the width of the first layer. This should have produced a greater effect in the ROC-Curve, but instead is slightly less effective than two layers with the same number of nodes. A plausible explanation could be that given the large amount of variables for each event, the larger amount of nodes is superfluous. This carries over to the number of layers, with the curves being nearly identical for an MLP with one or two layers.

    Using the top 10 variables produced varying results during training, with the integral still being slightly larger than the full variable set. This should not be the case, as the loss of information from the rest of the variables should make the classification less accurate.

    Methodological concerns

    Considering the architecture graph for the top 10 variables (for which the integral of the ROC Curve is approximately the same as the full set of variables):

    Network graph

    Figure 5: Network Architecture Graph, Top 10 variables

    It seems to show that only one of the nodes in the last hidden layer contributes to the output. This is further exemplified by the network architecture of a more complete model:

    Network graph

    Figure 6: Network Architecture Graph, All variables

    Where only two nodes of the final hidden layer contribute to the final model.

    Implementing convergence tests in the training seemed to shorten training time considerably, and should have been considered for more of the models. Considering the convergence graph for a model trained on all the data:

    Convergence

    Figure 7: Convergence graph over 1000 cycles, All data

    It is somewhat concerning that this is the case,as it implies that the model finds a minimum which is not reflective of the true classification.

    The use of KaggleWeight as a weight expression might have been misguided, as the original dataset provided to competitors excluded both weight and normalised weight as variables to be used during training. This might have led to certain events being deprioritised during the training, and reduction of importance for some of the variables.

    Issues with the dataset

    Some variables in the dataset used the value -999 as a way to signify that the variable was undefined or did not exist for a given event. This might have affected the training, as the normalised variables used in most of the training runs might give a skewed result in regards to these variables. This also might be the cause for the reader not functioning properly, as ROOT interprets this value as "Not a Number". The lack of a reader function is regrettable as it leaves the trained models without a mode of application.

    Conclusion

    This might have been a far too ambitious project to set out on without a firmer grasp of multivariate methods and machine learning. Most of the results found seem to be in opposition to current understanding of neural networks (Both width and depth should contribute to a greater extent than they do). One somewhat plausible explanation might be that all the training runs found similar local minima in the function space, given the quick convergence and similar ROC-curves/integrals. This does not explain why the last hidden layers had so few activated nodes, for which no reasonable explanation can be given.

    Addendum: Quirks of TMVA

    Sources

    A. Hoecker, P. Speckmayer, J. Stelzer,J. Therhaag, E. von Toerne, and H. Voss (2007), TMVA - Toolkit for Multivariate Data Analysis,PoS ACAT 040 , arXiv:physics/0703039
    ATLAS collaboration (2014), Dataset from the ATLAS Higgs Boson Machine Learning Challenge 2014,( CERN Open Data Portal, [Available here]
    I. Goodfellow, Y. Bengio and A. Courville (2016), Deep Learning, MIT Press,[Available here]
    Voss, H. (2012). Data Analysis in TMVA, [Available here]