Hyperparameter optimization

Posted by Ferran Pla - 03/06/2019

In Machine Learning, many parameters are assessed and improved via the learning process. By contrast, a hyperparameter is a variable whose value is set before training; therefore, it is neither assessed nor corrected.

In this post I explain different approaches used to perform hyperparameter optimization with the aid of a few visual examples. I try to keep the equations light and I provide links to original articles should the reader require a more in-depth understanding. I also offer a comparison of their performance in different simulated hyperparameter spaces.

Introduction

Neural network models are exceptionally effective and malleable; this is both a blessing and a curse. While neural nets are able to solve many challenging problems, they are also comprised of an overwhelming sea of configurations. Being able to choose a suitable set of model parameters is often what determines whether the net can accomplish difficult tasks.

Deep learning’s success is notably achieved by the ability to use the backpropagation algorithm to efficiently calculate the gradient of an objective function over each model parameter. It is in our hands to design the frame where the net will learn. In most cases, we must determine:

  • Feature engineering: The quality and quantity of the features extracted from our data are essential and will dramatically influence the results we are going to achieve. It is a difficult and expensive process, usually tackled through use of our domain knowledge. (E.g. when dealing with a moving object, the speed as well as the acceleration are good feature candidates).
  • Model architecture: We must face the problem of choosing the most appropriate algorithm for our specific problem. Many algorithms are available and each of them offer plenty of hyperparameters to adjust. Searching for solutions in similar problems might be a good starting point, but more often than not, it doesn't achieve the adequate results and tweaking is needed.
  • Training configuration: The algorithm's learning efficiency will be determined by the training configuration.
  • Evaluation method: In order to choose the best model, a comparison method determining the models appropriateness must be defined. (E.g. accuracy and ROC AUC over the test sets are metrics often used to compare models).

Therefore, a large number of decisions must be made before the training and evaluation process.

To simplify, let’s focus on the process after having already established the feature engineering and the learning algorithm, i.e. hyperparameter optimization. Hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm.

 

Hyperparameter Optimization Approaches

In order to explain the methods visually, I will make use of a simple test function for optimization as a hypothetical black-box function representing the final metric on the defined evaluation method given a certain configuration. In this case, our configuration is just one hyperparameter: the learning rate and the evaluation metric is the accuracy over the test set:

equation

where lambda is the learning rate and  we want to maximize the function value (representing the accuracy). The strategies shown are applicable to mixed spaces (numerical and categorical) unless advised otherwise.

The essential trait is having an ordered set, and in the case of a categorical variable, rounding and indexing the parameter sampling to the actual set accomplishes the extension.

Grid Search

This is the traditional method for performing hyperparameter optimization, grid search or parameter sweep, which is simply a brute force search through a manually specified subset of the hyperparameter space. In our example, we define the following 5 learning rate values manually and evaluate saving the best result:

valores_lambda

grid_search-1

This means that grid search would train a model with each learning rate value and output the rate that achieved the highest score in the evaluation method defined; in our example, the output would be lr = 0.5. There is, however, a decision to be made before the exploration: we must define the grid manually, which is not only a matter of the number of evaluations (in our case, 5 samples), but also a matter of scale depending on the hyperparameter tuned (E.g. the learning rate is usually sampled in a logarithmic scale as it has a vast utility domain).

Obviously, grid search suffers from the curse of dimensionality; taking into account the hyperparameter independence of each iteration, the process is parallelized whenever possible.

Here is a guide on Support Vector Machines using grid search.

 

Random search

Random search samples random values selected from the hyperparameter domain. It can be applied to a discrete setting manually defined as described above, but also generalizes well to continuous and mixed spaces. A stopping mechanism needs to be defined, usually either by setting a number of iterations or by setting an objective threshold to be achieved. It performs well when the optimization problem has a low intrinsic dimensionality. In our example, the samples are selected randomly inside the domain from the uniform distribution and setting the stopping threshold to 0.99 accuracy:

random_search

Random search also allows for parallelism as well as sharing the curse of dimensionality with the grid search method; additionally, it allows inclusion of prior knowledge by specifying the distribution from which to sample. Here is an example of academic use in this method.

 

Bayesian optimization

Bayesian optimization is a methodology to perform global optimization of multimodal black-box functions. As such, it is applicable to hyperparameter optimization. It combines 4 steps:

  • Choose some prior measure over the space of possible evaluation method functions. As our objective function is unknown (evaluation metric over the configurations function), this step intends to treat it as a random function with a chosen prior distribution. The prior captures our beliefs about the behaviour of the function. (E.g. A multivariate Gaussian distribution is often used.)
  • Combine the prior distribution with some given observations (configurations tested) to get a posterior distribution over the objective (estimation of where the true function resides).
  • Use the posterior distribution to decide where to test the next configuration according to some acquisition function. An acquisition function is a tool that tells us the configuration to test where the most information will be achieved.
  • Evaluate the configuration chosen in step 3.

After giving some already tested initial configurations (at least 2) and a first completion of the steps, iterating between 2 and 4 until the stopping mechanism is triggered closes the Bayesian optimization.

Without some statistical background, the process might sound complicated, but basically, by using a probabilistic model and iteratively evaluating an interesting hyperparameter configuration based on the current information, it aims to find the optimal configuration.

Let me show you in our example step by step. First we assume all distributions to be multivariate normal distributions. Then, we choose 2 initial configurations randomly (in this case, 2 learning rate values inside the domain). Having some initial evaluations, we iterate the process (using previous steps):

  • Use the evaluations to estimate the accuracy function in a confidence interval (green region) around the estimated mean (black discontinuous line).
  • Use the estimation and the acquisition function (brown line in the figure below) to output the most interesting value (gold star over the brown line).
  • Evaluate the point and check the stopping mechanism (in this case, we set the threshold at 0.9995).

bayesian_optimization

It is a very powerful and flexible tool that offers a trade-off between two key concepts, exploration and exploitation, so as to minimize the number of evaluations:

  • Exploration: Prioritizes evaluating in places where the variance is large, i.e. where our estimation is most likely to be inaccurate. (In our example, broader green regions).
  • Exploitation: Prioritizes evaluating in places where the estimation is closest to our objective. (In our example, higher points in the discontinuous line).

Weighting these two concepts is key in every optimization problem; depending on our objective space, strengthening one concept over the other could be vastly beneficial. This trade-off is offered by the acquisition function, examples of them include probability of improvement, expected improvement, Bayesian expected losses, upper confidence bounds  (UCB), Thompson sampling and mixtures of these. For a brief explanation, I suggest reading these notes

To better understand the entire process, try it on your own in this pure python open source implementation.

This approach has been used in a wide variety of situations even outside machine learning and has proven to be very effective. Bayesian optimization really shines when the evaluation time is long and the hyperparameter space is large and mixed, which is usually the case of hyperparameter optimization in DL.

 

Evolutionary optimization

Emulating Charles Darwin, an evolution strategy is an algorithm that provides the stronger versions of a population and then produces the next generation based on these versions. Starting with a random population, the iterative process of selecting the strongest and producing the next generation will stop once the best known solution is satisfactory for the user.

In hyperparameter optimization, evolutionary optimization is a methodology for the global optimization of noisy black-box functions. Searching the hyperparameter space, this method is a batch-sequential process that aims to optimize the objective function.

There are many evolutionary strategies; I am going to show you one of the most successful to date: Covariance-Matrix Adaptation Evolution Strategy (CMA-ES). As many strategies do, it samples from a normal distribution; however, this approach is flexible and can take the observations of each generation and adaptively increase or decrease the search space for the next generation. It will not only adapt for the mean and sigma parameters of each hyperparameter, but will also calculate the entire covariance matrix of the hyperparameter space.

CMA-ES modifies the covariance calculation formula in a clever way to make it adapt well to an optimization problem. To simplify, let's assume 2 hyperparameters and that we are filtering 25% of the population by iteration. First, the mean of the stronger samples is computed:

Mean

where h1 is the first hyperparameter, h2 the second hyperparameter and N is the number of samples filtered of each population. The hat over the variables indicates the stronger population. Next, instead of computing the usual covariance matrix:

expected_variance

it computes the variance over the mean of the complete current generation, instead of the mean over the stronger observations:

expected_variance

Armed with the computed metrics, a new population is sampled from a normal distribution with the means and covariance estimated.

Let's take a look at the whole process in our ongoing example. First we sample an arbitrary number of observations in a predefined distribution over our hyperparameter space; in our case, let's take 10 random samples of a normal distribution centered in 0.85 and with a variance of 0.05. Then we start the iterative process:

  • Let evolution do its work and filter the 3 stronger samples (the 3 green points with higher accuracy).
  • Compute the mean and the modified sigma over the strong population (vertical discontinuous black line and the green region respectively).
  • Sample a new set of 10 samples from the normal distribution with the computed metrics (red ticks over the horizontal axis).
  • Evaluate and check the stopping mechanism (in our case, surpassing 100 evaluations).

 

evolutionary_strategy

On observing the progress of the algorithm visually, we can understand the reason for the modification over the covariance matrix. During the first stages, the modified variance grows in order to reach a wider range of learning rates and, as we approach the optimum, the variance decreases converging to zero.

This algorithm is a very popular black box optimization algorithm. It works well in many situations and has been tested by many researchers in all kinds of situations. As a hyperparameter optimization algorithm, it is a well-performing algorithm over unstable spaces.For further details, I recommend reading this CMA-ES Tutorial prepared by Nikolaus Hansen, the author of the CMA-ES algorithm.

 

Other Algorithms 

Gradient based optimizations have been the most effective algorithms over the Machine Learning environment; even over black box functions, the gradient can be estimated with outstanding results. In hyperparameter optimization, however, the origin space isn't usually algebraic and in order to navigate in a gradient fashion, addition and multiplication operations are needed, as well as a convergence space. (E.g. the decision of whether to use dropout or not cannot be explored in a continuous manner.)

In cases where our hyperparameter space is algebraic (e.g. the learning rate combined with the regularization weight), I recommend using gradient based optimizations. For further details, check the theory over the Broyden method as a quasi-Newtonian method (estimating the Jacobian matrix) and feel free to test it on your problems!

Method Comparison

In applied mathematics, test functions, known as artificial landscapes, are useful for evaluating characteristics of optimization algorithms. I am going to exploit three different test functions offered in a public domain in order to evaluate the algorithms explained above. More specifically, the hypothetical objective functions chosen to reach a wide variety of situations are the following:

 

Sphere function:

Sphere_function_equation

Sphere_function

Himmelblau's function:

 

Himmelblau_function_equation

Himmelblau_function

Rastrigin function:

Rastrigin_function_equation

Rastrigin_function  

In order to make visualization easier, a 2-D hyperparameter space is assumed.

Examples

Here are some examples of the hyperparameter optimization:

  • CMA-ES over Himmelblau's function. Configuration: 50 samples by generation and keeping the best 20% (10 samples).

Example_Himmelblau_EvolThe modified version of the covariance matrix initially produces a dispersed cloud of points, and as it approaches a minimal point, the cloud concentrates around it.

  • Bayesian optimization over the Rastrigin function. Configuration: Using an Upper Confidence Bound kernel favoring exploitation (kappa = 5).

Example_Rastrigin_Bayes_ucb5The algorithm focuses on improving the objective around the explored minima instead of exploring new zones of interest.

  • Bayesian optimization over the Rastrigin function. Configuration: In this case, UCB kernel favoring exploration (kappa = 25).

Example_Rastrigin_Bayes_ucb25By favoring exploration, the algorithm switches its focus to examining a wider and unknown range of values instead of evaluating in places where the estimation is closest to our goal.

Evaluation Method

Each of the algorithms is applied several thousand times until the metrics shown stabilize. As some of the algorithms need a prior belief over the objective function (e.g. exploration vs. exploitation in Bayesian optimization, grid definition in grid search algorithm), random prior beliefs will be chosen (e.g. random grid distance inside a default domain). The importance of this choice will be shown and compared as well. The evaluation metrics shown are the following:

 

  • Prior Belief Importance (PBI): The influence of our prior belief over the success rate of the algorithm. This value is computed as the variance of probability of success in all prior beliefs chosen from a default range of values scaled by a maximum variance scenario. A higher PBI implies a higher algorithm dependency on the user.
  • Success Rate (SR): Probability of reaching the marked threshold in a proven to be capable of success configuration for the specific objective function.
  • Mean Evaluations Required (MER): The average number of function evaluations required in order to surpass the desired threshold over the objective function in successful attempts. In other words, the speed at which the algorithm converges.

 

Function

Algorithm

PBI

SR

MER

Sphere

Grid search

0,4

0,5%

1720,9

 

Random Search

1,6

4,6%

1944,9

 

Bayesian Opt.

0,1

99,9%

29,3

 

Evolutionary Opt.

26,6

99,8%

213,2

Himmelblau

Grid Search

0,0

0,0%

1765,1

 

Random Search

0,1

0,1%

1917,0

 

Bayesian Opt.

0,1

99,9%

53,5

 

Evolutionary Opt.

60,4

99,7%

554,1

Rastrigin

Grid Search

14,0

25,5%

1766,0

 

Random Search

57,8

42,1%

1178,1

 

Bayesian Opt.

42,4

87,5%

214,4

 

Evolutionary Opt.

40,7

96,2%

1045,9

 

For example, in optimizing over Himmelblau's function, the Bayesian optimization generally shows better performance: barely any importance in the user choices (0.1), very high probability of reaching the goal (99.9%) and a low average in evaluations required (53.5); i.e. regardless of the user's manually selected parameters, the Bayesian optimization consistently reaches the marked threshold and in a very fast manner.

CMA-ES shows better results over the Rastrigin function. As shown in the table, the algorithm reaches the optimum more consistently than its counterparts; however, it also offers a slow convergence.

To conclude, evolutionary algorithms may be advantageous in highly volatile objective functions, yet overall it is evident that Bayesian optimization is the most robust and efficient option.

 

Buguroo's experience

In Buguroo's Deep Learning department, a modified version of Bayesian optimization balancing the exploration and exploitation over time has been adopted in order to hone our models´ performance. In our experience, this approach has been found to be satisfactory in most cases. Our keystroke dynamics models, the agents entrusted with confirming the identity of an individual based on the manner and the rhythm of typing on a keyboard, are a successful example of use.

 

Topics: deep learning, Cybersecurity

 

 

Deep Learning for Online Fraud Prevention


recent posts

A new banking Trojan, BANKER RTC PORTAL, attacks Latin American and European banks

read more

Brain hacking I: The Principles of Persuasion

read more

Cryptojacking and ransomware: Cyberthreat scenarios for 2019

read more