In Machine Learning, many parameters are assessed and improved via the learning process. By contrast, a hyperparameter is a variable whose value is set before training; therefore, it is neither assessed nor corrected.
In this post I explain different approaches used to perform hyperparameter optimization with the aid of a few visual examples. I try to keep the equations light and I provide links to original articles should the reader require a more indepth understanding. I also offer a comparison of their performance in different simulated hyperparameter spaces.
Introduction
Neural network models are exceptionally effective and malleable; this is both a blessing and a curse. While neural nets are able to solve many challenging problems, they are also comprised of an overwhelming sea of conﬁgurations. Being able to choose a suitable set of model parameters is often what determines whether the net can accomplish difficult tasks.
Deep learning’s success is notably achieved by the ability to use the backpropagation algorithm to efficiently calculate the gradient of an objective function over each model parameter. It is in our hands to design the frame where the net will learn. In most cases, we must determine:
 Feature engineering: The quality and quantity of the features extracted from our data are essential and will dramatically inﬂuence the results we are going to achieve. It is a difficult and expensive process, usually tackled through use of our domain knowledge. (E.g. when dealing with a moving object, the speed as well as the acceleration are good feature candidates).
 Model architecture: We must face the problem of choosing the most appropriate algorithm for our speciﬁc problem. Many algorithms are available and each of them offer plenty of hyperparameters to adjust. Searching for solutions in similar problems might be a good starting point, but more often than not, it doesn't achieve the adequate results and tweaking is needed.
 Training conﬁguration: The algorithm's learning efficiency will be determined by the training conﬁguration.
 Evaluation method: In order to choose the best model, a comparison method determining the models appropriateness must be deﬁned. (E.g. accuracy and ROC AUC over the test sets are metrics often used to compare models).
Therefore, a large number of decisions must be made before the training and evaluation process.
To simplify, let’s focus on the process after having already established the feature engineering and the learning algorithm, i.e. hyperparameter optimization. Hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm.
Hyperparameter Optimization Approaches
In order to explain the methods visually, I will make use of a simple test function for optimization as a hypothetical blackbox function representing the ﬁnal metric on the deﬁned evaluation method given a certain conﬁguration. In this case, our conﬁguration is just one hyperparameter: the learning rate and the evaluation metric is the accuracy over the test set:
where lambda is the learning rate and we want to maximize the function value (representing the accuracy). The strategies shown are applicable to mixed spaces (numerical and categorical) unless advised otherwise.
The essential trait is having an ordered set, and in the case of a categorical variable, rounding and indexing the parameter sampling to the actual set accomplishes the extension.
Grid Search
This is the traditional method for performing hyperparameter optimization, grid search or parameter sweep, which is simply a brute force search through a manually speciﬁed subset of the hyperparameter space. In our example, we deﬁne the following 5 learning rate values manually and evaluate saving the best result:
This means that grid search would train a model with each learning rate value and output the rate that achieved the highest score in the evaluation method deﬁned; in our example, the output would be lr = 0.5. There is, however, a decision to be made before the exploration: we must deﬁne the grid manually, which is not only a matter of the number of evaluations (in our case, 5 samples), but also a matter of scale depending on the hyperparameter tuned (E.g. the learning rate is usually sampled in a logarithmic scale as it has a vast utility domain).
Obviously, grid search suffers from the curse of dimensionality; taking into account the hyperparameter independence of each iteration, the process is parallelized whenever possible.
Here is a guide on Support Vector Machines using grid search.
Random search
Random search samples random values selected from the hyperparameter domain. It can be applied to a discrete setting manually deﬁned as described above, but also generalizes well to continuous and mixed spaces. A stopping mechanism needs to be deﬁned, usually either by setting a number of iterations or by setting an objective threshold to be achieved. It performs well when the optimization problem has a low intrinsic dimensionality. In our example, the samples are selected randomly inside the domain from the uniform distribution and setting the stopping threshold to 0.99 accuracy:
Random search also allows for parallelism as well as sharing the curse of dimensionality with the grid search method; additionally, it allows inclusion of prior knowledge by specifying the distribution from which to sample. Here is an example of academic use in this method.
Bayesian optimization
Bayesian optimization is a methodology to perform global optimization of multimodal blackbox functions. As such, it is applicable to hyperparameter optimization. It combines 4 steps:
 Choose some prior measure over the space of possible evaluation method functions. As our objective function is unknown (evaluation metric over the conﬁgurations function), this step intends to treat it as a random function with a chosen prior distribution. The prior captures our beliefs about the behaviour of the function. (E.g. A multivariate Gaussian distribution is often used.)
 Combine the prior distribution with some given observations (conﬁgurations tested) to get a posterior distribution over the objective (estimation of where the true function resides).
 Use the posterior distribution to decide where to test the next conﬁguration according to some acquisition function. An acquisition function is a tool that tells us the conﬁguration to test where the most information will be achieved.
 Evaluate the conﬁguration chosen in step 3.
After giving some already tested initial conﬁgurations (at least 2) and a ﬁrst completion of the steps, iterating between 2 and 4 until the stopping mechanism is triggered closes the Bayesian optimization.
Without some statistical background, the process might sound complicated, but basically, by using a probabilistic model and iteratively evaluating an interesting hyperparameter conﬁguration based on the current information, it aims to ﬁnd the optimal conﬁguration.
Let me show you in our example step by step. First we assume all distributions to be multivariate normal distributions. Then, we choose 2 initial conﬁgurations randomly (in this case, 2 learning rate values inside the domain). Having some initial evaluations, we iterate the process (using previous steps):
 Use the evaluations to estimate the accuracy function in a conﬁdence interval (green region) around the estimated mean (black discontinuous line).
 Use the estimation and the acquisition function (brown line in the figure below) to output the most interesting value (gold star over the brown line).
 Evaluate the point and check the stopping mechanism (in this case, we set the threshold at 0.9995).
It is a very powerful and ﬂexible tool that offers a tradeoff between two key concepts, exploration and exploitation, so as to minimize the number of evaluations:
 Exploration: Prioritizes evaluating in places where the variance is large, i.e. where our estimation is most likely to be inaccurate. (In our example, broader green regions).
 Exploitation: Prioritizes evaluating in places where the estimation is closest to our objective. (In our example, higher points in the discontinuous line).
Weighting these two concepts is key in every optimization problem; depending on our objective space, strengthening one concept over the other could be vastly beneﬁcial. This tradeoff is offered by the acquisition function, examples of them include probability of improvement, expected improvement, Bayesian expected losses, upper conﬁdence bounds (UCB), Thompson sampling and mixtures of these. For a brief explanation, I suggest reading these notes
To better understand the entire process, try it on your own in this pure python open source implementation.
This approach has been used in a wide variety of situations even outside machine learning and has proven to be very effective. Bayesian optimization really shines when the evaluation time is long and the hyperparameter space is large and mixed, which is usually the case of hyperparameter optimization in DL.
Evolutionary optimization
Emulating Charles Darwin, an evolution strategy is an algorithm that provides the stronger versions of a population and then produces the next generation based on these versions. Starting with a random population, the iterative process of selecting the strongest and producing the next generation will stop once the best known solution is satisfactory for the user.
In hyperparameter optimization, evolutionary optimization is a methodology for the global optimization of noisy blackbox functions. Searching the hyperparameter space, this method is a batchsequential process that aims to optimize the objective function.
There are many evolutionary strategies; I am going to show you one of the most successful to date: CovarianceMatrix Adaptation Evolution Strategy (CMAES). As many strategies do, it samples from a normal distribution; however, this approach is ﬂexible and can take the observations of each generation and adaptively increase or decrease the search space for the next generation. It will not only adapt for the mean and sigma parameters of each hyperparameter, but will also calculate the entire covariance matrix of the hyperparameter space.
CMAES modiﬁes the covariance calculation formula in a clever way to make it adapt well to an optimization problem. To simplify, let's assume 2 hyperparameters and that we are ﬁltering 25% of the population by iteration. First, the mean of the stronger samples is computed:
where h1 is the ﬁrst hyperparameter, h2 the second hyperparameter and N is the number of samples ﬁltered of each population. The hat over the variables indicates the stronger population. Next, instead of computing the usual covariance matrix:
it computes the variance over the mean of the complete current generation, instead of the mean over the stronger observations:
Armed with the computed metrics, a new population is sampled from a normal distribution with the means and covariance estimated.
Let's take a look at the whole process in our ongoing example. First we sample an arbitrary number of observations in a predeﬁned distribution over our hyperparameter space; in our case, let's take 10 random samples of a normal distribution centered in 0.85 and with a variance of 0.05. Then we start the iterative process:
 Let evolution do its work and ﬁlter the 3 stronger samples (the 3 green points with higher accuracy).
 Compute the mean and the modiﬁed sigma over the strong population (vertical discontinuous black line and the green region respectively).
 Sample a new set of 10 samples from the normal distribution with the computed metrics (red ticks over the horizontal axis).
 Evaluate and check the stopping mechanism (in our case, surpassing 100 evaluations).
On observing the progress of the algorithm visually, we can understand the reason for the modiﬁcation over the covariance matrix. During the ﬁrst stages, the modiﬁed variance grows in order to reach a wider range of learning rates and, as we approach the optimum, the variance decreases converging to zero.
This algorithm is a very popular black box optimization algorithm. It works well in many situations and has been tested by many researchers in all kinds of situations. As a hyperparameter optimization algorithm, it is a wellperforming algorithm over unstable spaces.For further details, I recommend reading this CMAES Tutorial prepared by Nikolaus Hansen, the author of the CMAES algorithm.
Other Algorithms
Gradient based optimizations have been the most effective algorithms over the Machine Learning environment; even over black box functions, the gradient can be estimated with outstanding results. In hyperparameter optimization, however, the origin space isn't usually algebraic and in order to navigate in a gradient fashion, addition and multiplication operations are needed, as well as a convergence space. (E.g. the decision of whether to use dropout or not cannot be explored in a continuous manner.)
In cases where our hyperparameter space is algebraic (e.g. the learning rate combined with the regularization weight), I recommend using gradient based optimizations. For further details, check the theory over the Broyden method as a quasiNewtonian method (estimating the Jacobian matrix) and feel free to test it on your problems!
Method Comparison
In applied mathematics, test functions, known as artiﬁcial landscapes, are useful for evaluating characteristics of optimization algorithms. I am going to exploit three different test functions offered in a public domain in order to evaluate the algorithms explained above. More speciﬁcally, the hypothetical objective functions chosen to reach a wide variety of situations are the following:
Sphere function:
Himmelblau's function:
Rastrigin function:
In order to make visualization easier, a 2D hyperparameter space is assumed.
Examples
Here are some examples of the hyperparameter optimization:
 CMAES over Himmelblau's function. Conﬁguration: 50 samples by generation and keeping the best 20% (10 samples).
The modiﬁed version of the covariance matrix initially produces a dispersed cloud of points, and as it approaches a minimal point, the cloud concentrates around it.
 Bayesian optimization over the Rastrigin function. Conﬁguration: Using an Upper Conﬁdence Bound kernel favoring exploitation (kappa = 5).
The algorithm focuses on improving the objective around the explored minima instead of exploring new zones of interest.
 Bayesian optimization over the Rastrigin function. Conﬁguration: In this case, UCB kernel favoring exploration (kappa = 25).
By favoring exploration, the algorithm switches its focus to examining a wider and unknown range of values instead of evaluating in places where the estimation is closest to our goal.
Evaluation Method
Each of the algorithms is applied several thousand times until the metrics shown stabilize. As some of the algorithms need a prior belief over the objective function (e.g. exploration vs. exploitation in Bayesian optimization, grid deﬁnition in grid search algorithm), random prior beliefs will be chosen (e.g. random grid distance inside a default domain). The importance of this choice will be shown and compared as well. The evaluation metrics shown are the following:
 Prior Belief Importance (PBI): The inﬂuence of our prior belief over the success rate of the algorithm. This value is computed as the variance of probability of success in all prior beliefs chosen from a default range of values scaled by a maximum variance scenario. A higher PBI implies a higher algorithm dependency on the user.
 Success Rate (SR): Probability of reaching the marked threshold in a proven to be capable of success conﬁguration for the speciﬁc objective function.
 Mean Evaluations Required (MER): The average number of function evaluations required in order to surpass the desired threshold over the objective function in successful attempts. In other words, the speed at which the algorithm converges.
Function 
Algorithm 
PBI 
SR 
MER 
Sphere 
Grid search 
0,4 
0,5% 
1720,9 

Random Search 
1,6 
4,6% 
1944,9 

Bayesian Opt. 
0,1 
99,9% 
29,3 

Evolutionary Opt. 
26,6 
99,8% 
213,2 
Himmelblau 
Grid Search 
0,0 
0,0% 
1765,1 

Random Search 
0,1 
0,1% 
1917,0 

Bayesian Opt. 
0,1 
99,9% 
53,5 

Evolutionary Opt. 
60,4 
99,7% 
554,1 
Rastrigin 
Grid Search 
14,0 
25,5% 
1766,0 

Random Search 
57,8 
42,1% 
1178,1 

Bayesian Opt. 
42,4 
87,5% 
214,4 

Evolutionary Opt. 
40,7 
96,2% 
1045,9 
For example, in optimizing over Himmelblau's function, the Bayesian optimization generally shows better performance: barely any importance in the user choices (0.1), very high probability of reaching the goal (99.9%) and a low average in evaluations required (53.5); i.e. regardless of the user's manually selected parameters, the Bayesian optimization consistently reaches the marked threshold and in a very fast manner.
CMAES shows better results over the Rastrigin function. As shown in the table, the algorithm reaches the optimum more consistently than its counterparts; however, it also offers a slow convergence.
To conclude, evolutionary algorithms may be advantageous in highly volatile objective functions, yet overall it is evident that Bayesian optimization is the most robust and efficient option.
Buguroo's experience
In Buguroo's Deep Learning department, a modiﬁed version of Bayesian optimization balancing the exploration and exploitation over time has been adopted in order to hone our models´ performance. In our experience, this approach has been found to be satisfactory in most cases. Our keystroke dynamics models, the agents entrusted with conﬁrming the identity of an individual based on the manner and the rhythm of typing on a keyboard, are a successful example of use.