January 1, 2021

#87 - Machine Learning classification comparison between GridSearchCV and RandomizedSearchCV

In a dataset with 13 features that represent different wine properties, such as color intensity and alcohol content, a machine learning model is created for a classification task. The goal is to predict in which class these wines belong, that could be class 0, class 1 or class 2. 

The baseline of the classifier is done using the Random Forest algorithm and the source code of this example can be found on github.

Baseline: Random Forest Classifier with its default numbers of trees ('n_estimators') in the forest is 100.


Now two different methods are used for tuning the parameters of the model, first the Grid SearchCV and after the Randomized SearchCV.

GridSearchCV: The main parameters on this example are shown below. The difference from the baseline model, is that a parameters grid dictionary with a list of values is evaluated. In this example the model evaluates the accuracy for each set of parameters on the list. Having the following elements 2 'max_depth' x 3 'max_features' x 3 'n_estimators', there are in total 18 possible combinations. 
In addition, a cross validation number of 10 is applied, so there are 18 parameters combination x 10-fold cross validation on the dataset, therefore a total of 180 fits are executed. 


RandomizedSearchCV: This method has some similarities with the previous. One of the main differences is in the number of fits executed. The algorithm does not execute all the combination of parameters, but rather only the number of 'n_iter' set by the programmer, in this case 10. 
Therefore, the number of fits executed are 10 'n_iter' x 10 cross validation, so 100 fits in total. 


Notes:

  • In this simple example the accuracy between all 3 models (Random Forest baseline, GridSearchCV and RandomizedSearchCV) are very close.
  • The computation time of the GridSearchCV is the worst. This is due to the higher number of fits executed trying all the  parameter combinations for every cross validation dataset.
  • Some advantages of the RandomizedSearchCV are that it is possible to adjust the number of iterations executed by the model, by changing the 'n_iter' parameter and also it is possible to print the most important features and set parameters of the studied model. These can lead to further tuning, such as select only the key features and better targeted parameters that can help to reduce the computational load. 

No comments:

Post a Comment