Machine Learning Algorithms - Overview

Linear Regression

Type                     : Supervised Learning
Target Attribute : Continuous variable
Pre-processing
  1. Remove the replace the NULL and NA values
  2. Check for Outliers and replace
  3. Divide the data into train and test data set.
  4. Check for multicollinearity.
  5. Convert the categorical variables to numeric variables
  6. Use feature selection techniques to select only the important features.
                 Forward selection
                 Backward selection
                 Hybrid feature selection
Build linear regression model (without regularization)

Metrics to consider for evaluation
  1. R square value – This is the proportion of the data explained by the model
  2. Adjusted R square – This takes account of number of features
  3. RMSE – Root Mean Squared Error – This gives the root of squared difference between the actual and predicted target variable
  4. Mean Absolute Error
  5. Mean Squared Error
  6. AIC and BIC values
  7. Residual Analysis – Error terms should be randomly distributed
If the model is over fitting below approaches can be used
  1. Normalize the data and re-build the new model with regularization parameter.
  2. Build new model with only significant features by performing feature selection.
  3. Ask the customer to provide more samples of data.

If the model is under fitting
  1. Build new model with polynomial feature or by using feature transformation and feature extraction.

Logistic Regression

Type                     : Supervised Learning
Target Attribute : Discrete Variable / Classes
Pre-processing
  1. Remove the replace the NULL and NA values
  2. Check for Outliers and replace
  3. Divide the data into train and test data set.
  4. Check for multicollinearity.
  5. Convert the categorical variables to numeric variables
  6. Use feature selection techniques to select only the important features.
                 Forward selection
                 Backward selection
                 Hybrid feature selection

Build logistic regression model (without regularization)

Metrics to consider for evaluation
  1. Confusion Matrix
  2. Classification accuracy (Accuracy = (TP+TN)/(TP+TN+FP+FN)
  3. Classification error (misclassification rate = (FP+FN)/(TP+TN+FN+FP)
  4. Recall / Sensitivity / True positive rate = (TP/(TP+FN))
  5. Specificity = TN/(TN+FP)
  6. False positive rate = FP/(TN+FP)
  7. Precision = TP/(TP+FP)Residual Analysis – Error terms should be randomly distributed
  8. AUC value
Note: TP=True Positive, TN=True Negative, FP=False Positive, FN=False Negative. These metrics can be extracted by Confusion matrix which will be explained in the coming blogs.

Depends on the business problem appropriate metric can be used to evaluate the model.

In Logistic regression threshold of the probability to classify the classes can be selected by plotting the ROC curve based on the metric requirement.

Comments

Popular posts from this blog

Machine Learning Basics

Ubuntu Cheat Sheet