Machine Learning Basics

In this blog I give an overview of the Machine Learning Project Flow.
Every Machine Learning project involves the below steps:

Understand the client requirement / Problem statement

Data Understanding

Data Collection (CSV file, logs, sensor data, data from SQL etc)

Data Explore

Data Quality Analysis : Analyse the data such that the sufficient information or data is available to prepare the plan for building ML model.

Data Preparation

Cleaning the data : Check for NULL and NA values in the dataset, and take necessary actions.
a. Remove if dataset is huge and removing a samples doesn’t affect the quality of the data.
b. Impute the missing values with mean, median or KNN.

Outliers : Might be due to human error. This can be checked by using boxplot or the summary statistics of the data. Remove or replace accordingly.

Sample Distribution : Check how the features are distribute using histogram.
Divide the data into train and test data set.

Feature Selection : This can be done using filter, wrapper and embedded method
a. Filter – Using statistical methods. Correlation check.
b. Wrapper – Subset, Forward, Backward, Hybrid selection, Boruta feature selection, Random forest important variable selection.
c. Embedded – Lasso and Ridge.

Feature Engineering
a. New features are created from the existing features.
b. Feature transformations.
c. Perform normalization and standardization.
d. Dimensionality reduction : The dimension of the dataset can be reduced by using techniques such as PCA.

Model Building

Based on the problem statement decide whether the problem belongs to supervised and unsupervised model.
If supervised model, check whether the target variable is continuous or discrete.
    If continuous – use regression model
    If discrete – use classification model
Build the model using appropriate machine learning algorithm on train data.
Multiple models can be built to check which gives the better accuracy.

Model Evaluation

Accuracy/error rate on both training data and testing data using appropriate evaluation metrics.
If the accuracy is good on Training and poor on Testing dataset then the model is overfitting.
   Build new model using regularization.
   Ask the customer to provide more data samples.
If the accuracy is poor on both training and test data set then the model is underfitting.
   Build new model by adding more features.
   Build new model which includes feature transformation.

I am currently working on separate blog for each of the steps, will update the links once the contents are up on the blog.


Comments

Popular posts from this blog

Machine Learning Algorithms - Overview

Ubuntu Cheat Sheet