Author's image
Machine learning is a type of computer algorithm that helps machines learn without explicit programming.
Today, we see applications of machine learning everywhere: in navigation systems, movie streaming platforms, and e-commerce applications.
In fact, from the moment you wake up in the morning until the moment you go to bed, you've likely interacted with dozens of machine learning models without even realizing it.
The machine learning industry is projected to grow more than 36% between 2024 and 2030.
Since almost all large organizations are actively investing in ai, you will only benefit by honing your machine learning skills.
Whether you are a data science enthusiast, a developer, or a regular person who wants to improve your knowledge on the subject, here are five commonly used machine learning models you should know about:
1. Linear regression
Linear regression is the most popular machine learning model used to perform quantitative tasks.
This algorithm is used to predict a continuous outcome (y) using one or more independent variables (x).
For example, you would use linear regression if you were given the task of predicting house prices based on their size.
In this case, the size of the house is the independent variable x that will be used to predict the price of the house, which is the independent variable.
This is done by fitting a linear equation that models the relationship between x and y, represented by y=mX+c.
Below is a diagram representing a linear regression that models the relationship between house price and size:
Author's image
Learning resource
To learn more about the intuition behind linear regression and how it works mathematically, I recommend watching Krish Naik YouTube Tutorial about the topic.
2. Logistic regression
Logistic regression is a classification model used to predict a discrete outcome given one or more independent variables.
For example, given the number of negative keywords in a sentence, logistic regression can be used to predict whether a given message should be classified as legitimate or spam.
Here is a graph showing how logistic regression works:
Author's image
Note that unlike linear regression which represents a straight line, logistic regression is modeled as an S-shaped curve.
As indicated in the curve above, as the number of negative keywords increases, the probability that the message will be classified as spam also increases.
The x-axis of this curve represents the number of negative keywords and the y-axis shows the probability that the email is spam.
Typically, in logistic regression, a probability of 0.5 or greater indicates a positive result; in this context, it means that the message is spam.
Conversely, a probability less than 0.5 indicates a negative result, meaning the message is not spam.
Learning resource
If you want to learn more about logistic regression, StatQuest Logistic Regression Tutorial It's a great place to start.
3. Decision trees
Decision trees are a popular machine learning model used for both classification and regression tasks.
They work by breaking down the data set based on its characteristics, creating a tree-like structure to model this data.
In simple terms, decision trees allow us to continuously slice data based on specific parameters until a final decision is made.
Below is an example of a simple decision tree that determines whether a person should eat ice cream on a given day:
Author's image
- The tree starts with the weather, identifying whether it is conducive to eating ice cream.
- If the weather is warm, then move on to the next node, health. Otherwise, the decision is negative and there will be no further divisions.
- At the next node, if the person is healthy, they can eat the ice cream. Otherwise, they must refrain from doing so.
Notice how the data is divided into each node of the decision tree, breaking down the classification process into simple, manageable questions.
A similar decision tree can be drawn for regression tasks with a quantitative result, and the intuition behind the process will remain the same.
Learning resource
For more information on decision trees, I suggest seeing StatsQuest Video Tutorial in the subject.
4. Random forests
The random forest model combines the predictions made by multiple decision trees and returns a single result.
Intuitively, this model should perform better than a single decision tree because it leverages the capabilities of multiple predictive models.
This is done with the help of a technique known as bagging or bootstrap aggregation.
This is how bagging works:
A statistical technique called bootstrap is used to sample the data set multiple times with replacement.
Then, a decision tree is trained on each sample data set. The output of all trees is finally combined to generate a single prediction.
In the case of a regression problem, the final result is generated by averaging the predictions made by each decision tree. For classification problems, a majority class prediction is performed.
Learning resource
You can see Krish Naik's Tutorial on Random Forests to learn more about the theory and intuition behind the model.
5. K-means clustering
So far, all of the machine learning models we've looked at fall under the framework of a method called supervised learning.
Supervised learning is a technique that uses a set of labeled data to train algorithms to predict an outcome.
In contrast, unsupervised learning is a technique that does not deal with labeled data. Instead, it identifies patterns in the data without being trained on what specific results to look for.
K-Means clustering is an unsupervised learning model that essentially ingests unlabeled data and assigns each data point to a cluster.
The observations belong to the group with the closest mean.
Here is a visual representation of the K-Means clustering model:
Author's image
Notice how the algorithm has grouped each data point into three distinct groups, each represented by a different color. These groups are grouped according to their proximity to the centroid, indicated by a red x mark.
Simply put, all data points within Group 1 share similar characteristics, which is why they are grouped together. The same principle applies to Groups 2 and 3.
When creating a K-Means clustering model, you must explicitly specify the number of clusters you want to generate.
This can be achieved using a technique called the elbow method, which simply plots the model error scores with various group values on a line graph. Then, choose the inflection point of the curve, or its “elbow,” as the optimal number of groups.
Here is a visual representation of the elbow method:
Author's image
Note that the inflection point on this curve is at the 3-group mark, which means that the optimal number of groups for this algorithm is 3.
Learning resource
If you want to learn more about the topic, StatQuest has a
8 minute video that clearly explains the working behind K-Means clustering.
Next steps
The machine learning algorithms explained in this article are commonly used in industry-wide applications such as forecasting, spam detection, loan approval, and customer segmentation.
If you have managed to follow us this far, congratulations! You now have a solid understanding of the most commonly used predictive algorithms and have taken the first step to venture into the field of machine learning.
But the journey doesn't end here.
To solidify your understanding of machine learning models and be able to apply them to real-world applications, I suggest learning a programming language like Python or R.
Open source camp Python course for beginners
The course is a great starting point. If you find yourself stuck in your programming journey, I have a Youtube video which explains how to learn to code from scratch.
Once you learn how to code, you can implement these models in practice using libraries like Scikit-Learn and Keras.
To improve your data science and machine learning skills, I suggest you create a custom learning path using generative ai models like ChatGPT. Here's a more detailed roadmap to help you get started ChatGPT to learn data science.
Natasha Selvaraj is a self-taught data scientist with a passion for writing. Natassha writes on all things data science, a true master of all things data. You can connect with her at LinkedIn or look at it Youtube channel.