Image by author
One of the fields that underpins data science is machine learning. So if you want to get into data science, understanding machine learning is one of the first steps you should take.
But where do you start? Start by understanding the difference between the two main types of machine learning algorithms. Only after that can we talk about individual algorithms that should be on your priority list to learn as a beginner.
The main distinction between algorithms is based on how they learn.
Image by author
Supervised learning algorithms They are trained in a labeled data set. This data set serves as supervision (hence the name) for learning because some data in it is already labeled as correct answer. Based on this information, the algorithm can learn and apply that learning to the rest of the data.
On the other hand, unsupervised learning algorithms learn in a unlabeled data setwhich means they are dedicated to finding patterns in data without humans giving instructions.
You can read in more detail about machine learning algorithms and types of learning.
There are other types of machine learning as well, but not for beginners.
Algorithms are used to solve two main different problems within each type of machine learning.
Again, there are a few more tasks, but they are not for beginners.
Image by author
Supervised learning tasks
Regression is the task of predicting a numerical valuecalled continuous outcome variable or dependent variable. The prediction is based on the predictor variable(s) or the independent variable(s).
Think about predicting oil prices or air temperatures.
Classification is used to predict the category (class) of the input data. He outcome variable here it is categorical or discrete.
Think about predicting whether the email is spam or not or whether the patient will contract a certain disease or not.
Unsupervised learning tasks
Group half split data into subsets or groups. The goal is to group the data in the most natural way possible. This means that data points within the same group are more similar to each other than to data points in other groups.
Dimensionality reduction refers to reducing the number of input variables in a data set. It basically means reduce the data set to very few variables while still capturing its essence.
Here is an overview of the algorithms I will cover.
Image by author
Supervised learning algorithms
When choosing the algorithm for your problem, it is important to know what task the algorithm is used for.
As a data scientist, you will probably apply these algorithms in Python using the scikit-learn library. Although it does (almost) everything for you, it is recommended that you know at least the general principles of the internal workings of each algorithm.
Finally, once the algorithm is trained, you must evaluate its performance. For that, each algorithm has some standard metrics.
1. Linear regression
Used for: Regression
Description: Linear regression draws a straight line called regression line between the variables. This line passes approximately through the center of the data points, thus minimizing estimation error. Shows the predicted value of the dependent variable based on the value of the independent variables.
Evaluation metrics:
- Mean square error (MSE): Represents the average squared error, the error being the difference between the actual and predicted values. The smaller the value, the better the performance of the algorithm.
- R-squared: Represents the percentage of variance of the dependent variable that can be predicted by the independent variable. For this measure, you should strive to get as close to 1 as possible.
2. Logistic regression
Used for: Classification
Description: Uses a logistics function to translate the data values into a binary category, i.e. 0 or 1. This is done using the threshold, usually set to 0.5. The binary outcome makes this algorithm perfect for predicting binary outcomes, such as YES/NO, TRUE/FALSE, or 0/1.
Evaluation metrics:
- Accuracy: The relationship between correct and total predictions. The closer to 1, the better.
- Precision: the measure of the model's accuracy in positive predictions; It is shown as the ratio between correct positive predictions and the total expected positive outcomes. The closer to 1, the better.
- Remember: it also measures the model's accuracy in positive predictions. It is expressed as a relationship between correct positive predictions and the total observations made in the class. Read more about these metrics here.
- F1 Score: The harmonic mean of the model's recall and precision. The closer to 1, the better.
3. Decision trees
Used for: Regression and classification
Description: Decision trees are algorithms that use hierarchical or tree structure to predict a value or class. The root node represents the entire data set, which then branches into decision nodes, branches, and exits based on the values of the variables.
Evaluation metrics:
- Accuracy, precision, recall and F1 score -> for classification
- MSE, R squared -> for regression
4. Naive Bayes
Used for: Classification
Description: This is a family of classification algorithms that use Bayes theoremmeaning that they assume independence between features within a class.
Evaluation metrics:
- Accuracy
- Precision
- Remember
- F1 score
5. K-nearest neighbors (KNN)
Used for: Regression and classification
Description: Calculate the distance between the test data and the k-number of closest data points from the training data. The test data belongs to a class with a larger number of “neighbors”. Regarding the regression, the predicted value is the average of the k chosen training points.
Evaluation metrics:
- Accuracy, precision, recall and F1 score -> for classification
- MSE, R squared -> for regression
6. Support Vector Machines (SVM)
Used for: Regression and classification
Description: This algorithm draws a tech/big-data/articles/what-is-support-vector-machine/” rel=”noopener” target=”_blank”>hyperplane to separate different kinds of data. It is located at the greatest distance from the closest points of each class. The greater the distance of the data point from the hyperplane, the more it belongs to its class. For regression, the principle is similar: the hyperplane maximizes the distance between the predicted and actual values.
Evaluation metrics:
- Accuracy, precision, recall and F1 score -> for classification
- MSE, R squared -> for regression
7. Random Forest
Used for: Regression and classification
Description: The random forest algorithm uses a set of decision trees, which then form a decision forest. The prediction of the algorithm is based on the prediction of many decision trees. Data will be assigned to the class that receives the most votes. For regression, the predicted value is an average of the predicted values of all trees.
Evaluation metrics:
- Accuracy, precision, recall and F1 score -> for classification
- MSE, R squared -> for regression
8. Gradient boosting
Used for: Regression and classification
Description: These algorithms Use a set of weak models, and each subsequent model recognizes and corrects the errors of the previous model. This process is repeated until the error is minimized (loss function).
Evaluation metrics:
- Accuracy, precision, recall and F1 score -> for classification
- MSE, R squared -> for regression
Unsupervised learning algorithms
9. K-means clustering
Used for: Group
Description: the algorithm divides the data set into groups of k numbers, each represented by its centroid or geometric center. Through the iterative process of dividing data into k number of clusters, the goal is to minimize the distance between data points and the centroid of their cluster. On the other hand, it also tries to maximize the distance of these data points from the centroid of the other groups. Simply put, data belonging to the same group should be as similar as possible and as different as data from other groups.
Evaluation metrics:
- Inertia: The sum of the distance squared of the distance of each data point from the centroid of the nearest cluster. The lower the inertia value, the more compact the group will be.
- Silhouette score: Measures the cohesion (the similarity of the data within its own group) and the separation (the difference of the data with other groups) of the groups. The value of this score ranges between -1 and +1. The larger the value, the better the data matches its group and the worse it matches other groups.
10. Principal Component Analysis (PCA)
Used for: Dimensionality reduction
Description: the algorithm It reduces the number of variables used by constructing new variables (principal components) and at the same time attempts to maximize the variance captured from the data. In other words, it limits the data to its most common components without losing the essence of the data.
Evaluation metrics:
- Explained Variance: The percentage of variance covered by each principal component.
- Total variance explained: The percentage of variance covered by all principal components.
Machine learning is an essential part of data science. With these ten algorithms, you will cover the most common tasks in machine learning. Of course, this overview only gives you a general idea of how each algorithm works. So this is just the beginning.
Now you need to learn how to implement these algorithms in Python and solve real problems. In that, I recommend using scikit-learn. Not only because it is a relatively easy-to-use ML library but also because of its extensive materials about ML algorithms.
twitter.com/StrataScratch” rel=”noopener”>twitter.com/StrataScratch” target=”_blank” rel=”noopener noreferrer”>Nate Rosidi He is a data scientist and in product strategy. He is also an adjunct professor of analytics and is the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real questions from top companies. Nate writes about the latest trends in the career market, provides interview tips, shares data science projects, and covers all things SQL.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>