
Image by author
As a data scientist, you’ll want to know the accuracy of your results to ensure validity. The data science workflow is a planned project, with controlled conditions. Allowing you to assess each stage and how it lent itself to your outcome.
Probability is the measure of how likely an event or something is to occur. It is an important element in predictive analytics that allows you to explore the computational mathematics behind your result.
Using a simple example, let’s take a look at flipping a coin: either heads (H) or tails (T). Its probability will be the number of ways an event can occur divided by the total number of possible outcomes.
- If we want to find the probability of heads, it would be 1 (heads) / 2 (heads and tails) = 0.5.
- If we want to find the probability of tails, it would be 1 (tails) / 2 (heads and tails) = 0.5.
But we don’t want to confuse probability and probability, there is a difference. Probability is the measure of a specific event or outcome occurring. Probability is applied when you want to increase the chances of a specific event or outcome occurring.
To break it down, probability is all about possible outcomes, while probability is all about hypotheses.
Another term to know is ”mutually exclusive events”. These are events that do not occur at the same time. For example, you can’t go left and right at the same time. Or if we flip a coin, we may get heads or tails, but not both.
types of probability
- theoretical probability: focuses on the probability of an event occurring and is based on the basis of reasoning. Using theory, the result is the expected value. Using the heads and tails example, the theoretical probability of landing heads is 0.5 or 50%.
- Experimental probability: focuses on the frequency with which an event occurs during the duration of an experiment. Using the heads and tails example, if we were to toss a coin 10 times and it landed on heads 6 times, the experimental probability that the coin would land on heads would be 6/10 or 60%.
Conditional probability is the chance that an event/result will occur based on an existing event/result. For example, if you are working for an insurance company, you may want to find the probability that a person will be able to pay for your insurance based on the condition that they have obtained a home loan.
Conditional probability helps data scientists produce more accurate models and results by using other variables in the data set.
A probability distribution is a statistical function that helps describe the possible values and probabilities of a random variable within a given range. The range will have possible minimum and maximum values, and where they are plotted on a distribution plot depends on statistical tests.
Based on the type of data used in the project, you can find out what type of distribution you are using. I will divide them into two categories: discrete distribution and continuous distribution.
discrete distribution
The discrete distribution is when the data can only take on certain values or have a limited number of outcomes. For example, if you were to roll a die, its limited values are 1, 2, 3, 4, 5, and 6.
There are different types of discrete distribution. For example:
- Discrete uniform distribution is when all outcomes are equally likely. Using the example of rolling a six-sided die, there is an equal probability that it will land on 1, 2, 3, 4, 5, or 6 – ⅙. However, the problem with the discrete uniform distribution is that it does not give us relevant information that data scientists can use and apply.
- Bernoulli distribution is another type of discrete distribution, where the experiment has only two possible outcomes, either yes or no, 1 or 2, true or false. This can be used when tossing a coin, either heads or tails. Using the Bernoulli distribution, we have the probability of one of the outcomes (p) and we can deduce it from the total probability (1), represented as (1-p).
- binomial distribution is a Bernoulli sequence of events and is the discrete probability distribution that can produce only two possible outcomes in an experiment, either success or failure. When tossing a coin, the probability of tossing a coin will always be 1.5 or ½ in each experiment performed.
- distribution of poison is the distribution of how many times an event is likely to occur over a specified period or distance. Instead of focusing on the occurrence of an event, it focuses on the frequency of an event that occurs in a specific interval. For example, if 12 cars are on a particular road at 11 am every day, we can use the Poisson distribution to find how many cars are on that road at 11 am in a month.
Continuous Distribution
Unlike discrete distributions that have finite outcomes, continuous distributions have continuous outcomes. These distributions often appear as a curve or line on a graph, since the data is continuous.
- Normal distribution it is one that you may have heard of, as it is the most widely used. It is a symmetric distribution of values around the mean, without bias. The data follows a bell shape when plotted, where the midrange is the mean. For example, characteristics such as height and IQ scores follow a normal distribution.
- T distribution It is a type of continuous distribution that is used when the population standard deviation (σ) is unknown and the sample size is small (n<30). It follows the same shape as a normal distribution, the bell curve. For example, if we were looking at how many chocolate bars were sold in one day, we would use the normal distribution. However, if we want to see how many were sold in a specific hour, we will use the t distribution.
- exponential distribution It is a type of continuous probability distribution that focuses on the amount of time until an event occurs. For example, we may want to analyze earthquakes and we can use an exponential distribution. The amount of time, from this point until an earthquake occurs. The exponential distribution is drawn as a curved line and represents probabilities exponentially.
From the above, you can see how data scientists can use probability to understand more about data and answer questions. It is very useful for data scientists to know and understand the chances of an event occurring and can be very effective in the decision-making process.
You will be constantly working with data and need to learn more about it before doing any form of analysis. Looking at the data distribution can give you a lot of information, and you can use it to adjust your task, process, and model to cater for the data distribution.
This reduces the time you spend understanding data, provides a more efficient workflow, and produces more accurate results.
Many of the data science concepts are based on the fundamentals of probability.
nisha aria is a data scientist and freelance technical writer. She is particularly interested in providing Data Science career tips or tutorials and theory-based knowledge about Data Science. She also wants to explore the different ways that Artificial Intelligence is or can benefit the longevity of human life. An enthusiastic student looking to expand her technological knowledge and her writing skills while helping mentor others.