Author's image
I am a data scientist with a background in computer science.
I am familiar with data structures, object-oriented programming and database management, as I was taught these concepts for 3 years at university.
However, upon entering the field of data science, I noticed a significant skills gap.
I didn't have the math and statistics skills needed for almost all data science positions.
I took a few online statistics courses, but nothing seemed to really stick with me.
Most of the programs were very basic and designed for high-level executives. Others were detailed and based on prior knowledge that I did not possess.
I spent time searching the Internet for resources to better understand concepts such as hypothesis testing and confidence intervals.
And after interviewing for multiple data science positions, I found that most statistics interview questions followed a similar pattern.
In this article, I will list 10 of the most popular statistics questions I have encountered in data science interviews, along with sample answers to these questions.
Question 1: What is a p-value?
Answer: Given that the null hypothesis is true, a p-value is the probability that you will see a result at least as extreme as the one observed.
P-values are typically calculated to determine whether the result of a statistical test is significant. Simply put, the p-value tells us whether there is sufficient evidence to reject the null hypothesis.
Question 2: Explain the concept of statistical power.
Answer: If you were to run a statistical test to detect whether an effect exists, statistical power is the probability that the test will accurately detect the effect.
Here is a simple example to explain:
Let's say we run an ad to a test group of 100 people and get 80 conversions.
The null hypothesis is that the ad had no effect on the number of conversions. However, in reality, the ad did have a significant impact on the number of sales.
Statistical power is the probability that the null hypothesis is accurately rejected and the effect is actually detected. Higher statistical power indicates that the test is better able to detect an effect, if there is one.
Question 3: How would you describe confidence intervals to a non-technical stakeholder?
Let's use the same example as before, where you run an ad to a sample of 100 people and get 80 conversions.
Instead of saying the conversion rate is 80%, we would provide a range, since we don't know how the actual population would behave. In other words, if we took an infinite number of samples, how many conversions would we see?
Here is an example of what we could say based solely on the data obtained from our sample:
“If we were to run this ad to a larger group of people, we are 95% confident that the conversion rate would be between 75% and 88%.”
We use this range because we don't know how the entire population will react and we can only generate an estimate based on our test group, which is just a sample.
Question 4: What is the difference between a parametric and a nonparametric test?
A parametric test assumes that the data set follows an underlying distribution. The most common assumption made when performing a parametric test is that the data are normally distributed.
Some examples of parametric tests include ANOVA, T-test, F-test, and Chi-square test.
However, nonparametric tests do not make assumptions about the distribution of the data set. If the data set does not have a normal distribution or if it contains ranges or outliers, it is advisable to choose a nonparametric test.
Question 5: What is the difference between covariance and correlation?
Covariance measures the direction of the linear relationship between variables. Correlation measures the strength and direction of this relationship.
While both correlation and covariance provide similar information about the relationship between features, the main difference between them is scale.
The correlation ranges from -1 to +1. It is standardized and makes it easy to understand whether there is a positive or negative relationship between the characteristics and how strong this effect is. On the other hand, the covariance is shown in the same units as the dependent and independent variables, which can make its interpretation a little difficult.
Question 6: How would you analyze and handle outliers in a data set?
There are a few ways to detect outliers in your dataset.
- Visual methods: Outliers can be identified visually using graphs such as box plots and scatter plots. Points that are outside the boundaries of a box plot are usually outliers. When using scatter plots, outliers can be detected as points that are far away from other data points in the visualization.
- Non-visual methods: A non-visual technique for detecting outliers is the Z-score. Z-scores are calculated by subtracting a value from the mean and dividing by the standard deviation. This tells us how many standard deviations away from the mean a value is. Values that are above or below 3 standard deviations from the mean are considered outliers.
Question 7: Differentiate between a one-tailed and a two-tailed test.
A one-tailed test checks whether there is a relationship or effect in only one direction. For example, after running an ad, you can use a one-tailed test to check whether there was a positive impact, i.e. an increase in sales. This is a right-tailed test.
A two-tailed test examines the possibility of a relationship in both directions. For example, if a new teaching style has been implemented in all public schools, a two-tailed test would assess whether there is a significant increase or decrease in grades.
Question 8: Given the following scenario, which statistical test would you choose to implement?
An online retailer wants to evaluate the effectiveness of a new advertising campaign. It collects daily sales data for 30 days before and after the ad is launched. The company wants to determine whether the ad contributed to a significant difference in daily sales.
Options:
A) Chi-square test
B) Paired t test
C) One-way ANOVA
d) Independent samples t test
AnswerTo evaluate the effectiveness of a new advertising campaign, we must use a paired t-test.
A paired t-test is used to compare the means of two samples and test whether a difference is statistically significant.
In this case, we are comparing sales before and after the ad ran, comparing a change in the same set of data, which is why we use a paired t-test instead of an independent samples t-test.
Question 9: What is a Chi-Square test of independence?
A chi-square test of independence is used to examine the relationship between observed and expected outcomes. The null hypothesis (H0) of this test is that any observed differences between characteristics are due purely to chance.
In simple terms, this test can help us identify whether the relationship between two categorical variables is due to chance or whether there is a statistically significant association between them.
For example, if you want to test whether there is a relationship between gender (male vs female) and ice cream flavor preference (vanilla vs chocolate), you can use a Chi-Square test of independence.
Question 10: Explain the concept of regularization in regression models.
Regularization is a technique used to reduce overfitting by adding additional information, allowing models to better adapt and generalize to data sets on which they have not been trained.
In regression, there are two commonly used regularization techniques: ridge regression and lasso regression.
These are models that slightly modify the error equation of the regression model by adding a penalty term.
In ridge regression, a penalty term is multiplied by the sum of the squared coefficients. This means that models with larger coefficients receive a larger penalty. In lasso regression, a penalty term is multiplied by the sum of the absolute coefficients.
While the main goal of both methods is to reduce the size of the coefficients and minimize the model error, ridge regression penalizes large coefficients more.
On the other hand, lasso regression applies a constant penalty to each coefficient, meaning that coefficients can be reduced to zero in some cases.
10 Statistics Questions to Ace Your Data Science Interview: Next Steps
If you've made it this far, congratulations!
You now have a solid understanding of the statistics questions asked in data science interviews.
As a next step, I recommend taking an online course to review these concepts and put them into practice.
Below are some statistics learning resources that I have found useful:
The final course can be audited for free on edX, while the first two resources are YouTube channels that cover statistics and machine learning extensively.
 
 
Natassha Selvaraj Natassha is a self-taught data scientist with a passion for writing. She writes about everything related to data science, she is a true master of all things data-related. You can contact her at LinkedIn or take a look at it YouTube Channel.