Image of Free beak
Statistical functions are the cornerstone of extracting meaningful insights from raw data. Python offers a powerful set of tools for statisticians and data scientists to understand and analyze datasets. Libraries such as NumPy, Pandas, and SciPy offer a comprehensive set of functions. This guide goes over 10 essential statistical functions in Python within these libraries.
Libraries for statistical analysis
Python offers many libraries designed specifically for statistical analysis. Three of the most commonly used are NumPy, Pandas, and SciPy Stats.
- NumPy: Short for Numerical Python, this library provides support for matrices, arrays, and a variety of mathematical functions.
- Pandas: Pandas is a data analysis and manipulation library that is useful for working with tables and time series. It is based on NumPy and adds additional functions for data manipulation.
- SciPy Statistics: This library, short for Scientific Python, is used for scientific and technical calculations. It provides a wealth of probability distributions, statistical functions, and hypothesis testing.
Python libraries must be downloaded and imported into the workspace before they can be used. To install a library, use the terminal and the pip install command. Once it has been installed, it can be loaded into your Python script or Jupyter notebook using the import statement. NumPy is normally imported as np
Pandas like pd
and normally only the statistics module is imported from SciPy.
pip install numpy
pip install pandas
pip install scipy
import numpy as np
import pandas as pd
from scipy import stats
Where different functions can be calculated using more than one library, example code using each will be shown.
1. Average (mean)
The mean, also known as the average, is the most fundamental statistical measure. It provides a central value for a set of numbers. Mathematically, it is the sum of all the values divided by the number of values present.
mean_numpy = np.mean(data)
mean_pandas = pd.Series(data).mean()
2. Medium
The median is another measure of central tendency. It is calculated by reporting the middle value of the data set when all values are ordered. Unlike the mean, it is not affected by outliers. This makes it a more robust measure for skewed distributions.
median_numpy = np.median(data)
median_pandas = pd.Series(data).median()
3. Standard deviation
The standard deviation is a measure of the amount of variation or spread in a set of values. It is calculated using the differences between each data point and the mean. A low standard deviation indicates that the values in the data set tend to be close to the mean, while a higher standard deviation indicates that the values are more spread out.
std_numpy = np.std(data)
std_pandas = pd.Series(data).std()
4. Percentiles
Percentiles indicate the relative position of a value within a data set when all the data are ordered. For example, the 25th percentile is the value below which 25% of the data lies. The median is technically defined as the 50th percentile.
Percentiles are calculated using the NumPy library and the specific percentiles of interest must be included in the function. In the example, the 25th, 50th, and 75th percentiles are calculated, but any percentile value between 0 and 100 is valid.
percentiles = np.percentile(data, (25, 50, 75))
5. Correlation
The correlation between two variables describes the strength and direction of their relationship. It is the degree to which one variable changes when the other changes. The correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicates that there is no linear relationship between the variables.
6. Covariance
Covariance is a statistical measure that represents the degree to which two variables change together. It does not provide the strength of the relationship in the same way that a correlation does, but it does indicate the direction of the relationship between variables. It is also key to many statistical methods that analyze relationships between variables, such as principal component analysis.
7. Asymmetry
Skewness measures the asymmetry of the distribution of a continuous variable. A skewness of zero indicates that the data are symmetrically distributed, as in the normal distribution. Skewness helps to identify potential outliers in the data set, and establishing symmetry is a requirement for some statistical methods and transformations.
skew_scipy = stats.skew(data)
skew_pandas = pd.Series(data).skew()
8. Kurtosis
Kurtosis, often used in conjunction with skewness, describes the area of the tails of a distribution relative to the normal distribution. It is used to indicate the presence of outliers and to describe the overall shape of the distribution, such as whether it is very pointed (called leptokurtic) or flatter (called platykurtic).
kurt_scipy = stats.kurtosis(data)
kurt_pandas = pd.Series(data).kurt()
9. T-Test
A t-test is a statistical test used to determine whether there is a significant difference between the means of two groups. Or, in the case of a one-sample t-test, it can be used to determine whether a sample mean is significantly different from a predetermined population mean.
This test is run using the statistics module of the SciPy library. The test provides two results: the t-statistic and the p-value. Typically, if the p-value is less than 0.05, the result is considered statistically significant when the two means are different from each other.
t_test, p_value = stats.ttest_ind(data1, data2)
onesamp_t_test, p_value = stats.ttest_1samp(data, popmean = 0)
10. Chi-square
The Chi-square test is used to determine whether there is a significant association between two categorical variables, such as job title and gender. The test also uses the statistics module of the SciPy library and requires the input of observed data and expected data. Similar to the t-test, the output provides a Chi-square test statistic and a p-value that can be compared to 0.05.
chi_square_test, p_value = stats.chisquare(f_obs=observed, f_exp=expected)
Summary
This article highlights 10 key Python statistical functions, but there are many more contained in various packages that can be used for more specific applications. Leveraging these tools for statistics and data analysis allows you to gain valuable insights from your data.
Mehrnaz Siavoshi She holds a master's degree in data analytics and is a full-time biostatistician, working on developing complex machine learning and statistical analysis in the healthcare field. She has experience with ai and has taught undergraduate courses in biostatistics and machine learning at the University of the People.