Detecting multicollinearity in data sets is an important step but also a challenge. I will demonstrate how to detect variables with similar behavior in mixed data sets and how to further examine relationships with interactive graphs.
Understanding the strength of relationships between variables in a data set is important because variables with statistically similar behavior can affect the reliability of models. To eliminate the so-called multicollinearity we can use correlation measures for continuous variables. However, when we also have categorical variables and therefore mixed data sets, it becomes even more difficult to test for multicollinearity. Statistical tests, such as hypergeometric tests and the Mann-Whitney U test, can be used to test associations between variables in mixed data sets. While this is great, it requires several intermediate steps, such as variable typing, one-hot coding, and multiple test fixes, among others. This entire process is easily implemented in a method called HNET. In this blog, I will demonstrate how to detect variables with similar behavior so that multicollinearity can be easily detected.
Real-world data often contain measurements with both continuous and discrete values. We need to look at each variable and use common sense to determine if the variables can be related to each other. But when there are dozens (or more) variables, where each variable can have multiple states per category, manually checking all variables is time-consuming and error-prone. We can automate this task by performing intensive preprocessing steps, along with statistical testing methods. Here it comes HNET (1, 2) game that uses statistical tests to determine significant relationships between all variables in a data set. It allows you to input your raw, unstructured data into the model and then generates a network that sheds light on the complex relationships between variables. Let’s move on to the next section where I will explain how to detect variables with similar behavior using statistics.…