Bored with Kaggle and FiveThirtyEight? These are the alternative strategies that I use to obtain unique and high-quality data sets.
The key to a great data science project is a great data set, but finding great data is much easier said than done.
I remember when I was studying for my master’s degree in Data Science, a little over a year ago. Throughout the course, I found that coming up with project ideas was the easy part: it was find good data sets What I struggled with the most. I would spend hours scouring the internet, tearing my hair out trying to find interesting data sources and getting nowhere.
Since then I have come a long way in my approach and in this article I want to share with you the 5 strategies I use to find data sets. If you’re bored with standard sources like Kaggle and FiveThirtyEight, these strategies will allow you to get unique data much more tailored to the specific use cases you have in mind.
Yes, believe it or not, this is actually a legitimate strategy. It even has a fancy technical name (“synthetic data generation”).
If you’re testing a new idea or have very specific data requirements, creating synthetic data is a great way to get original, custom data sets.
For example, suppose you are trying to create a churn prediction model, a model that can predict the probability that a customer will churn from a company. Churn is a fairly common “operational problem” many companies face, and tackling a problem like this is a great way to show recruiters that they can use ML to solve business-relevant problems, as I’ve argued above:
However, if you search online for “churn datasets”, you’ll find that there are (at the time of writing) only two main datasets obviously available to the public: the Bank customer churn datasetand the Telecom Turnover Dataset. These data sets are a great place to start, but they may not reflect the type of data needed to model churn in other industries.
Instead, you could try creating synthetic data that better suits your requirements.
If this sounds too good to be true, here’s an example dataset I created with just a short notice to that old chestnut, ChatGPT:
Of course, ChatGPT is limited in the speed and size of the data sets it can create, so if you want to improve this technique I recommend using the Python library faker
or scikit-learn sklearn.datasets.make_classification
and sklearn.datasets.make_regression
functions These tools are a great way to generate large data sets programmatically in the blink of an eye, and are perfect for building proof-of-concept models without spending a lot of time searching for the perfect data set.
In practice, I have rarely needed to use synthetic data creation techniques to generate complete data sets (and, as I’ll explain later, you’d be wise to be careful if you intend to do this). Instead, I find this to be a very good technique for generating contradictory examples or adding noise to your data sets, allowing me to test weaknesses in my models and build stronger versions. But regardless of how you use this technique, it’s an incredibly useful tool to have at your disposal.
Creating synthetic data is a good solution for situations where you can’t find the kind of data you’re looking for, but the obvious problem is that you have no guarantee that the data is a good representation of real-life populations.
If you want to ensure that your data is realistic, the best way to do it is surprise surprise…
…to go and find some real data.
One way to do this is to contact companies that might have such data and ask if they would be interested in sharing some of it with you. At the risk of stating the obvious, no company will give you data that is highly sensitive or if you plan to use it for commercial or unethical purposes. That would be just stupid.
However, if you intend to use the data for research (for a university project, for example), companies may be open to providing data if it is in the context of a research project. something for something joint research agreement.
What I mean by this? It’s actually quite simple: I mean an agreement whereby they provide you with some data (anonymized/desensitized) and you use the data to conduct research that is of some benefit to them. For example, if you are interested in studying the churn model, you could write a proposal to compare different churn prediction techniques. Then share the proposal with a few companies and ask if there is potential to work together. If you’re persistent and cast a wide net, chances are you’ll find a company that’s willing to provide data for your project. as long as you share your findings with them so that they can make a profit from the research.
If that sounds too good to be true, you might be surprised to hear that this is exactly what i did during my masters. I reached out to a couple of companies with a proposal on how I could use their data for research that would benefit them, signed some documents to confirm that I wouldn’t use the data for any other purpose, and did a really fun project using some real world data. It really can be done.
The other thing I particularly like about this approach is that it provides a way to exercise and develop a fairly broad set of skills that are important in data science. You need to communicate well, show business awareness, and become a pro at managing stakeholder expectations, all of which are essential skills in the day-to-day life of a data scientist.
Many data sets used in academic studies are not published on platforms like Kaggle, but are still publicly available for other researchers to use.
One of the best ways to find data sets like these is by searching the repositories associated with academic journal articles. Because? Because many journals require their contributors to make the underlying data publicly available. For example, two of the data sources I used during my master’s degree (the fragile families data set and the Hate Speech Data website) were not available on Kaggle; I found them through academic papers and their associated code repositories.
How can you find these repositories? It’s actually surprisingly simple: I start by opening papersconcode.com, search for papers in the area I’m interested in, and look through the available data sets until you find something that looks interesting. In my experience, this is a very good way to find datasets that the masses on Kaggle haven’t done all the way to.
I honestly have no idea why more people don’t use BigQuery’s public datasets. there is literally hundreds of data sets covering everything from Google search trends to London bike rentals and genomic sequencing of cannabis.
One of the things I like best about this source is that many of these data sets are incredibly business relevant. You can say goodbye to niche academic topics like flower classification and digit prediction; In BigQuery, there are data sets about real-world business problems, such as ad performance, website visits, and economic forecasts.
Many people steer clear of these data sets because they require knowledge of SQL to load. But, even if you don’t know SQL and only know a language like Python or R, I encourage you to take an hour or two to learn some basic SQL and then start querying these data sets. It doesn’t take long to get up and running and this really is a treasure trove of high value data assets.
To use the datasets in BigQuery Public Datasets, you can sign up for a completely free account and create a sandboxed project by following the instructions. here. You don’t need to enter your credit card details or anything like that, just your name, your email, a little information about the project and you’re done. If you need more compute power at a later date, you can upgrade to a paid project and access GCP compute resources and advanced BigQuery features, but I’ve personally never needed to do this and have found sandboxing to be more than adequate. .
My final advice is to try using a data set search engine. These are amazing tools that have come out in the last few years and make it very easy to quickly see what’s available. Three of my favorites are:
In my experience, searching with these tools can be a much more effective strategy than using generic search engines, since you are often provided with metadata about the data sets and have the ability to rank them based on how often they have been used. and the post. date. A pretty nifty approach, if you ask me.