Image by author
You've read in these pages (and I'm guilty of writing some of those articles) that data science projects are crucial to developing the entire suite of data science technical skills. That's true, they are. But what is also vital is to have high-quality data sets for your data science projects. Collecting quality data is simply one of the stages of a data science projectbut the one who can make it or break it.
The question is, where to find this damn data? Fortunately, numerous websites offer a wealth of data for various purposes.
Image by author
Have you heard about Kaggle, probably the most well-known platform in the data science community. It hosts a wide range of data sets in various formats (CSV, JSON, SQLite, BigQuery) and from multiple industries and topics, such as healthcare, automotive, arts and entertainment, biology, social sciences, investments, social media, sports, etc. in. You can also search for data sets based on their technical focus, for example computer science, classification, computer vision, NLP, or data visualization.
Currently, there are 274,855 data sets available, so you won't be missing data.
Kaggle's easy-to-use interface and active community forums make it an excellent resource for beginners and professionals alike.
If you are a machine learning enthusiast, the UCI Machine Learning Repository It should be your reference site. As the name suggests, this repository was created by the University of California, Irvine (UCI). They compiled an extensive collection of data sets designed for machine learning. Because the data sets cover multiple topics, they are especially useful. These datasets cover a wide range of topics and are particularly useful for those who want to practice and improve their machine learning skills.
There are currently 653 data sets; You can explore them by data type, subject area, task, number of functions and instances, and function type.
StrataScratch provides 49 data sets and projects from real companies. This is particularly beneficial for those preparing for data science interviews, as it helps users develop their technical skills and their ability to derive business insights from data. This enables a practical, industry-relevant approach to data science projects.
The projects cover various topics such as data exploration, data engineering, business analysis, regression, classification, NLP and clustering.
Google Dataset Search is a tool whose purpose is to find data sets on the web. You already know how to use it, even if you've never heard of it until now. Because? Well, it looks and works like a normal Google search, only it focuses exclusively on finding data sets. It is extremely useful if you are searching for data from various sources, academic articles and government databases.
Amazon AWS Public Data Sets The program is another site where you can find a lot of open data. With 494 data sets currently available, it is a valuable resource for data scientists. The data sets you find there can be integrated with AWS cloud services. This can be useful if your projects require more computing resources.
The range of data available includes genomics, meteorology and astronomy, among others.
Data.gov is a data repository sponsored by the US government and contains data from several US organizations. Includes 283,935 data sets from 132 US organizations. There is a wide range of data, such as agriculture, public health, finance, education, demographics, economics and environmental data.
Data sets come in almost 50 different formats, the most popular being HTML, XML, ZIP, CSV, PDF, ArcGIS GeoServices REST API, KML, GeoJSON, JSON, and TEXT.
fivethirtyeight by ABC News is the repository of code and data for your articles and graphics. It is a perfect resource for data journalists and anyone interested in statistical storytelling. If you're interested in doing projects involving current events, politics, sports, and more, this is your source.
It offers more than 160 data sets from 2014 to the present.
He World Bank Open Data offers extensive data sets revolving around global development data. This data includes indicators on the economy, environment and social issues of countries around the world. If you are interested in socioeconomic and global development issues, you can find a lot of interesting data here.
GitHub It is not just a code sharing platform. It can also be used to search for data sets for data projects. Many organizations and individual users host their datasets in GitHub repositories. This data covers a wide range of topics, often supported by extensive documentation and code for analysis.
OpenML is an online platform for machine learning. This also means giving you access to a lot of data. More specifically, almost 5,400 data sets. It is designed to share, organize, and discuss data and results from machine learning experiments. OpenML can be integrated with popular machine learning environments, which is a plus for your data science learning.
He Datasets Subreddit is a community-driven data source. People share everything on reddit. Well, they also share and request data sets for data projects. Sometimes it is difficult to find data there. But not for lack of data. Otherwise! The place is packed with data, which can sometimes make searching for data quite chaotic. The data varies from very specific and unusual data sets to more traditional data sets. Since this is basically a forum, you can also participate in discussions and ask for help with data sets.
The statistical office of the European Union is called Eurostat, and it is a complete source of data. If you are interested in high-quality statistical data on EU member countries, this should be your main data source. Data on EU countries includes topics such as economy, population, health and trade.
HDX It is an open platform where you can find humanitarian data. It is managed by the United Nations Office for the Coordination of Humanitarian Affairs. This platform provides data on humanitarian crises and emergencies in all countries around the world. You might find it useful if you are interested in projects focused on global issues, disaster response, and human well-being.
There are 20,344 active and 2,570 archived data sets with various characteristics and formats.
About him Centers for Disease Control and Prevention, you can find health-related data. The data sets focus on various health conditions, risk factors and public health. So if these are the topics you are interested in, you will find a lot of useful data here.
He BLS The site has a lot of data on US economic conditions, the job market, price changes, quality of life, etc. You will find many quality data sets if you are interested in those topics.
The last data source I will mention is POT. There is a lot of data on aerospace, applied science, applications, Earth science, management/operations, raw data, software and space science.
It has over 10,000 data sets, so don't get lost in its data universe!
These 16 websites, I'm sure, will give you enough data to work until the end of time, which was precisely my goal! However, the amount of data is not everything.
I chose these sites because they will provide you with a very diverse range of data sets suitable for a variety of data science projects. The details of the data set differ from industry to industry. Therefore, working with multiple data sets also allows you to gain domain knowledge.
Whether you're delving into machine learning, data analytics, data journalism, statistical analysis, or data visualization, you can always count on these resources.
Now you can make your own data science project! If you need more ideas, here are some. data science projects you can do it as a beginner.
Nate Rosidi He is a data scientist and in product strategy. He is also an adjunct professor of analysis and is the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real questions from top companies. Connect with him on Twitter: StrataScratch either LinkedIn.