Image by author
Data science remains one of the most popular job titles of the 21st century. So it is not surprising that there is a lot of curiosity about it. But first, what is data science?
Data science is a multidisciplinary field that includes different elements from various domains such as data visualization, model building, and data manipulation.
In this article, we'll take a closer look at these elements and explore libraries that will allow you to apply them using Python. Whether you are a professional or consider yourself a beginner, this article is sure to expand your knowledge. Let us begin!
Image by author
Data Collection means the process of combining information from the web.
You may see different data projects, including synthetic datasets or Kaggle datasets.
Even if this is good for beginners, if you want to get a competitive job, you should do more.
In Python, there are many options to do that, let's take a closer look at 3 of them.
Scrappy
This is a web crawling framework for Python, ideal for large-scale data extraction.
It is more sophisticated than BeautifulSoup and allows for more complex data collection.
A unique feature of Scrapy is its ability to handle asynchronous requests efficiently, making it faster for large-scale scraping tasks. If you are new, the following suits you better.
beautifulsoup
BeautifulSoup is used to parse HTML and XML documents. It's simpler and easier to use than Scrapy, making it ideal for beginners or simpler scraping tasks.
A distinctive aspect of BeautifulSoup is its flexibility to parse even poorly formatted HTML.
Selenium
Selenium is mainly used to automate web browsers. It is perfect for extracting data from websites that require interaction, such as filling out forms or including JavaScript-based content.
Its novel feature is the ability to automate and interact with web pages as if a human were browsing, enabling data collection from dynamic web pages.
Now you have data, but you must explore it to see its characteristics.
spicy
Scipy is used for scientific and technical computing.
It is more focused on advanced calculations compared to numpy and offers additional functionalities such as optimization, integration and interpolation.
A unique feature of Scipy is its extensive collection of submodules for different scientific computing tasks.
numpy
It is one of the most important libraries in Python for data science.
Most of its fame comes from its array object. While Scipy is based on Numpy, Numpy also works on its own.
A distinguishing feature is its ability to perform efficient matrix calculations, which is actually why it is so important in data science; However, the next one is also too important.
pandas
Pandas offers easy-to-use data structures such as data frames and data analysis tools that will be better suited to manipulating data using data frames.
A novel aspect of Pandas, which distinguishes it from other data manipulation tools, is DataFrames, which provides extensive capabilities for data manipulation and analysis.
Image by author
Data manipulation is the process in which you shape your data to prepare for the next stages.
pandas
Pandas offers data structures like DataFrame, which makes working with everything easier. Because there are too many built-in functions defined in pandas, which will turn your 100 lines of code into 2 built-in functions.
It also has data visualization capabilities and data exploration functions, making it more versatile than other Python libraries.
Data visualization allows you to tell the whole story on a single page. To do this, in this section we will cover 3 of them.
Matplotlib
If you visualized your data with Python, you know what matplotlib is. It is a Python library for creating a wide range of chart types, such as static, interactive, or even animated.
It is a more customizable data visualization library than others. You can control virtually any element of a plot with it.
born in the sea
Seaborn is built on top of Matplotlib and offers a different type of view of the same graphs, such as a bar chart.
It can be easier to use for creating complex visualizations, compared to Matplotlib, and is fully integrated with Pandas DataFrames.
plot
Ploty is more interactive than others. You can even create a dashboard with it and you can also integrate your code with Plotly and view your charts on the Plotly website.
If you want to know more, here are the Python Data Visualization Libraries.
Model Building is the step where you can finally see the results of your actions to make predictions. To do that, we still have too many libraries.
Learn science kit
The most famous Python library for machine learning is Sci-kit learn. It offers overly simple yet efficient features to build your model in a couple of seconds. Of course, you can develop many of these features yourself, but do you want to write 100 lines of code instead of 1?
Its novel feature is the complete collection of algorithms in a single package.
TensorFlow
TensorFlow, created by Google, is more suitable for high-level models such as deep learning and offers high-level features for building large-scale neural networks compared to Scikit-learn. Additionally, there are many free tools available online, also created by Google, that make learning TensorFlow easy.
Hard
Keras offers a high-level neural network API and is capable of running on top of Tensorflow. It focuses more on enabling rapid experimentation with deep neural networks than Tensorflow.
Now you have your model, but it's just a script. To get something more meaningful, you need to convert your model into a web app or API to prepare it for production.
Django
The most famous web framework allows you to develop your model in a structured way. It is more complicated than Flask and FastAPI, but the reason behind this is that it has many built-in features such as an admin panel.
In Flask, for example, you'd have to develop a lot of things from scratch, but if you don't know much about web frameworks, it's a good place to start.
Flask
Flask is a micro web framework for Python, with it you can develop your own web application or API, more easily. It is more flexible than Django and better suited for smaller applications.
Fast API
FastAPI is fast and easy to use, which made it more popular.
A unique feature of FastAPI is its automatic documentation generation and built-in validation using Python type hints.
If you want to know more, here are the Top 18 Python Libraries.
At this stage you have everything, but in your own environment. To share your model with the world and test it further, you need to share it with people. To do that, your web application or API must be running on the server.
heroku
A cloud platform as a service (PaaS) that supports multiple programming languages.
It is easier to use for beginners compared to AWS and also offers simpler deployment processes for web applications. If you are a complete beginner, it might be better for you, like Python anywhere.
Python anywhere
PythonAnyhwhere is an online development environment, it also offers web hosting service, based on the Python programming language, which can be understood from its name.
It is more focused on specific Python projects compared to other tools. If you choose Flask in step 6, you can upload your model to Python anywhereand it also has a free feature.
AWS (Amazon Web Services)
AWS has too many different options for each feature it offers on the platform. If you plan to choose a database, even for it, there are too many options.
It is more complex and comprehensive than other tools and is suitable for large-scale operations.
For example, if you chose Django in the previous section and took your time to build a large-scale web application, your next choice would be AWS.
In this article, we explore the top Python libraries used in data science. When working on your data science projects, remember that there is no single definitive method. I hope this article has introduced you to different tools.
Nate Rosidi He is a data scientist and in product strategy. He is also an adjunct professor of analysis and is the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real questions from top companies. Connect with him on Twitter: StrataScratch either LinkedIn.