Image generated with DALL-E
In an age where analytical data processing is the fundamental difference between a successful business or not, we need a tool stack that can meet the needs. The advancement of technology has helped advance all these data tools that we need, namely DuckDB and MotherDuck.
DuckDB is an open source SQL online analytical processing (OLAP) database management system. The database system is designed to quickly handle analytical data queries, regardless of the size of the data. The system implements in-memory processing and OLAP systems that effectively improve our data analysis process.
DuckDB is perfect for storing and processing tabular data that involves data analysis (table joining, data aggregation, etc.) and when our workflow often involves significant changes to the table. On the other hand, DuckDB is not suitable for high-volume data activity and multiple simultaneous processes in a database.
motherduck is a DuckDB managed service in the cloud. It is free to use and open source while maintained by the DuckDB community. It is a service created by partnering with DuckDB Lab to create a cloud services platform that the public can use.
With a combination of DuckDB and Motherduck, we can create an analytics engine that can be easily used in all scenarios. How do we do that? Let's get into it.
We'll use MotherDuck's native user interface to give you an example of how the service works and why DuckDB is a powerful tool for data analysis. Register on the website and purchase the MotherDuck account if you haven't already.
Once you successfully register for the MotherDuck account, we will access the MotherDuck user interface. Try to get familiar with the user interface and you will realize that it is similar to Jupyter Notebook if you ever use one.
We will experiment with the power of DBduck in the MotherDuck UI with DS Salary data from Kaggle. Upload the data using the Add Files button and a new cell will be displayed with the query to be executed. The query should look like this.
CREATE OR REPLACE TABLE ds_salaries AS SELECT * FROM read_csv_auto(('ds_salaries.csv'));
Once you create the table, try querying the data with the following code.
select * from my_db.ds_salaries limit 10;
As you can see, MotherDuck is a lot like performing data analysis in Notebook, but with SQL queries. Let's try the query to perform data analysis in MotherDuck.
select job_title,
avg(salary_in_usd) as average_salary_in_usd
from my_db.ds_salaries
GROUP BY job_title
ORDER BY job_title
You can run the query on the cell; The table result is displayed similar to the image below.
You can filter the data, rotate the table or download the result with the radio button available on the user interface.
MotherDuck also allows the user to access the database through Python on their Notebook. We need to install the DuckDB package using the following code.
pip install duckdb==v0.9.2
The current version supported by MotherDuck is DuckDB 0.9.2; That's why we installed that version.
When the installation is successful, we need to connect DuckDB with Motherduck. There are a few ways to authenticate the connection, but we would use the service token. This token is acquired in the MotherDuck settings.
import duckdb
token = "insert token here"
# initiate the MotherDuck connection
con = duckdb.connect(f'md:?motherduck_token={token}')
If we did not set any database name, MotherDuck would access using the default database, which is my_db. Next, let's use the same query we did earlier in Notebook.
q = """
select job_title,
avg(salary_in_usd) as average_salary_in_usd
from my_db.ds_salaries
GROUP BY job_title
ORDER BY job_title
"""
con.sql(q).show()
You will see a result similar to the following table.
┌─────────────────────────────────────┬───────────────────────┐
│ job_title │ average_salary_in_usd │
│ varchar │ double │
├─────────────────────────────────────┼───────────────────────┤
│ 3D Computer Vision Researcher │ 21352.25 │
│ ai Developer │ 136666.0909090909 │
│ ai Programmer │ 55000.0 │
│ ai Scientist │ 110120.875 │
│ Analytics Engineer │ 152368.63106796116 │
│ Applied Data Scientist │ 113726.3 │
│ Applied Machine Learning Engineer │ 99875.5 │
│ Applied Machine Learning Scientist │ 109452.83333333333 │
│ Applied Scientist │ 190264.4827586207 │
│ Autonomous Vehicle Technician │ 26277.5 │
│ · │ · │
│ · │ · │
│ · │ · │
│ Principal Data Engineer │ 192500.0 │
│ Principal Data Scientist │ 198171.125 │
│ Principal Machine Learning Engineer │ 190000.0 │
│ Product Data Analyst │ 56497.2 │
│ Product Data Scientist │ 8000.0 │
│ Research Engineer │ 163108.37837837837 │
│ Research Scientist │ 161214.19512195123 │
│ Software Data Engineer │ 62510.0 │
│ Staff Data Analyst │ 15000.0 │
│ Staff Data Scientist │ 105000.0 │
├─────────────────────────────────────┴───────────────────────┤
│ 93 rows (20 shown) 2 columns │
└─────────────────────────────────────────────────────────────┘
With the above query, you can use the following code to process them in Pandas DataFrame.
import pandas as pd
df = con.sql(q).fetchdf()
Finally, you can load another set of data into the database using the following query.
con.sql("CREATE TABLE mytable AS SELECT * FROM '~/filepath.csv'")
The above query assumes that your data is a CSV file. Other options include S3 or the local DuckDB for the MotherDuck database.
DuckDB is an open source database system developed specifically for data analysis. The system is designed to handle data processing quickly and efficiently. MotherDuck is an open source cloud managed service for DuckDB.
By combining DuckDB and MotherDuck, we can turn our laptops into a personal analysis engine by having our data in the cloud and quickly processing it with DuckDB.
Cornellius Yudha Wijaya He is an assistant data science manager and data writer. While working full-time at Allianz Indonesia, she loves sharing Python tips and data through social media and print media.