Now let's say you've already fetched a bunch of records by making API requests with the parameters mentioned above, it's time for you to decide how you want to write them to the destination table.
Answer: Merge/Dedup mode (recommended)
This question refers to the choice of Write disposition either Sync modeThe immediate answer is that since you are looking to load your data incrementally, you will probably choose to write the extracted data in append mode or merge mode (also known as deduplication mode).
However, let's step back a bit to examine our options more closely and determine which method is best suited for incremental loading.
Below are the most popular writing layouts.
- overwrite/replace: delete all existing records in the target tables and then insert the extracted records.
- attach: simply add extracted records to the target tables.
- go after: Insert new
records and updates(**) existing records. merge_key
primary_key
How do we know which records are new?: Typically, we'll use a primary key to determine this. If you're using DLT, your merge strategy can be more sophisticated than that, including distinguishing between and dedup_sort
(one is used for merging and one is used for deduplication before merging)either
(which records will be deleted with the same key in the deduplication process) I'll leave that part for another tutorial.(**) This is a simple explanation, if you want to know more about how dlt handles this merging strategy, read more
here
. Here is an example to help us understand the results of different writing arrangements.
On 06/19/2024: We performed the first synchronization. source application
Image by the author destination database
Data uploaded to our No matter which synchronization strategy you choose, the table at the destination is
from the source table. updated_at
Image by the author updated_at
Saved state of
= 2024–06–03, which is the last between the 2 records we synchronized.
source application
Data in
- Image by the author
- Changes in the source table:
- Record id=1 (sales figure) was updated.
Record id=2 was deleted. updated_at
Record id=3 was inserted.
In this sync, we ONLY pull records with the
> 2024–06–03 (state saved since last sync). Therefore, we will only extract records id=1 and id=3. Since record id=2 was removed from the source data, there is no way for us to recognize this change. With the second sync, you will now see the difference between the writing strategies. destination database
Data uploaded to our
Scenario 1: Overwrite
Image by the author The destination table will be overwritten with the 2 records extracted this time.
Scenario 2: Append
Image by the author The 2 extracted records will be added to the target table, existing records will not be affected.
Scenario 3: Merge or deduplication
Image by the author The 2 extracted records with id=1 and 3 will replace the existing records in the destination. This process is called merging or deduplication. The record id=2 in the destination table remains untouched.
Conclusions:The deduplication strategy can be effective in the incremental data loading process, but if your table is very large, this deduplication process can take a considerable amount of time.