How to build a semi-structured streaming analytics platform on Snowflake

Image by Publisher

Snowflake is a SaaS, i.e. Software as a Service that is well suited for running analytics on large volumes of data. The platform is extremely easy to use and is suitable for business users, analytics teams, etc., to derive value from ever-growing data sets. This article will look at the components of building a Snowflake streaming semi-structured analytics platform for healthcare data. We’ll also go over some key considerations during this phase.

There are many different data formats that are supported by the healthcare industry as a whole, but we will consider one of the newer semi-structured formats, i.e. FHIR (Fast Healthcare Interoperability Resources) to build our analytics platform. This format generally has all the patient-focused information integrated into 1 JSON document. This format contains a large amount of information, such as all hospital encounters, laboratory results, etc. The analytics team, when provided with a searchable data lake, can extract valuable information such as how many patients were diagnosed with cancer, etc. We’re going with the assumption that all these JSON files are pushed to AWS S3 (or any other public cloud storage) every 15 minutes via different AWS services or API endpoints.

RAW zone from AWS S3 to Snowflake:

Data must be continuously streamed from AWS S3 to the Snowflake RAW zone.
Snowflake offers the Snowpipe managed service, which can continuously read JSON files from S3.
A table with a variants column must be created in the Snowflake RAW zone to hold the JSON data in the native format.

Snowflake RAW Zone a Streams:

Streams is a managed change data capture service that will essentially be able to capture all new incoming JSON documents into the Snowflake RAW zone.
The streams would point to the Snowflake RAW Zone table and should be set to append=true
Flows are like any table and can be easily queried.

Snowflake Task 1:

Snowflake Task is a scheduler-like object. Queries or stored procedures can be scheduled to run using cron job notations.
In this architecture, we create Task 1 to get the data from Streams and ingest it into a staging table. This layer would be truncated and reloaded
This is done to ensure that new JSON documents are processed every 15 minutes.

Snowflake Task 2:

This layer will convert the raw JSON document into reporting tables that can be easily referenced by the analytics team.
To convert JSON documents into structured format, Snowflake’s flattening feature can be used.
Lateral flatten is an easy to use function that exploits nested array elements and can be easily extracted using the ‘:’ notation.

It is recommended to use Snowpipe with some large files. Cost can be high if small files on external storage don’t get together
In a production environment, ensure that automated processes are created to monitor the streams, as once they become stale, data cannot be recovered from them.
The maximum allowed size of a single JSON document is 16 MB compressed that can be uploaded to Snowflake. If you have huge JSON documents that exceed these size limits, make sure you have a process to split them before ingesting them into Snowflake

Managing semi-structured data is always a challenge due to the nested structure of elements embedded within JSON documents. Consider the gradual and exponential increase in the volume of incoming data before designing the final reporting layer. This article is intended to demonstrate how easy it is to create a streaming pipeline with semi-structured data.

milind choudhary is an experienced data engineer/data architect who has a decade of working experience creating data lakes/lake houses using a variety of conventional and modern tools. He is extremely passionate about data transmission architecture and is also a Technical Reviewer at Packt & O’Reilly.