Working with sensitive data or in a highly regulated environment requires a secure cloud infrastructure for data processing. The cloud can seem like an open environment on the Internet and pose security concerns. When you start your journey with Azure and don’t have enough experience with resource configuration, it’s easy to make design and implementation mistakes that can impact the security and resiliency of your new data platform. In this post, I’ll outline the most important aspects of designing a cloud-ready framework for a data platform on Azure.
An Azure landing zone is the foundation for deploying resources in the public cloud. It contains essential elements for a robust platform. These elements include networking, identity and access management, security, governance, and compliance. By deploying a landing zone, organizations can streamline the process of configuring their infrastructure, ensuring that best practices and guidelines are used.
An Azure landing zone is an environment that follows key design principles to enable application migration, modernization, and development. In Azure, subscriptions are used to isolate and develop application and platform resources. They are categorized as follows:
- Application landing zones:Subscriptions dedicated to hosting application-specific resources.
- Platform landing area: Subscriptions that contain shared services such as identity, connectivity, and management resources provided for application landing zones.
These design principles help organizations successfully operate in a cloud environment and scale a platform.
Deploying a data platform on Azure involves a high-level architecture design where you select resources for data ingestion, transformation, distribution, and exploration. The first step may require a landing zone design. If you need a secure platform that follows best practices, it’s critical to start with a landing zone. It will help you organize resources within subscriptions and resource groups, define network topology, and ensure connectivity to on-premises environments via VPN, while adhering to naming conventions and standards.
Architectural design
Tailoring an architecture to a data platform requires careful selection of resources. Azure offers native resources for data platforms such as Azure Synapse Analytics, Azure Databricks, Azure Data Factory, and Microsoft Fabric. The available services offer a variety of ways to achieve similar goals, allowing flexibility in architecture selection.
For example:
- Data ingestion: Azure Data Factory or Synapse Pipelines.
- Data processing: Azure Databricks or Apache Spark on Synapse.
- Data analysis: Power BI or Databricks dashboards.
We can use Apache Spark and Python or low-code drag-and-drop tools. Various combinations of these tools can help us create the most suitable architecture based on our skills, use cases, and capabilities.
Azure also allows you to use other components such as Snowflake or create your own compositions using open source software, virtual machines (VMs) or Kubernetes Service (AKS). You can take advantage of virtual machines or AKS to configure data processing, exploration, orchestration, ai or ML services.
Typical structure of a data platform
A typical data platform on Azure should include several key components:
1. Tools for ingesting data from sources into an Azure storage account. Azure offers services such as Azure Data Factory, Azure Synapse Pipelines, or Microsoft Fabric. We can use these tools to collect data from sources.
2. Data Warehouse, Data Lake or Data Lakehouse: Depending on your architectural preferences, we can select different services to store data and a business model.
- For Data Lake or Data Lakehouse, we can use Databricks or Fabric.
- For the data warehouse we can select Azure Synapse, Snowflake or MS Fabric Warehouse.
3. To orchestrate data processing in Azure we have Azure Data Factory, Azure Synapse Pipelines, Airflow or Databricks Workflows.
4. Data transformation in Azure can be managed by several services.
- For Apache Spark: Databricks, Azure Synapse Spark Pool, and MS Fabric Notebooks,
- For SQL-based transformation, we can use Spark SQL on Databricks, Azure Synapse, or MS Fabric, T-SQL on SQL Server, MS Fabric, or Synapse Dedicated Pool. Alternatively, Snowflake offers full SQL capabilities.
Subscriptions
An important aspect of platform design is planning the segmentation of subscriptions and resource groups based on business units and software development lifecycle. It is possible to use separate subscriptions for production and non-production environments. With this distinction, we can achieve a more flexible security model, separate policies for production and test environments, and avoid quota limitations.
Networks
A virtual network is similar to a traditional network operating in your datacenter. Azure Virtual Networks (VNet) provide a basic security layer for your platform. By disabling public endpoints for your resources, you will significantly reduce the risk of data leaks in the event of lost keys or passwords. Without public endpoints, data stored in Azure storage accounts is only accessible when connected to your VNet.
Connectivity to an on-premises network enables a direct connection between Azure resources and on-premises data sources. Depending on the connection type, communication traffic can go through an encrypted tunnel over the Internet or through a private connection.
To improve security in a virtual network, you can use network security groups (NSGs) and firewalls to manage inbound and outbound traffic rules. These rules allow you to filter traffic based on IP addresses, ports, and protocols. Additionally, Azure allows you to route traffic between subnets, virtual and on-premises networks, and the Internet. Using custom route tables allows you to control where traffic is routed.
Naming convention
A naming convention establishes a standardization for platform resource names, making them more self-descriptive and easier to manage. This standardization helps in navigating different resources and filtering them in the Azure portal. A well-defined naming convention allows you to quickly identify the type, purpose, environment, and Azure region of a resource. This consistency can be beneficial in your CI/CD processes as predictable names are easier to parameterize.
When considering a naming convention, you should consider the information you want to capture. The standard should be easy to follow, consistent, and practical. It's worth including things like the organization, business unit or project, resource type, environment, region, and instance number. You should also consider the scope of the resources to ensure that names are unique within their context. For certain resources, such as storage accounts, names should be globally unique.
For example, a Databricks workspace could be named using the following format:
Examples of abbreviations:
A complete naming convention typically includes the following format:
- Resource type: An abbreviation representing the type of resource.
- Project name: A unique identifier for your project.
- Atmosphere: The environment that supports the resource (for example, development, QA, production).
- Region: The geographic region or cloud provider where the resource is deployed.
- Instance: A number to differentiate between multiple instances of the same resource.
Deploying infrastructure through the Azure portal may seem straightforward, but it often involves numerous detailed steps for each resource. Highly secure infrastructure will require configuration of resources, networks, private endpoints, DNS zones, etc. Resources like Azure Synapse or Databricks require additional internal configuration, such as setting up Unity Catalog, managing secret scopes, and configuring security settings (users, groups, etc.).
Once you are done with the test environment, you will need to replicate the same setup in the QA and production environments. This is where it is easy to make mistakes. To minimize potential errors that could affect the quality of development, it is recommended to use an Infrastructure as Code (IasC) approach to infrastructure development. IasC allows you to create cloud infrastructure as code in Terraform or Biceps, allowing you to deploy multiple environments with consistent configurations.
In my cloud projects, I use accelerators to quickly launch new infrastructure configurations. Microsoft also offers accelerators that can be used. Storing infrastructure as code in a repository offers additional benefits such as version control, change tracking, performing code reviews, and integration with DevOps pipelines to manage and promote changes across environments.
If your data platform does not handle sensitive information and you do not need a highly secure data platform, you can create a simpler setup with public internet access without virtual networks (VNets), VPNs, etc. However, in a highly regulated area, a completely different implementation plan is required. This plan will involve collaboration with various teams within your organization, such as DevOps, Platform, and Networking teams, or even external resources.
A secure network infrastructure, resources and security will need to be established. Only when the infrastructure is ready can activities related to the development of data processing begin.
If you found this article interesting, please let me know by clicking the “clap” button or giving it a “like” on LinkedIn. Your support is very valuable. If you have any questions or would like some advice, please feel free to contact me at LinkedIn.