What if multiple data pipelines need to interact with the same API endpoint? Would I really have to declare this endpoint in every pipeline? In case this endpoint changes in the near future, you will need to update its value in each file.
Airflow variables are simple but valuable constructs used to avoid redundant declarations in multiple DAGs. They are simply objects consisting of a JSON serializable key and value, stored in the Airflow metadata database.
What if your code uses tokens or other types of secrets? Encoding them in plain text does not seem to be a safe approach. Beyond reducing repetition, Airflow variables also help manage sensitive information. With six different ways to define variables in Airflow, selecting the appropriate method is crucial to ensure security and portability.
An often overlooked aspect is the impact variable recovery has on airflow performance. You can potentially overload the metadata database with requests every time the Scheduler parses DAG files (default is thirty seconds).
It's pretty easy to fall into this trap unless you understand how DAGs are parsed by the Scheduler and how variables are retrieved from the database.
Before getting into the discussion about how metastore variables are obtained and what best practices to apply to optimize DAGs, it is important to understand the basic concepts well. For now, let's focus on how we can declare variables in Airflow.
As already mentioned, there are several different ways to declare variables in Airflow. Some of them happen to be more secure and portable than others, so let's examine them all and try to understand their advantages and disadvantages.
1. Create a variable from the user interface
In this first approach, we will create a variable through the user interface. In the top menu select Admin
→ Variables
→ +