Gyasi Sutton, MD, MPH Physician in Training, Coder at heart. Python, R, Node, and Rust.

Setting up Airflow for ETL development

  Reading Time:

This is part of a larger series of development and setup steps to build a self-contained Medical Record Database using ETL and other technologies. While a lot of what I do here can easily be scripted with pure Python, I am always exploring open-source ways to do things in a structured manner. I hope you find the content useful.

Introduction to Apache Airflow

Apache Airflow is a powerful platform used to programmatically author, schedule, and monitor workflows. It is particularly popular in the data engineering and data science communities for orchestrating complex data workflows, including data ingestion and transformation tasks, such as running dbt (data build tool) models. Airflow allows users to define workflows as code, making it easy to manage, version, and share workflows across teams.

Key Features of Apache Airflow

Directed Acyclic Graphs (DAGs)

At the core of Apache Airflow is the concept of Directed Acyclic Graphs (DAGs). A DAG is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. Each node in the DAG represents a task, and the edges define the order in which tasks should be executed. DAGs ensure that tasks are executed in a specific sequence without any cycles, which means a task cannot depend on itself, either directly or indirectly.

Simple Pipeline example. Each Node(Task) can be a seperate script processing the previous nodes output

Integration with Various Data Tools

Apache Airflow is designed to be extensible and integrates seamlessly with a wide range of data tools and services. It supports various operators and hooks that allow you to interact with databases, cloud services, and other data processing tools. This makes it easy to create workflows that involve multiple systems, such as extracting data from a database, transforming it using a tool like dbt, and loading it into a data warehouse.

Web Interface for Monitoring

Airflow provides a rich web-based user interface that allows users to monitor and manage workflows. The UI provides a clear view of the DAGs, their status, and the logs of each task. Users can trigger tasks manually, view task dependencies, and even retry failed tasks directly from the interface. This makes it easy to keep track of complex workflows and quickly identify and resolve issues.

Setting the Environment with Docker-compose

Docker-Compose is particularly useful for setting up Apache Airflow in a development or testing environment, as it allows you to define all the necessary services in a single configuration file and start them with a single command. We will download the official download image, edit it, create a dockerfile, and add some plugins, with one simple bash script.

#!/bin/bash

# Define custom ports
WEB_SERVER_PORT=8081
FLOWER_PORT=5556

# Create necessary directories
mkdir -p ./dags ./logs ./plugins

# Set environment variable for Airflow user
echo -e "AIRFLOW_UID=$(id -u)" > .env

# Download the docker-compose.yaml file
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.10.2/docker-compose.yaml'

# Change permissions to make the file writable
chmod u+w docker-compose.yaml

# Modify the docker-compose.yaml file to use custom ports
sed -i "s/8080:8080/${WEB_SERVER_PORT}:8080/g" docker-compose.yaml
sed -i "s/5555:5555/${FLOWER_PORT}:5555/g" docker-compose.yaml

# Uncomment the build line and remove the image line
sed -i '/^\s*image:.*apache\/airflow/d' docker-compose.yaml
sed -i 's/^\s*# build: \./  build: ./' docker-compose.yaml

# Verify the modification
if grep -q '^\s*build: \.$' docker-compose.yaml; then
    echo "Successfully updated docker-compose.yaml with build context."
    sleep 5
else
    echo "Failed to update docker-compose.yaml with build context." >&2
    exit 1
fi
# Create a Dockerfile to install dbt and plugins
cat <<EOF > Dockerfile
FROM apache/airflow:2.10.2

# Install dbt
RUN pip install dbt-core dbt-postgres

# Install Airflow plugins
RUN pip install apache-airflow-providers-slack
RUN pip install apache-airflow-providers-amazon
RUN pip install apache-airflow-providers-google
RUN pip install apache-airflow-providers-postgres

# Switch to the airflow user
USER airflow
EOF

# Build the custom Docker image
docker-compose build

# Initialize the Airflow database
docker-compose up airflow-init

# Start Airflow services in detached mode
docker-compose up -d

# Output the access information
echo "Airflow is running on http://localhost:${WEB_SERVER_PORT}"
echo "Flower is running on http://localhost:${FLOWER_PORT}"

In essence, the script sets env variables and modifies the incoming docker-compose file, and executes. I have dbt installed along with some other airflow plugins orchestrated in the created Dockerfile, which you can modify to your preferences. I'll share this bash script with this gist, make sure if your environment is Linux, you 'chmod +x <gist_file>' to make it executable. It goes without saying to have docker and docker-compose installed on your system.  Quick point--you can set your port variables which can be needed if you are running a lot of docker-containers.  I have many different containers, and I don't want the ports on my host machine conflicting with each other.

A quick explanation of the Docker-Compose file

When setting up Apache Airflow using Docker-Compose, several services are typically included:

  • Scheduler: The scheduler is responsible for scheduling jobs and ensuring that tasks are executed according to their dependencies and schedules. It continuously monitors the DAGs and triggers tasks when their dependencies are met.
  • Webserver: The webserver hosts the Airflow web interface, allowing users to interact with the DAGs, view logs, and manage tasks.
  • Worker: Workers are responsible for executing the tasks defined in the DAGs. In a distributed setup, multiple workers can be used to parallelize task execution and improve performance.
  • Init Service: The init service is used to initialize the Airflow database and ensure that all necessary configurations are in place before starting the other services.

Running Services

Once you execute the script and there are no errors the echo statements have something like this:

You can now navigate to localhost:8081 and view this:

Feel free to navigate the interface or bash into the container to verify the 'dbt --version'

You will see a lot of example dags.

Recap

In this tutorial, we explored how to set up Apache Airflow. By leveraging Airflow's powerful workflow orchestration capabilities, you can manage complex data processes more effectively and efficiently. We covered the essential features of Airflow, including Directed Acyclic Graphs (DAGs), integration with various data tools, and the intuitive web interface for monitoring workflows.

Additionally, we walked through the process of using Docker-Compose to simplify the environment setup, ensuring that all necessary services are configured and running seamlessly. With the provided bash script, you can easily customize your Airflow instance to fit your specific needs, including adding plugins and managing port configurations to prevent conflicts.

Now that you have a solid foundation for using Apache Airflow, you can begin to implement it in your own data projects. Whether you're automating data ingestion, transformation, or analysis, Airflow provides a structured and scalable solution that can grow with your needs. I hope this tutorial has equipped you with the knowledge and tools to harness the full potential of Apache Airflow in your workflows.

Hey everyone, let's keep teaching and learning!

"Creating and Managing Scalable Application Environments with Traefik and Docker: A Comprehensive Tutorial"

Traefik is an open-source reverse proxy and load balancer that enables you to define your own route rules, SSL certificates, and more for your applications. It...

Understanding Gists for Sharing Code

IntroductionSharing code snippets efficiently is crucial for collaboration and learning. GitHub Gists offer a streamlined way to share code or text snippets, making them accessible to...