What is Apache Airflow?
Apache Airflow is a platform to programmatically author, schedule and monitor workflows.
When workflows are defined as code, they become more maintainable, version-able, testable, and collaborative.
Use Airflow to author workflows and the orchestration of data pipelines as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
History of Apache Airflow
Apache Airflow is a project started at Airbnb in October 2014. It's a highly flexible tool that can be used to automate all sorts of processes, including ETL pipelines.
Airflow has been gaining popularity in the past couple of years and is listed as one of the most popular tools (based on StackShare statistics). A lot of companies are already using Airflow in production today.
What is a workflow?
A workflow is a description of what you want to achieve, while a DAG describes how to achieve it. It's the difference between "I need to get from New York to San Francisco" and "my flight leaves JFK on time and arrives at LAX after a stopover in Denver".
Apache Airflow is a tool to express and execute workflows as directed acyclic graphs (DAGs).
It has a nice UI where you can see all the DAGs, their status, how long they took to run, if they are currently running or failed, if there are any tasks which need to be run and much more.
The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
Example of a workflow on Apache Airflow
Apache Airflow allows data stewards (usually a data analyst or engineer) to create workflows that help sort and interact with raw data. It transforms this raw data into bits of information that are understandable to people within the data organization, and external to. For example, a data analyst may set up a workflow that calculates sales earnings on a daily basis- this means that the workflow goes from taking the purchase on the customers end, loading this, then transforming it into a value that a table or database understands, and then, creates a chart or final product that demonstrates this final value.
Components of Apache Airflow Architecture
As defined by Qubole, the components of Apache Airflow Architecture are as follows:
- Dynamic: Airflow pipelines are configured as code (Python), allowing for dynamic pipeline generation. This allows for users to write code that instantiates pipelines dynamically.
- Extensible: Easily define your own operators and executors, and extend the library so it fits the level of abstraction that suits your environment.
- Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using the Jinja templating engine.
- Scalable: Airflow has a modular architecture and uses a message queue to communicate with and orchestrate an arbitrary number of workers.