Managed Workflows for Apache Airflow (MWAA) is your best bet for most greenfield automation projects. However, it requires more work and more knowledge, both upfront and as new features are added. If you only need to move and transform data, and if you are willing to commit to the AWS ecosystem, Glue might be a lower-maintenance option. In some cases, it makes sense to use both.
I started out writing an article explaining why AWS Managed Workflows for Apache Airflow (MWAA) is a better and more flexible option than older workflow offerings like Step Functions, Data Pipelines, Simple Workflows, and Batch. But when I got to AWS Glue, I wasn’t so sure.
What’s so great about Airflow?
As its name suggests, AWS Glue unites numerous other data-related services in the AWS ecosystem. It helps you push data between S3, RedShift, DynamoDB, RDS, and more, performing arbitrary transformations all along the way. Glue will happily scale up to enormous workloads and right back down to zero, billing you only for what you used.
With Airflow–even managed Airflow from AWS–none of that comes out of the box. You can get autoscaling, but you’re ultimately responsible for provisioning workers and schedulers. You can integrate with AWS services, but you do it by writing Python code.
So why do I like Airflow at all? Why wasn’t I writing an article saying that Glue is the one tool to rule them all?
Because it isn’t. Glue is a tool for moving and transforming data, and that’s it. Airflow is a tool for orchestrating whatever black boxes you want. Sending API requests, uploading to servers, launching or terminating or polling external processes — Airflow has you covered. To do this, Airflow forces you to make more architectural decisions. In return, you get an abstraction layer that seamlessly handles decisions, retries, complex schedules, and more.
Who’s using which?
As a very coarse approximation, I searched separately for “glue” and “airflow” on LinkedIn Jobs, then looked at the first 50 unique (and relevant) positions. Standardizing the titles, I wound up with the following:
For Airflow, there is a clear tilt towards machine learning and data science. Jobs that require Glue, meanwhile, are more squarely focused on data infrastructure.
Here we see the trade-off in action: Glue gives you more power out of the box, but Airflow is way more flexible. Machine learning engineers need to update their models to reflect the newest data. The process by which this occurs is extremely varied, and (as of this writing) usually custom-made. Airflow provides a single orchestration tool that works from ingestion to delivery, no matter what that delivery looks like.
If you don’t need Airflow’s flexibility, you might like to have something that is highly resilient to isolated failures, gets rid of the scaling problem, and has a clear “right way” to talk to your enterprise cloud resources. That would be Glue.
Is there a world where you’re using both?
Although Glue is presented as an end-to-end solution for data transformation and movement around the AWS ecosystem, there’s no reason why you couldn’t think of a Glue ETL job as a discrete step in a larger ensemble of operations.
Airflow provides you with the means to do just that, with hooks, operators, and sensors for AWS Glue and its related services. There are also a few tutorials, both from AWS and others, that use this technique.
One reason you might want to do this is to take advantage of Glue’s two graphical tools:
- DataBrew, a code-free tool for cleaning and organizing data; and
- Glue Studio, a code-generating tool for planning ETL jobs.
I must admit, I’m slightly wary of using tools like these for production processes. Tools like these are useful for prototyping, but for data that’s used inside a product (or to make decisions), I’d like to see a more robust, test-driven approach, and that means code.
You might also use both to leverage Glue’s specialization. Minseok Song points out that Glue is easier to debug than Airflow, owing to laser focus on ETL. For this reason, you might implement your ETL tasks in Glue, then embed these within a larger sequence of tasks.
Another reason is to take advantage of Glue’s “data crawlers,” which can discover new data and load them into a glue data catalog. The contents of the data catalog can be accessed from within Airflow via a hook and used for anything you like.
The final word
Anything you can do in Glue, you can eventually do in Airflow. If Glue can do everything you need, you can do it sooner with Glue. But Airflow is still the best choice in most situations, especially now that it has a managed offering from Amazon.
- David Bruce Borenstein, PhD