From manual to declarative: Terraform and IaC in a fast growing company

17 min readNov 24, 2023

Our journey from manually provisioning and maintaining infrastructure resources, to fully declarative Infrastructure as Code (IaC).

by Theodore Kirkiris, Senior Site Reliability Engineer @ Workable
and Konstantinos Rousopoulos, Senior Site Reliability Engineer @ Workable

Summary

In this article, we are going to discuss our Infrastructure as Code (IaC) journey at Workable. For those of you who have experience with IaC and, more specifically, have adopted Terraform as the preferred tool to achieve this, you already understand the challenges and frustrations that come with organizing and architecting the code in a way that it is DRY, simple and maintainable (considering that a bug or a typo can potentially misconfigure or even wipe out a significant number of production resources) while also making it scalable, flexible and extendable, to meet the ever-changing needs of a fast-growing company. Although this article provides the evolution of our IaC over the years, if you are looking for a suggestion on how to structure your IaC, feel free to skip to the last section. In that section, we describe our current architecture, which we believe is DRY (Don't Repeat Yourself), flexible, and has provided an efficiency and confidence boost to our team in managing the ever-growing infrastructure effectively.

While parts of the engineering evolution at Workable are briefly mentioned to provide context, it is by no means a complete journey. If you are interested in knowing more, you should watch the great talk our VPs of Engineering gave at Voxxed Days.

When Workable started in 2012, a very small team of engineers worked on the first generation, which was a monolith requiring very few infrastructure resources. We used PostgreSQL as the primary persistence layer, Solr for text search, and Redis for distributed caching. Our code was deployed on Heroku, a container-based Platform as a Service (PaaS) that provided integrated data services and a powerful ecosystem for deploying and running modern applications. This allowed developers to focus on the application logic without the added complexity of managing infrastructure at production capacity.

The product kept growing over the years and, fast forward to 2016, the engineering team had already introduced a few microservices to offer a wider range of features that enhanced the user experience. However, these microservices didn’t fit quite well in the monolith’s stack and domain and there were more to come. The code was still deployed in Heroku but the product had grown significantly, we had already started using services from external providers (cloud storage, databases etc.), and infrastructure maintenance and monitoring remained a shared responsibility among the engineering team. This is when the SRE team was formed.

In the beginning, we were provisioning everything manually while assessing the situation and how our infrastructure should be managed. In 2017, as we were scaling up as a company, we reached a point where we needed a more efficient way to provision, configure, and manage our infrastructure. We needed a process that would enable us to shift from repetitive, manual tasks that were becoming more complex and error-prone, simplify and standardize the configuration of our infrastructure, and allow us to scale with minimal effort.

This is our story with Terraform: a journey on defining Infrastructure as Code that grows and transforms along with our company.

1G IaC — Adopting Terraform

Infrastructure as code (IaC) tools allow you to manage infrastructure using configuration files. These tools help you build, modify, and maintain your infrastructure in a safe, consistent, and repeatable way, by defining resource configurations that you can version, reuse, and share. Terraform is one of these tools, developed by HashiCorp.

Terraform allows you to define resources and infrastructure in human-readable configuration files, using a declarative language also developed by HashiCorp (HCL: HashiCorp Configuration Language). Declarative means that it describes the desired end-state for your infrastructure, as opposed to procedural programming languages that require well-defined, step-by-step instructions for completing tasks (like Αnsible is in the IaC realm).

Terraform also enables you to manage the lifecycle of your infrastructure by maintaining a state file that serves as a source of truth to identify any, changes and apply them, to ensure that the end state of the infrastructure matches the defined configuration. Additionally, it can determine dependencies between resources and create or destroy them in the correct order.

In the early stages, our infrastructure requirements were not very complex and they were limited to a single provider. As a result, the directory structure of our Terraform code followed the standards of the time for small-to-medium complexity infrastructure:

infrastructure
└── aws
    ├── production
    │   ├── eu-west-1
    │   │   ├── ec2
    │   │   ├── lambda
    │   │   └── ...
    │   ├── global
    │   │   └── iam
    │   └── us-east-1
    │       ├── ec2
    │       ├── rds
    │       └── ...
    └── staging
        └── ...

It was okay for managing a few resources here and there, it was clean and easy to track what was provisioned and where, and kept our production and staging environments completely isolated. However, adopting a microservice architecture and scaling out to several environments for development, testing and production, made things complicated:

We didn’t use modules, so our code wasn’t either DRY or standardized
Configuration files became cumbersome as more and more instances of the same AWS resource were introduced in a single file:
– Updating/reviewing them required more time and effort
– Terraform took increasingly more time to finish planning and applying

We needed to manage the infrastructure for a considerable number of microservices where each one:

requires a different set of resources
is deployed in several environments, with slightly different configuration

and add some business logic to our code for conditionally creating and configuring the infrastructure. For example:

a microservice requires dedicated resources for production but can use shared resources in development or testing environments, e.g. cloud storage
infrastructure for development and testing environments does not need to be highly available

However, the directory structure did not reflect our microservice architecture, so we were unable to bundle all the necessary resources for a single microservice. This made it difficult to provision or deprecate them all at once, potentially leaving orphaned resources behind.

In addition to the above, in 2018, we made the decision to abandon Heroku and adopt Kubernetes as our deployment platform. While Heroku had proven to be an excellent choice, it became cost-prohibitive as our company scaled up and the demand for new resources grew. Additionally, it lacked the flexibility and detailed monitoring capabilities that we required. This decision significantly increased the size and complexity of the infrastructure we needed to manage with Terraform. At that point, we knew that the simple structure we had in place was not scalable and we needed to take a step back and rethink our IaC architecture.

Terraform is an excellent tool for managing infrastructure but it has its limitations, more so back in 2018. Implementing business logic was challenging, sharing configuration files was not as straightforward, and changing the directory structure to support “applications per environment” could not be handled natively. Even if we used Terraform modules, we would still have to manually duplicate a significant amount of code for each module and environment, making our configuration anything but DRY, and would not solve our problem of bundling resources together.

You may be thinking of CI/CD right now. I hear you, but we are not there yet.

The goal was to switch from a resource-based to a microservice-based structure, but Terraform was missing the key features we required to redesign our IaC. This is when Terragrunt entered the game.

2G IaC — Terragrunt to the rescue

Terragrunt is a Terraform wrapper developed by Gruntwork, that provides extra tools for keeping configurations DRY, working with multiple Terraform modules, and managing a remote state.

Terragrunt has enabled us to:

keep Terraform code DRY
ditch duplicated backend code
inherit configuration from parent directories
apply multiple modules at once

which would enable us to create the desired microservice blueprints and reorganize our IaC accordingly.

During this phase we redesigned the directory structure of our code to align with the hierarchy of our infrastructure: a design that is agnostic to cloud vendors and service providers, and has a clear isolation of scope by:

organization, e.g. staging, production etc.
environment, e.g. testing, development, production etc.
microservice

Since we decided to restructure and be more consistent with our IaC, we needed to establish a new and consistent naming scheme for our resources. A scheme that would apply to both Terraform resources and the actual cloud infrastructure. Considering each environment to be the highest level of abstraction and that resources will not be shared among different environments, we came up with a naming convention proposal for Terraform resource names, variable names, and resource tags.

Modules

Having a clear directory structure and naming conventions, we began breaking our infrastructure into modules to create abstractions and describe our infrastructure in terms of its architecture, rather than directly in terms of physical objects. Following Terraform’s best practices, we created reusable modules to bundle different resources for each microservice (often from different providers) and incorporate business logic for conditional provisioning and configuration.

Eventually, our directory structure looked like this:

modules/
├── README.md
└── organization
    └── infrastructure
        └── environments
            ├── gke
            │   ├── firewall.tf
            │   ├── iam.tf
            │   ├── main.tf
            │   ├── nodepools.tf
            │   ├── providers.tf
            │   ├── remote_state.tf
            │   └── variables.tf
            ├── microservice1
            │   ├── README.md
            │   ├── iam.tf
            │   ├── mongo.tf
            │   ├── providers.tf
            │   ├── s3.tf
            │   └── variables.tf
            └── microservice2
                ├── README.md
                ├── cloudfront.tf
                ├── iam.tf
                ├── iam_policy.json
                ├── postgres.tf
                ├── providers.tf
                ├── remote_state.tf
                ├── s3.tf
                ├── s3_policy.json
                └── variables.tf

By switching to Terraform modules, we were able to achieve the abstraction we needed. However, the code to instantiate each module and set values for input variables, define output variables, configure the providers and provide a remote state still created a lot of maintenance overhead.

Live

With Terragrunt, we were able to add a level of abstraction and promote a versioned, immutable artifact of our code across different environments. The tool can fetch remote Terraform configurations, which exist in typical Terraform code and require input values for anything that should be different between environments.

In a separate repository, following a similar directory structure, we defined the live code for all of our environments. This live code now consists of only 3 files:

[REQUIRED] a Terragrunt .hcl file to specify the source of the code
[REQUIRED] a Terraform .auto.tfvars file which should only contain the necessary key/value pairs for configuring the resources
[OPTIONAL] a Terraform secrets .auto.tfvars file is used only when we want to keep secrets in the configuration. To keep our secrets safe, we use git-crypt which enables transparent encryption and decryption of files in a git repository.

This way, modules were organization and environment-agnostic, while the configuration code would be different for each microservice across different environments and organizations. Eventually, our live config looked like this:

live/
├── production
│   ├── org_config.auto.tfvars
│   ├── production1
│   │   ├── env_config.auto.tfvars
│   │   ├── gke
│   │   │   ├── terragrunt.hcl
│   │   │   ├── variables.auto.tfvars
|   |   |   └── secrets.auto.tfvars
│   │   ├── microservice1
│   │   │   ├── terragrunt.hcl
│   │   │   └── variables.auto.tfvars
│   │   └── microservice2
│   │       ├── terragrunt.hcl
│   │       └── variables.auto.tfvars
│   └── production2
│       ├── env_config.auto.tfvars
│       ├── gke
│       │   ├── terragrunt.hcl
│       │   ├── variables.auto.tfvars
|       |   └── secrets.auto.tfvars
│       ├── microservice1
│       │   ├── terragrunt.hcl
│       │   └── variables.auto.tfvars
│       └── microservice2
│           ├── terragrunt.hcl
│           └── variables.auto.tfvars
└── staging
    ├── dev
    │   ├── env_config.auto.tfvars
    │   ├── gke
    │   │   ├── terragrunt.hcl
    │   │   ├── variables.auto.tfvars
    |   |   └── secrets.auto.tfvars
    │   ├── microservice1
    │   │   ├── terragrunt.hcl
    │   │   └── variables.auto.tfvars
    │   └── microservice2
    │       ├── terragrunt.hcl
    │       └── variables.auto.tfvars
    ├── org_config.auto.tfvars
    └── qa
        ├── env_config.auto.tfvars
        ├── gke
        │   ├── terragrunt.hcl
        │   ├── variables.auto.tfvars
        |   └── secrets.auto.tfvars
        ├── microservice1
        │   ├── terragrunt.hcl
        │   └── variables.auto.tfvars
        └── microservice2
            ├── terragrunt.hcl
            └── variables.auto.tfvars

And just like that, we had everything in place in terms of code, and for the most part, it worked like a charm. Terraform is a tool for creating and managing the infrastructure on which your applications will run. You can define the resources and their specifications in a declarative way, and it will map out the dependencies, build everything for you, and even maintain a state to ensure parity between the current and desired end-state of your infrastructure. However, this does not apply to the software. Terraform is designed to create the resources themselves but not to manage the software they are running.

Running most of our workloads on Kubernetes and some of it in VMs meant we needed to manage many clusters and individual servers with a certain degree of customization in terms of software. For example, we wanted software that would enable us to build our GitOps pipelines (Flux), extend the networking capabilities of the cluster (Istio), or bootstrap services (mostly internal tools like Redash and Airflow) that were running on VMs.

Maintaining several environments required a consistent, reliable, and secure way to install, configure, and manage the software while ensuring absolute integrity. This is where Ansible came in. Although there were Terraform providers available for some managed resources at the time, their maturity along with the team’s capacity and existing familiarity with Ansible led us to adopt it for specific customization tasks, mostly for software installation.

Ansible

Ansible is an IT automation (IaC) tool that can configure systems, deploy software, and orchestrate more advanced IT tasks, such as continuous deployments or zero downtime rolling updates. Ansible executes a set of predefined steps and focuses on the automation process, rather than the desired end state.

While Terraform and Ansible are not mutually exclusive, they are both tools that can be used for IaC. Terraform follows a declarative approach that is ideal for provisioning, modifying, managing, and destroying infrastructure resources based on configuration files. Ansible is primarily a configuration management tool that follows a procedural approach, excelling at configuring resources when specific steps need to be executed in a specific order, such as installing/updating software, configuring runtime environments, updating system configuration files, etc.

Combining Terraform with Ansible has created a flexible workflow for spinning up new infrastructure and configuring both the necessary hardware and software. We utilized Terraform’s local-exec provisioner to execute Ansible playbooks from within our modules, and templating to customize the playbooks based on Terraform variables. This hybrid approach allowed us to quickly bootstrap our clusters with dependencies for our workloads. Additionally, since the playbook was integrated into a module, we were able to ensure consistency across environments, simplifying maintenance and troubleshooting since all provisioned resources would be configured in the same way… Or not :)

Having everything in place meant we were off to a good start, but by the time we finished, we started hitting other types of limitations that originated from other areas.

Issues

WET code
The vast majority of our microservices require various types of resources, such as an RDS instance or an S3 bucket. That is why we made the decision to switch from a resource-based to a microservice-based design. However, the practice of duplicating the definition of the same type of resource across different modules not only led to WET code but also created a lack of homogeneity in configuration. This introduced a significant cognitive overhead when ensuring the consistency of our code in case we needed to make a change, such as enforcing encryption at rest for all our S3 buckets.

One module to rule them all
Bundling different resources into reusable modules and incorporating business logic was the way to go (and we still believe it is). However, managing multiple environments with different requirements (often from different teams) on resource provisioning and configuration of the same microservice has resulted in an overwhelming amount of business logic to accommodate all possible scenarios.

To give an example, let’s consider a single microservice that requires cloud storage and a service account to access it, a PostgreSQL database and a CDN. Some potential scenarios would be:

QA team: We need to use a shared bucket created in the X environment so that we don’t have to upload all the files multiple times
Development team: We need to use a common CDN created in the Y environment so that we don’t have to maintain multiple origins
Production environment: Requires resource separation and isolation

Running Ansible became scary
As our infrastructure needs grew and the product architecture became more complex, we found that the customizations and configurations performed by Ansible were becoming increasingly difficult to manage. We had a single playbook with over 1000 lines and 100 Ansible tasks, which made it hard to control and track changes. Unlike Terraform, Ansible doesn’t provide a clear way to determine which tasks would modify resources or not. This caused an anxiety everytime we had to run this in production especially as more and more critical software was managed with Αnsible.

At that time we took the decision to:

Move whatever was feasible to Terraform providers: Kubernetes and Helm operators were more mature at this point, so we started using them instead of running kubectl commands or installing Helm charts directly.
Move the installation of Kubernetes services out of Terraform: When possible and appropriate, we moved the installation process for Kubernetes services into our GitOps workflow managed by Flux, like every other Kubernetes service.
You may be wondering about new environments, consistency and automation. Although our Helm / Flux setup probably deserves a separate article we will only briefly mention here that most functionality was shifted towards Helm charts while still managing helm releases for these services through Terraform.
Break down remaining tasks to individual playbooks: To simplify management and improve flexibility, we split the remaining tasks into smaller playbooks that focused on performing simple and well-contained actions, so that we can run them separately and as needed.

Documentation
The increasing complexity of modules effectively rendered them difficult to understand, update, and even use for provisioning resources in new environments. It was becoming more than obvious that we were missing documentation to have a clear overview of the functionality, the values we needed to configure for the input variables, and the expected outcome.

3G IaC — The return of Terraform

After upgrading Terraform to v1.x.x, we were in a good position to begin evaluating the limitations of our current structure and the issues we were facing.

The microservice-based organization of our IaC we used so far, has proven to be beneficial in creating the necessary building blocks and incorporating the required business logic to meet our business needs. This approach has made it easy to introduce new microservices or improve existing ones. For example, if microservice X needs to utilize a NoSQL database, we just need to update its module to include a Mongo cluster and apply this change to all the environments where microservice X is deployed. However, when we needed to make horizontal changes (e.g. to all NoSQL databases across microservices), it was getting more and more tedious. As the company grew we needed to address all the issues described above sooner rather than later, as the codebase was growing bigger and becoming more unmanageable, and implementing more complex requirements was becoming difficult.

Let’s take a look at our current structure:

Child modules as core resources

To address our previous issues we decided to add another layer of abstraction by introducing the use of child modules.

To start developing the child modules, we needed to establish a set of rules to guarantee consistency and uniformity:

Bundle Terraform resources that we always need to provision together. For example, a PostgreSQL RDS instance with a parameter, a subnet, and a security group.
Set and enforce default values that apply globally to specific resources in our infrastructure. For example, an S3 bucket must block public access, deny HTTP requests, and use server-side encryption.
Must not contain any business logic.
The term “business logic” refers to the way we configure Terraform resources for specific cases per environment and application, and the way we bundle these resources. All business logic should remain in the root modules.
Contains resources for a single provider.

A child module should only contain resources from the same provider. This is better shown in the file structure we decided to follow for these modules:

modules-terraform/
├── aiven
│  └── kafka
│     ├── README.md
│     ├── main.tf
│     ├── outputs.tf
│     ├── variables.tf
│     └── versions.tf
├── aws
│  └── db_instance
│     ├── README.md
│     ├── main.tf
│     ├── outputs.tf
│     ├── variables.tf
│     └── versions.tf
└── gcp
   └── storage_bucket
      ├── README.md
      ├── main.tf
      ├── outputs.tf
      ├── variables.tf
      └── versions.tf

Child modules can be chained together to create more comprehensive child modules. For example for an AWS RDS instance that instantiates a Postgres DB, we have a Postgres child module with standard Postgres configuration that we want across all our Postgres instances, regardless of the service provider. This child module is then used by the RDS child module which includes all the standard configurations related to RDS (logging, SSL, encryption at rest etc) and utilizes the Postgres child module for all the standard configurations specific to Postgres.

So, our new structure with the addition of child modules can be represented as follows:

Terragrunt root modules for resource bundling and business logic

A Terraform root module instantiates multiple child modules and is responsible for providing all the resources for a specific microservice, alongside the “business logic” depending on the environment and the microservice requirements.

For example, let’s assume we need to create a new RDS database for a microservice:

Our requirements:

Applying naming conventions for each environment/application (Business logic)
The naming convention for the cloud infrastructure is:
<environment_name>-<service_name>-<resource_type>-<random_id>
Apply standard tags across our infrastructure (Business logic)
Tags should always include name, provisioner, team, application, environment, organization
Use VPC security groups from a remote state** (Business logic)
Create everything at once (Child module)
An RDS PostgreSQL instance, a parameter group that will enforce SSL, a subnet group, and a security group with a rule to allow all egress traffic.
In this case, everything will be handled by the child module since the requirement is to always provision these resources together. The child module will even handle the validation for not enforcing SSL as it is a hard security requirement for all DB instances.

variable "db_parameter_group_parameters" {
  description = "A list of DB parameter maps to apply"
  type        = list(map(string))
  default     = []

  validation {
    condition = alltrue(
      [
        for parameter in var.db_parameter_group_parameters :
        (
          !contains(["rds.force_ssl"], parameter["name"])
        )
      ]
    )
    error_message = "You can't overwrite force_ssl."
  }
}

Using our new structure, this can be easily achieved with just 40 lines of code.

locals {
  identifier = (var.microservice_pg_identifier == null
    ? "${var.env_name}-${var.microservice_name}-pg-${random_id.id.hex}"
    : var.microservice_pg_identifier
  )
  db_parameter_group_name = "${var.env_name}-${var.microservice_name}-postgres-${element(split(".", var.microservice_pg_engine_version), 0)}"
  db_subnet_group_name    = "${var.env_name}-${var.microservice_name}-${var.aws_region}-db-subnet"
  db_security_group_name  = "${var.env_name}-${var.microservice_name}-${var.aws_region}-db-sg"

  tags = {
    name        = local.identifier
    provisioner = "terraform"
    team        = var.team_name
    app         = var.microservice_name
    env         = var.env_name
    org         = var.org_name
  }
}

resource "random_id" "id" {
  byte_length = 2
}

module "microservice_pg" {
  source = "../../../../modules-terraform/aws/db_instance"

  identifier        = local.identifier
  engine_version    = var.microservice_pg_engine_version
  allocated_storage = var.microservice_pg_allocated_storage

  vpc_security_group_ids = [
    data.terraform_remote_state.vpc.outputs.postgres_security_group_production_id,
    data.terraform_remote_state.vpc.outputs.postgres_security_group_staging_id,
    data.terraform_remote_state.vpc.outputs.postgres_security_group_services_id,
  ]

  backup_retention_period = var.microservice_pg_backup_retention_period

  db_subnet_group_name       = local.db_subnet_group_name
  db_subnet_group_subnet_ids = data.terraform_remote_state.vpc.outputs.vpc_id

  db_parameter_group_name   = local.db_parameter_group_name
  db_parameter_group_family = var.microservice_pg_parameter_group_family

  db_security_group_name   = local.db_security_group_name
  db_security_group_vpc_id = data.terraform_remote_state.vpc.outputs.vpc_id

  tags = local.tags
}

Live configuration

In the live configuration, Terragrunt is used to set global configuration variables, generate provider versions, and manipulate the IPs that are whitelisted across our infrastructure.

Terraform provider version management

To manage our Terraform provider versions, we use Terragrunt to generate the versions.tf file in each module. In our root Terragrunt configuration we add a generate block, and at the locals block we decode the YAML file containing the current provider version we are using. This enables us to update the versions for a provider globally across our modules. If we want a specific module to use a different version of a provider, we can overwrite it in the live configuration of the module.

Root terragrunt.hcl:

locals {
  provider_version = yamldecode(file("provider_versions.yaml"))
  [...]
}

generate "versions" {
  path      = "versions.tf"
  if_exists = "overwrite"
  contents  = <<EOF
  terraform {
    required_version = ">= 1.0"
    required_providers {
      aws = {
        source  = "hashicorp/aws"
        version = "${local.provider_version.aws}"
      }
      [...]
    }
  }
EOF
}

provider_versions.yaml:

aws: "3.74.1" # https://registry.terraform.io/providers/hashicorp/aws/3.74.1
google: "3.90.1" # https://registry.terraform.io/providers/hashicorp/google/3.90.1
[...]

What about Ansible?

As already mentioned above, we have decided to simplify Ansible and move as much as possible to more suitable IaC or GitOps operations. In our IaC code we only use Αnsible for:

Installing Istio. At the time we did the restructuring, the only production-ready installation method for Istio was istioctl, and still moving from istioctl to Helm requires a deletion and reinstallation, which needs to be carefully planned for our production environment.
Configuring and managing VMs. Procedural tools like Αnsible, Chef, and Puppet are still the best options for configuring and maintaining VM software. One of the advantages of Ansible is that it can be run client-side and thus a VM does not need to be accessible to any configuration management tool.

CI

We do not want to elaborate on this as it is out of the scope of this article but stay tuned…

There is a follow-up article coming that will describe our IaC CI/CD journey.