The Production-Grade Infrastructure Checklist

Jul 10, 2024

☁️ Introduction

I'm reading a book called "Terraform Up and Running"and it has the very interesting Chapter 8 about production grade infrastructure. That chapter contains a production-grade infrastructure checklist which is so good that I decided to move it to a separate post for future reference.

☁️ Introduction

💎 The Production-Grade Infrastructure Checklist

#TaskDescriptionExample tools
1

Install

Install the software binaries and all dependecies

Bash, Ansible, Docker, Packer

2

Configure

Configure the softwae at runtime. Includes port settings, TLS certs, service discovery, leaders, followers, replication, etc.

Chef, Ansible, Kubernetes

3

Provision

Provision the infrastructure.Includes servers, load balancers, network configuration, firewall settings, IAM permissions, etc.

Terraform, CloudFormation

4

Deploy

Deploy the service on top of the infrastructure. Roll out updates with no downtime. Includes blue-green, rolling, and canary deployments.

ASG, Kubernetes, ECS

5

High availability

Withstand outages of individual processes, servers, services, datacenters, and regions.

Multi-datacenter, multi-region

6

Scalability

Scale up and down in response to load. Scale horizontally (more servers) and/or vertically (bigger servers).

Auto scaling, replication

7

Performance

Optimize CPU, memory, disk, network, and GPU usage. Includes query tuning, benchmarking, load testing, and profiling.

Dynatrace, Valgrind, VisualVM

8

Networking

Configure static and dynamic IPs, ports, service discovery, firewalls, DNS, SSH access, and VPN access.

VPCs, firewalls, Route 53

9

Security

Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening.

ACM, Let's Encrypt, KMS, Vault

10

Metrics

Availability metrics, business metrics, app metrics, server metrics, events, observability, tracing, and alerting.

CloudWatch, Datadog, Grafana

11

Logs

Rotate logs on disk. Aggregate log data to a central location.

Elastic Stack, Sumo Logic

12

Data backup

Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account.

AWS Backup, RDS snapshots

13

Cost optimization

Pick proper Instance types, use spot and reserved Instances, use auto scaling, and clean up unused resources.

Auto scaling, Infracost

14

Documentation

Document your code, architecture, and practices. Create playbooks to respond to incidents.

READMEs, wikis, Slack, IaC

15

Tests

Write automated tests for your infrastructure code. Run tests after every commit and nightly.

Terratest, tflint, OPA, InSpec

🎉 Conclusions

The Production Infrastructure Checklist provided in this article is pretty strong, and I'm looking forward to using it in my day job.

Also, I highly recommend reading the book "Terraform Up and Running" where I found this checklist. It contains more practical advice about using Terraform and building modern infrastructure.

🎉 Conclusions