✅ The Production-Grade Infrastructure Checklist

Jul 10, 2024

☁️ Introduction

I'm reading a book called "Terraform Up and Running"and it has the very interesting Chapter 8 about production grade infrastructure. That chapter contains a production-grade infrastructure checklist which is so good that I decided to move it to a separate post for future reference.

💎 The Production-Grade Infrastructure Checklist

#	Task	Description	Example tools
1	Install	Install the software binaries and all dependecies	Bash, Ansible, Docker, Packer
2	Configure	Configure the softwae at runtime. Includes port settings, TLS certs, service discovery, leaders, followers, replication, etc.	Chef, Ansible, Kubernetes
3	Provision	Provision the infrastructure.Includes servers, load balancers, network configuration, firewall settings, IAM permissions, etc.	Terraform, CloudFormation
4	Deploy	Deploy the service on top of the infrastructure. Roll out updates with no downtime. Includes blue-green, rolling, and canary deployments.	ASG, Kubernetes, ECS
5	High availability	Withstand outages of individual processes, servers, services, datacenters, and regions.	Multi-datacenter, multi-region
6	Scalability	Scale up and down in response to load. Scale horizontally (more servers) and/or vertically (bigger servers).	Auto scaling, replication
7	Performance	Optimize CPU, memory, disk, network, and GPU usage. Includes query tuning, benchmarking, load testing, and profiling.	Dynatrace, Valgrind, VisualVM
8	Networking	Configure static and dynamic IPs, ports, service discovery, firewalls, DNS, SSH access, and VPN access.	VPCs, firewalls, Route 53
9	Security	Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening.	ACM, Let's Encrypt, KMS, Vault
10	Metrics	Availability metrics, business metrics, app metrics, server metrics, events, observability, tracing, and alerting.	CloudWatch, Datadog, Grafana
11	Logs	Rotate logs on disk. Aggregate log data to a central location.	Elastic Stack, Sumo Logic
12	Data backup	Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account.	AWS Backup, RDS snapshots
13	Cost optimization	Pick proper Instance types, use spot and reserved Instances, use auto scaling, and clean up unused resources.	Auto scaling, Infracost
14	Documentation	Document your code, architecture, and practices. Create playbooks to respond to incidents.	READMEs, wikis, Slack, IaC
15	Tests	Write automated tests for your infrastructure code. Run tests after every commit and nightly.	Terratest, tflint, OPA, InSpec

Author's Quote

Every time you're working on a new piece of infrastructure, go through this checklist. Not every single piece of infrastructure needs every single item on the list, but you should consciously and explicitly document which items you've implemented, which ones you've decided to skip, and why.

🎉 Conclusions

The Production Infrastructure Checklist provided in this article is pretty strong, and I'm looking forward to using it in my day job.

Also, I highly recommend reading the book "Terraform Up and Running" where I found this checklist. It contains more practical advice about using Terraform and building modern infrastructure.