✅ The Production-Grade Infrastructure Checklist
☁️ Introduction
I'm reading a book called "Terraform Up and Running"and it has the very interesting Chapter 8 about production grade infrastructure. That chapter contains a production-grade infrastructure checklist which is so good that I decided to move it to a separate post for future reference.
💎 The Production-Grade Infrastructure Checklist
# | Task | Description | Example tools |
---|---|---|---|
1 | Install | Install the software binaries and all dependecies | Bash, Ansible, Docker, Packer |
2 | Configure | Configure the softwae at runtime. Includes port settings, TLS certs, service discovery, leaders, followers, replication, etc. | Chef, Ansible, Kubernetes |
3 | Provision | Provision the infrastructure.Includes servers, load balancers, network configuration, firewall settings, IAM permissions, etc. | Terraform, CloudFormation |
4 | Deploy | Deploy the service on top of the infrastructure. Roll out updates with no downtime. Includes blue-green, rolling, and canary deployments. | ASG, Kubernetes, ECS |
5 | High availability | Withstand outages of individual processes, servers, services, datacenters, and regions. | Multi-datacenter, multi-region |
6 | Scalability | Scale up and down in response to load. Scale horizontally (more servers) and/or vertically (bigger servers). | Auto scaling, replication |
7 | Performance | Optimize CPU, memory, disk, network, and GPU usage. Includes query tuning, benchmarking, load testing, and profiling. | Dynatrace, Valgrind, VisualVM |
8 | Networking | Configure static and dynamic IPs, ports, service discovery, firewalls, DNS, SSH access, and VPN access. | VPCs, firewalls, Route 53 |
9 | Security | Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening. | ACM, Let's Encrypt, KMS, Vault |
10 | Metrics | Availability metrics, business metrics, app metrics, server metrics, events, observability, tracing, and alerting. | CloudWatch, Datadog, Grafana |
11 | Logs | Rotate logs on disk. Aggregate log data to a central location. | Elastic Stack, Sumo Logic |
12 | Data backup | Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account. | AWS Backup, RDS snapshots |
13 | Cost optimization | Pick proper Instance types, use spot and reserved Instances, use auto scaling, and clean up unused resources. | Auto scaling, Infracost |
14 | Documentation | Document your code, architecture, and practices. Create playbooks to respond to incidents. | READMEs, wikis, Slack, IaC |
15 | Tests | Write automated tests for your infrastructure code. Run tests after every commit and nightly. | Terratest, tflint, OPA, InSpec |
Author's Quote
Every time you're working on a new piece of infrastructure, go through this checklist. Not every single piece of infrastructure needs every single item on the list, but you should consciously and explicitly document which items you've implemented, which ones you've decided to skip, and why.
🎉 Conclusions
The Production Infrastructure Checklist provided in this article is pretty strong, and I'm looking forward to using it in my day job.
Also, I highly recommend reading the book "Terraform Up and Running" where I found this checklist. It contains more practical advice about using Terraform and building modern infrastructure.