Deploy Simple AWS Infrastructure Using Terraform

Building cloud infrastructure used to mean spending hours clicking through the AWS console and trying to remember which settings you used last time. I’ve been there – frantically taking screenshots of configurations and keeping messy notes just to recreate environments.

That’s where Terraform comes in. Instead of all that manual work, you write a few configuration files that describe exactly what you want your infrastructure to look like, and Terraform handles the rest.

Here’s what I am going to build:

A VPC
Public subnets for things that need internet access
Private subnets for your databases and sensitive stuff
All the networking pieces to make everything talk to each other
Security groups to keep the bad guys out
3 EC2 instances

What you’ll need before we start:

An AWS account (the free tier works fine for this / elastic ip is around i think $0.005 per hr / if you have free aws credits, this won’t hurt)
Terraform installed on your computer
AWS CLI set up with your credentials
About an hour of your time

Let’s setup the environment first. Go to IAM > Create a User > Security Credentials Tab > Create Access Key. Create an AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

.env

AWS_ACCESS_KEY_ID=12345678
AWS_SECRET_ACCESS_KEY=12345678

AWS_ACCESS_KEY_ID=12345678
AWS_SECRET_ACCESS_KEY=12345678

Create a docker-compose file that creates a containerized Terraform environment with AWS CLI. The latest image of terraform will be used, make sure to mount the current directory to /workspace, and stdin_open and tty are set to true so that it will enable terminal access.

yml

services:
  terraform:
    image: hashicorp/terraform:latest
    working_dir: /workspace
    container_name: terraform-aws
    entrypoint: ["sh", "-c", "apk add --no-cache aws-cli && sleep infinity"]
    volumes:
      - .:/workspace
    stdin_open: true
    tty: true
    environment:
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
      - AWS_REGION=us-east-1

services:
  terraform:
    image: hashicorp/terraform:latest
    working_dir: /workspace
    container_name: terraform-aws
    entrypoint: ["sh", "-c", "apk add --no-cache aws-cli && sleep infinity"]
    volumes:
      - .:/workspace
    stdin_open: true
    tty: true
    environment:
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
      - AWS_REGION=us-east-1

Run docker-compose up -d to start, then docker exec -it terraform-aws sh to access the container.

Bash

docker-compose up -d
docker exec -it terraform-aws sh

docker-compose up -d
docker exec -it terraform-aws sh

And this is the Terraform Project Structure:

Key Pairs

Key Pairs are cryptographic SSH keys used to securely authenticate and access EC2 instances without passwords. Key Pairs are like the master key and ID badges that let you securely access your employees or EC2 instances in your office building. Even if someone breaks into your building, they can’t access your employee’s work stations without the specific private badge reader. It’s important to note if you lose the private key file, you lose ssh access to your EC2 instance permanently or you can setup alternative access methods.

Generates a 4096-bit RSA private/public key pair in memory.

HCL

resource "tls_private_key" "private" {
    algorithm   =   "RSA"
    rsa_bits    =   4096
}

resource "tls_private_key" "private" {
    algorithm   =   "RSA"
    rsa_bits    =   4096
}

Registers the public key with AWS as an EC2 key pair named “terraform-key-pair”.

HCL

resource "aws_key_pair" "generated_key" {
    key_name    =   "terraform-key-pair"
    public_key  =   tls_private_key.private.public_key_openssh
}

resource "aws_key_pair" "generated_key" {
    key_name    =   "terraform-key-pair"
    public_key  =   tls_private_key.private.public_key_openssh
}

Saves the private key as a .pem file to your local Terraform directory.

HCL

resource "local_file" "private_key" {
    content     =   tls_private_key.private.private_key_pem
    filename    =   "${path.root}/terraform-key-pair.pem"
}

resource "local_file" "private_key" {
    content     =   tls_private_key.private.private_key_pem
    filename    =   "${path.root}/terraform-key-pair.pem"
}

Expose the return values of the above code to be used by other modules.

HCL

output "key_pair_name" {
    value = aws_key_pair.generated_key.key_name
}

output "tls_private_key_pem" {
    value = tls_private_key.private.private_key_pem
}

output "key_pair_name" {
    value = aws_key_pair.generated_key.key_name
}

output "tls_private_key_pem" {
    value = tls_private_key.private.private_key_pem
}

Create the keypair module.

HCL

module "keypair" {
    source          =   "../../modules/keypair"
}

module "keypair" {
    source          =   "../../modules/keypair"
}

You can see the full code here https://github.com/rinavillaruz/easy-aws-infrastructure-terraform.

Networking

VPC

A VPC is a logically isolated section of a cloud provider’s network where you can launch and manage your cloud resources in a virtual network that you define and control. They are fundamental to cloud architecture because they can give you the network foundation needed to build a secure and scalable applications while maintaining control over your network environment. Like an office building floor, you can rent an entire floor of a sky scraper, you can decide how to divide it into rooms or subnets, who can access each rooms or security groups and weather some rooms have windows to the outside world or internet access. While others are interior offices or private subnets. It’s essentially cloud providers saying “here’s your own private piece of the internet where you can build whatever you need.

Create a VPC with 10.0.0.0/16 CIDR block.

HCL

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"

  tags = {
    Name = "VPC"
  }
}

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"

  tags = {
    Name = "VPC"
  }
}

Expose the return values of the above code to be used by other modules.

HCL

output "vpc_id" {
  value = aws_vpc.main.id
}

output "vpc_cidr_block" {
  value = aws_vpc.main.cidr_block
}

output "vpc_id" {
  value = aws_vpc.main.id
}

output "vpc_cidr_block" {
  value = aws_vpc.main.cidr_block
}

Public Subnets

A Public Subnet has a route to an Internet Gateway. Think of it as a building’s main entrance that connects your floor directly to the street. Any server you put in a public subnet can receive traffic directly from the internet. Just like how people on the street can see and access those street-facing conference rooms. You have to note that just because a room has windows, doesn’t mean anyone can just walk in. You still have security or security groups, firewalls controlling exactly who can enter and what they can do.

Defines an input variable that accepts a list of subnet CIDR blocks, defaulting to one subnet (10.0.1.0/24).

HCL

variable "public_subnet_cidrs" {
    type = list(string)
    default = [ "10.0.1.0/24" ]
}

variable "public_subnet_cidrs" {
    type = list(string)
    default = [ "10.0.1.0/24" ]
}

Defines an input variable that accepts a list of availability zones, defaulting to us-east-1a and us-east-1b.

HCL

variable "azs" {
    type = list(string)
    default = [ "us-east-1a", "us-east-1b" ]
}

variable "azs" {
    type = list(string)
    default = [ "us-east-1a", "us-east-1b" ]
}

Creates multiple public subnets in the VPC, one for each CIDR block in the variable, distributing them across the specified availability zones.

HCL

resource "aws_subnet" "public_subnets" {
  count             = length(var.public_subnet_cidrs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = element(var.public_subnet_cidrs, count.index)
  availability_zone = element(var.azs, count.index)

  tags = {
    Name = "Public Subnet ${count.index+1}"
  }
}

resource "aws_subnet" "public_subnets" {
  count             = length(var.public_subnet_cidrs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = element(var.public_subnet_cidrs, count.index)
  availability_zone = element(var.azs, count.index)

  tags = {
    Name = "Public Subnet ${count.index+1}"
  }
}

Expose the return values of the above code to be used by other modules.

HCL

output "public_subnets" {
  value = aws_subnet.public_subnets
}

output public_subnet_cidrs {
  value = var.public_subnet_cidrs
}

output "public_subnets" {
  value = aws_subnet.public_subnets
}

output public_subnet_cidrs {
  value = var.public_subnet_cidrs
}

Private Subnets

A Private Subnet is like the interior offices and back rooms. They have no windows to the street and can’t be accessed directly from the outside world. It has no direct route to the internet gateway. There’s no street entrance to these rooms. These servers can’t receive traffic directly from the internet just like how people on the street cant walk directly in to your back offices. To be able for private subnet to access the outside world, it must go through NAT Gateway which we will not cover. Even if someone breaks through your perimeter security, your most critical systems: databases, internal applications, are in these window-less backrooms. Where they are much harder to reach and attack directly.

Defines an input variable for private subnet CIDR blocks, defaulting to two subnets (10.0.2.0/24 and 10.0.3.0/24).

HCL

variable "private_subnet_cidrs" {
  type = list(string)
  default = [ "10.0.2.0/24", "10.0.3.0/24" ]
}

variable "private_subnet_cidrs" {
  type = list(string)
  default = [ "10.0.2.0/24", "10.0.3.0/24" ]
}

Creates private subnets using the CIDR blocks and availability zones from variables, limited by whichever list is shorter.

HCL

resource "aws_subnet" "private_subnets" {
  count             = min(length(var.private_subnet_cidrs), length(var.azs))
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnet_cidrs[count.index]
  availability_zone = var.azs[count.index]

  tags = {
    Name = "Private Subnet ${count.index + 1}"
  }
}

resource "aws_subnet" "private_subnets" {
  count             = min(length(var.private_subnet_cidrs), length(var.azs))
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnet_cidrs[count.index]
  availability_zone = var.azs[count.index]

  tags = {
    Name = "Private Subnet ${count.index + 1}"
  }
}

Expose the return values of the above code to be used by other modules.

HCL

output "private_subnets" {
  value = aws_subnet.private_subnets
}

output "private_subnets" {
  value = aws_subnet.private_subnets
}

You can see the full code here https://github.com/rinavillaruz/easy-aws-infrastructure-terraform.

Internet Gateway

An Internet Gateway is like the main building entrance and lobby. Its a single point where your office floors connect to the outside world or the internet. Simply put, its an aws-managed component that connects your VPC to the internet. Without Internet Gateway, your VPC has no internet connectivity at all – completely isolated network.

Creates an internet gateway and attaches it to the VPC to enable internet access.

HCL

resource "aws_internet_gateway" "igw" {
  vpc_id  = aws_vpc.main.id

  tags = {
    Name = "Internet Gateway"
  }
}

resource "aws_internet_gateway" "igw" {
  vpc_id  = aws_vpc.main.id

  tags = {
    Name = "Internet Gateway"
  }
}

Route Tables

Route Tables are network routing rules that determine where to send traffic based on destination IP addresses. When your web servers wants to download updates from the internet, it checks it’s route table. To reach 0.0.0.0/0, go to the Internet Gateway and sends the traffic there.

Creates a public route table in the VPC.

HCL

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "Public Route Table"
  }
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "Public Route Table"
  }
}

Adds a route in the public route table that sends all traffic (0.0.0.0/0) to the internet gateway.

HCL

resource "aws_route" "public_internet_access" {
  route_table_id          = aws_route_table.public.id
  destination_cidr_block  = "0.0.0.0/0"
  gateway_id              = aws_internet_gateway.igw.id
}

resource "aws_route" "public_internet_access" {
  route_table_id          = aws_route_table.public.id
  destination_cidr_block  = "0.0.0.0/0"
  gateway_id              = aws_internet_gateway.igw.id
}

Associates the first public subnet with the public route table.

HCL

resource "aws_route_table_association" "public_first_subnet" {
  subnet_id       = aws_subnet.public_subnets[0].id
  route_table_id  = aws_route_table.public.id
}

resource "aws_route_table_association" "public_first_subnet" {
  subnet_id       = aws_subnet.public_subnets[0].id
  route_table_id  = aws_route_table.public.id
}

Creates a private route table in the VPC.

HCL

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "Private Route Table"
  }
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "Private Route Table"
  }
}

Associates all private subnets with the private route table.

HCL

resource "aws_route_table_association" "private" {
  count           = length(var.private_subnet_cidrs)
  subnet_id       = element(aws_subnet.private_subnets[*].id, count.index)
  route_table_id  = aws_route_table.private.id
}

resource "aws_route_table_association" "private" {
  count           = length(var.private_subnet_cidrs)
  subnet_id       = element(aws_subnet.private_subnets[*].id, count.index)
  route_table_id  = aws_route_table.private.id
}

Create the networking module.

HCL

module "networking" {
    source          =   "../../modules/networking"
}

module "networking" {
    source          =   "../../modules/networking"
}

You can see the full code here https://github.com/rinavillaruz/easy-aws-infrastructure-terraform.

Security Groups

Security Groups are virtual firewalls that control inbound and outbound traffic at the instance level like EC2 and RDS. Unlike the building’s main security that controls who can enter each room or area, security groups are like individual bodyguards that stick with specific employees wherever they go. For example, if someone approaches your employee, start a conversation which is called the inbound request and you bodyguard approves it, they automatically allow that person to respond back which is called the outbound response.

The Security Groups module will be accepting returned values from the Networking module which are the vpc_id and vpc_cidr_block. Create a variable for them.

HCL

variable "vpc_id" {
  type = string
}

variable "vpc_cidr_block" {
  type = string
}

variable "vpc_id" {
  type = string
}

variable "vpc_cidr_block" {
  type = string
}

Creates a public security group in the VPC.

HCL

resource "aws_security_group" "public" {
  name    = "public-sg"
  vpc_id  = var.vpc_id

  tags = {
    Name = "Public SG"
  }
}

resource "aws_security_group" "public" {
  name    = "public-sg"
  vpc_id  = var.vpc_id

  tags = {
    Name = "Public SG"
  }
}

Creates a private security group in the VPC.

HCL

resource "aws_security_group" "private" {
  name    = "private-sg"
  vpc_id  = var.vpc_id

  tags = {
    Name = "Private SG"
  }
}

resource "aws_security_group" "private" {
  name    = "private-sg"
  vpc_id  = var.vpc_id

  tags = {
    Name = "Private SG"
  }
}

Allows SSH outbound traffic from the public security group to the private security group.

HCL

resource "aws_vpc_security_group_egress_rule" "public_egress" {
  security_group_id             = aws_security_group.public.id
  from_port                     = 22
  to_port                       = 22
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.private.id

  tags = {
    Name = "SSH Outgoing - Public SG -> Private SG"
  }
}

resource "aws_vpc_security_group_egress_rule" "public_egress" {
  security_group_id             = aws_security_group.public.id
  from_port                     = 22
  to_port                       = 22
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.private.id

  tags = {
    Name = "SSH Outgoing - Public SG -> Private SG"
  }
}

Allows SSH inbound traffic to the private security group from the public security group.

HCL

resource "aws_vpc_security_group_ingress_rule" "private_ingress" {
  security_group_id             = aws_security_group.private.id
  from_port                     = 22
  to_port                       = 22
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.public.id

  tags = {
    Name = "SSH Incoming - Private SG <- Public SG"
  }
}

resource "aws_vpc_security_group_ingress_rule" "private_ingress" {
  security_group_id             = aws_security_group.private.id
  from_port                     = 22
  to_port                       = 22
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.public.id

  tags = {
    Name = "SSH Incoming - Private SG <- Public SG"
  }
}

Allows SSH inbound traffic to the public security group from anywhere on the internet.

HCL

resource "aws_vpc_security_group_ingress_rule" "public_ssh_anywhere" {
  security_group_id = aws_security_group.public.id
  from_port         = 22
  to_port           = 22
  ip_protocol       = "tcp"
  cidr_ipv4         = "0.0.0.0/0"

  tags = {
    Name = "Public SG SSH Anywhere"
  }
}

resource "aws_vpc_security_group_ingress_rule" "public_ssh_anywhere" {
  security_group_id = aws_security_group.public.id
  from_port         = 22
  to_port           = 22
  ip_protocol       = "tcp"
  cidr_ipv4         = "0.0.0.0/0"

  tags = {
    Name = "Public SG SSH Anywhere"
  }
}

Create the security_groups module, passing the VPC ID and CIDR block from the networking module, and waits for networking to complete first.

HCL

module security_groups {
  source            =   "../../modules/security_groups"

  vpc_id            =   module.networking.vpc_id
  vpc_cidr_block    =   module.networking.vpc_cidr_block

  depends_on        =   [ module.networking ]
}

module security_groups {
  source            =   "../../modules/security_groups"

  vpc_id            =   module.networking.vpc_id
  vpc_cidr_block    =   module.networking.vpc_cidr_block

  depends_on        =   [ module.networking ]
}

You can see the full code here https://github.com/rinavillaruz/easy-aws-infrastructure-terraform.

EC2 Instances

Just like you hire different types of employees with different skills and assign them to different rooms in your office, EC2 instances are virtual computers that you hire from AWS that performs specific tasks. EC2 instances are your actual workforce. The virtual computers doing their real work in your cloud infrastructure just like employees doing real work in your office building.

Defines the private IP address for the public instance, defaulting to 10.0.1.10.

HCL

variable "public_instance_private_ip" {
  type = string
  default = "10.0.1.10"
}

variable "public_instance_private_ip" {
  type = string
  default = "10.0.1.10"
}

Defines the private IP addresses for private instances, defaulting to 10.0.2.10 and 10.0.3.10.

HCL

variable "private_instance_private_ips" {
  type = list(string)
  default = [ "10.0.2.10", "10.0.3.10" ]
}

variable "private_instance_private_ips" {
  type = list(string)
  default = [ "10.0.2.10", "10.0.3.10" ]
}

The compute module will be accepting returned values from the other modules. Create variables for them.

HCL

variable "vpc_id" {
  type = string
}

variable "private_subnets" {
  type = any
}

variable "public_subnets" {
  type = any
}

variable "public_security_group_id" {
  type = string
}

variable "private_security_group_id" {
  type = string
}

variable "tls_private_key_pem" {
  type = string
  sensitive = true
}

variable "key_pair_name" {
  type = string
}

variable "public_subnet_cidrs" {
  type = list(string)
}

variable "vpc_id" {
  type = string
}

variable "private_subnets" {
  type = any
}

variable "public_subnets" {
  type = any
}

variable "public_security_group_id" {
  type = string
}

variable "private_security_group_id" {
  type = string
}

variable "tls_private_key_pem" {
  type = string
  sensitive = true
}

variable "key_pair_name" {
  type = string
}

variable "public_subnet_cidrs" {
  type = list(string)
}

Creates public EC2 instances in each public subnet with the specified AMI, instance type, and security group.

HCL

resource "aws_instance" "public" {
  count                   = length(var.public_subnet_cidrs)
  ami                     = "ami-084568db4383264d4"
  instance_type           = "t3.micro"
  key_name                = var.key_pair_name
  vpc_security_group_ids  = [var.public_security_group_id]
  subnet_id               = var.public_subnets[count.index].id 
  private_ip              = var.public_instance_private_ip

  tags = {
    Name = "Public Instance"
  }
}

resource "aws_instance" "public" {
  count                   = length(var.public_subnet_cidrs)
  ami                     = "ami-084568db4383264d4"
  instance_type           = "t3.micro"
  key_name                = var.key_pair_name
  vpc_security_group_ids  = [var.public_security_group_id]
  subnet_id               = var.public_subnets[count.index].id 
  private_ip              = var.public_instance_private_ip

  tags = {
    Name = "Public Instance"
  }
}

Creates Elastic IP addresses for each public instance.

HCL

resource "aws_eip" "public_eip" {
  count   = length(var.public_subnet_cidrs)
  domain  = "vpc"
}

resource "aws_eip" "public_eip" {
  count   = length(var.public_subnet_cidrs)
  domain  = "vpc"
}

Associates each Elastic IP with its corresponding public instance.

HCL

resource "aws_eip_association" "public_eip_assoc" {
  count         = length(var.public_subnet_cidrs)
  instance_id   = aws_instance.public[count.index].id
  allocation_id = aws_eip.public_eip[count.index].id
}

resource "aws_eip_association" "public_eip_assoc" {
  count         = length(var.public_subnet_cidrs)
  instance_id   = aws_instance.public[count.index].id
  allocation_id = aws_eip.public_eip[count.index].id
}

Creates private EC2 instances in private subnets with custom root volumes and assigned private IPs.

HCL

resource "aws_instance" "private_instances" {
  for_each                = { for index, ip in var.private_instance_private_ips : index => ip }
  ami                     = "ami-084568db4383264d4"
  instance_type           = "t3.micro"
  key_name                = var.key_pair_name
  vpc_security_group_ids  = [var.private_security_group_id]
  subnet_id               = var.private_subnets[each.key].id
  private_ip              = var.private_instance_private_ips[each.key]

  root_block_device {
    volume_size           = 20
    volume_type           = "gp3"
    delete_on_termination = true
  }

  tags = {
    Name = "Private Instance ${each.key + 1}"
  }
}

resource "aws_instance" "private_instances" {
  for_each                = { for index, ip in var.private_instance_private_ips : index => ip }
  ami                     = "ami-084568db4383264d4"
  instance_type           = "t3.micro"
  key_name                = var.key_pair_name
  vpc_security_group_ids  = [var.private_security_group_id]
  subnet_id               = var.private_subnets[each.key].id
  private_ip              = var.private_instance_private_ips[each.key]

  root_block_device {
    volume_size           = 20
    volume_type           = "gp3"
    delete_on_termination = true
  }

  tags = {
    Name = "Private Instance ${each.key + 1}"
  }
}

Creates the compute module, passing networking details, security group IDs, and key pair information from other modules, and waits for all dependencies to complete first.

HCL

module "compute" {
  source                    =   "../../modules/compute"

  vpc_id                    =   module.networking.vpc_id
  public_subnet_cidrs       =   module.networking.public_subnet_cidrs
  private_subnets           =   module.networking.private_subnets
  public_subnets            =   module.networking.public_subnets
  public_security_group_id  =   module.security_groups.public_security_group_id
  private_security_group_id =   module.security_groups.private_security_group_id
  key_pair_name             =   module.keypair.key_pair_name
  tls_private_key_pem       =   module.keypair.tls_private_key_pem

  depends_on                =   [ module.keypair, module.networking, module.security_groups ]
}

module "compute" {
  source                    =   "../../modules/compute"

  vpc_id                    =   module.networking.vpc_id
  public_subnet_cidrs       =   module.networking.public_subnet_cidrs
  private_subnets           =   module.networking.private_subnets
  public_subnets            =   module.networking.public_subnets
  public_security_group_id  =   module.security_groups.public_security_group_id
  private_security_group_id =   module.security_groups.private_security_group_id
  key_pair_name             =   module.keypair.key_pair_name
  tls_private_key_pem       =   module.keypair.tls_private_key_pem

  depends_on                =   [ module.keypair, module.networking, module.security_groups ]
}

Policies

In aws, create a terraform-user and add these policies:

JSON

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"ec2:DescribeVpcAttribute",
				"ec2:DescribeInstanceTypes",
				"ec2:DescribeAddressesAttribute",
				"ec2:CreateVpc",
				"ec2:DeleteVpc",
				"ec2:DescribeVpcs",
				"ec2:ModifyVpcAttribute",
				"ec2:CreateSubnet",
				"ec2:DeleteSubnet",
				"ec2:DescribeSubnets",
				"ec2:CreateInternetGateway",
				"ec2:DeleteInternetGateway",
				"ec2:DescribeInternetGateways",
				"ec2:AttachInternetGateway",
				"ec2:DetachInternetGateway",
				"ec2:CreateRouteTable",
				"ec2:DeleteRouteTable",
				"ec2:DescribeRouteTables",
				"ec2:AssociateRouteTable",
				"ec2:DisassociateRouteTable",
				"ec2:CreateRoute",
				"ec2:DeleteRoute",
				"ec2:CreateSecurityGroup",
				"ec2:DeleteSecurityGroup",
				"ec2:DescribeSecurityGroups",
				"ec2:DescribeSecurityGroupRules",
				"ec2:AuthorizeSecurityGroupIngress",
				"ec2:AuthorizeSecurityGroupEgress",
				"ec2:RevokeSecurityGroupIngress",
				"ec2:RevokeSecurityGroupEgress",
				"ec2:CreateKeyPair",
				"ec2:DeleteKeyPair",
				"ec2:DescribeKeyPairs",
				"ec2:ImportKeyPair",
				"ec2:RunInstances",
				"ec2:TerminateInstances",
				"ec2:DescribeInstances",
				"ec2:DescribeInstanceAttribute",
				"ec2:AllocateAddress",
				"ec2:ReleaseAddress",
				"ec2:DescribeAddresses",
				"ec2:AssociateAddress",
				"ec2:DisassociateAddress",
				"ec2:CreateTags",
				"ec2:DescribeTags",
				"ec2:DescribeAvailabilityZones",
				"ec2:DescribeImages",
				"ec2:DescribeVolumes",
				"ec2:DescribeInstanceCreditSpecifications",
				"ec2:DescribeNetworkInterfaces"
			],
			"Resource": "*"
		}
	]
}

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"ec2:DescribeVpcAttribute",
				"ec2:DescribeInstanceTypes",
				"ec2:DescribeAddressesAttribute",
				"ec2:CreateVpc",
				"ec2:DeleteVpc",
				"ec2:DescribeVpcs",
				"ec2:ModifyVpcAttribute",
				"ec2:CreateSubnet",
				"ec2:DeleteSubnet",
				"ec2:DescribeSubnets",
				"ec2:CreateInternetGateway",
				"ec2:DeleteInternetGateway",
				"ec2:DescribeInternetGateways",
				"ec2:AttachInternetGateway",
				"ec2:DetachInternetGateway",
				"ec2:CreateRouteTable",
				"ec2:DeleteRouteTable",
				"ec2:DescribeRouteTables",
				"ec2:AssociateRouteTable",
				"ec2:DisassociateRouteTable",
				"ec2:CreateRoute",
				"ec2:DeleteRoute",
				"ec2:CreateSecurityGroup",
				"ec2:DeleteSecurityGroup",
				"ec2:DescribeSecurityGroups",
				"ec2:DescribeSecurityGroupRules",
				"ec2:AuthorizeSecurityGroupIngress",
				"ec2:AuthorizeSecurityGroupEgress",
				"ec2:RevokeSecurityGroupIngress",
				"ec2:RevokeSecurityGroupEgress",
				"ec2:CreateKeyPair",
				"ec2:DeleteKeyPair",
				"ec2:DescribeKeyPairs",
				"ec2:ImportKeyPair",
				"ec2:RunInstances",
				"ec2:TerminateInstances",
				"ec2:DescribeInstances",
				"ec2:DescribeInstanceAttribute",
				"ec2:AllocateAddress",
				"ec2:ReleaseAddress",
				"ec2:DescribeAddresses",
				"ec2:AssociateAddress",
				"ec2:DisassociateAddress",
				"ec2:CreateTags",
				"ec2:DescribeTags",
				"ec2:DescribeAvailabilityZones",
				"ec2:DescribeImages",
				"ec2:DescribeVolumes",
				"ec2:DescribeInstanceCreditSpecifications",
				"ec2:DescribeNetworkInterfaces"
			],
			"Resource": "*"
		}
	]
}

Run the following to see the changes.

Bash

terraform init
terraform plan -out=aws.tfplan
terraform apply aws.tfplan

terraform init
terraform plan -out=aws.tfplan
terraform apply aws.tfplan

If you want to destroy everything, run

Bash

terraform destroy --auto-approve

terraform destroy --auto-approve

You can see this on Github https://github.com/rinavillaruz/easy-aws-infrastructure-terraform.

Deploy Kubernetes using Terraform and Amazon Web Services (AWS)

This is one of the hard ways to install and run Kubernetes. I recommend this for learning purposes and not for production use. There is an Amazon EKS (Elastic Kubernetes Service) which you can use rather than setting up your own just like this tutorial.

First, the VPC, Subnets, Security Groups, Key Pairs, SSM, IAM roles, Network Load Balancer, EC2 Instances (1 Bastion host, 3 Control Planes, 3 Worker Nodes) need to be setup first before Kubernetes can be installed through bash scripting.

Terraform will be used to spin AWS resources. It is an Infrastructure-as-code that lets you create AWS resources without having to provision them manually by mouse clicks.

Table of Contents:

Key Pair

Key Pairs are secure authentication method for accessing EC2 instances via SSH. It consists of Public and Private Keys.

Public Key: Gets installed on EC2 instances during launch
Private Key: Stays on your local machine (like a secret password)

Create the Key Pairs first, name it as terraform-key-pair.pem and save it locally.

HCL

# modules/keypair/main.tf

# Generate an RSA key pair
resource "tls_private_key" "private" {
  algorithm = "RSA"
  rsa_bits  = 4096
}

# Create an AWS key pair using the generated public key
resource "aws_key_pair" "generated_key" {
  key_name   = "terraform-key-pair"
  public_key = tls_private_key.private.public_key_openssh
}

# Save the private key locally
resource "local_file" "private_key" {
  content  = tls_private_key.private.private_key_pem
  filename = "${path.root}/terraform-key-pair.pem"
}

# modules/keypair/main.tf

# Generate an RSA key pair
resource "tls_private_key" "private" {
  algorithm = "RSA"
  rsa_bits  = 4096
}

# Create an AWS key pair using the generated public key
resource "aws_key_pair" "generated_key" {
  key_name   = "terraform-key-pair"
  public_key = tls_private_key.private.public_key_openssh
}

# Save the private key locally
resource "local_file" "private_key" {
  content  = tls_private_key.private.private_key_pem
  filename = "${path.root}/terraform-key-pair.pem"
}

Expose the return values of the above code using outputs to be used on other modules.

HCL

# modules/keypair/outputs.tf

output "key_pair_name" {  
  description = "Name of the AWS key pair for SSH access to EC2 instances"
  value       = aws_key_pair.generated_key.key_name
}

output "tls_private_key_pem" {  
  description = "Private key in PEM format for SSH access - keep secure and do not expose"
  value       = tls_private_key.private.private_key_pem
  sensitive   = true
}

# modules/keypair/outputs.tf

output "key_pair_name" {  
  description = "Name of the AWS key pair for SSH access to EC2 instances"
  value       = aws_key_pair.generated_key.key_name
}

output "tls_private_key_pem" {  
  description = "Private key in PEM format for SSH access - keep secure and do not expose"
  value       = tls_private_key.private.private_key_pem
  sensitive   = true
}

Create a custom module named keypair.

HCL

# environment/development/main.tf

module keypair {
  source = "../../modules/keypair"
}

# environment/development/main.tf

module keypair {
  source = "../../modules/keypair"
}

IAM

IAM (Identity Access Management) is AWS’s security system where in it controls who can do what in an AWS account. IAM can do two things: Authenticate (who are you?) and Authorize (what can you do?).

Authentication
– Users: Individual people (e.g., developers)
– Roles: Temporary identities for services/applications
– Groups: Collections of users with similar permissions

Authorization
– Policies: Rules that define permissions
– Permissions: Specific actions allowed/denied

Example IAM Users:

John (Developer) -> Can create EC2 instances but not delete them
Sarah (Admin) -> Can do everything
CI/CD System -> Can deploy applications but not manage billing

Example IAM Roles

EC2 Instance Role -> Can read from S3 buckets
Lambda Function Role -> Can write to DynamoDB
Kubernetes Node Role -> Can join cluster and pull images

Example Policies:

{
  "Effect": "Allow",
  "Action": "ec2:DescribeInstances",
  "Resource": "*"
}

1. data "aws_caller_identity" "current" {}

Gets information about the current AWS account and user that Terraform is using. It’s like asking “Who am I?” to AWS. It provides:

Account ID: The AWS account number (like 123456789012)
User ID: The unique identifier of the current user/role
ARN: The full Amazon Resource Name of the current user/role

HCL

# modules/iam/main.tf

# IAM
data "aws_caller_identity" "current" {}

# modules/iam/main.tf

# IAM
data "aws_caller_identity" "current" {}

Example usage:

HCL

"arn:aws:ssm:us-east-1:${data.aws_caller_identity.current.account_id}:parameter/k8s/*"

"arn:aws:ssm:us-east-1:${data.aws_caller_identity.current.account_id}:parameter/k8s/*"

2. resource "random_id" "cluster" { byte_length = 4 }

Generates a random identifier that will be consistent across Terraform runs. It’s like creating a unique “serial number” for your cluster. This ensures all resources are uniquely named and belong to the same cluster deployment. What it does:

Creates: A random 4-byte (32-bit) identifier
Formats: Usually displayed as hexadecimal (like a1b2c3d4)
Persistence: Same value every time you run terraform apply (unless you destroy and recreate)

It is used in aws_iam_role.kubernetes_master, aws_iam_instance_profile.kubernetes_master, aws_iam_role.kubernetes_worker, aws_iam_instance_profile.kubernetes_worker.

This is used to avoid naming conflicts. When multiple people or environments deploy the same Terraform code, IAM resources need unique names because:

IAM names are globally unique within an AWS account
Multiple deployments would conflict without unique identifiers
Easy identification of which resources belong to which cluster

If Consistent Random Suffixes are implemented there will be:

No Conflicts: Multiple developers/environments can deploy simultaneously
Easy Cleanup: All resources for one cluster have the same suffix
Clear Ownership: Can identify which resources belong to which deployment
Testing: Can deploy multiple test environments without conflicts

HCL

resource "random_id" "cluster" {
  byte_length = 4
}

resource "random_id" "cluster" {
  byte_length = 4
}

Control Plane Master IAM Setup

Master Role – Identity for control plane nodes

This creates an IAM role that EC2 instances can assume to get AWS permissions. The assume_role_policy is a trust policy that says “only EC2 instances can use this role” – it controls WHO can assume the role, not WHAT they can do. The actual permissions (like accessing S3 or Parameter Store) are added later by attaching separate IAM policies to this role.

HCL

# modules/iam/main.tf

resource "aws_iam_role" "kubernetes_master" {
  name = "kubernetes-master-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })

  tags = {
    Name        = "${terraform.workspace} - Kubernetes Master Role"
    Description = "IAM role for Kubernetes control plane nodes with AWS API permissions"
    Purpose     = "Kubernetes Control Plane"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
    Project     = "Kubernetes"
    NodeType    = "Control Plane"
    Service     = "EC2"
  }
}

# modules/iam/main.tf

resource "aws_iam_role" "kubernetes_master" {
  name = "kubernetes-master-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })

  tags = {
    Name        = "${terraform.workspace} - Kubernetes Master Role"
    Description = "IAM role for Kubernetes control plane nodes with AWS API permissions"
    Purpose     = "Kubernetes Control Plane"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
    Project     = "Kubernetes"
    NodeType    = "Control Plane"
    Service     = "EC2"
  }
}

Master Instance Profile – Attaches role to EC2.

This creates an IAM instance profile that acts as a bridge between EC2 instances and IAM roles. The instance profile gets attached to EC2 instances and allows them to assume the specified IAM role to obtain temporary AWS credentials. Think of it as the mechanism that lets EC2 instances “wear” the IAM role – without an instance profile, EC2 instances cannot access AWS APIs because they have no way to authenticate or assume roles.

HCL

# modules/iam/main.tf

resource "aws_iam_instance_profile" "kubernetes_master" {
  name = "kubernetes-master-profile-${random_id.cluster.hex}" 
  role = aws_iam_role.kubernetes_master.name

  tags = {
    Name        = "${terraform.workspace} - Kubernetes Control Plane Instance Profile"
    Description = "Instance profile for control plane nodes - enables AWS API access for cluster management"
    Purpose     = "Kubernetes Control Plane"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
  }
}

# modules/iam/main.tf

resource "aws_iam_instance_profile" "kubernetes_master" {
  name = "kubernetes-master-profile-${random_id.cluster.hex}" 
  role = aws_iam_role.kubernetes_master.name

  tags = {
    Name        = "${terraform.workspace} - Kubernetes Control Plane Instance Profile"
    Description = "Instance profile for control plane nodes - enables AWS API access for cluster management"
    Purpose     = "Kubernetes Control Plane"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
  }
}

Master SSM Policy – Parameter store permissions

This policy gives the control plane nodes permission to store and manage cluster secrets in AWS Parameter Store. When the first control plane node sets up the cluster, it creates a “join command” (like a password) and stores it in AWS Parameter Store so other nodes can retrieve it and join the cluster. The policy restricts access to only parameters that start with /k8s/ for security.

What control plane can do:

PutParameter: Store cluster join command and tokens
GetParameter: Read existing cluster info
DeleteParameter: Clean up old/expired tokens
DescribeParameters: List available parameters

HCL

# modules/iam/main.tf

# SSM parameter access policy for Kubernetes control plane - allows storing/retrieving cluster join tokens
resource "aws_iam_role_policy" "kubernetes_master_ssm" {
  name = "kubernetes-master-ssm-policy"
  role = aws_iam_role.kubernetes_master.id
  
  policy = jsonencode({
    # Policy grants control plane full access to SSM parameters under /k8s/ namespace
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ssm:PutParameter",     # Store cluster join command with tokens and CA cert hash
          "ssm:GetParameter",     # Retrieve existing parameters for validation
          "ssm:DeleteParameter",  # Clean up expired or invalid join tokens
          "ssm:DescribeParameters" # List and discover available k8s parameters
        ]
        # Restrict access to only k8s namespace parameters for security
        Resource = "arn:aws:ssm:us-east-1:${data.aws_caller_identity.current.account_id}:parameter/k8s/*"
      }
    ]
  })
}

# modules/iam/main.tf

# SSM parameter access policy for Kubernetes control plane - allows storing/retrieving cluster join tokens
resource "aws_iam_role_policy" "kubernetes_master_ssm" {
  name = "kubernetes-master-ssm-policy"
  role = aws_iam_role.kubernetes_master.id
  
  policy = jsonencode({
    # Policy grants control plane full access to SSM parameters under /k8s/ namespace
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ssm:PutParameter",     # Store cluster join command with tokens and CA cert hash
          "ssm:GetParameter",     # Retrieve existing parameters for validation
          "ssm:DeleteParameter",  # Clean up expired or invalid join tokens
          "ssm:DescribeParameters" # List and discover available k8s parameters
        ]
        # Restrict access to only k8s namespace parameters for security
        Resource = "arn:aws:ssm:us-east-1:${data.aws_caller_identity.current.account_id}:parameter/k8s/*"
      }
    ]
  })
}

Worker Nodes IAM Setup

Worker Role – Identity for worker nodes

This creates an IAM role specifically for worker node EC2 instances. The assume_role_policy is a trust policy that allows only EC2 instances to assume this role and get AWS credentials. This role will later have policies attached that give worker nodes the specific permissions they need (like pulling container images, managing storage volumes, and handling pod networking) – but this just creates the empty role container that worker nodes can use.

HCL

# modules/iam/main.tf

resource "aws_iam_role" "kubernetes_worker" {
  name = "kubernetes-worker-profile-${random_id.cluster.hex}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })

  tags = {
    Name        = "${terraform.workspace} - Kubernetes Worker Role"
    Description = "IAM role for Kubernetes worker nodes with permissions for pod networking, storage, and container operations"
    Purpose     = "Kubernetes Worker Nodes"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
  }
}

# modules/iam/main.tf

resource "aws_iam_role" "kubernetes_worker" {
  name = "kubernetes-worker-profile-${random_id.cluster.hex}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })

  tags = {
    Name        = "${terraform.workspace} - Kubernetes Worker Role"
    Description = "IAM role for Kubernetes worker nodes with permissions for pod networking, storage, and container operations"
    Purpose     = "Kubernetes Worker Nodes"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
  }
}

Worker Instance Profile – Attaches role to EC2

This creates an IAM instance profile that acts as a bridge between worker node EC2 instances and the worker IAM role. The instance profile gets attached to worker EC2 instances and allows them to assume the kubernetes_worker role to obtain AWS credentials. This enables worker nodes to access AWS APIs for tasks like pulling container images, managing EBS volumes, and configuring networking – without it, worker nodes couldn’t authenticate with AWS services.

HCL

# modules/iam/main.tf

resource "aws_iam_instance_profile" "kubernetes_worker" {
  name = "kubernetes-worker-profile"
  role = aws_iam_role.kubernetes_worker.name

  tags = {
    Name        = "${terraform.workspace} - Kubernetes Worker Instance Profile"
    Description = "Instance profile for worker nodes - enables AWS API access for container operations and networking"
    Purpose     = "Kubernetes Worker Nodes"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
  }
}

# modules/iam/main.tf

resource "aws_iam_instance_profile" "kubernetes_worker" {
  name = "kubernetes-worker-profile"
  role = aws_iam_role.kubernetes_worker.name

  tags = {
    Name        = "${terraform.workspace} - Kubernetes Worker Instance Profile"
    Description = "Instance profile for worker nodes - enables AWS API access for container operations and networking"
    Purpose     = "Kubernetes Worker Nodes"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
  }
}

Worker SSM Policy – Read-only parameter access

This creates an IAM policy that gets attached to the worker role, giving worker nodes read-only access to AWS Parameter Store. It allows worker nodes to retrieve the cluster join command that was stored by the control plane, but restricts access to only parameters under the /k8s/ path for security. This is how worker nodes get the secret tokens they need to join the existing Kubernetes cluster.

HCL

# modules/iam/main.tf

# Worker node SSM access - read-only permissions to get cluster join command
resource "aws_iam_role_policy" "kubernetes_worker_ssm" {
  name = "kubernetes-worker-ssm-policy"
  role = aws_iam_role.kubernetes_worker.id
  
  policy = jsonencode({
    # Policy allows worker nodes to read SSM parameters under /k8s/ path
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ssm:GetParameter",   # Read join command stored by control plane
          "ssm:GetParameters"   # Batch read multiple parameters if needed
        ]
        # Only allow access to k8s namespace parameters
        Resource = "arn:aws:ssm:us-east-1:${data.aws_caller_identity.current.account_id}:parameter/k8s/*"
      }
    ]
  })
}

# modules/iam/main.tf

# Worker node SSM access - read-only permissions to get cluster join command
resource "aws_iam_role_policy" "kubernetes_worker_ssm" {
  name = "kubernetes-worker-ssm-policy"
  role = aws_iam_role.kubernetes_worker.id
  
  policy = jsonencode({
    # Policy allows worker nodes to read SSM parameters under /k8s/ path
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ssm:GetParameter",   # Read join command stored by control plane
          "ssm:GetParameters"   # Batch read multiple parameters if needed
        ]
        # Only allow access to k8s namespace parameters
        Resource = "arn:aws:ssm:us-east-1:${data.aws_caller_identity.current.account_id}:parameter/k8s/*"
      }
    ]
  })
}

How IAM works:

Control plane starts -> Gets master role
Kubernetes initializes -> Generates join token
Control plane stores join command in SSM parameter /k8s/join-command
Worker nodes start -> Get worker role
Workers read join command from SSM
Workers join the cluster using the token

Expose the return values to be used in other modules.

HCL

# modules/iam/outputs.tf

output "kubernetes_master_instance_profile" {  
  description = "IAM instance profile name for Kubernetes control plane nodes - provides AWS API permissions"
  value       = aws_iam_instance_profile.kubernetes_master.name
}

output "kubernetes_worker_instance_profile" {  
  description = "IAM instance profile name for Kubernetes worker nodes - provides AWS API permissions for pods and services"
  value       = aws_iam_instance_profile.kubernetes_worker.name
}

# modules/iam/outputs.tf

output "kubernetes_master_instance_profile" {  
  description = "IAM instance profile name for Kubernetes control plane nodes - provides AWS API permissions"
  value       = aws_iam_instance_profile.kubernetes_master.name
}

output "kubernetes_worker_instance_profile" {  
  description = "IAM instance profile name for Kubernetes worker nodes - provides AWS API permissions for pods and services"
  value       = aws_iam_instance_profile.kubernetes_worker.name
}

Create a custom module named iam.

HCL

# environments/development/main.tf

module iam {
  source = "../../modules/iam"
}

# environments/development/main.tf

module iam {
  source = "../../modules/iam"
}

SSM Parameter Store

This SSM parameter provides a secure, automated way for control plane nodes to share fresh join tokens with worker nodes, eliminating manual steps and security risks. You don’t go to every node and ssh just to enter the join command.

HCL

# modules/ssm/main.tf
resource "aws_ssm_parameter" "join_command" {
  name        = "/k8s/control-plane/join-command"
  type        = "SecureString"
  value       = "placeholder-will-be-updated-by-script"
  description = "Kubernetes cluster join command for worker nodes - automatically updated by control plane initialization script"
  
  lifecycle {
    ignore_changes = [value] # Let the script update the value
  }
}

# modules/ssm/main.tf
resource "aws_ssm_parameter" "join_command" {
  name        = "/k8s/control-plane/join-command"
  type        = "SecureString"
  value       = "placeholder-will-be-updated-by-script"
  description = "Kubernetes cluster join command for worker nodes - automatically updated by control plane initialization script"
  
  lifecycle {
    ignore_changes = [value] # Let the script update the value
  }
}

Name & Path:

HCL

name = "/k8s/control-plane/join-command"

name = "/k8s/control-plane/join-command"

Hierarchical path: Organized under /k8s/ namespace
Specific location: Control plane section for join commands
Matches IAM policy: IAM roles above have access to /k8s/* path

Security Type:

HCL

type = "SecureString"

type = "SecureString"

Encrypted storage: Value is encrypted at rest in AWS
Secure transmission: Encrypted in transit when accessed
Better than plaintext: Protects sensitive cluster tokens

The Join Command Content
What gets stored (after control plane runs):

# Real example of what replaces the placeholder:
"kubeadm join 10.0.1.10:6443 --token abc123.def456ghi789 --discovery-token-ca-cert-hash sha256:1234567890abcdef..."

# Real example of what replaces the placeholder:
"kubeadm join 10.0.1.10:6443 --token abc123.def456ghi789 --discovery-token-ca-cert-hash sha256:1234567890abcdef..."

Why Ignore Changes is needed?

Control plane script updates the value with real join command
Without lifecycle: Terraform would overwrite script’s value back to placeholder
With lifecycle: Terraform ignores value changes, lets script manage it

HCL

lifecycle {
  ignore_changes = [value] # Let the script update the value
}

lifecycle {
  ignore_changes = [value] # Let the script update the value
}

Create a custom module named ssm.

HCL

# environments/development/main.tf

module ssm {
  source = "../../modules/ssm"
}

# environments/development/main.tf

module ssm {
  source = "../../modules/ssm"
}

Networking

The networking section creates the foundational network infrastructure for the Kubernetes cluster.

Create the variables first.

HCL

# modules/networking/variables.tf

variable "aws_region" {
  type        = map(string)
  description = "AWS region for each environment - maps workspace to region"
  default = {
    "development" = "us-east-1"
    "production"  = "us-east-2"
  }
}

variable "public_subnet_cidrs" {
  type        = list(string)
  description = "Public Subnet CIDR values for load balancers and internet-facing resources"
  default     = ["10.0.1.0/24"]
}

variable "private_subnet_cidrs" {
  type        = list(string)
  description = "Private Subnet CIDR values for Kubernetes nodes and internal services"
  default     = ["10.0.2.0/24", "10.0.3.0/24", "10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"]
}

variable "azs" {
  type        = map(list(string))
  description = "Availability Zones for each environment - ensures high availability across multiple AZs"
  default = {
    "development" = ["us-east-1a", "us-east-1b", "us-east-1c", "us-east-1d", "us-east-1f"]
    "production"  = ["us-east-2a", "us-east-2b", "us-east-2c", "us-east-2d", "us-east-2f"]
  }
}

# modules/networking/variables.tf

variable "aws_region" {
  type        = map(string)
  description = "AWS region for each environment - maps workspace to region"
  default = {
    "development" = "us-east-1"
    "production"  = "us-east-2"
  }
}

variable "public_subnet_cidrs" {
  type        = list(string)
  description = "Public Subnet CIDR values for load balancers and internet-facing resources"
  default     = ["10.0.1.0/24"]
}

variable "private_subnet_cidrs" {
  type        = list(string)
  description = "Private Subnet CIDR values for Kubernetes nodes and internal services"
  default     = ["10.0.2.0/24", "10.0.3.0/24", "10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"]
}

variable "azs" {
  type        = map(list(string))
  description = "Availability Zones for each environment - ensures high availability across multiple AZs"
  default = {
    "development" = ["us-east-1a", "us-east-1b", "us-east-1c", "us-east-1d", "us-east-1f"]
    "production"  = ["us-east-2a", "us-east-2b", "us-east-2c", "us-east-2d", "us-east-2f"]
  }
}

VPC (Virtual Private Cloud)

A VPC (Virtual Private Cloud) in Amazon Web Services (AWS) is your own isolated network within the AWS cloud — like a private data center you control.

HCL

# modules/networking/main.tf

resource "aws_vpc" "main" {
  cidr_block            = "10.0.0.0/16"
  enable_dns_hostnames  = true
  enable_dns_support    = true
  
  tags = {
    Name                = "${terraform.workspace} - Kubernetes Cluster VPC"
    Environment         = terraform.workspace
    Purpose             = "Kubernetes Infrastructure"
  }
}

# modules/networking/main.tf

resource "aws_vpc" "main" {
  cidr_block            = "10.0.0.0/16"
  enable_dns_hostnames  = true
  enable_dns_support    = true
  
  tags = {
    Name                = "${terraform.workspace} - Kubernetes Cluster VPC"
    Environment         = terraform.workspace
    Purpose             = "Kubernetes Infrastructure"
  }
}

Public Subnet (Internet-facing)

A public subnet in AWS is a subnet inside a VPC that can directly communicate with the internet — typically used for resources that need to be accessible from outside AWS

In AWS, a DMZ (Demilitarized Zone) is a subnet or network segment that acts as a buffer zone between the public internet and your private/internal AWS resources. It’s used to host public-facing services while minimizing the exposure of your internal network.

The public subnet contains the bastion host – a dedicated EC2 instance that acts as a secure gateway for accessing private resources. The bastion has a public IP and sits in the public subnet, allowing administrators to SSH into it from the internet, then use it as a stepping stone to securely connect to instances in private subnets that don’t have direct internet access.

HCL

# modules/networking/main.tf

resource "aws_subnet" "public_subnets" {
  count               = length(var.public_subnet_cidrs)
  vpc_id              = aws_vpc.main.id
  cidr_block          = element(var.public_subnet_cidrs, count.index)
  availability_zone   = element(var.azs[terraform.workspace], count.index)

  tags = {
    Name              = "${terraform.workspace} - Public Subnet ${count.index + 1}"
    Description       = "Public subnet for bastion host and load balancers"
    Type              = "Public"
    Environment       = terraform.workspace
    AvailabilityZone  = element(var.azs[terraform.workspace], count.index)
    Purpose           = "DMZ"
    ManagedBy         = "Terraform"
    Project           = "Kubernetes"
    Tier              = "DMZ"  # Demilitarized Zone
  }
}

# modules/networking/main.tf

resource "aws_subnet" "public_subnets" {
  count               = length(var.public_subnet_cidrs)
  vpc_id              = aws_vpc.main.id
  cidr_block          = element(var.public_subnet_cidrs, count.index)
  availability_zone   = element(var.azs[terraform.workspace], count.index)

  tags = {
    Name              = "${terraform.workspace} - Public Subnet ${count.index + 1}"
    Description       = "Public subnet for bastion host and load balancers"
    Type              = "Public"
    Environment       = terraform.workspace
    AvailabilityZone  = element(var.azs[terraform.workspace], count.index)
    Purpose           = "DMZ"
    ManagedBy         = "Terraform"
    Project           = "Kubernetes"
    Tier              = "DMZ"  # Demilitarized Zone
  }
}

Private Subnets (Internal)

A private subnet in AWS is a subnet within your VPC that does NOT have direct access to or from the public internet. It’s used to host internal resources that should remain isolated from external access, such as: Application servers, Databases (e.g., RDS), Internal services (e.g., Redis, internal APIs).

Hosts the Kubernetes control plane and worker nodes. No direct internet access (protected from external access).

HCL

# modules/networking/main.tf

resource "aws_subnet" "private_subnets" {
  count               = min(length(var.private_subnet_cidrs), length(var.azs[terraform.workspace]))
  vpc_id              = aws_vpc.main.id
  cidr_block          = var.private_subnet_cidrs[count.index]
  availability_zone   = var.azs[terraform.workspace][count.index] # Ensures 1 AZ per subnet

  tags = {
    Name              = "${terraform.workspace} - Private Subnet ${count.index + 1}"
    Description       = "Private subnet for Kubernetes worker and control plane nodes"
    Type              = "Private"
    Environment       = terraform.workspace
    AvailabilityZone  = var.azs[terraform.workspace][count.index]
    Purpose           = "Kubernetes Nodes"
    ManagedBy         = "Terraform"
    Project           = "Kubernetes"
    Tier              = "Internal"
  }
}

# modules/networking/main.tf

resource "aws_subnet" "private_subnets" {
  count               = min(length(var.private_subnet_cidrs), length(var.azs[terraform.workspace]))
  vpc_id              = aws_vpc.main.id
  cidr_block          = var.private_subnet_cidrs[count.index]
  availability_zone   = var.azs[terraform.workspace][count.index] # Ensures 1 AZ per subnet

  tags = {
    Name              = "${terraform.workspace} - Private Subnet ${count.index + 1}"
    Description       = "Private subnet for Kubernetes worker and control plane nodes"
    Type              = "Private"
    Environment       = terraform.workspace
    AvailabilityZone  = var.azs[terraform.workspace][count.index]
    Purpose           = "Kubernetes Nodes"
    ManagedBy         = "Terraform"
    Project           = "Kubernetes"
    Tier              = "Internal"
  }
}

Multi-AZ Distribution: Spreads resources across multiple data centers (High availability). If one AZ fails, others continue running (Fault tolerance).

availability_zone = element(var.azs[terraform.workspace], count.index)

availability_zone = element(var.azs[terraform.workspace], count.index)

Internet Gateway

An Internet Gateway (IGW) in AWS is a component that connects your VPC to the internet. It allows resources in your VPC (like EC2 instances in a public subnet) to send traffic to the internet and receive traffic from the internet. It is attached to the public subnet. It enables bastion host to receive ssh connections. It handles:

Outbound connections (e.g., your EC2 instance accessing a website)
Inbound connections (e.g., users accessing your public web server)

HCL

# modules/networking/main.tf

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name        = "${terraform.workspace} - Internet Gateway"
    Purpose     = "Internet access for public subnets"
    Description = "Provides internet connectivity for bastion host and load balancers"
    Type        = "Gateway"
  }
}

# modules/networking/main.tf

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name        = "${terraform.workspace} - Internet Gateway"
    Purpose     = "Internet access for public subnets"
    Description = "Provides internet connectivity for bastion host and load balancers"
    Type        = "Gateway"
  }
}

Public route table

A public route table in AWS is a route table associated with one or more public subnets, and it directs traffic destined for the internet to an Internet Gateway (IGW).

Traffic flow: Public subnet -> Internet Gateway -> Internet

HCL

# modules/networking/main.tf

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  
  tags = {
    Name        = "${terraform.workspace} - Public Route Table"
    Description = "Route table for public subnets - directs traffic to internet gateway"
    Type        = "Public"
    Purpose     = "Internet routing for DMZ resources"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
    Tier        = "DMZ"
    RouteType   = "Internet-bound"
    Project     = "Kubernetes"
  }
}

# modules/networking/main.tf

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  
  tags = {
    Name        = "${terraform.workspace} - Public Route Table"
    Description = "Route table for public subnets - directs traffic to internet gateway"
    Type        = "Public"
    Purpose     = "Internet routing for DMZ resources"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
    Tier        = "DMZ"
    RouteType   = "Internet-bound"
    Project     = "Kubernetes"
  }
}

Private Route Table

A private route table in AWS is a route table used by private subnets—subnets that do not have direct access to or from the internet.

A private route table does NOT have a route to an Internet Gateway (IGW). Instead, it may have a route to a NAT Gateway or no external route at all, depending on whether you want outbound internet access (e.g., for software updates) or complete isolation.

Traffic flow: Private subnet -> NAT Gateway -> Internet Gateway -> Internet

HCL

# modules/networking/main.tf

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name        = "${terraform.workspace} - Private Route Table"
    Description = "Route table for private subnets - directs internet traffic through NAT Gateway"
    Type        = "Private"
    Environment = terraform.workspace
    Purpose     = "NAT Gateway Routing"
    ManagedBy   = "Terraform"
  }
}

# modules/networking/main.tf

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name        = "${terraform.workspace} - Private Route Table"
    Description = "Route table for private subnets - directs internet traffic through NAT Gateway"
    Type        = "Private"
    Environment = terraform.workspace
    Purpose     = "NAT Gateway Routing"
    ManagedBy   = "Terraform"
  }
}

Elastic IP for NAT Gateway

Static public IP provides consistent IP address and it is required for NAT Gateway operation.

What NAT Gateway Does:

Private Subnet (10.0.1.x) -> NAT Gateway -> Internet

It translates private IPs to public IP for outbound traffic. It needs public IP to communicate with internet on behalf of private resources. Without EIP – NAT Gateway Won’t Work.

Without EIP (Dynamic IP):

Today: NAT uses IP 12.123.45.67
Tomorrow: AWS changes it to 12.234.56.78
Result: External services block your new IP

With EIP (Static IP):

Always: NAT uses IP 54.123.45.67
Result: Consistent external identity

HCL

# modules/networking/main.tf

resource "aws_eip" "nat_eip" {
  domain = "vpc"

  tags = {
    Name        = "${terraform.workspace} - NAT Gateway EIP"
    Description = "Elastic IP for NAT Gateway - enables internet access for private subnets"
    Purpose     = "NAT Gateway"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
  }
}

# modules/networking/main.tf

resource "aws_eip" "nat_eip" {
  domain = "vpc"

  tags = {
    Name        = "${terraform.workspace} - NAT Gateway EIP"
    Description = "Elastic IP for NAT Gateway - enables internet access for private subnets"
    Purpose     = "NAT Gateway"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
  }
}

NAT Gateway

Allows private subnets to reach internet. Outbound traffic only, no inbound from internet. It’s essential for Kubernetes nodes to download images, updates and etc.

HCL

# modules/networking/main.tf

resource "aws_nat_gateway" "nat" {
  allocation_id = aws_eip.nat_eip.id
  subnet_id     = aws_subnet.public_subnets[0].id

  tags = {
    Name        = "${terraform.workspace} - NAT Gateway"
    Description = "NAT Gateway for private subnet internet access - enables Kubernetes nodes to reach external services"
    Purpose     = "Private Subnet Internet Access"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
  }

  depends_on = [aws_internet_gateway.igw]
}

# modules/networking/main.tf

resource "aws_nat_gateway" "nat" {
  allocation_id = aws_eip.nat_eip.id
  subnet_id     = aws_subnet.public_subnets[0].id

  tags = {
    Name        = "${terraform.workspace} - NAT Gateway"
    Description = "NAT Gateway for private subnet internet access - enables Kubernetes nodes to reach external services"
    Purpose     = "Private Subnet Internet Access"
    Environment = terraform.workspace
    ManagedBy   = "Terraform"
  }

  depends_on = [aws_internet_gateway.igw]
}

Add a default route to the internet gateway in the public route table

HCL

# modules/networking/main.tf

resource "aws_route" "public_internet_access" {
  route_table_id         = aws_route_table.public.id
  destination_cidr_block = "0.0.0.0/0"
  gateway_id             = aws_internet_gateway.igw.id
}

# modules/networking/main.tf

resource "aws_route" "public_internet_access" {
  route_table_id         = aws_route_table.public.id
  destination_cidr_block = "0.0.0.0/0"
  gateway_id             = aws_internet_gateway.igw.id
}

Associate only the first public subnet with the public route table

HCL

# modules/networking/main.tf

resource "aws_route_table_association" "public_first_subnet" {
  subnet_id      = aws_subnet.public_subnets[0].id
  route_table_id = aws_route_table.public.id
}

# modules/networking/main.tf

resource "aws_route_table_association" "public_first_subnet" {
  subnet_id      = aws_subnet.public_subnets[0].id
  route_table_id = aws_route_table.public.id
}

Add a route in the private route table to direct internet traffic through the NAT Gateway

HCL

# modules/networking/main.tf

resource "aws_route" "private_nat" {
  route_table_id         = aws_route_table.private.id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.nat.id
}

# modules/networking/main.tf

resource "aws_route" "private_nat" {
  route_table_id         = aws_route_table.private.id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.nat.id
}

Link private subnets to the private route table

HCL

# modules/networking/main.tf

resource "aws_route_table_association" "private" {
  count          = length(var.private_subnet_cidrs)
  subnet_id      = element(aws_subnet.private_subnets[*].id, count.index)
  route_table_id = aws_route_table.private.id
}

# modules/networking/main.tf

resource "aws_route_table_association" "private" {
  count          = length(var.private_subnet_cidrs)
  subnet_id      = element(aws_subnet.private_subnets[*].id, count.index)
  route_table_id = aws_route_table.private.id
}

Expose the return values to be used in other modules.

HCL

# modules/networking/outputs.tf

output "vpc_id" {
  description = "ID of the VPC for the Kubernetes cluster"
  value       = aws_vpc.main.id
}

output "vpc_cidr_block" {
  description = "CIDR block of the VPC for security group rules and network configuration"
  value       = aws_vpc.main.cidr_block
}

output "private_subnets" {
  description = "Private subnets for Kubernetes worker nodes and internal services"
  value       = aws_subnet.private_subnets
}

output "public_subnets" {
  description = "Public subnets for load balancers, bastion hosts, and internet-facing resources"
  value       = aws_subnet.public_subnets
}

# modules/networking/outputs.tf

output "vpc_id" {
  description = "ID of the VPC for the Kubernetes cluster"
  value       = aws_vpc.main.id
}

output "vpc_cidr_block" {
  description = "CIDR block of the VPC for security group rules and network configuration"
  value       = aws_vpc.main.cidr_block
}

output "private_subnets" {
  description = "Private subnets for Kubernetes worker nodes and internal services"
  value       = aws_subnet.private_subnets
}

output "public_subnets" {
  description = "Public subnets for load balancers, bastion hosts, and internet-facing resources"
  value       = aws_subnet.public_subnets
}

Create a custom module named networking.

HCL

# environments/development/main.tf

module networking {
  source = "../../modules/networking"
}

# environments/development/main.tf

module networking {
  source = "../../modules/networking"
}

Security Groups

The security groups creates network security rules (firewalls) for Kubernetes cluster. This creates a secure, layered defense where each Kubernetes component can communicate as needed while preventing unauthorized access from the internet.

To know more about Kubernetes Ports and Protocols, visit https://kubernetes.io/docs/reference/networking/ports-and-protocols/.

Create the variables.

HCL

# modules/security_groups/variables.tf

// FROM Other Module
variable "vpc_id" {
  description = "VPC ID from AWS module"
  type        = string
}

variable "vpc_cidr_block" {
  description = "CIDR block of the VPC for internal network communication"
  type        = string
}

# modules/security_groups/variables.tf

// FROM Other Module
variable "vpc_id" {
  description = "VPC ID from AWS module"
  type        = string
}

variable "vpc_cidr_block" {
  description = "CIDR block of the VPC for internal network communication"
  type        = string
}

1. Bastion Security Group: Creates a firewall group for the bastion host.

HCL

# modules/security_groups/main.tf

resource "aws_security_group" "bastion" {
  name        = "bastion-sg" 
  vpc_id      = var.vpc_id
  description = "Security group for the bastion host"

  tags = {
    Name = "${terraform.workspace} - Bastion Host SG"
  }
}

# modules/security_groups/main.tf

resource "aws_security_group" "bastion" {
  name        = "bastion-sg" 
  vpc_id      = var.vpc_id
  description = "Security group for the bastion host"

  tags = {
    Name = "${terraform.workspace} - Bastion Host SG"
  }
}

2. Bastion SSH from Internet: Allows SSH connections to bastion host from anywhere on the internet.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "bastion_ssh_anywhere" {
  security_group_id = aws_security_group.bastion.id
  from_port         = 22
  to_port           = 22
  ip_protocol       = "tcp"
  cidr_ipv4         = "0.0.0.0/0"
  description       = "Allow SSH access to bastion host from any IP address"

  tags = {
    Name = "${terraform.workspace} - Bastion SSH Internet Access"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "bastion_ssh_anywhere" {
  security_group_id = aws_security_group.bastion.id
  from_port         = 22
  to_port           = 22
  ip_protocol       = "tcp"
  cidr_ipv4         = "0.0.0.0/0"
  description       = "Allow SSH access to bastion host from any IP address"

  tags = {
    Name = "${terraform.workspace} - Bastion SSH Internet Access"
  }
}

3. Bastion SSH to Control Plane: Allows bastion host to SSH to control plane nodes.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_egress_rule" "bastion_egress_control_plane" {
  security_group_id             = aws_security_group.bastion.id
  from_port                     = 22
  to_port                       = 22
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.control_plane.id
  description = "Allow SSH from bastion host to Kubernetes control plane nodes for cluster administration"

  tags = {
    Name = "${terraform.workspace} - Bastion SSH to Control Plane"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_egress_rule" "bastion_egress_control_plane" {
  security_group_id             = aws_security_group.bastion.id
  from_port                     = 22
  to_port                       = 22
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.control_plane.id
  description = "Allow SSH from bastion host to Kubernetes control plane nodes for cluster administration"

  tags = {
    Name = "${terraform.workspace} - Bastion SSH to Control Plane"
  }
}

4. Bastion SSH to Workers: Allows bastion host to SSH to worker nodes.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_egress_rule" "bastion_egress_workers" {
  security_group_id             = aws_security_group.bastion.id
  from_port                     = 22
  to_port                       = 22
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.worker_node.id
  description                   = "Allow SSH from bastion host to worker nodes for maintenance and troubleshooting"
  
  tags = {
    Name = "${terraform.workspace} - Bastion SSH to Worker Nodes"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_egress_rule" "bastion_egress_workers" {
  security_group_id             = aws_security_group.bastion.id
  from_port                     = 22
  to_port                       = 22
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.worker_node.id
  description                   = "Allow SSH from bastion host to worker nodes for maintenance and troubleshooting"
  
  tags = {
    Name = "${terraform.workspace} - Bastion SSH to Worker Nodes"
  }
}

5. Control Plane Security Group: Creates a firewall group for Kubernetes master nodes.

HCL

# modules/security_groups/main.tf

resource "aws_security_group" "control_plane" {
  name        = "control-plane-sg"  
  vpc_id      = var.vpc_id
  description = "Security group for the Kubernetes control plane"

  tags = {
    Name = "${terraform.workspace} - Kubernetes Control Plane SG"
  }
}

# modules/security_groups/main.tf

resource "aws_security_group" "control_plane" {
  name        = "control-plane-sg"  
  vpc_id      = var.vpc_id
  description = "Security group for the Kubernetes control plane"

  tags = {
    Name = "${terraform.workspace} - Kubernetes Control Plane SG"
  }
}

6. Control Plane SSH Access: Allows SSH to control plane from bastion host only.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "control_plane_ssh" {
  security_group_id             = aws_security_group.control_plane.id
  from_port                     = 22
  to_port                       = 22
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.bastion.id
  description                   = "Allow SSH access to control plane nodes from bastion host for cluster administration"

  tags = {
    Name = "${terraform.workspace} - Control Plane SSH from Bastion"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "control_plane_ssh" {
  security_group_id             = aws_security_group.control_plane.id
  from_port                     = 22
  to_port                       = 22
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.bastion.id
  description                   = "Allow SSH access to control plane nodes from bastion host for cluster administration"

  tags = {
    Name = "${terraform.workspace} - Control Plane SSH from Bastion"
  }
}

7. Control Plane etcd: Allows etcd database communication. Kubernetes stores all data in etcd.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "control_plane_etcd" {
  security_group_id = aws_security_group.control_plane.id
  from_port         = 2379
  to_port           = 2380
  ip_protocol       = "tcp"
  cidr_ipv4         = var.vpc_cidr_block
  description       = "Allow etcd client and peer communication within VPC for Kubernetes cluster state management"

  tags = {
    Name = "${terraform.workspace} - Control Plane etcd Communication"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "control_plane_etcd" {
  security_group_id = aws_security_group.control_plane.id
  from_port         = 2379
  to_port           = 2380
  ip_protocol       = "tcp"
  cidr_ipv4         = var.vpc_cidr_block
  description       = "Allow etcd client and peer communication within VPC for Kubernetes cluster state management"

  tags = {
    Name = "${terraform.workspace} - Control Plane etcd Communication"
  }
}

8. Control Plane kubelet: Allows kubelet API access. Use for monitoring and managing pods.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "control_plane_self_control_plane" {
  security_group_id = aws_security_group.control_plane.id
  from_port         = 10250
  to_port           = 10250
  ip_protocol       = "tcp"
  cidr_ipv4         = var.vpc_cidr_block
  description       = "Allow kubelet API access within VPC for control plane node communication and monitoring"

  tags = {
    Name = "${terraform.workspace} - Control Plane kubelet API"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "control_plane_self_control_plane" {
  security_group_id = aws_security_group.control_plane.id
  from_port         = 10250
  to_port           = 10250
  ip_protocol       = "tcp"
  cidr_ipv4         = var.vpc_cidr_block
  description       = "Allow kubelet API access within VPC for control plane node communication and monitoring"

  tags = {
    Name = "${terraform.workspace} - Control Plane kubelet API"
  }
}

9. Control Plane Scheduler: Allows access to scheduler metrics. Use for health checks and monitoring.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "control_plane_kube_scheduler" {
  security_group_id             = aws_security_group.control_plane.id
  from_port                     = 10259
  to_port                       = 10259
  ip_protocol                   = "tcp"
  cidr_ipv4                     = var.vpc_cidr_block
  description                   = "Allow kube-scheduler metrics and health check access from VPC for cluster monitoring"

  tags = {
    Name = "${terraform.workspace} - Control Plane kube-scheduler"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "control_plane_kube_scheduler" {
  security_group_id             = aws_security_group.control_plane.id
  from_port                     = 10259
  to_port                       = 10259
  ip_protocol                   = "tcp"
  cidr_ipv4                     = var.vpc_cidr_block
  description                   = "Allow kube-scheduler metrics and health check access from VPC for cluster monitoring"

  tags = {
    Name = "${terraform.workspace} - Control Plane kube-scheduler"
  }
}

10. Control Plane Controller Manager: Allows access to controller manager metrics. Use for health checks and monitoring.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "control_plane_kube_controller_manager" {
  security_group_id             = aws_security_group.control_plane.id
  from_port                     = 10257
  to_port                       = 10257
  ip_protocol                   = "tcp"
  cidr_ipv4                     = var.vpc_cidr_block
  description                   = "Allow kube-controller-manager metrics and health check access from VPC for cluster monitoring"
  
  tags = {
    Name = "${terraform.workspace} - Control Plane kube-controller-manager"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "control_plane_kube_controller_manager" {
  security_group_id             = aws_security_group.control_plane.id
  from_port                     = 10257
  to_port                       = 10257
  ip_protocol                   = "tcp"
  cidr_ipv4                     = var.vpc_cidr_block
  description                   = "Allow kube-controller-manager metrics and health check access from VPC for cluster monitoring"
  
  tags = {
    Name = "${terraform.workspace} - Control Plane kube-controller-manager"
  }
}

11. Control Plane All Outbound: Allows control plane to connect to anything on internet. Use for downloading of updates and call AWS APIs.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_egress_rule" "control_plane_egress_all" {
  security_group_id             = aws_security_group.control_plane.id
  ip_protocol                   = "-1"
  cidr_ipv4                     = "0.0.0.0/0"
  description                   = "Allow all outbound traffic from control plane for AWS APIs, container registries, and external services"

  tags = {
    Name = "${terraform.workspace} - Control Plane Outbound All"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_egress_rule" "control_plane_egress_all" {
  security_group_id             = aws_security_group.control_plane.id
  ip_protocol                   = "-1"
  cidr_ipv4                     = "0.0.0.0/0"
  description                   = "Allow all outbound traffic from control plane for AWS APIs, container registries, and external services"

  tags = {
    Name = "${terraform.workspace} - Control Plane Outbound All"
  }
}

12. Worker Node Security Group: Creates a firewall group for worker nodes.

HCL

# modules/security_groups/main.tf

resource "aws_security_group" "worker_node" {
  name        = "worker-node-sg"  
  vpc_id      = var.vpc_id
  description = "Security group for Kubernetes worker nodes - controls pod and application traffic"

  tags = {
    Name = "${terraform.workspace} - Worker Nodes SG"
  }
}

# modules/security_groups/main.tf

resource "aws_security_group" "worker_node" {
  name        = "worker-node-sg"  
  vpc_id      = var.vpc_id
  description = "Security group for Kubernetes worker nodes - controls pod and application traffic"

  tags = {
    Name = "${terraform.workspace} - Worker Nodes SG"
  }
}

13. Worker All Outbound: Allows workers to connect to anything on internet. Use for downloading of container images and call external APIs.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_egress_rule" "worker_node_egress_all" {
  security_group_id             = aws_security_group.worker_node.id
  ip_protocol                   = "-1"
  cidr_ipv4                     = "0.0.0.0/0"
  description                   = "Allow all outbound traffic from worker nodes for container images, application traffic, and AWS services"
  
  tags = {
    Name = "${terraform.workspace} - Worker Nodes Outbound All"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_egress_rule" "worker_node_egress_all" {
  security_group_id             = aws_security_group.worker_node.id
  ip_protocol                   = "-1"
  cidr_ipv4                     = "0.0.0.0/0"
  description                   = "Allow all outbound traffic from worker nodes for container images, application traffic, and AWS services"
  
  tags = {
    Name = "${terraform.workspace} - Worker Nodes Outbound All"
  }
}

14. Worker SSH Access: Allows SSH to workers from bastion only.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "worker_node_ssh" {
  security_group_id             = aws_security_group.worker_node.id
  from_port                     = 22
  to_port                       = 22
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.bastion.id
  description                   = "Allow SSH access to worker nodes from bastion host for maintenance and troubleshooting"

  tags = {
    Name = "${terraform.workspace} - Worker Nodes SSH from Bastion"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "worker_node_ssh" {
  security_group_id             = aws_security_group.worker_node.id
  from_port                     = 22
  to_port                       = 22
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.bastion.id
  description                   = "Allow SSH access to worker nodes from bastion host for maintenance and troubleshooting"

  tags = {
    Name = "${terraform.workspace} - Worker Nodes SSH from Bastion"
  }
}

15. Worker kubelet API: Allows control plane to manage worker pods. Use to how Kubernetes schedules pods.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "worker_node_kubelet_api" {
  security_group_id             =   aws_security_group.worker_node.id
  from_port                     =   10250
  to_port                       =   10250
  ip_protocol                   =   "tcp"
  referenced_security_group_id  =   aws_security_group.control_plane.id
  description                   =   "Allow control plane access to worker node kubelet API for pod management and monitoring"
  tags = {
    Name = "${terraform.workspace} - Worker Nodes kubelet API"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "worker_node_kubelet_api" {
  security_group_id             =   aws_security_group.worker_node.id
  from_port                     =   10250
  to_port                       =   10250
  ip_protocol                   =   "tcp"
  referenced_security_group_id  =   aws_security_group.control_plane.id
  description                   =   "Allow control plane access to worker node kubelet API for pod management and monitoring"
  tags = {
    Name = "${terraform.workspace} - Worker Nodes kubelet API"
  }
}

16. Worker kube-proxy: Allows load balancer to check worker health.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "worker_node_kube_proxy" {
  security_group_id             = aws_security_group.worker_node.id
  from_port                     = 10256
  to_port                       = 10256
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.elb.id
  description                   = "Allow load balancer access to kube-proxy health check endpoint on worker nodes"
  
  tags = {
    Name = "${terraform.workspace} - Worker Nodes kube-proxy"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "worker_node_kube_proxy" {
  security_group_id             = aws_security_group.worker_node.id
  from_port                     = 10256
  to_port                       = 10256
  ip_protocol                   = "tcp"
  referenced_security_group_id  = aws_security_group.elb.id
  description                   = "Allow load balancer access to kube-proxy health check endpoint on worker nodes"
  
  tags = {
    Name = "${terraform.workspace} - Worker Nodes kube-proxy"
  }
}

17. Worker NodePort TCP: Allows internet to access applications on workers. Expose web apps and APIs.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "worker_node_tcp_nodeport_services" {
  security_group_id   =   aws_security_group.worker_node.id
  from_port           =   30000
  to_port             =   32767
  ip_protocol         =   "tcp"
  cidr_ipv4           =   "0.0.0.0/0"
  description         =   "Allow internet access to Kubernetes NodePort services (TCP 30000-32767) for application traffic"
  tags = {
    Name = "${terraform.workspace} - Worker Nodes NodePort TCP"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "worker_node_tcp_nodeport_services" {
  security_group_id   =   aws_security_group.worker_node.id
  from_port           =   30000
  to_port             =   32767
  ip_protocol         =   "tcp"
  cidr_ipv4           =   "0.0.0.0/0"
  description         =   "Allow internet access to Kubernetes NodePort services (TCP 30000-32767) for application traffic"
  tags = {
    Name = "${terraform.workspace} - Worker Nodes NodePort TCP"
  }
}

18. Worker NodePort UDP: Allows internet to access UDP applications on workers. Expose UDP services like DNS and games.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "worker_node_udp_nodeport_services" {
  security_group_id   =   aws_security_group.worker_node.id
  from_port           =   30000
  to_port             =   32767
  ip_protocol         =   "udp"
  cidr_ipv4           =   "0.0.0.0/0"
  description         =   "Allow internet access to Kubernetes NodePort services (UDP 30000-32767) for application traffic"
  tags = {
    Name = "${terraform.workspace} - Worker Nodes NodePort UDP"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "worker_node_udp_nodeport_services" {
  security_group_id   =   aws_security_group.worker_node.id
  from_port           =   30000
  to_port             =   32767
  ip_protocol         =   "udp"
  cidr_ipv4           =   "0.0.0.0/0"
  description         =   "Allow internet access to Kubernetes NodePort services (UDP 30000-32767) for application traffic"
  tags = {
    Name = "${terraform.workspace} - Worker Nodes NodePort UDP"
  }
}

19. Control Plane API Health Check: Allows load balancer to check if API server is healthy.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "allow_nlb_health_check" {
  security_group_id = aws_security_group.control_plane.id
  from_port         = 6443
  to_port           = 6443
  ip_protocol       = "tcp"
  cidr_ipv4         = var.vpc_cidr_block
  description       = "Allow Network Load Balancer health checks to Kubernetes API server on port 6443"

  tags = {
    Name = "${terraform.workspace} - Control Plane NLB Health Check"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "allow_nlb_health_check" {
  security_group_id = aws_security_group.control_plane.id
  from_port         = 6443
  to_port           = 6443
  ip_protocol       = "tcp"
  cidr_ipv4         = var.vpc_cidr_block
  description       = "Allow Network Load Balancer health checks to Kubernetes API server on port 6443"

  tags = {
    Name = "${terraform.workspace} - Control Plane NLB Health Check"
  }
}

20. Control Plane BGP: Allows advanced networking protocols for Service mesh, advanced CNI plugins.

HCL

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "allow_bgp" {
  security_group_id   = aws_security_group.control_plane.id
  from_port           = 179
  to_port             = 179
  ip_protocol         = "tcp"
  cidr_ipv4           = var.vpc_cidr_block
  description         = "Allow BGP protocol communication within VPC for network routing and service mesh"

  tags = {
    Name = "${terraform.workspace} - Control Plane BGP Communication"
  }
}

# modules/security_groups/main.tf

resource "aws_vpc_security_group_ingress_rule" "allow_bgp" {
  security_group_id   = aws_security_group.control_plane.id
  from_port           = 179
  to_port             = 179
  ip_protocol         = "tcp"
  cidr_ipv4           = var.vpc_cidr_block
  description         = "Allow BGP protocol communication within VPC for network routing and service mesh"

  tags = {
    Name = "${terraform.workspace} - Control Plane BGP Communication"
  }
}

When to use cidr_ipv4 = var.vpc_cidr_block? Use VPC CIDR when communication needs to happen with:

Multiple different security groups (avoiding many separate rules)
Load balancers or services that don’t have their own security groups
System-level protocols that need broad VPC access
Health checks that come from various AWS services

For example,

# etcd (ports 2379-2380) - Multiple control plane nodes need to communicate
cidr_ipv4 = var.vpc_cidr_block

# kubelet API (port 10250) - Control plane, workers, monitoring all need access
cidr_ipv4 = var.vpc_cidr_block

# kube-scheduler (10259) - Monitoring systems need access
cidr_ipv4 = var.vpc_cidr_block

# BGP (port 179) - Network routing between various nodes
cidr_ipv4 = var.vpc_cidr_block

# NLB health checks (port 6443) - Load balancer health checks
cidr_ipv4 = var.vpc_cidr_block

# etcd (ports 2379-2380) - Multiple control plane nodes need to communicate
cidr_ipv4 = var.vpc_cidr_block

# kubelet API (port 10250) - Control plane, workers, monitoring all need access
cidr_ipv4 = var.vpc_cidr_block

# kube-scheduler (10259) - Monitoring systems need access
cidr_ipv4 = var.vpc_cidr_block

# BGP (port 179) - Network routing between various nodes
cidr_ipv4 = var.vpc_cidr_block

# NLB health checks (port 6443) - Load balancer health checks
cidr_ipv4 = var.vpc_cidr_block

For Quick Decisions, use cidr_ipv4 = var.vpc_cidr_block when:

“Do multiple types of resources need access?” → YES = VPC CIDR
“Is this a system/infrastructure port?” → YES = VPC CIDR
“Do health checks or monitoring need access?” → YES = VPC CIDR

Use referenced_security_group_id when:

“Is this one specific service talking to another?” → YES = Security Group
“Can I identify exactly who should have access?” → YES = Security Group
“Is this application-level communication?” → YES = Security Group

Expose the return values to be used in other modules.

HCL

# modules/security_groups/outputs.tf

output "bastion_security_group_id" {  
  description = "Security group ID for the bastion host - used for SSH access to cluster nodes"
  value       = aws_security_group.bastion.id
}

output "control_plane_security_group_id" {  
  description = "Security group ID for Kubernetes control plane nodes - manages API server and cluster components"
  value       = aws_security_group.control_plane.id
}

output "worker_node_security_group_id" {  
  description = "Security group ID for Kubernetes worker nodes - handles application workloads and pod traffic"
  value       = aws_security_group.worker_node.id
}

# modules/security_groups/outputs.tf

output "bastion_security_group_id" {  
  description = "Security group ID for the bastion host - used for SSH access to cluster nodes"
  value       = aws_security_group.bastion.id
}

output "control_plane_security_group_id" {  
  description = "Security group ID for Kubernetes control plane nodes - manages API server and cluster components"
  value       = aws_security_group.control_plane.id
}

output "worker_node_security_group_id" {  
  description = "Security group ID for Kubernetes worker nodes - handles application workloads and pod traffic"
  value       = aws_security_group.worker_node.id
}

Create a custom module named networking. Pass values from networking so that it can be used inside the security groups (vpc_id and vpc_cidr_block).

HCL

# environments/development/main.tf

module security_groups {
  source = "../../modules/security_groups"

  vpc_id          = module.networking.vpc_id
  vpc_cidr_block  = module.networking.vpc_cidr_block
  
  depends_on = [module.networking]
}

# environments/development/main.tf

module security_groups {
  source = "../../modules/security_groups"

  vpc_id          = module.networking.vpc_id
  vpc_cidr_block  = module.networking.vpc_cidr_block
  
  depends_on = [module.networking]
}

EC2 Instances

EC2 instances are virtual computers that you rent from AWS. They are servers running in Amazon’s data centers that you can control remotely. EC2 instances are virtual computers in the cloud that you can create, configure, and control through code, giving you the flexibility to build infrastructure without buying physical hardware. In the Kubernetes setup, we will use 7 ec2 instances to hold the bastion host, control planes and worker nodes.

Create the variables first.

HCL

variable "public_subnet_cidrs" {
    type        = list(string)
    description = "Public Subnet CIDR values"
    default     = ["10.0.1.0/24"]
}

variable "control_plane_private_ips" {  
  type        = list(string)
  description = "List of private IPs for control plane nodes"
  default     = ["10.0.2.10", "10.0.3.10", "10.0.4.10"]
}

variable "bastion" {
  description = "Configuration for the bastion host used as a secure gateway to access private cluster resources"
  type = map
  default = {
      "ami"           =   "ami-084568db4383264d4"
      "instance_type" =   "t3.micro"
      "private_ip"    =   "10.0.1.10"
      "name"          =   "Bastion Host"
  }
}

variable "common_functions" {
  description = "Configuration for deploying shared utility scripts across all cluster instances"
  type = any
  default = {
      "source"        =   "scripts/common-functions.sh"
      "destination"   =   "/tmp/common-functions.sh"
      "connection"    =   {
          "type"          =   "ssh"
          "user"          =   "ubuntu"
          "bastion_user"  =   "ubuntu"
          "timeout"       =   "30m" # Allow enough time for installation
      }
  }
}

variable "control_plane" {
  description = "Configuration for the primary Kubernetes control plane node including API server, scheduler, and controller manager"
  type = any
  default = {
      "ami"               =   "ami-084568db4383264d4"
      "instance_type"     =   "t3.xlarge"
      "root_block_device" =   {
          "volume_size"           = 20
          "volume_type"           = "gp3"
          "delete_on_termination" = true
      }
      "init_file"     =   "scripts/init-control-plane.sh.tmpl"
      "name"          =   "Control Plane 1"
  }
}

variable "wait_for_master_ready" {
  description = "Configuration for the script that waits for the control plane to be fully operational before proceeding with cluster setup"
  type = map
  default = {
      "source" = "scripts/wait-for-master.sh.tmpl"
  }
}

variable "control_plane_secondary" {
  description = "Configuration for additional control plane nodes to provide high availability for the Kubernetes cluster"
  type = any
  default = {
      "ami"               =   "ami-084568db4383264d4"  # Replace with a Ubuntu 12 AMI ID
      "instance_type"     =   "t3.xlarge"
      "root_block_device" =   {
          "volume_size"           = 20
          "volume_type"           = "gp3"
          "delete_on_termination" = true
      }
      "init_file"         =   "scripts/init-control-plane.sh.tmpl"
      "name"              =   "Control Plane 1"
  }
}

variable "worker_nodes" {
  description = "Configuration for Kubernetes worker nodes that run application workloads and pods"
  type = any
  default = {
      "count"         =   3
      "ami"           =   "ami-084568db4383264d4"
      "instance_type" =   "t3.large"
      "root_block_device" = {
          "volume_size"           = 20
          "volume_type"           = "gp3"
          "delete_on_termination" = true
      }
      "init_file"     =   "scripts/init-worker-node.sh.tmpl"
      "name"          =   "Worker Node"
  }
}

variable "wait_for_workers_to_join" {
  description = "Configuration for the script that waits for all worker nodes to successfully join the Kubernetes cluster"
  type = map
  default = {
      "init_file" =   "scripts/wait-for-workers.sh.tmpl"
      "log_file"  =   "/var/log/k8s-wait-for-workers-$(date +%Y%m%d-%H%M%S).log"
  }
}

variable "label_worker_nodes" {
  description = "Configuration for applying labels and taints to worker nodes for workload scheduling and node organization" 
  type = any
  default = {
      "init_file" = "scripts/label-worker-nodes.sh.tmpl"
      "expected_worker_count" = 3
  }
}

# FROM Other Module
variable "vpc_id" {
  description = "VPC ID from AWS module where the Kubernetes cluster will be deployed"
  type        = string
}

variable "private_subnets" {
  description = "Private subnets from AWS module for deploying worker nodes and internal cluster components"
  type        = any
}

variable "public_subnets" {
  description = "Public subnets from AWS module for deploying bastion host and load balancers"  
  type        = any
}

variable "bastion_security_group_id" {
  description = "Security group ID for the bastion host allowing SSH access from authorized sources"
  type        = string
}

variable "control_plane_security_group_id" {
  description = "Security group ID for control plane nodes allowing Kubernetes API and inter-node communication"  
  type        = string
}

variable "worker_node_security_group_id" {
  description = "Security group ID for worker nodes allowing pod-to-pod communication and kubelet access" 
  type        = string
}

variable "kubernetes_master_instance_profile" {
  description = "IAM instance profile for control plane nodes with permissions for Kubernetes master operations"  
  type        = string
}

variable "kubernetes_worker_instance_profile" {
  description = "IAM instance profile for worker nodes with permissions for Kubernetes worker operations"  
  type        = string
}

variable "tls_private_key_pem" {
  description = "TLS private key in PEM format for secure communication within the Kubernetes cluster"
  type        = string
  sensitive   = true
}

variable "key_pair_name" {
  description = "AWS EC2 key pair name for SSH access to cluster instances"
  type        = string
}

variable "public_subnet_cidrs" {
    type        = list(string)
    description = "Public Subnet CIDR values"
    default     = ["10.0.1.0/24"]
}

variable "control_plane_private_ips" {  
  type        = list(string)
  description = "List of private IPs for control plane nodes"
  default     = ["10.0.2.10", "10.0.3.10", "10.0.4.10"]
}

variable "bastion" {
  description = "Configuration for the bastion host used as a secure gateway to access private cluster resources"
  type = map
  default = {
      "ami"           =   "ami-084568db4383264d4"
      "instance_type" =   "t3.micro"
      "private_ip"    =   "10.0.1.10"
      "name"          =   "Bastion Host"
  }
}

variable "common_functions" {
  description = "Configuration for deploying shared utility scripts across all cluster instances"
  type = any
  default = {
      "source"        =   "scripts/common-functions.sh"
      "destination"   =   "/tmp/common-functions.sh"
      "connection"    =   {
          "type"          =   "ssh"
          "user"          =   "ubuntu"
          "bastion_user"  =   "ubuntu"
          "timeout"       =   "30m" # Allow enough time for installation
      }
  }
}

variable "control_plane" {
  description = "Configuration for the primary Kubernetes control plane node including API server, scheduler, and controller manager"
  type = any
  default = {
      "ami"               =   "ami-084568db4383264d4"
      "instance_type"     =   "t3.xlarge"
      "root_block_device" =   {
          "volume_size"           = 20
          "volume_type"           = "gp3"
          "delete_on_termination" = true
      }
      "init_file"     =   "scripts/init-control-plane.sh.tmpl"
      "name"          =   "Control Plane 1"
  }
}

variable "wait_for_master_ready" {
  description = "Configuration for the script that waits for the control plane to be fully operational before proceeding with cluster setup"
  type = map
  default = {
      "source" = "scripts/wait-for-master.sh.tmpl"
  }
}

variable "control_plane_secondary" {
  description = "Configuration for additional control plane nodes to provide high availability for the Kubernetes cluster"
  type = any
  default = {
      "ami"               =   "ami-084568db4383264d4"  # Replace with a Ubuntu 12 AMI ID
      "instance_type"     =   "t3.xlarge"
      "root_block_device" =   {
          "volume_size"           = 20
          "volume_type"           = "gp3"
          "delete_on_termination" = true
      }
      "init_file"         =   "scripts/init-control-plane.sh.tmpl"
      "name"              =   "Control Plane 1"
  }
}

variable "worker_nodes" {
  description = "Configuration for Kubernetes worker nodes that run application workloads and pods"
  type = any
  default = {
      "count"         =   3
      "ami"           =   "ami-084568db4383264d4"
      "instance_type" =   "t3.large"
      "root_block_device" = {
          "volume_size"           = 20
          "volume_type"           = "gp3"
          "delete_on_termination" = true
      }
      "init_file"     =   "scripts/init-worker-node.sh.tmpl"
      "name"          =   "Worker Node"
  }
}

variable "wait_for_workers_to_join" {
  description = "Configuration for the script that waits for all worker nodes to successfully join the Kubernetes cluster"
  type = map
  default = {
      "init_file" =   "scripts/wait-for-workers.sh.tmpl"
      "log_file"  =   "/var/log/k8s-wait-for-workers-$(date +%Y%m%d-%H%M%S).log"
  }
}

variable "label_worker_nodes" {
  description = "Configuration for applying labels and taints to worker nodes for workload scheduling and node organization" 
  type = any
  default = {
      "init_file" = "scripts/label-worker-nodes.sh.tmpl"
      "expected_worker_count" = 3
  }
}

# FROM Other Module
variable "vpc_id" {
  description = "VPC ID from AWS module where the Kubernetes cluster will be deployed"
  type        = string
}

variable "private_subnets" {
  description = "Private subnets from AWS module for deploying worker nodes and internal cluster components"
  type        = any
}

variable "public_subnets" {
  description = "Public subnets from AWS module for deploying bastion host and load balancers"  
  type        = any
}

variable "bastion_security_group_id" {
  description = "Security group ID for the bastion host allowing SSH access from authorized sources"
  type        = string
}

variable "control_plane_security_group_id" {
  description = "Security group ID for control plane nodes allowing Kubernetes API and inter-node communication"  
  type        = string
}

variable "worker_node_security_group_id" {
  description = "Security group ID for worker nodes allowing pod-to-pod communication and kubelet access" 
  type        = string
}

variable "kubernetes_master_instance_profile" {
  description = "IAM instance profile for control plane nodes with permissions for Kubernetes master operations"  
  type        = string
}

variable "kubernetes_worker_instance_profile" {
  description = "IAM instance profile for worker nodes with permissions for Kubernetes worker operations"  
  type        = string
}

variable "tls_private_key_pem" {
  description = "TLS private key in PEM format for secure communication within the Kubernetes cluster"
  type        = string
  sensitive   = true
}

variable "key_pair_name" {
  description = "AWS EC2 key pair name for SSH access to cluster instances"
  type        = string
}

Bastion Host Instance

Creates bastion host EC2 instance (one per public subnet). The purpose is for SSH gateway to access private cluster nodes.

HCL

resource "aws_instance" "bastion" {
  count                   = length(var.public_subnet_cidrs)
  ami                     = var.bastion.ami
  instance_type           = var.bastion.instance_type
  key_name                = var.key_pair_name 
  vpc_security_group_ids  = [var.bastion_security_group_id]
  subnet_id               = var.public_subnets[count.index].id
  private_ip              = var.bastion.private_ip

  tags = {
    Name                = "${terraform.workspace} - ${var.bastion.name}"
    Environment         = terraform.workspace
    Project             = "Kubernetes"
    Role                = "bastion-host"
    ManagedBy           = "Terraform"
    CostCenter          = "Infrastructure"
    MonitoringEnabled   = "true"
    SubnetType          = "public"
    CreatedDate         = formatdate("YYYY-MM-DD", timestamp())
  }

  lifecycle {
    ignore_changes = [tags["CreatedDate"]]
  }
}

resource "aws_instance" "bastion" {
  count                   = length(var.public_subnet_cidrs)
  ami                     = var.bastion.ami
  instance_type           = var.bastion.instance_type
  key_name                = var.key_pair_name 
  vpc_security_group_ids  = [var.bastion_security_group_id]
  subnet_id               = var.public_subnets[count.index].id
  private_ip              = var.bastion.private_ip

  tags = {
    Name                = "${terraform.workspace} - ${var.bastion.name}"
    Environment         = terraform.workspace
    Project             = "Kubernetes"
    Role                = "bastion-host"
    ManagedBy           = "Terraform"
    CostCenter          = "Infrastructure"
    MonitoringEnabled   = "true"
    SubnetType          = "public"
    CreatedDate         = formatdate("YYYY-MM-DD", timestamp())
  }

  lifecycle {
    ignore_changes = [tags["CreatedDate"]]
  }
}

Bastion Elastic IP

Creates static public IP addresses for bastion hosts. The purpose is to give bastion a fixed IP that doesn’t change when instance restarts. So you always know the IP to SSH to

HCL

resource "aws_eip" "bastion_eip" {
  count    = length(var.public_subnet_cidrs)
  domain   = "vpc"
}

resource "aws_eip" "bastion_eip" {
  count    = length(var.public_subnet_cidrs)
  domain   = "vpc"
}

Bastion EIP Association

Attaches the static IP to the bastion instance. The purpose is to link the elastic IP to the actual server.

HCL

resource "aws_eip_association" "bastion_eip_assoc" {
  count         = length(var.public_subnet_cidrs)
  instance_id   = aws_instance.bastion[count.index].id
  allocation_id = aws_eip.bastion_eip[count.index].id
}

resource "aws_eip_association" "bastion_eip_assoc" {
  count         = length(var.public_subnet_cidrs)
  instance_id   = aws_instance.bastion[count.index].id
  allocation_id = aws_eip.bastion_eip[count.index].id
}

Upload Common Functions

Copies a script file to the control plane node. Uploads shared utility functions used by other scripts. Uses SSH through bastion host to reach control plane.

HCL

resource "null_resource" "upload_common_functions" {
  depends_on = [null_resource.wait_for_master_ready]
  
  provisioner "file" {
    source                  = "${path.module}/${var.common_functions.source}"
    destination             = var.common_functions.destination
    
    connection {
      type                  = var.common_functions.connection.type
      user                  = var.common_functions.connection.user
      private_key           = var.tls_private_key_pem
      host                  = aws_instance.control_plane["0"].private_ip
      bastion_host          = aws_eip.bastion_eip[0].public_ip
      bastion_user          = var.common_functions.connection.bastion_user
      bastion_private_key   = var.tls_private_key_pem
    }
  }
  
  # Make sure the file is executable
  provisioner "remote-exec" {
    inline = [
      "chmod +x /tmp/common-functions.sh",
      "echo 'Common functions uploaded successfully'"
    ]
    
    connection {
      type                = var.common_functions.connection.type
      user                = var.common_functions.connection.user
      private_key         = var.tls_private_key_pem
      host                = aws_instance.control_plane["0"].private_ip
      bastion_host        = aws_eip.bastion_eip[0].public_ip
      bastion_user        = var.common_functions.connection.bastion_user
      bastion_private_key = var.tls_private_key_pem
    }
  }
}

resource "null_resource" "upload_common_functions" {
  depends_on = [null_resource.wait_for_master_ready]
  
  provisioner "file" {
    source                  = "${path.module}/${var.common_functions.source}"
    destination             = var.common_functions.destination
    
    connection {
      type                  = var.common_functions.connection.type
      user                  = var.common_functions.connection.user
      private_key           = var.tls_private_key_pem
      host                  = aws_instance.control_plane["0"].private_ip
      bastion_host          = aws_eip.bastion_eip[0].public_ip
      bastion_user          = var.common_functions.connection.bastion_user
      bastion_private_key   = var.tls_private_key_pem
    }
  }
  
  # Make sure the file is executable
  provisioner "remote-exec" {
    inline = [
      "chmod +x /tmp/common-functions.sh",
      "echo 'Common functions uploaded successfully'"
    ]
    
    connection {
      type                = var.common_functions.connection.type
      user                = var.common_functions.connection.user
      private_key         = var.tls_private_key_pem
      host                = aws_instance.control_plane["0"].private_ip
      bastion_host        = aws_eip.bastion_eip[0].public_ip
      bastion_user        = var.common_functions.connection.bastion_user
      bastion_private_key = var.tls_private_key_pem
    }
  }
}

Control Plane Instance

Creates the primary Kubernetes master node. The purpose is to run API server, scheduler, controller manager. Located in a private subnet (protected from internet).

HCL

resource "aws_instance" "control_plane" {
  for_each                = { "0" = true }
  ami                     = var.control_plane.ami
  instance_type           = var.control_plane.instance_type
  key_name                = var.key_pair_name
  vpc_security_group_ids  = [var.control_plane_security_group_id]
  subnet_id               = var.private_subnets[0].id
  private_ip              = var.control_plane_private_ips[0]
  iam_instance_profile    = var.kubernetes_master_instance_profile

  root_block_device {
    volume_size           = var.control_plane.root_block_device.volume_size
    volume_type           = var.control_plane.root_block_device.volume_type
    delete_on_termination = var.control_plane.root_block_device.delete_on_termination
  }

  user_data = templatefile("${path.module}/${var.control_plane.init_file}", {
    common_functions                  = file("${path.module}/${var.common_functions.source}")
    control_plane_endpoint            = aws_lb.k8s_api.dns_name
    control_plane_master_private_ip   = var.control_plane_private_ips[0]
    is_first_control_plane            = "true"
  })

  tags = {
    Name = "${terraform.workspace} - ${var.control_plane.name}"
  }
}

resource "aws_instance" "control_plane" {
  for_each                = { "0" = true }
  ami                     = var.control_plane.ami
  instance_type           = var.control_plane.instance_type
  key_name                = var.key_pair_name
  vpc_security_group_ids  = [var.control_plane_security_group_id]
  subnet_id               = var.private_subnets[0].id
  private_ip              = var.control_plane_private_ips[0]
  iam_instance_profile    = var.kubernetes_master_instance_profile

  root_block_device {
    volume_size           = var.control_plane.root_block_device.volume_size
    volume_type           = var.control_plane.root_block_device.volume_type
    delete_on_termination = var.control_plane.root_block_device.delete_on_termination
  }

  user_data = templatefile("${path.module}/${var.control_plane.init_file}", {
    common_functions                  = file("${path.module}/${var.common_functions.source}")
    control_plane_endpoint            = aws_lb.k8s_api.dns_name
    control_plane_master_private_ip   = var.control_plane_private_ips[0]
    is_first_control_plane            = "true"
  })

  tags = {
    Name = "${terraform.workspace} - ${var.control_plane.name}"
  }
}

Wait for Master Ready

Runs a script that waits for Kubernetes master to be fully started. The purpose is to ensure master is ready before creating other nodes.

HCL

resource "null_resource" "wait_for_master_ready" {
  depends_on = [aws_instance.control_plane]

  provisioner "remote-exec" {
    inline = [
      templatefile("${path.module}/${var.wait_for_master_ready.source}", {
        common_functions = file("${path.module}/${var.common_functions.source}")
      })
    ]

    connection {
      type                = var.common_functions.connection.type
      user                = var.common_functions.connection.user
      private_key         = var.tls_private_key_pem
      host                = aws_instance.control_plane["0"].private_ip
      bastion_host        = aws_eip.bastion_eip[0].public_ip
      bastion_user        = var.common_functions.connection.bastion_user
      bastion_private_key = var.tls_private_key_pem
      timeout             = var.common_functions.connection.timeout
    }
  }

  triggers = {
    instance_id = aws_instance.control_plane["0"].id
  }
}

resource "null_resource" "wait_for_master_ready" {
  depends_on = [aws_instance.control_plane]

  provisioner "remote-exec" {
    inline = [
      templatefile("${path.module}/${var.wait_for_master_ready.source}", {
        common_functions = file("${path.module}/${var.common_functions.source}")
      })
    ]

    connection {
      type                = var.common_functions.connection.type
      user                = var.common_functions.connection.user
      private_key         = var.tls_private_key_pem
      host                = aws_instance.control_plane["0"].private_ip
      bastion_host        = aws_eip.bastion_eip[0].public_ip
      bastion_user        = var.common_functions.connection.bastion_user
      bastion_private_key = var.tls_private_key_pem
      timeout             = var.common_functions.connection.timeout
    }
  }

  triggers = {
    instance_id = aws_instance.control_plane["0"].id
  }
}

Secondary Control Plane Instances

Creates additional master nodes for high availability. The purpose is if primary master fails, these can take over. The location is different subnets from primary master. Key difference: is_first_control_plane = "false" in user_data.

HCL

resource "aws_instance" "control_plane_secondary" {
  for_each                = { "1" = 1, "2" = 2 }
  
  ami                     = var.control_plane_secondary.ami
  instance_type           = var.control_plane_secondary.instance_type
  key_name                = var.key_pair_name 
  vpc_security_group_ids  = [var.control_plane_security_group_id]
  subnet_id               = var.private_subnets[each.value].id
  private_ip              = var.control_plane_private_ips[each.value]
  iam_instance_profile    = var.kubernetes_master_instance_profile

  root_block_device {
    volume_size           = var.control_plane_secondary.root_block_device.volume_size
    volume_type           = var.control_plane_secondary.root_block_device.volume_type
    delete_on_termination = var.control_plane_secondary.root_block_device.delete_on_termination
  }

  user_data = templatefile("${path.module}/${var.control_plane_secondary.init_file}", {
    common_functions                  = file("${path.module}/${var.common_functions.source}")
    control_plane_endpoint            = aws_lb.k8s_api.dns_name
    control_plane_master_private_ip   = var.control_plane_private_ips[0]
    is_first_control_plane            = "false"
  })

  depends_on = [null_resource.wait_for_master_ready]

  tags = {
    Name              = "${terraform.workspace} - ${var.control_plane.name}"
    Environment       = terraform.workspace
    Project           = "Kubernetes"
    Role              = "control-plane"
    ManagedBy         = "Terraform"
    CostCenter        = "Infrastructure"
    MonitoringEnabled = "true"
    SubnetType        = "private"
    CreatedDate       = formatdate("YYYY-MM-DD", timestamp())
  }

  lifecycle {
    ignore_changes = [tags["CreatedDate"]]
  }
}

resource "aws_instance" "control_plane_secondary" {
  for_each                = { "1" = 1, "2" = 2 }
  
  ami                     = var.control_plane_secondary.ami
  instance_type           = var.control_plane_secondary.instance_type
  key_name                = var.key_pair_name 
  vpc_security_group_ids  = [var.control_plane_security_group_id]
  subnet_id               = var.private_subnets[each.value].id
  private_ip              = var.control_plane_private_ips[each.value]
  iam_instance_profile    = var.kubernetes_master_instance_profile

  root_block_device {
    volume_size           = var.control_plane_secondary.root_block_device.volume_size
    volume_type           = var.control_plane_secondary.root_block_device.volume_type
    delete_on_termination = var.control_plane_secondary.root_block_device.delete_on_termination
  }

  user_data = templatefile("${path.module}/${var.control_plane_secondary.init_file}", {
    common_functions                  = file("${path.module}/${var.common_functions.source}")
    control_plane_endpoint            = aws_lb.k8s_api.dns_name
    control_plane_master_private_ip   = var.control_plane_private_ips[0]
    is_first_control_plane            = "false"
  })

  depends_on = [null_resource.wait_for_master_ready]

  tags = {
    Name              = "${terraform.workspace} - ${var.control_plane.name}"
    Environment       = terraform.workspace
    Project           = "Kubernetes"
    Role              = "control-plane"
    ManagedBy         = "Terraform"
    CostCenter        = "Infrastructure"
    MonitoringEnabled = "true"
    SubnetType        = "private"
    CreatedDate       = formatdate("YYYY-MM-DD", timestamp())
  }

  lifecycle {
    ignore_changes = [tags["CreatedDate"]]
  }
}

Worker Node Instances

Creates Kubernetes worker nodes (default: 3 nodes). The purpose is to Run application pods and workloads. Distributed across private subnets using modulo.

HCL

resource "aws_instance" "worker_nodes" {
  count                   = var.worker_nodes.count
  ami                     = var.worker_nodes.ami
  instance_type           = var.worker_nodes.instance_type
  key_name                = var.key_pair_name
  vpc_security_group_ids  = [var.worker_node_security_group_id]
  
  # Use modulo to distribute worker nodes across available subnets
  subnet_id               = var.private_subnets[count.index % length(var.private_subnets)].id

  iam_instance_profile    = var.kubernetes_worker_instance_profile

  root_block_device {
    volume_size           = var.worker_nodes.root_block_device.volume_size
    volume_type           = var.worker_nodes.root_block_device.volume_type
    delete_on_termination = var.worker_nodes.root_block_device.delete_on_termination
  }

  user_data = templatefile("${path.module}/${var.worker_nodes.init_file}", {
    common_functions = file("${path.module}/${var.common_functions.source}")
  })

  # Wait for at least the master control plane to be ready
  depends_on = [null_resource.wait_for_master_ready]

   tags = {
    Name              = "${terraform.workspace} - ${var.worker_nodes.name} ${count.index + 1}"
    Environment       = terraform.workspace
    Project           = "Kubernetes"
    Role              = "worker-node"
    ManagedBy         = "Terraform"
    CostCenter        = "Infrastructure"
    MonitoringEnabled = "true"
    SubnetType        = "private"
    NodeType          = "compute"
    WorkloadCapable   = "true"
    CreatedDate       = formatdate("YYYY-MM-DD", timestamp())
  }

  lifecycle {
    ignore_changes = [tags["CreatedDate"]]
  }
}

resource "aws_instance" "worker_nodes" {
  count                   = var.worker_nodes.count
  ami                     = var.worker_nodes.ami
  instance_type           = var.worker_nodes.instance_type
  key_name                = var.key_pair_name
  vpc_security_group_ids  = [var.worker_node_security_group_id]
  
  # Use modulo to distribute worker nodes across available subnets
  subnet_id               = var.private_subnets[count.index % length(var.private_subnets)].id

  iam_instance_profile    = var.kubernetes_worker_instance_profile

  root_block_device {
    volume_size           = var.worker_nodes.root_block_device.volume_size
    volume_type           = var.worker_nodes.root_block_device.volume_type
    delete_on_termination = var.worker_nodes.root_block_device.delete_on_termination
  }

  user_data = templatefile("${path.module}/${var.worker_nodes.init_file}", {
    common_functions = file("${path.module}/${var.common_functions.source}")
  })

  # Wait for at least the master control plane to be ready
  depends_on = [null_resource.wait_for_master_ready]

   tags = {
    Name              = "${terraform.workspace} - ${var.worker_nodes.name} ${count.index + 1}"
    Environment       = terraform.workspace
    Project           = "Kubernetes"
    Role              = "worker-node"
    ManagedBy         = "Terraform"
    CostCenter        = "Infrastructure"
    MonitoringEnabled = "true"
    SubnetType        = "private"
    NodeType          = "compute"
    WorkloadCapable   = "true"
    CreatedDate       = formatdate("YYYY-MM-DD", timestamp())
  }

  lifecycle {
    ignore_changes = [tags["CreatedDate"]]
  }
}

Wait for Workers to Join

Runs script that waits for all worker nodes to join cluster.

HCL

resource "null_resource" "wait_for_workers_to_join" {
  depends_on    = [
    aws_instance.worker_nodes,
    aws_instance.control_plane_secondary
  ]

  provisioner "remote-exec" {
    inline      = [
      templatefile("${path.module}/${var.wait_for_workers_to_join.init_file}", {
        common_functions  = file("${path.module}/${var.common_functions.source}")
        expected_workers  = length(aws_instance.worker_nodes)
        timeout_seconds   = 600
        check_interval    = 30
        log_file          = var.wait_for_workers_to_join.log_file
      })
    ]

    connection {
      type                = var.common_functions.connection.type
      user                = var.common_functions.connection.user
      private_key         = var.tls_private_key_pem
      host                = aws_instance.control_plane["0"].private_ip
      bastion_host        = aws_eip.bastion_eip[0].public_ip
      bastion_user        = var.common_functions.connection.bastion_user
      bastion_private_key = var.tls_private_key_pem
    }
  }

  triggers = {
    worker_instances        = join(",", aws_instance.worker_nodes[*].id)
    control_plane_instances = join(",", values(aws_instance.control_plane_secondary)[*].id)
  }
}

resource "null_resource" "wait_for_workers_to_join" {
  depends_on    = [
    aws_instance.worker_nodes,
    aws_instance.control_plane_secondary
  ]

  provisioner "remote-exec" {
    inline      = [
      templatefile("${path.module}/${var.wait_for_workers_to_join.init_file}", {
        common_functions  = file("${path.module}/${var.common_functions.source}")
        expected_workers  = length(aws_instance.worker_nodes)
        timeout_seconds   = 600
        check_interval    = 30
        log_file          = var.wait_for_workers_to_join.log_file
      })
    ]

    connection {
      type                = var.common_functions.connection.type
      user                = var.common_functions.connection.user
      private_key         = var.tls_private_key_pem
      host                = aws_instance.control_plane["0"].private_ip
      bastion_host        = aws_eip.bastion_eip[0].public_ip
      bastion_user        = var.common_functions.connection.bastion_user
      bastion_private_key = var.tls_private_key_pem
    }
  }

  triggers = {
    worker_instances        = join(",", aws_instance.worker_nodes[*].id)
    control_plane_instances = join(",", values(aws_instance.control_plane_secondary)[*].id)
  }
}

Label Worker Nodes

Applies labels to worker nodes to organize them for workload scheduling. Labels nodes with the “worker” role so they display properly in kubectl output instead of showing <none> as their role.

HCL

resource "null_resource" "label_worker_nodes" {
  depends_on = [null_resource.wait_for_workers_to_join]

  provisioner "remote-exec" {
    inline = [
      templatefile("${path.module}/${var.label_worker_nodes.init_file}", {
        common_functions      = file("${path.module}/${var.common_functions.source}")
        expected_worker_count = var.label_worker_nodes.expected_worker_count
      })
    ]

    connection {
      type                = var.common_functions.connection.type
      user                = var.common_functions.connection.user
      private_key         = var.tls_private_key_pem
      host                = aws_instance.control_plane["0"].private_ip
      bastion_host        = aws_eip.bastion_eip[0].public_ip
      bastion_user        = var.common_functions.connection.bastion_user
      bastion_private_key = var.tls_private_key_pem
    }
  }

  triggers = {
    worker_wait_complete = null_resource.wait_for_workers_to_join.id
  }
}

resource "null_resource" "label_worker_nodes" {
  depends_on = [null_resource.wait_for_workers_to_join]

  provisioner "remote-exec" {
    inline = [
      templatefile("${path.module}/${var.label_worker_nodes.init_file}", {
        common_functions      = file("${path.module}/${var.common_functions.source}")
        expected_worker_count = var.label_worker_nodes.expected_worker_count
      })
    ]

    connection {
      type                = var.common_functions.connection.type
      user                = var.common_functions.connection.user
      private_key         = var.tls_private_key_pem
      host                = aws_instance.control_plane["0"].private_ip
      bastion_host        = aws_eip.bastion_eip[0].public_ip
      bastion_user        = var.common_functions.connection.bastion_user
      bastion_private_key = var.tls_private_key_pem
    }
  }

  triggers = {
    worker_wait_complete = null_resource.wait_for_workers_to_join.id
  }
}

Kubernetes API Load Balancer

Creates internal (private subnets only) Network Load Balancer for Kubernetes API server. The purpose is to distribute API requests across multiple master nodes.

HCL

resource "aws_lb" "k8s_api" {
  name               =  "k8s-api-lb"
  internal           =  true
  load_balancer_type =  "network"
  subnets            =  [for subnet in var.private_subnets : subnet.id]

  tags = {
    Name              = "${terraform.workspace} - Kubernetes API Load Balancer"
    Environment       = terraform.workspace
    Project           = "Kubernetes"
    Role              = "api-load-balancer"
    Component         = "networking"
    Purpose           = "kubernetes-api-endpoint"
    ManagedBy         = "Terraform"
    CostCenter        = "Infrastructure"
    MonitoringEnabled = "true"
    LoadBalancerType  = "network"
    Scheme            = "internal"
    Protocol          = "tcp"
    HighAvailability  = "true"
    SecurityLevel     = "high"
    CreatedDate       = formatdate("YYYY-MM-DD", timestamp())
  }

  lifecycle {
    ignore_changes = [tags["CreatedDate"]]
  }
}

resource "aws_lb" "k8s_api" {
  name               =  "k8s-api-lb"
  internal           =  true
  load_balancer_type =  "network"
  subnets            =  [for subnet in var.private_subnets : subnet.id]

  tags = {
    Name              = "${terraform.workspace} - Kubernetes API Load Balancer"
    Environment       = terraform.workspace
    Project           = "Kubernetes"
    Role              = "api-load-balancer"
    Component         = "networking"
    Purpose           = "kubernetes-api-endpoint"
    ManagedBy         = "Terraform"
    CostCenter        = "Infrastructure"
    MonitoringEnabled = "true"
    LoadBalancerType  = "network"
    Scheme            = "internal"
    Protocol          = "tcp"
    HighAvailability  = "true"
    SecurityLevel     = "high"
    CreatedDate       = formatdate("YYYY-MM-DD", timestamp())
  }

  lifecycle {
    ignore_changes = [tags["CreatedDate"]]
  }
}

API Target Group

Creates target group for API server health checks. The purpose is to define which servers receive traffic and how to check if they’re healthy. TCP connection test every 10 seconds

Port: 6443 (standard Kubernetes API port)

HCL

resource "aws_lb_target_group" "k8s_api" {
  name     =    "k8s-api-tg"
  port     =    6443
  protocol =    "TCP"
  vpc_id   =    var.vpc_id

  health_check {
    protocol            =   "TCP"
    port                =   6443
    healthy_threshold   =   2
    unhealthy_threshold =   2
    interval            =   10
  }

  tags = {
    Name              = "${terraform.workspace} - Kubernetes API Target Group"
    Environment       = terraform.workspace
    Project           = "Kubernetes"
    Role              = "api-target-group"
    Component         = "networking"
    Purpose           = "kubernetes-api-health-check"
    ManagedBy         = "Terraform"
    CostCenter        = "Infrastructure"
    MonitoringEnabled = "true"
    Protocol          = "TCP"
    Port              = "6443"
    HealthCheck       = "enabled"
    ServiceType       = "kubernetes-api-server"
    TargetType        = "control-plane-nodes"
    CreatedDate       = formatdate("YYYY-MM-DD", timestamp())
  }

  lifecycle {
    ignore_changes = [tags["CreatedDate"]]
  }
}

resource "aws_lb_target_group" "k8s_api" {
  name     =    "k8s-api-tg"
  port     =    6443
  protocol =    "TCP"
  vpc_id   =    var.vpc_id

  health_check {
    protocol            =   "TCP"
    port                =   6443
    healthy_threshold   =   2
    unhealthy_threshold =   2
    interval            =   10
  }

  tags = {
    Name              = "${terraform.workspace} - Kubernetes API Target Group"
    Environment       = terraform.workspace
    Project           = "Kubernetes"
    Role              = "api-target-group"
    Component         = "networking"
    Purpose           = "kubernetes-api-health-check"
    ManagedBy         = "Terraform"
    CostCenter        = "Infrastructure"
    MonitoringEnabled = "true"
    Protocol          = "TCP"
    Port              = "6443"
    HealthCheck       = "enabled"
    ServiceType       = "kubernetes-api-server"
    TargetType        = "control-plane-nodes"
    CreatedDate       = formatdate("YYYY-MM-DD", timestamp())
  }

  lifecycle {
    ignore_changes = [tags["CreatedDate"]]
  }
}

Master Target Group Attachment

Adds primary master node to load balancer target group. The purpose is for primary master receives API traffic through load balancer.

HCL

resource "aws_lb_target_group_attachment" "k8s_api_master" {
  target_group_arn  =     aws_lb_target_group.k8s_api.arn
  target_id         =     aws_instance.control_plane["0"].id
  port              =     6443
}

resource "aws_lb_target_group_attachment" "k8s_api_master" {
  target_group_arn  =     aws_lb_target_group.k8s_api.arn
  target_id         =     aws_instance.control_plane["0"].id
  port              =     6443
}

Secondary Target Group Attachments

Adds secondary master nodes to load balancer target group. The purpose is for all masters receive API traffic for high availability.

HCL

resource "aws_lb_target_group_attachment" "k8s_api_secondary" {
  for_each         =    aws_instance.control_plane_secondary
  target_group_arn =    aws_lb_target_group.k8s_api.arn
  target_id        =    each.value.id
  port             =    6443
}

resource "aws_lb_target_group_attachment" "k8s_api_secondary" {
  for_each         =    aws_instance.control_plane_secondary
  target_group_arn =    aws_lb_target_group.k8s_api.arn
  target_id        =    each.value.id
  port             =    6443
}

Load Balancer Listener

Configures load balancer to listen on port 6443. The purpose is to accept incoming API requests and forwards to healthy masters. Forward all traffic to target group.

HCL

resource "aws_lb_listener" "k8s_api" {
  load_balancer_arn =   aws_lb.k8s_api.arn
  port              =   6443
  protocol          =   "TCP"

  default_action {
    type             =  "forward"
    target_group_arn =  aws_lb_target_group.k8s_api.arn
  }
}

resource "aws_lb_listener" "k8s_api" {
  load_balancer_arn =   aws_lb.k8s_api.arn
  port              =   6443
  protocol          =   "TCP"

  default_action {
    type             =  "forward"
    target_group_arn =  aws_lb_target_group.k8s_api.arn
  }
}

Summary of flow:

Bastion gets created with static IP
Primary master gets created and initialized
Wait for master to be ready
Secondary masters join the cluster
Worker nodes get created and join
Wait for all workers to join
Label workers for organization
Load balancer distributes API traffic across all masters

Create a custom module and name it compute. Pass outputs from other modules.

HCL

# environments/development/main.tf

module "compute" {
  source = "../../modules/compute"
  
  # Pass AWS resources from development module
  private_subnets                     = module.networking.private_subnets
  public_subnets                      = module.networking.public_subnets
  bastion_security_group_id           = module.security_groups.bastion_security_group_id
  control_plane_security_group_id     = module.security_groups.control_plane_security_group_id
  worker_node_security_group_id       = module.security_groups.worker_node_security_group_id  
  kubernetes_master_instance_profile  = module.iam.kubernetes_master_instance_profile
  kubernetes_worker_instance_profile  = module.iam.kubernetes_worker_instance_profile
  key_pair_name                       = module.keypair.key_pair_name
  tls_private_key_pem                 = module.keypair.tls_private_key_pem
  vpc_id                              = module.networking.vpc_id
  
  depends_on = [module.iam, module.keypair, module.networking, module.security_groups]
}

# environments/development/main.tf

module "compute" {
  source = "../../modules/compute"
  
  # Pass AWS resources from development module
  private_subnets                     = module.networking.private_subnets
  public_subnets                      = module.networking.public_subnets
  bastion_security_group_id           = module.security_groups.bastion_security_group_id
  control_plane_security_group_id     = module.security_groups.control_plane_security_group_id
  worker_node_security_group_id       = module.security_groups.worker_node_security_group_id  
  kubernetes_master_instance_profile  = module.iam.kubernetes_master_instance_profile
  kubernetes_worker_instance_profile  = module.iam.kubernetes_worker_instance_profile
  key_pair_name                       = module.keypair.key_pair_name
  tls_private_key_pem                 = module.keypair.tls_private_key_pem
  vpc_id                              = module.networking.vpc_id
  
  depends_on = [module.iam, module.keypair, module.networking, module.security_groups]
}

Kubernetes Control Planes Installation using Bash Script

This script essentially automates the creation of a highly available Kubernetes cluster on AWS, handling both the initial cluster setup and the addition of subsequent control plane nodes.

1. wait_for_variables() Waits for required environment variables to be available.

Polls for 30 attempts (60 seconds total) checking if control_plane_master_private_ip, control_plane_endpoint, and is_first_control_plane are set
Returns 0 if all variables are available, 1 if timeout occurs

Bash

#!/bin/bash
set -e

# Function: Wait for required environment variables to be available
wait_for_variables() {
    max_attempts=30
    sleep_interval=2
    attempt=1
    
    while [ $attempt -le $max_attempts ]; do
        # Check if all required variables are set and non-empty
        if [ -n "${control_plane_master_private_ip}" ] && [ -n "${control_plane_endpoint}" ] && [ -n "${is_first_control_plane}" ]; then
            return 0
        fi
        
        sleep $sleep_interval
        attempt=$((attempt + 1))
    done
    
    return 1
}

# Wait for variables or exit if timeout
if ! wait_for_variables; then
    exit 1
fi

# Validate required environment variables are set
if [ -z "${control_plane_master_private_ip}" ] || [ -z "${control_plane_endpoint}" ] || [ -z "${is_first_control_plane}" ]; then
    exit 1
fi

#!/bin/bash
set -e

# Function: Wait for required environment variables to be available
wait_for_variables() {
    max_attempts=30
    sleep_interval=2
    attempt=1
    
    while [ $attempt -le $max_attempts ]; do
        # Check if all required variables are set and non-empty
        if [ -n "${control_plane_master_private_ip}" ] && [ -n "${control_plane_endpoint}" ] && [ -n "${is_first_control_plane}" ]; then
            return 0
        fi
        
        sleep $sleep_interval
        attempt=$((attempt + 1))
    done
    
    return 1
}

# Wait for variables or exit if timeout
if ! wait_for_variables; then
    exit 1
fi

# Validate required environment variables are set
if [ -z "${control_plane_master_private_ip}" ] || [ -z "${control_plane_endpoint}" ] || [ -z "${is_first_control_plane}" ]; then
    exit 1
fi

2. System Preparation Block. Prepares the system for Kubernetes installation.

Swap Management: Disables swap memory (required by Kubernetes) and comments it out in /etc/fstab to prevent re-enabling on reboot
Network Configuration: Enables IP forwarding by setting net.ipv4.ip_forward = 1 for pod-to-pod communication
Package Updates: Updates system packages with retry logic for reliability

Bash

# SYSTEM PREPARATION
# Disable swap (required for Kubernetes)
swapoff -a

# Permanently disable swap by commenting it out in fstab
sed -i '/ swap / s/^/#/' /etc/fstab

# Enable IP forwarding for pod networking
cat <<EOF | tee /etc/sysctl.d/k8s.conf
net.ipv4.ip_forward = 1
EOF

# Apply sysctl settings without reboot
sysctl --system

# Update package lists with retry logic
for attempt in 1 2 3; do
    if apt-get update; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

# SYSTEM PREPARATION
# Disable swap (required for Kubernetes)
swapoff -a

# Permanently disable swap by commenting it out in fstab
sed -i '/ swap / s/^/#/' /etc/fstab

# Enable IP forwarding for pod networking
cat <<EOF | tee /etc/sysctl.d/k8s.conf
net.ipv4.ip_forward = 1
EOF

# Apply sysctl settings without reboot
sysctl --system

# Update package lists with retry logic
for attempt in 1 2 3; do
    if apt-get update; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

3. Container Runtime Setup Block. Installs and configures containerd as the container runtime.

Package Installation: Installs essential packages including containerd and security tools
Repository Setup: Adds Docker’s GPG key and repository for containerd installation
Containerd Configuration:
- Generates default config file
- Enables systemd cgroup driver (required for Kubernetes)
- Starts and enables the containerd service

Bash

# CONTAINER RUNTIME SETUP (containerd)
# Install required packages
apt-get install -y ca-certificates curl gnupg lsb-release containerd apt-transport-https unzip

# Create directory for APT keyrings
mkdir -p /etc/apt/keyrings

# Add Docker GPG key (for containerd installation)
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

# Make the key readable
sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Add Docker repository
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  noble stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null

# Update package list again after adding repository
for attempt in 1 2 3; do
    if apt-get update; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

# Configure containerd
mkdir -p /etc/containerd

# Generate default containerd configuration
containerd config default | tee /etc/containerd/config.toml

# Enable systemd cgroup driver (required for Kubernetes)
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml

# Start and enable containerd service
systemctl restart containerd
systemctl enable containerd

# CONTAINER RUNTIME SETUP (containerd)
# Install required packages
apt-get install -y ca-certificates curl gnupg lsb-release containerd apt-transport-https unzip

# Create directory for APT keyrings
mkdir -p /etc/apt/keyrings

# Add Docker GPG key (for containerd installation)
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

# Make the key readable
sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Add Docker repository
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  noble stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null

# Update package list again after adding repository
for attempt in 1 2 3; do
    if apt-get update; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

# Configure containerd
mkdir -p /etc/containerd

# Generate default containerd configuration
containerd config default | tee /etc/containerd/config.toml

# Enable systemd cgroup driver (required for Kubernetes)
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml

# Start and enable containerd service
systemctl restart containerd
systemctl enable containerd

4. Kubernetes Installation Block. Installs Kubernetes components.

Repository Setup: Adds Kubernetes GPG key and repository
Component Installation: Installs kubelet, kubeadm, and kubectl
Package Protection: Uses apt-mark hold to prevent automatic updates that could break the cluster
Service Management: Enables kubelet service

Bash

# KUBERNETES INSTALLATION
# Add Kubernetes GPG key with retry logic
for attempt in 1 2 3; do
    if curl --connect-timeout 30 --max-time 60 -fsSL https://pkgs.k8s.io/core:/stable:/v1.33/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

# Add Kubernetes repository
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.33/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list

# Update package list
apt-get update

# Install Kubernetes components
apt-get install -y kubelet kubeadm kubectl

# Prevent automatic updates of Kubernetes packages
apt-mark hold kubelet kubeadm kubectl

# Enable kubelet service
systemctl enable --now kubelet

# KUBERNETES INSTALLATION
# Add Kubernetes GPG key with retry logic
for attempt in 1 2 3; do
    if curl --connect-timeout 30 --max-time 60 -fsSL https://pkgs.k8s.io/core:/stable:/v1.33/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

# Add Kubernetes repository
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.33/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list

# Update package list
apt-get update

# Install Kubernetes components
apt-get install -y kubelet kubeadm kubectl

# Prevent automatic updates of Kubernetes packages
apt-mark hold kubelet kubeadm kubectl

# Enable kubelet service
systemctl enable --now kubelet

5. AWS CLI Installation Block. Installs AWS CLI for Parameter Store operations.

Downloads, extracts, and installs AWS CLI v2
Includes retry logic and verification

Bash

# AWS CLI INSTALLATION
# Download AWS CLI with retry logic
for attempt in 1 2 3; do
    if curl --connect-timeout 30 --max-time 300 "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

# Extract and install AWS CLI
unzip awscliv2.zip
sudo ./aws/install

# Verify AWS CLI installation
if ! aws --version; then
    exit 1
fi

# AWS CLI INSTALLATION
# Download AWS CLI with retry logic
for attempt in 1 2 3; do
    if curl --connect-timeout 30 --max-time 300 "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

# Extract and install AWS CLI
unzip awscliv2.zip
sudo ./aws/install

# Verify AWS CLI installation
if ! aws --version; then
    exit 1
fi

6. First Control Plane Node Block (is_first_control_plane = true). Initializes the first control plane node and sets up the cluster.

Configuration Validation: Validates kubeadm config before cluster initialization
Cluster Initialization: Creates the Kubernetes cluster with specific networking settings
User Setup: Configures kubectl access for the ubuntu user
Control Plane Health Check: Waits up to 150 seconds for the control plane to become responsive
CNI Installation: Installs Calico for pod networking
Certificate Regeneration:
- Backs up existing API server certificates
- Regenerates certificates to include load balancer DNS as Subject Alternative Name (SAN)
- This allows external access through the load balancer
Join Command Generation:
- Creates join commands for both worker nodes and additional control planes
- Replaces private IP with load balancer DNS for external access
Parameter Store Operations: Stores join commands in AWS Systems Manager for other nodes to retrieve

Bash

# CLUSTER INITIALIZATION OR JOIN
if [ "${is_first_control_plane}" = "true" ]; then
    # FIRST CONTROL PLANE NODE SETUP
    
    # Validate kubeadm configuration before initialization
    if ! kubeadm config validate --config <(cat <<EOF
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: "${control_plane_master_private_ip}"
  bindPort: 6443
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
controlPlaneEndpoint: "${control_plane_master_private_ip}:6443"
apiServer:
  certSANs:
    - "${control_plane_endpoint}"
networking:
  podSubnet: "192.168.0.0/16"
EOF
); then
        exit 1
    fi

    # Initialize Kubernetes cluster
    kubeadm init \
        --control-plane-endpoint "${control_plane_master_private_ip}:6443" \
        --apiserver-advertise-address="${control_plane_master_private_ip}" \
        --upload-certs \
        --pod-network-cidr=192.168.0.0/16 \
        --apiserver-cert-extra-sans "${control_plane_endpoint}"

    # Setup kubeconfig for ubuntu user
    export KUBE_USER=ubuntu
    mkdir -p /home/$KUBE_USER/.kube
    sudo cp -i /etc/kubernetes/admin.conf /home/$KUBE_USER/.kube/config
    sudo chown $KUBE_USER:$KUBE_USER /home/$KUBE_USER/.kube/config

    # Wait for control plane to become responsive
    control_plane_ready=false
    for i in {1..30}; do
        if KUBECONFIG=/etc/kubernetes/admin.conf kubectl get nodes &>/dev/null; then
            control_plane_ready=true
            break
        fi
        sleep 5
    done

    if [ "$control_plane_ready" = false ]; then
        exit 1
    fi

    # Install Calico CNI (Container Network Interface)
    for attempt in 1 2 3; do
        if KUBECONFIG=/etc/kubernetes/admin.conf kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml; then
            break
        else
            if [ $attempt -eq 3 ]; then
                exit 1
            fi
            sleep 10
        fi
    done

    # CERTIFICATE REGENERATION FOR LOAD BALANCER
    # Backup existing certificates
    if [ ! -f /etc/kubernetes/pki/apiserver.crt ]; then
        exit 1
    fi

    sudo mv /etc/kubernetes/pki/apiserver.crt /etc/kubernetes/pki/apiserver.crt.bak
    sudo mv /etc/kubernetes/pki/apiserver.key /etc/kubernetes/pki/apiserver.key.bak

    # Create configuration for certificate regeneration with load balancer DNS
    cat <<EOF | sudo tee /root/kubeadm-dns.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
controlPlaneEndpoint: "${control_plane_endpoint}:6443"
apiServer:
  certSANs:
    - "${control_plane_endpoint}"
    - "${control_plane_master_private_ip}"
EOF

    # Regenerate API server certificates with load balancer DNS as SAN
    sudo kubeadm init phase certs apiserver --config /root/kubeadm-dns.yaml

    # Restart kubelet to pick up new certificates
    sudo systemctl restart kubelet

    # JOIN COMMAND GENERATION
    # Generate join command for worker nodes
    JOIN_COMMAND=$(kubeadm token create --print-join-command 2>/dev/null)
    if [ -z "$JOIN_COMMAND" ]; then
        exit 1
    fi

    # Generate certificate key for control plane nodes
    CERT_KEY=$(sudo kubeadm init phase upload-certs --upload-certs 2>/dev/null | tail -n 1)
    if [ -z "$CERT_KEY" ]; then
        exit 1
    fi

    # Create control plane join command
    CONTROL_PLANE_JOIN_COMMAND="$JOIN_COMMAND --control-plane --certificate-key $CERT_KEY"
    WORKER_NODE_JOIN_COMMAND="$JOIN_COMMAND"

    # Replace private IP with load balancer DNS in join commands
    JOIN_COMMAND_WITH_DNS=$(echo "$CONTROL_PLANE_JOIN_COMMAND" | sed "s/${control_plane_master_private_ip}:6443/${control_plane_endpoint}:6443/g")
    WORKER_NODE_JOIN_COMMAND_WITH_DNS=$(echo "$WORKER_NODE_JOIN_COMMAND" | sed "s/${control_plane_master_private_ip}:6443/${control_plane_endpoint}:6443/g")

    # Store join commands in AWS Systems Manager Parameter Store
    for attempt in 1 2 3; do
        if aws ssm put-parameter \
          --name "/k8s/control-plane/join-command" \
          --value "$JOIN_COMMAND_WITH_DNS" \
          --type "SecureString" \
          --overwrite \
          --region "us-east-1" \
          --cli-connect-timeout 10 \
          --cli-read-timeout 30; then
            break
        else
            if [ $attempt -eq 3 ]; then
                exit 1
            fi
            sleep 10
        fi
    done

    for attempt in 1 2 3; do
        if aws ssm put-parameter \
          --name "/k8s/worker-node/join-command" \
          --value "$WORKER_NODE_JOIN_COMMAND_WITH_DNS" \
          --type "SecureString" \
          --overwrite \
          --region "us-east-1" \
          --cli-connect-timeout 10 \
          --cli-read-timeout 30; then
            break
        else
            if [ $attempt -eq 3 ]; then
                exit 1
            fi
            sleep 10
        fi
    done
else
...

# CLUSTER INITIALIZATION OR JOIN
if [ "${is_first_control_plane}" = "true" ]; then
    # FIRST CONTROL PLANE NODE SETUP
    
    # Validate kubeadm configuration before initialization
    if ! kubeadm config validate --config <(cat <<EOF
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: "${control_plane_master_private_ip}"
  bindPort: 6443
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
controlPlaneEndpoint: "${control_plane_master_private_ip}:6443"
apiServer:
  certSANs:
    - "${control_plane_endpoint}"
networking:
  podSubnet: "192.168.0.0/16"
EOF
); then
        exit 1
    fi

    # Initialize Kubernetes cluster
    kubeadm init \
        --control-plane-endpoint "${control_plane_master_private_ip}:6443" \
        --apiserver-advertise-address="${control_plane_master_private_ip}" \
        --upload-certs \
        --pod-network-cidr=192.168.0.0/16 \
        --apiserver-cert-extra-sans "${control_plane_endpoint}"

    # Setup kubeconfig for ubuntu user
    export KUBE_USER=ubuntu
    mkdir -p /home/$KUBE_USER/.kube
    sudo cp -i /etc/kubernetes/admin.conf /home/$KUBE_USER/.kube/config
    sudo chown $KUBE_USER:$KUBE_USER /home/$KUBE_USER/.kube/config

    # Wait for control plane to become responsive
    control_plane_ready=false
    for i in {1..30}; do
        if KUBECONFIG=/etc/kubernetes/admin.conf kubectl get nodes &>/dev/null; then
            control_plane_ready=true
            break
        fi
        sleep 5
    done

    if [ "$control_plane_ready" = false ]; then
        exit 1
    fi

    # Install Calico CNI (Container Network Interface)
    for attempt in 1 2 3; do
        if KUBECONFIG=/etc/kubernetes/admin.conf kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml; then
            break
        else
            if [ $attempt -eq 3 ]; then
                exit 1
            fi
            sleep 10
        fi
    done

    # CERTIFICATE REGENERATION FOR LOAD BALANCER
    # Backup existing certificates
    if [ ! -f /etc/kubernetes/pki/apiserver.crt ]; then
        exit 1
    fi

    sudo mv /etc/kubernetes/pki/apiserver.crt /etc/kubernetes/pki/apiserver.crt.bak
    sudo mv /etc/kubernetes/pki/apiserver.key /etc/kubernetes/pki/apiserver.key.bak

    # Create configuration for certificate regeneration with load balancer DNS
    cat <<EOF | sudo tee /root/kubeadm-dns.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
controlPlaneEndpoint: "${control_plane_endpoint}:6443"
apiServer:
  certSANs:
    - "${control_plane_endpoint}"
    - "${control_plane_master_private_ip}"
EOF

    # Regenerate API server certificates with load balancer DNS as SAN
    sudo kubeadm init phase certs apiserver --config /root/kubeadm-dns.yaml

    # Restart kubelet to pick up new certificates
    sudo systemctl restart kubelet

    # JOIN COMMAND GENERATION
    # Generate join command for worker nodes
    JOIN_COMMAND=$(kubeadm token create --print-join-command 2>/dev/null)
    if [ -z "$JOIN_COMMAND" ]; then
        exit 1
    fi

    # Generate certificate key for control plane nodes
    CERT_KEY=$(sudo kubeadm init phase upload-certs --upload-certs 2>/dev/null | tail -n 1)
    if [ -z "$CERT_KEY" ]; then
        exit 1
    fi

    # Create control plane join command
    CONTROL_PLANE_JOIN_COMMAND="$JOIN_COMMAND --control-plane --certificate-key $CERT_KEY"
    WORKER_NODE_JOIN_COMMAND="$JOIN_COMMAND"

    # Replace private IP with load balancer DNS in join commands
    JOIN_COMMAND_WITH_DNS=$(echo "$CONTROL_PLANE_JOIN_COMMAND" | sed "s/${control_plane_master_private_ip}:6443/${control_plane_endpoint}:6443/g")
    WORKER_NODE_JOIN_COMMAND_WITH_DNS=$(echo "$WORKER_NODE_JOIN_COMMAND" | sed "s/${control_plane_master_private_ip}:6443/${control_plane_endpoint}:6443/g")

    # Store join commands in AWS Systems Manager Parameter Store
    for attempt in 1 2 3; do
        if aws ssm put-parameter \
          --name "/k8s/control-plane/join-command" \
          --value "$JOIN_COMMAND_WITH_DNS" \
          --type "SecureString" \
          --overwrite \
          --region "us-east-1" \
          --cli-connect-timeout 10 \
          --cli-read-timeout 30; then
            break
        else
            if [ $attempt -eq 3 ]; then
                exit 1
            fi
            sleep 10
        fi
    done

    for attempt in 1 2 3; do
        if aws ssm put-parameter \
          --name "/k8s/worker-node/join-command" \
          --value "$WORKER_NODE_JOIN_COMMAND_WITH_DNS" \
          --type "SecureString" \
          --overwrite \
          --region "us-east-1" \
          --cli-connect-timeout 10 \
          --cli-read-timeout 30; then
            break
        else
            if [ $attempt -eq 3 ]; then
                exit 1
            fi
            sleep 10
        fi
    done
else
...

7. Additional Control Plane Node Block (is_first_control_plane = false) inside else statement. Joins additional control plane nodes to the existing cluster

Command Retrieval: Retrieves the control plane join command from AWS Parameter Store with retry logic
Cluster Join: Executes the join command to add this node as an additional control plane
Configuration Update: Updates the local kubeconfig to use the load balancer endpoint instead of the first node’s IP
User Setup: Configures kubectl access for the ubuntu user so that the load balancer endpoint will reflect.

Bash

else
...
    # ADDITIONAL CONTROL PLANE NODE SETUP
    
    # Wait before retrieving join command
    sleep 120

    # Retrieve join command from AWS Systems Manager Parameter Store
    for attempt in 1 2 3; do
        JOIN_CMD=$(aws ssm get-parameter \
              --region us-east-1 \
              --name "/k8s/control-plane/join-command" \
              --with-decryption \
              --query "Parameter.Value" \
              --output text \
              --no-cli-pager \
              --cli-read-timeout 30 \
              --cli-connect-timeout 10 2>/dev/null)
        
        if [ $? -eq 0 ] && [ -n "$JOIN_CMD" ] && [[ "$JOIN_CMD" != *"error"* ]] && [[ "$JOIN_CMD" != "None" ]]; then
            break
        else
            if [ $attempt -eq 3 ]; then
                exit 1
            fi
            sleep 20
        fi
    done

    # Join the existing cluster as additional control plane
    if eval "sudo $JOIN_CMD"; then
        # Success
        :
    else
        exit 1
    fi

    # Update kubeconfig to use load balancer endpoint
    if [ -f /etc/kubernetes/admin.conf ]; then
        sudo sed -i "s|https://${control_plane_master_private_ip}:6443|https://${control_plane_endpoint}:6443|g" /etc/kubernetes/admin.conf

        # Setup kubeconfig for ubuntu user
        export KUBE_USER=ubuntu
        mkdir -p /home/$KUBE_USER/.kube
        sudo cp -i /etc/kubernetes/admin.conf /home/$KUBE_USER/.kube/config
        sudo chown $KUBE_USER:$KUBE_USER /home/$KUBE_USER/.kube/config
    else
        exit 1
    fi
fi

else
...
    # ADDITIONAL CONTROL PLANE NODE SETUP
    
    # Wait before retrieving join command
    sleep 120

    # Retrieve join command from AWS Systems Manager Parameter Store
    for attempt in 1 2 3; do
        JOIN_CMD=$(aws ssm get-parameter \
              --region us-east-1 \
              --name "/k8s/control-plane/join-command" \
              --with-decryption \
              --query "Parameter.Value" \
              --output text \
              --no-cli-pager \
              --cli-read-timeout 30 \
              --cli-connect-timeout 10 2>/dev/null)
        
        if [ $? -eq 0 ] && [ -n "$JOIN_CMD" ] && [[ "$JOIN_CMD" != *"error"* ]] && [[ "$JOIN_CMD" != "None" ]]; then
            break
        else
            if [ $attempt -eq 3 ]; then
                exit 1
            fi
            sleep 20
        fi
    done

    # Join the existing cluster as additional control plane
    if eval "sudo $JOIN_CMD"; then
        # Success
        :
    else
        exit 1
    fi

    # Update kubeconfig to use load balancer endpoint
    if [ -f /etc/kubernetes/admin.conf ]; then
        sudo sed -i "s|https://${control_plane_master_private_ip}:6443|https://${control_plane_endpoint}:6443|g" /etc/kubernetes/admin.conf

        # Setup kubeconfig for ubuntu user
        export KUBE_USER=ubuntu
        mkdir -p /home/$KUBE_USER/.kube
        sudo cp -i /etc/kubernetes/admin.conf /home/$KUBE_USER/.kube/config
        sudo chown $KUBE_USER:$KUBE_USER /home/$KUBE_USER/.kube/config
    else
        exit 1
    fi
fi

Key Design Pattern:

Retry Logic: Most network operations include retry mechanisms for reliability
Conditional Execution: The script branches based on whether this is the first control plane node
Error Handling: Uses set -e to exit on any command failure
High Availability: Configures the cluster to use a load balancer endpoint for external access
Security: Uses proper certificate management and secure parameter storage
Idempotency: Many operations are designed to be safely re-runnable

Kubernetes Master Control Plane Wait Script

This script is typically used where you need to:

Wait for a newly created Kubernetes master node to become fully operational
Verify the installation completed successfully before proceeding with additional configuration
Ensure the cluster is ready to accept worker nodes or workload deployments
Provide debugging information if the setup fails

The script essentially acts as a “health check” that confirms a Kubernetes control plane is not just installed, but fully ready for use.

1. Cloud-Init Completion Wait Block. Waits for the cloud-init process to complete before proceeding.

Timeout Protection: Uses a 20-minute timeout (1200 seconds) to prevent infinite waiting
Status Monitoring: Continuously polls cloud-init status every 30 seconds
Success Detection: Looks for “done” status indicating successful completion
Error Handling: If status contains “error”, displays detailed error information and exits

Bash

#!/bin/bash
# Wait for Kubernetes master control plane to be ready

set -e

# CLOUD-INIT COMPLETION WAIT
# Wait for cloud-init to finish (up to 20 minutes)
timeout 1200 bash -c '
  while true; do
    status=$(sudo cloud-init status 2>/dev/null || echo "unknown")
    if [[ "$status" == *"done"* ]]; then
      break
    elif [[ "$status" == *"error"* ]]; then
      sudo cloud-init status --long 2>&1
      exit 1
    else
      sleep 30
    fi
  done
'

#!/bin/bash
# Wait for Kubernetes master control plane to be ready

set -e

# CLOUD-INIT COMPLETION WAIT
# Wait for cloud-init to finish (up to 20 minutes)
timeout 1200 bash -c '
  while true; do
    status=$(sudo cloud-init status 2>/dev/null || echo "unknown")
    if [[ "$status" == *"done"* ]]; then
      break
    elif [[ "$status" == *"error"* ]]; then
      sudo cloud-init status --long 2>&1
      exit 1
    else
      sleep 30
    fi
  done
'

2. Installation Verification Block. Verifies that the Kubernetes installation completed successfully.

Success Log Check: Looks for /var/log/k8s-install-success.txt as proof of successful installation
Success Path: If found, displays the last 10 lines of the success log
Error Log Check: If no success log, checks for /var/log/k8s-install-error.txt
Error Path: If error log exists, displays its contents and exits with failure
Fallback: If neither log exists, shows cloud-init output for debugging and exits

Bash

# INSTALLATION VERIFICATION
# Verify Kubernetes installation completed successfully
if [ -f /var/log/k8s-install-success.txt ]; then
  # Installation success log found
  tail -10 /var/log/k8s-install-success.txt
else
  # No success log found, check for errors
  if [ -f /var/log/k8s-install-error.txt ]; then
    # Error log found - installation failed
    cat /var/log/k8s-install-error.txt
    exit 1
  else
    # No error log either, check cloud-init output
    sudo tail -50 /var/log/cloud-init-output.log
    exit 1
  fi
fi

# INSTALLATION VERIFICATION
# Verify Kubernetes installation completed successfully
if [ -f /var/log/k8s-install-success.txt ]; then
  # Installation success log found
  tail -10 /var/log/k8s-install-success.txt
else
  # No success log found, check for errors
  if [ -f /var/log/k8s-install-error.txt ]; then
    # Error log found - installation failed
    cat /var/log/k8s-install-error.txt
    exit 1
  else
    # No error log either, check cloud-init output
    sudo tail -50 /var/log/cloud-init-output.log
    exit 1
  fi
fi

3. Filesystem Verification Block. Inspects the filesystem to verify expected files and directories exist. Provides diagnostic information about what files were created during installation.

Home Directory Check: Lists contents of /home/ubuntu/ directory
Kube Directory Check: Checks for /home/ubuntu/.kube/ directory (user kubectl config)
Kubernetes Directory Check: Checks for /etc/kubernetes/ directory (system configs)
Non-Fatal: Uses error suppression (2>/dev/null) since some directories might not exist yet

Bash

# FILESYSTEM VERIFICATION
# Check filesystem after installation
ls -la /home/ubuntu/
ls -la /home/ubuntu/.kube/ 2>/dev/null || echo 'No .kube directory yet'
ls -la /etc/kubernetes/ 2>/dev/null || echo 'No /etc/kubernetes directory yet'

# FILESYSTEM VERIFICATION
# Check filesystem after installation
ls -la /home/ubuntu/
ls -la /home/ubuntu/.kube/ 2>/dev/null || echo 'No .kube directory yet'
ls -la /etc/kubernetes/ 2>/dev/null || echo 'No /etc/kubernetes directory yet'

4. Kubeconfig Detection Block. Locates and sets up kubectl configuration for cluster access. kubectl requires proper configuration to communicate with the cluster

User Config Priority: First checks for user-specific config at /home/ubuntu/.kube/config
Admin Config Fallback: If user config missing, tries system admin config at /etc/kubernetes/admin.conf
Environment Setup: Sets KUBECONFIG environment variable to point to found config file
Failure Handling: If no config found, lists kubernetes directory contents and exits

Bash

# KUBECONFIG DETECTION
# Check for kubeconfig file and set KUBECONFIG environment variable
if [ -f /home/ubuntu/.kube/config ]; then
  export KUBECONFIG=/home/ubuntu/.kube/config
elif [ -f /etc/kubernetes/admin.conf ]; then
  export KUBECONFIG=/etc/kubernetes/admin.conf
else
  # No kubeconfig found after installation
  ls -la /etc/kubernetes/ 2>/dev/null || echo 'No /etc/kubernetes directory'
  exit 1
fi

# KUBECONFIG DETECTION
# Check for kubeconfig file and set KUBECONFIG environment variable
if [ -f /home/ubuntu/.kube/config ]; then
  export KUBECONFIG=/home/ubuntu/.kube/config
elif [ -f /etc/kubernetes/admin.conf ]; then
  export KUBECONFIG=/etc/kubernetes/admin.conf
else
  # No kubeconfig found after installation
  ls -la /etc/kubernetes/ 2>/dev/null || echo 'No /etc/kubernetes directory'
  exit 1
fi

5. kubectl Functionality Test Block. Verifies kubectl command-line tool is working properly. Ensures the kubectl tool itself is functional before testing cluster connectivity

Version Check: Runs kubectl version --client to test basic functionality
Binary Verification: Confirms kubectl is installed and accessible
Path Debugging: If kubectl fails, shows where (or if) kubectl is installed and displays PATH

Bash

# KUBECTL FUNCTIONALITY TEST
# Test kubectl client functionality
kubectl version --client 2>&1

if kubectl version --client >/dev/null 2>&1; then
  # kubectl is working
  :
else
  # kubectl not working
  which kubectl 2>/dev/null || echo 'kubectl not in PATH'
  echo "PATH contents: $PATH"
  exit 1
fi

# KUBECTL FUNCTIONALITY TEST
# Test kubectl client functionality
kubectl version --client 2>&1

if kubectl version --client >/dev/null 2>&1; then
  # kubectl is working
  :
else
  # kubectl not working
  which kubectl 2>/dev/null || echo 'kubectl not in PATH'
  echo "PATH contents: $PATH"
  exit 1
fi

6. API Server Connectivity Test Block. Tests connectivity to the Kubernetes API server. The API server must be responding before the cluster can be considered ready

Health Endpoint: Uses kubectl get --raw /healthz to test API server health
Timeout Protection: 5-minute timeout (300 seconds) to prevent infinite waiting
Retry Logic: Continuously retries every 10 seconds until success or timeout

Bash

# API SERVER CONNECTIVITY TEST
# Test API server connectivity and readiness
timeout 300 bash -c '
  while ! kubectl get --raw /healthz >/dev/null 2>&1; do
    sleep 10
  done
'

# API SERVER CONNECTIVITY TEST
# Test API server connectivity and readiness
timeout 300 bash -c '
  while ! kubectl get --raw /healthz >/dev/null 2>&1; do
    sleep 10
  done
'

7. System Services Status Check Block. Verifies critical Kubernetes system services are running. These services must be running for the cluster to function properly.

kubelet Status: Checks the Kubernetes node agent service
containerd Status: Checks the container runtime service
Limited Output: Shows only first 10 lines to avoid overwhelming output

Bash

# SYSTEM SERVICES STATUS CHECK
# Check status of critical Kubernetes services
systemctl status kubelet --no-pager 2>&1 | head -10
systemctl status containerd --no-pager 2>&1 | head -10

# SYSTEM SERVICES STATUS CHECK
# Check status of critical Kubernetes services
systemctl status kubelet --no-pager 2>&1 | head -10
systemctl status containerd --no-pager 2>&1 | head -10

8. Final Cluster Verification Block. Performs comprehensive cluster functionality tests. Confirms the cluster is not just running, but fully functional

Node Status: Lists all cluster nodes to verify cluster membership
System Pods: Checks status of system pods in kube-system namespace
Pod Verification: Writes pod output to temporary file and displays first 10 entries

Bash

# FINAL CLUSTER VERIFICATION
# Verify cluster is functional
kubectl get nodes 2>&1

# Check system pods status
kubectl get pods -n kube-system --no-headers > /tmp/pods_output 2>&1
head -10 /tmp/pods_output

# SUCCESS - Control plane is ready

# FINAL CLUSTER VERIFICATION
# Verify cluster is functional
kubectl get nodes 2>&1

# Check system pods status
kubectl get pods -n kube-system --no-headers > /tmp/pods_output 2>&1
head -10 /tmp/pods_output

# SUCCESS - Control plane is ready

Key Design Patterns:

Progressive Validation: Each step builds on the previous one, from basic system readiness to full cluster functionality
Timeout Protection: Critical waits include timeouts to prevent infinite hanging
Graceful Degradation: Provides diagnostic information when things fail
Error Propagation: Uses set -e to exit immediately on any command failure
Comprehensive Testing: Tests multiple layers from file system to cluster API

Kubernetes Worker Nodes Installation using Bash Script

This script essentially automates the process of preparing a server and joining it to an existing Kubernetes cluster as a worker node, handling all the prerequisites and configuration needed for the node to participate in the cluster and run workloads. For example,

You: "I need 3 more worker nodes for my cluster"
AWS/Terraform: "Creating 3 new servers..."
This Script (on each server): "Let me become a worker node..."

Script: "Preparing system... Installing container runtime... Installing Kubernetes..."
Script: "Getting join command from the managers..."
Script: "Joining cluster as worker node..."
Script: "SUCCESS! I'm now a worker node ready to run applications!"

You: "I need 3 more worker nodes for my cluster"
AWS/Terraform: "Creating 3 new servers..."
This Script (on each server): "Let me become a worker node..."

Script: "Preparing system... Installing container runtime... Installing Kubernetes..."
Script: "Getting join command from the managers..."
Script: "Joining cluster as worker node..."
Script: "SUCCESS! I'm now a worker node ready to run applications!"

What happens after this script runs:

The server becomes a worker node in your Kubernetes cluster
It can now run your applications (pods, containers)
The control plane can schedule work on this node
Your cluster has more capacity to run workloads

1. System Preparation Block. Prepares the system for Kubernetes installation.

Swap Management:
- Disables active swap memory (Kubernetes requirement)
- Comments out swap entries in /etc/fstab to prevent re-enabling on reboot
Network Configuration:
- Enables IP forwarding (net.ipv4.ip_forward = 1) for pod-to-pod communication
- Applies network settings immediately without requiring a reboot
Package Updates: Updates system packages with retry logic for network reliability

Bash

#!/bin/bash
set -e

# SYSTEM PREPARATION
# Disable swap (required for Kubernetes)
swapoff -a

# Permanently disable swap by commenting it out in fstab
sed -i '/ swap / s/^/#/' /etc/fstab

# Enable IP forwarding for pod networking
cat <<EOF | tee /etc/sysctl.d/k8s.conf
net.ipv4.ip_forward = 1
EOF

# Apply sysctl settings without reboot
sysctl --system

# Update package lists with retry logic
for attempt in 1 2 3; do
    if apt-get update; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

#!/bin/bash
set -e

# SYSTEM PREPARATION
# Disable swap (required for Kubernetes)
swapoff -a

# Permanently disable swap by commenting it out in fstab
sed -i '/ swap / s/^/#/' /etc/fstab

# Enable IP forwarding for pod networking
cat <<EOF | tee /etc/sysctl.d/k8s.conf
net.ipv4.ip_forward = 1
EOF

# Apply sysctl settings without reboot
sysctl --system

# Update package lists with retry logic
for attempt in 1 2 3; do
    if apt-get update; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

2. Container Runtime Setup Block. Installs and configures containerd as the container runtime.

Package Installation: Installs essential packages including:
- ca-certificates, curl, gnupg for secure downloads
- containerd for container runtime
- apt-transport-https, unzip for additional operations
Repository Setup:
- Creates APT keyring directory
- Downloads and installs Docker’s GPG key (containerd comes from Docker repo)
- Makes the key readable by all users
- Adds Docker repository to APT sources
Containerd Configuration:
- Creates containerd configuration directory
- Generates default configuration file
- Enables systemd cgroup driver (required for proper Kubernetes integration)
- Restarts and enables containerd service

Bash

# CONTAINER RUNTIME SETUP (containerd)
# Install required packages
apt-get install -y ca-certificates curl gnupg lsb-release containerd apt-transport-https unzip

# Create directory for APT keyrings
mkdir -p /etc/apt/keyrings

# Add Docker GPG key (for containerd installation)
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
 sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

# Make the key readable
sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Add Docker repository
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  noble stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null

# Update package list again after adding repository
for attempt in 1 2 3; do
    if apt-get update; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

# Configure containerd
mkdir -p /etc/containerd

# Generate default containerd configuration
containerd config default | tee /etc/containerd/config.toml

# Enable systemd cgroup driver (required for Kubernetes)
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml

# Start and enable containerd service
systemctl restart containerd
systemctl enable containerd

# CONTAINER RUNTIME SETUP (containerd)
# Install required packages
apt-get install -y ca-certificates curl gnupg lsb-release containerd apt-transport-https unzip

# Create directory for APT keyrings
mkdir -p /etc/apt/keyrings

# Add Docker GPG key (for containerd installation)
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
 sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

# Make the key readable
sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Add Docker repository
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  noble stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null

# Update package list again after adding repository
for attempt in 1 2 3; do
    if apt-get update; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

# Configure containerd
mkdir -p /etc/containerd

# Generate default containerd configuration
containerd config default | tee /etc/containerd/config.toml

# Enable systemd cgroup driver (required for Kubernetes)
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml

# Start and enable containerd service
systemctl restart containerd
systemctl enable containerd

3. Kubernetes Installation Block. Installs Kubernetes components needed for worker nodes

Repository Setup:
- Downloads Kubernetes GPG key with retry logic and timeouts
- Adds official Kubernetes repository to APT sources
Component Installation:
- Installs kubelet (node agent) and kubeadm (cluster management tool)
- Note: Does NOT install kubectl since worker nodes don’t need cluster management capabilities
Package Protection: Uses apt-mark hold to prevent automatic updates that could break cluster compatibility
Service Management: Enables kubelet service to start automatically

Bash

# KUBERNETES INSTALLATION
# Add Kubernetes GPG key with retry logic
for attempt in 1 2 3; do
    if curl --connect-timeout 30 --max-time 60 -fsSL https://pkgs.k8s.io/core:/stable:/v1.33/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

# Add Kubernetes repository
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.33/deb/ /' | tee /etc/apt/sources.list.d/kubernetes.list

# Update package list
apt-get update

# Install Kubernetes components (kubelet and kubeadm only - no kubectl needed on worker)
apt-get install -y kubelet kubeadm

# Prevent automatic updates of Kubernetes packages
apt-mark hold kubelet kubeadm

# Enable kubelet service
systemctl enable --now kubelet

# KUBERNETES INSTALLATION
# Add Kubernetes GPG key with retry logic
for attempt in 1 2 3; do
    if curl --connect-timeout 30 --max-time 60 -fsSL https://pkgs.k8s.io/core:/stable:/v1.33/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

# Add Kubernetes repository
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.33/deb/ /' | tee /etc/apt/sources.list.d/kubernetes.list

# Update package list
apt-get update

# Install Kubernetes components (kubelet and kubeadm only - no kubectl needed on worker)
apt-get install -y kubelet kubeadm

# Prevent automatic updates of Kubernetes packages
apt-mark hold kubelet kubeadm

# Enable kubelet service
systemctl enable --now kubelet

4. AWS CLI Installation Block. Installs AWS CLI for Parameter Store access.

Download: Downloads AWS CLI v2 installer with retry logic and extended timeout (5 minutes)
Installation: Extracts ZIP file and runs installer
Verification: Confirms AWS CLI is properly installed and accessible

Bash

# AWS CLI INSTALLATION
# Download AWS CLI with retry logic
for attempt in 1 2 3; do
    if curl --connect-timeout 30 --max-time 300 "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

# Extract and install AWS CLI
unzip awscliv2.zip
sudo ./aws/install

# Verify AWS CLI installation
if ! aws --version; then
    exit 1
fi

# AWS CLI INSTALLATION
# Download AWS CLI with retry logic
for attempt in 1 2 3; do
    if curl --connect-timeout 30 --max-time 300 "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"; then
        break
    else
        if [ $attempt -eq 3 ]; then
            exit 1
        fi
        sleep 10
    fi
done

# Extract and install AWS CLI
unzip awscliv2.zip
sudo ./aws/install

# Verify AWS CLI installation
if ! aws --version; then
    exit 1
fi

5. Cluster Join Process Block. Joins this node to the existing Kubernetes cluster as a worker.

Wait Period: Waits 2 minutes to ensure the control plane has stored the join command in Parameter Store
Command Retrieval:
- Retrieves worker node join command from AWS Systems Manager Parameter Store
- Uses retry logic with 20-second intervals
- Validates the command is not empty, doesn’t contain errors, and isn’t “None”
- Accesses the /k8s/worker-node/join-command parameter (different from control plane command)
Cluster Join:
- Executes the retrieved join command with sudo privileges
- The join command typically looks like: kubeadm join <load-balancer-dns>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>

Bash

# CLUSTER JOIN PROCESS
# Wait for join command to be available in Parameter Store
sleep 120

# Retrieve worker node join command from AWS Systems Manager Parameter Store
for attempt in 1 2 3; do
  JOIN_CMD=$(aws ssm get-parameter \
        --region us-east-1 \
        --name "/k8s/worker-node/join-command" \
        --with-decryption \
        --query "Parameter.Value" \
        --output text \
        --no-cli-pager \
        --cli-read-timeout 30 \
        --cli-connect-timeout 10 2>/dev/null)
  
  if [ $? -eq 0 ] && [ -n "$JOIN_CMD" ] && [[ "$JOIN_CMD" != *"error"* ]] && [[ "$JOIN_CMD" != "None" ]]; then
    break
  else
    if [ $attempt -eq 3 ]; then
      exit 1
    fi
    sleep 20
  fi
done

# Execute the join command to add this node as a worker to the cluster
if eval "sudo $JOIN_CMD"; then
  # Success - node joined cluster
  :
else
  exit 1
fi

# CLUSTER JOIN PROCESS
# Wait for join command to be available in Parameter Store
sleep 120

# Retrieve worker node join command from AWS Systems Manager Parameter Store
for attempt in 1 2 3; do
  JOIN_CMD=$(aws ssm get-parameter \
        --region us-east-1 \
        --name "/k8s/worker-node/join-command" \
        --with-decryption \
        --query "Parameter.Value" \
        --output text \
        --no-cli-pager \
        --cli-read-timeout 30 \
        --cli-connect-timeout 10 2>/dev/null)
  
  if [ $? -eq 0 ] && [ -n "$JOIN_CMD" ] && [[ "$JOIN_CMD" != *"error"* ]] && [[ "$JOIN_CMD" != "None" ]]; then
    break
  else
    if [ $attempt -eq 3 ]; then
      exit 1
    fi
    sleep 20
  fi
done

# Execute the join command to add this node as a worker to the cluster
if eval "sudo $JOIN_CMD"; then
  # Success - node joined cluster
  :
else
  exit 1
fi

Key Differences from Control Plane Script:

Simpler Role: Worker nodes only need to join the cluster, not initialize or manage it
No kubectl: Worker nodes don’t need cluster management tools
No Certificate Management: Workers don’t handle cluster certificates
No CNI Installation: Container networking is managed by control plane
Single Join Command: Uses worker-specific join command from Parameter Store
No Additional Configuration: No need to update configs or generate new commands

Design Patterns:

Retry Logic: Network operations include retry mechanisms for reliability
Parameter Store Integration: Uses AWS SSM to retrieve join commands securely
Error Handling: Uses set -e to exit on any command failure
Validation: Checks command retrieval success before execution
Minimal Installation: Only installs components needed for worker node functionality

Kubernetes Wait for Worker Node Script

This script waits and watches for worker nodes to join a Kubernetes cluster and become ready to run applications. Something like:

You’re organizing a team project:

You’re expecting 3 team members to join you. (EXPECTED_WORKERS = 3)
You’re willing to wait up to 30 minutes for everyone to show up. (TIMEOUT_SECONDS = 1800)
Every 30 seconds, you’ll check to see who’s arrived so far. (CHECK_INTERVAL = 30)

What the scripts monitor?

Stage 1: Node Join Detection

Counts how many worker nodes have joined the cluster
Like counting how many people walked into the office

Stage 2: Node Readiness Check

Counts how many worker nodes are ready (not just joined)
Like checking if people have their computers set up and are actually ready to work

When you create a Kubernetes cluster:

Control plane starts first (the “manager” nodes)
Worker nodes join later (the “worker” nodes that run your apps)
You need to wait for all workers to join and be ready before you can deploy applications

For example,

You: "I want 5 worker nodes in my cluster"
Terraform: "OK, creating 5 worker nodes..."
This Script: "I'll wait here and watch for all 5 to join and be ready"

*Time passes...*

Script: "1 worker joined... 2 workers joined... 3 workers joined..."
Script: "All 5 joined! Now waiting for them to be ready..."
Script: "Worker 1 ready... Worker 2 ready... All ready!"
Script: "SUCCESS! Your cluster is ready to use!"

Without this script, you might try to deploy apps too early and get errors like:

“No nodes available to schedule pods”
“Insufficient resources”
Apps failing because nodes aren’t ready yet

With this script, you know for certain that your cluster is 100% ready before you try to use it. In essence: It’s a “safety check” that prevents you from using a cluster before it’s fully operational.

1. Configuration Setup Block. Initializes the script environment and configuration.

Kubeconfig Export: Sets KUBECONFIG=/home/ubuntu/.kube/config for kubectl access
Variable Assignment: Retrieves configuration from Terraform variables:
- EXPECTED_WORKERS: Number of worker nodes expected to join
- TIMEOUT_SECONDS: Maximum time to wait for nodes
- CHECK_INTERVAL: Time between status checks
- LOG_FILE: Path where to save detailed logs
Log File Setup: Creates the log file and makes it writable (chmod 666)

Bash

#!/bin/bash
set -e

# CONFIGURATION SETUP
# Export kubeconfig for kubectl access
export KUBECONFIG=/home/ubuntu/.kube/config

# Configuration from Terraform variables
EXPECTED_WORKERS=${expected_workers}
TIMEOUT_SECONDS=${timeout_seconds}
CHECK_INTERVAL=${check_interval}
LOG_FILE="${log_file}"

# Create and configure log file
sudo touch "$LOG_FILE"
sudo chmod 666 "$LOG_FILE"

#!/bin/bash
set -e

# CONFIGURATION SETUP
# Export kubeconfig for kubectl access
export KUBECONFIG=/home/ubuntu/.kube/config

# Configuration from Terraform variables
EXPECTED_WORKERS=${expected_workers}
TIMEOUT_SECONDS=${timeout_seconds}
CHECK_INTERVAL=${check_interval}
LOG_FILE="${log_file}"

# Create and configure log file
sudo touch "$LOG_FILE"
sudo chmod 666 "$LOG_FILE"

2. count_worker_nodes() Function. Counts how many worker nodes have joined the cluster (regardless of readiness). Tracks the joining progress of worker nodes

Node Listing: Gets all nodes without headers using kubectl get nodes --no-headers
Filtering: Excludes control plane nodes by filtering out:
- Lines containing “control-plane”
- Lines containing “master”
Counting: Uses wc -l to count remaining lines
Error Handling: Returns 0 if kubectl fails

Bash

# WORKER NODE COUNTING FUNCTIONS
# Function to count current worker nodes (joined but may not be ready)
count_worker_nodes() {
    kubectl get nodes --no-headers 2>/dev/null | \
    grep -v control-plane | \
    grep -v master | \
    wc -l || echo 0
}

# WORKER NODE COUNTING FUNCTIONS
# Function to count current worker nodes (joined but may not be ready)
count_worker_nodes() {
    kubectl get nodes --no-headers 2>/dev/null | \
    grep -v control-plane | \
    grep -v master | \
    wc -l || echo 0
}

3. count_ready_worker_nodes() Function. Counts how many worker nodes are both joined AND ready for workloads. Ensures nodes are not just joined but actually functional.

Node Listing: Gets all nodes without headers
Multi-Stage Filtering:
- Excludes control plane nodes (same as above)
- Additionally filters for “Ready” status using grep Ready
Counting: Counts nodes that pass all filters
Error Handling: Returns 0 if kubectl fails

Bash

# Function to count ready worker nodes (joined and ready for workloads)
count_ready_worker_nodes() {
    kubectl get nodes --no-headers 2>/dev/null | \
    grep -v control-plane | \
    grep -v master | \
    grep Ready | \
    wc -l || echo 0
}

# Function to count ready worker nodes (joined and ready for workloads)
count_ready_worker_nodes() {
    kubectl get nodes --no-headers 2>/dev/null | \
    grep -v control-plane | \
    grep -v master | \
    grep Ready | \
    wc -l || echo 0
}

4. Main Wait Loop with Timeout Block. Continuously monitors worker node status until completion or timeout.

4a. Timeout Management

Time Calculation: Calculates start time, end time, and current time in Unix timestamps
Timeout Check: Exits with error if current time exceeds end time
Failure Path: Shows current cluster state and exits with code 1

4b. Status Monitoring

Node Counting: Calls both counting functions to get current status
Time Tracking: Calculates elapsed time and remaining time
Progress Display: Shows comprehensive status including:
- Current timestamp
- Worker node counts (joined vs ready)
- Expected count
- Time statistics

4c. Completion Logic

Join Check: Verifies if enough nodes have joined (current_workers >= EXPECTED_WORKERS)
Readiness Check: Verifies if enough nodes are ready (ready_workers >= EXPECTED_WORKERS)
Two-Stage Success:
- First celebrates when nodes join
- Then waits for them to become ready
Loop Exit: Breaks out of loop only when both conditions are met

4d. Status Display and Wait

Cluster State: Shows current node status with kubectl get nodes
Log Recording: Saves output to the log file
Interval Wait: Sleeps for the configured check interval before next iteration

Bash

# MAIN WAIT LOOP WITH TIMEOUT
# Calculate timeout timestamps
start_time=$(date +%s)
end_time=$((start_time + TIMEOUT_SECONDS))

while true; do
    current_time=$(date +%s)
    
    # Check if timeout has been reached
    if [ $current_time -gt $end_time ]; then
        echo "TIMEOUT: Worker nodes did not join within $TIMEOUT_SECONDS seconds"
        echo "Current cluster state:"
        kubectl get nodes --no-headers 2>&1 | tee -a "$LOG_FILE" || echo "kubectl failed"
        exit 1
    fi
    
    # Count current worker nodes
    current_workers=$(count_worker_nodes)
    ready_workers=$(count_ready_worker_nodes)
    
    # Calculate elapsed and remaining time
    elapsed=$((current_time - start_time))
    remaining=$((end_time - current_time))
    
    echo "Status check at $(date)"
    echo "Current worker nodes: $current_workers"
    echo "Ready worker nodes: $ready_workers"
    echo "Expected: $EXPECTED_WORKERS"
    echo "Elapsed: $elapsed s, Remaining: $remaining s"
    
    # Check if we have enough worker nodes joined
    if [ "$current_workers" -ge "$EXPECTED_WORKERS" ]; then
        echo "All $EXPECTED_WORKERS worker nodes have joined the cluster!"
        
        # Check if they are all ready
        if [ "$ready_workers" -ge "$EXPECTED_WORKERS" ]; then
            echo "All worker nodes are also ready!"
            break
        else
            echo "Worker nodes joined but not all are ready yet. Waiting for readiness..."
        fi
    fi
    
    # Show current cluster state
    echo "Current cluster state:"
    kubectl get nodes --no-headers 2>&1 | tee -a "$LOG_FILE" || echo "kubectl command failed"
    
    echo "Waiting $CHECK_INTERVAL seconds before next check..."
    sleep $CHECK_INTERVAL
done

# MAIN WAIT LOOP WITH TIMEOUT
# Calculate timeout timestamps
start_time=$(date +%s)
end_time=$((start_time + TIMEOUT_SECONDS))

while true; do
    current_time=$(date +%s)
    
    # Check if timeout has been reached
    if [ $current_time -gt $end_time ]; then
        echo "TIMEOUT: Worker nodes did not join within $TIMEOUT_SECONDS seconds"
        echo "Current cluster state:"
        kubectl get nodes --no-headers 2>&1 | tee -a "$LOG_FILE" || echo "kubectl failed"
        exit 1
    fi
    
    # Count current worker nodes
    current_workers=$(count_worker_nodes)
    ready_workers=$(count_ready_worker_nodes)
    
    # Calculate elapsed and remaining time
    elapsed=$((current_time - start_time))
    remaining=$((end_time - current_time))
    
    echo "Status check at $(date)"
    echo "Current worker nodes: $current_workers"
    echo "Ready worker nodes: $ready_workers"
    echo "Expected: $EXPECTED_WORKERS"
    echo "Elapsed: $elapsed s, Remaining: $remaining s"
    
    # Check if we have enough worker nodes joined
    if [ "$current_workers" -ge "$EXPECTED_WORKERS" ]; then
        echo "All $EXPECTED_WORKERS worker nodes have joined the cluster!"
        
        # Check if they are all ready
        if [ "$ready_workers" -ge "$EXPECTED_WORKERS" ]; then
            echo "All worker nodes are also ready!"
            break
        else
            echo "Worker nodes joined but not all are ready yet. Waiting for readiness..."
        fi
    fi
    
    # Show current cluster state
    echo "Current cluster state:"
    kubectl get nodes --no-headers 2>&1 | tee -a "$LOG_FILE" || echo "kubectl command failed"
    
    echo "Waiting $CHECK_INTERVAL seconds before next check..."
    sleep $CHECK_INTERVAL
done

5. Final Status Display Block. Shows completion status and saves final results.

Detailed Output: Uses kubectl get nodes -o wide for comprehensive node information
Log Persistence: Saves final state to log file
Success Confirmation: Confirms successful completion
Log Location: Reminds user where detailed logs are saved

Bash

# FINAL STATUS DISPLAY
# Show detailed final cluster state
echo "Final cluster state:"
kubectl get nodes -o wide 2>&1 | tee -a "$LOG_FILE"

echo "Worker nodes join process completed successfully!"
echo "Log saved to: $LOG_FILE"

# FINAL STATUS DISPLAY
# Show detailed final cluster state
echo "Final cluster state:"
kubectl get nodes -o wide 2>&1 | tee -a "$LOG_FILE"

echo "Worker nodes join process completed successfully!"
echo "Log saved to: $LOG_FILE"

Key Design Patterns:

Polling Loop: Continuously checks status at regular intervals
Timeout Protection: Prevents infinite waiting with configurable timeout
Two-Stage Validation: Distinguishes between “joined” and “ready” states
Progress Tracking: Provides detailed status updates during the wait
Error Resilience: Handles kubectl failures gracefully
Comprehensive Logging: Saves detailed information for debugging

This script is typically used in infrastructure automation scenarios where:

Terraform/CloudFormation: Waits for worker nodes to join after provisioning
CI/CD Pipelines: Ensures cluster is fully ready before deploying applications
Cluster Scaling: Verifies new nodes are operational after scaling events
Testing: Confirms cluster readiness in automated testing environments

The script implements a two-stage success model:

Stage 1: Worker nodes join the cluster (appear in kubectl get nodes)
Stage 2: Worker nodes become ready (can schedule and run pods)

This is important because nodes can join a cluster but still be initializing, pulling images, or having network issues that prevent them from being ready for workloads.

The script ensures the cluster is not just numerically complete, but functionally ready for production use.

Kubernetes Worker Node Labelling Script

This script assigns proper “worker” labels to nodes in a Kubernetes cluster so they show up with the correct role instead of <none>.

1. Configuration Setup Block. Sets up the environment and logging.

Kubeconfig Export: Sets up kubectl to access the cluster
Worker Count: Gets expected number of workers from configuration
Log File: Creates a timestamped log file to track what happens

Bash

#!/bin/bash
set -e

# CONFIGURATION SETUP
export KUBECONFIG=/home/ubuntu/.kube/config
EXPECTED_WORKERS=${expected_worker_count}

# Create log file with timestamp
LOG_FILE="/var/log/k8s-worker-labeling-$(date +%Y%m%d-%H%M%S).log"
sudo touch $LOG_FILE
sudo chmod 666 $LOG_FILE

#!/bin/bash
set -e

# CONFIGURATION SETUP
export KUBECONFIG=/home/ubuntu/.kube/config
EXPECTED_WORKERS=${expected_worker_count}

# Create log file with timestamp
LOG_FILE="/var/log/k8s-worker-labeling-$(date +%Y%m%d-%H%M%S).log"
sudo touch $LOG_FILE
sudo chmod 666 $LOG_FILE

2. Initial Cluster State Check Block. Shows what the cluster looks like before making changes.

Display Nodes: Shows all nodes and their current status
Error Handling: Exits if kubectl doesn’t work
Documentation: Saves the “before” state to the log file

Bash

# INITIAL CLUSTER STATE CHECK
echo 'Current cluster state before labeling:'
kubectl get nodes -o wide 2>&1 | tee -a $LOG_FILE || { 
    echo 'FAILED to get nodes'
    exit 1
}

# INITIAL CLUSTER STATE CHECK
echo 'Current cluster state before labeling:'
kubectl get nodes -o wide 2>&1 | tee -a $LOG_FILE || { 
    echo 'FAILED to get nodes'
    exit 1
}

3. Stabilization Wait Block. Gives nodes time to fully initialize. Newly joined nodes might still be initializing.

30-Second Wait: Ensures nodes are fully ready before labeling

Bash

# STABILIZATION WAIT
echo 'Waiting 30 seconds for nodes to stabilize...'
sleep 30

# STABILIZATION WAIT
echo 'Waiting 30 seconds for nodes to stabilize...'
sleep 30

4. Node Discovery Block. Finds all nodes in the cluster

Get Node List: Uses kubectl to get all node names
JSONPath Query: Extracts just the names from the full node information

Bash

# NODE DISCOVERY
# Get all node names in the cluster
node_list=$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}')
echo "All nodes found: $node_list"

# NODE DISCOVERY
# Get all node names in the cluster
node_list=$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}')
echo "All nodes found: $node_list"

5. Labeling Function with Retry Logic Block. Creates a reliable function to label individual nodes. What label_node_with_retry() does:

Readiness Check: Waits up to 60 seconds for the node to be “Ready”
Apply Label: Adds node-role.kubernetes.io/worker=worker label
Retry Logic: Tries up to 3 times if it fails
Error Recovery: Waits 10 seconds between attempts

Bash

# LABELING FUNCTION WITH RETRY LOGIC
label_node_with_retry() {
    local node="$1"
    local max_attempts=3
    local attempt=1
    
    while [ $attempt -le $max_attempts ]; do
        echo "Attempt $attempt/$max_attempts to label node: $node"
        
        # Wait for node to be ready
        if kubectl wait --for=condition=Ready node/$node --timeout=60s 2>&1 | tee -a $LOG_FILE; then
            echo "$node is ready, attempting to label..."
            
            # Apply worker label
            if kubectl label node "$node" node-role.kubernetes.io/worker=worker --overwrite 2>&1 | tee -a $LOG_FILE; then
                echo "SUCCEEDED to label $node as worker"
                return 0
            else
                echo "FAILED to label $node (attempt $attempt)"
            fi
        else
            echo "$node not ready yet (attempt $attempt)"
        fi
        
        attempt=$((attempt + 1))
        if [ $attempt -le $max_attempts ]; then
            echo "Waiting 10 seconds before retry..."
            sleep 10
        fi
    done
    
    echo "FAILED to label $node after $max_attempts attempts"
    return 1
}

# LABELING FUNCTION WITH RETRY LOGIC
label_node_with_retry() {
    local node="$1"
    local max_attempts=3
    local attempt=1
    
    while [ $attempt -le $max_attempts ]; do
        echo "Attempt $attempt/$max_attempts to label node: $node"
        
        # Wait for node to be ready
        if kubectl wait --for=condition=Ready node/$node --timeout=60s 2>&1 | tee -a $LOG_FILE; then
            echo "$node is ready, attempting to label..."
            
            # Apply worker label
            if kubectl label node "$node" node-role.kubernetes.io/worker=worker --overwrite 2>&1 | tee -a $LOG_FILE; then
                echo "SUCCEEDED to label $node as worker"
                return 0
            else
                echo "FAILED to label $node (attempt $attempt)"
            fi
        else
            echo "$node not ready yet (attempt $attempt)"
        fi
        
        attempt=$((attempt + 1))
        if [ $attempt -le $max_attempts ]; then
            echo "Waiting 10 seconds before retry..."
            sleep 10
        fi
    done
    
    echo "FAILED to label $node after $max_attempts attempts"
    return 1
}

6. First Labeling Pass Block. Goes through each node and labels appropriate ones as workers.

Check Each Node: Loops through all discovered nodes
Role Detection: Checks if node already has “control-plane” or “master” labels
Skip Control Planes: Leaves management nodes alone
Label Workers: Applies worker label to non-control-plane nodes

Bash

# FIRST LABELING PASS
# Process each node and determine if it should be labeled as worker
for node in $node_list; do
    if [ -n "$node" ]; then
        echo "Processing node: $node"
        
        # Check if node has control-plane or master role
        node_labels=$(kubectl get node "$node" -o jsonpath='{.metadata.labels}' 2>/dev/null || echo '')
        
        if echo "$node_labels" | grep -E 'control-plane|master' > /dev/null 2>&1; then
            echo "$node is a control plane node, skipping"
        else
            echo "$node appears to be a worker node"
            label_node_with_retry "$node"
        fi
    fi
done

echo 'First labeling pass completed'

# FIRST LABELING PASS
# Process each node and determine if it should be labeled as worker
for node in $node_list; do
    if [ -n "$node" ]; then
        echo "Processing node: $node"
        
        # Check if node has control-plane or master role
        node_labels=$(kubectl get node "$node" -o jsonpath='{.metadata.labels}' 2>/dev/null || echo '')
        
        if echo "$node_labels" | grep -E 'control-plane|master' > /dev/null 2>&1; then
            echo "$node is a control plane node, skipping"
        else
            echo "$node appears to be a worker node"
            label_node_with_retry "$node"
        fi
    fi
done

echo 'First labeling pass completed'

7. Second Labeling Pass Block. Catches any nodes that were missed in the first pass.

Some nodes might have joined after the first pass
Network issues might have caused failures
Find Unlabeled: Looks for nodes with <none> role
Final Attempt: Tries to label any remaining unlabeled nodes

Bash

# SECOND LABELING PASS
# Check for any remaining unlabeled worker nodes
echo 'Checking for any remaining unlabeled nodes...'
unlabeled_nodes=$(kubectl get nodes --no-headers | grep '<none>' | awk '{print $1}' || true)

if [ -n "$unlabeled_nodes" ]; then
    echo "Found unlabeled nodes: $unlabeled_nodes"
    for node in $unlabeled_nodes; do
        echo "Final attempt to label remaining node: $node"
        label_node_with_retry "$node"
    done
else
    echo 'No unlabeled nodes found'
fi

# SECOND LABELING PASS
# Check for any remaining unlabeled worker nodes
echo 'Checking for any remaining unlabeled nodes...'
unlabeled_nodes=$(kubectl get nodes --no-headers | grep '<none>' | awk '{print $1}' || true)

if [ -n "$unlabeled_nodes" ]; then
    echo "Found unlabeled nodes: $unlabeled_nodes"
    for node in $unlabeled_nodes; do
        echo "Final attempt to label remaining node: $node"
        label_node_with_retry "$node"
    done
else
    echo 'No unlabeled nodes found'
fi

8. Final Verification Block. Confirms the job was completed successfully.

Show Results: Displays final cluster state
Count Check: Counts how many nodes still have <none> role
Success/Failure: Exits with error if any nodes remain unlabeled
Log Information: Tells user where to find detailed logs

Bash

# FINAL VERIFICATION
echo 'Labeling process completed'
echo 'Final cluster state:'
kubectl get nodes -o wide 2>&1 | tee -a $LOG_FILE

# Check if any nodes still remain unlabeled
remaining_unlabeled=$(kubectl get nodes --no-headers | grep '<none>' | wc -l || echo '0')
if [ "$remaining_unlabeled" -gt 0 ]; then
    echo "WARNING: $remaining_unlabeled node(s) still have no role assigned"
    kubectl get nodes --no-headers | grep '<none>' 2>&1 | tee -a $LOG_FILE
    exit 1
else
    echo 'SUCCEEDED: All nodes have roles assigned'
fi

echo "Worker labeling process completed. Full log saved to: $LOG_FILE"
echo "To view the log later, run: sudo cat $LOG_FILE"

# FINAL VERIFICATION
echo 'Labeling process completed'
echo 'Final cluster state:'
kubectl get nodes -o wide 2>&1 | tee -a $LOG_FILE

# Check if any nodes still remain unlabeled
remaining_unlabeled=$(kubectl get nodes --no-headers | grep '<none>' | wc -l || echo '0')
if [ "$remaining_unlabeled" -gt 0 ]; then
    echo "WARNING: $remaining_unlabeled node(s) still have no role assigned"
    kubectl get nodes --no-headers | grep '<none>' 2>&1 | tee -a $LOG_FILE
    exit 1
else
    echo 'SUCCEEDED: All nodes have roles assigned'
fi

echo "Worker labeling process completed. Full log saved to: $LOG_FILE"
echo "To view the log later, run: sudo cat $LOG_FILE"

Before this script:

NAME           STATUS   ROLES    AGE
master-node    Ready    master   5m
worker-1       Ready    <none>   2m
worker-2       Ready    <none>   2m
worker-3       Ready    <none>   2m

After this script:

NAME           STATUS   ROLES           AGE
master-node    Ready    master          5m
worker-1       Ready    worker          2m
worker-2       Ready    worker          2m
worker-3       Ready    worker          2m

Why is labelling important?

Visual Clarity: Makes it clear which nodes do what
Management Tools: Some Kubernetes tools rely on proper labels
Best Practices: Follows Kubernetes conventions
Troubleshooting: Easier to identify node types when debugging

Common Functions Script

This script contains utility functions for logging and error handling that are used by all the other Kubernetes setup scripts.

1. log_step() Function. Records successful steps and displays progress

Takes 2 inputs: Step number and message
Dual Logging:
- Writes to /var/log/k8s-install-success.txt (for permanent record)
- Displays on screen with timestamp
Format: [STEP_NUMBER] MESSAGE

Bash

log_step() {
    local step="$1"
    local message="$2"
    echo "[$step] $message" | sudo tee -a /var/log/k8s-install-success.txt > /dev/null
    echo "$(date '+%Y-%m-%d %H:%M:%S') [$step] $message"
}

log_step() {
    local step="$1"
    local message="$2"
    echo "[$step] $message" | sudo tee -a /var/log/k8s-install-success.txt > /dev/null
    echo "$(date '+%Y-%m-%d %H:%M:%S') [$step] $message"
}

Example:

Bash

log_step "5" "Kubernetes installed successfully"

Output:

Bash

2025-08-03 14:30:15 [5] Kubernetes installed successfully

2. log_error() Function. Records errors and displays them prominently

Takes 2 inputs: Step number and error message
Dual Logging:
- Writes to /var/log/k8s-install-error.txt (for error tracking)
- Displays on screen with timestamp to stderr (error output)
Format: ERROR [STEP_NUMBER] MESSAGE

Bash

log_error() {
    local step="$1"
    local message="$2"
    echo "ERROR [$step] $message" | sudo tee -a /var/log/k8s-install-error.txt > /dev/null
    echo "$(date '+%Y-%m-%d %H:%M:%S') ERROR [$step] $message" >&2
}

log_error() {
    local step="$1"
    local message="$2"
    echo "ERROR [$step] $message" | sudo tee -a /var/log/k8s-install-error.txt > /dev/null
    echo "$(date '+%Y-%m-%d %H:%M:%S') ERROR [$step] $message" >&2
}

Example:

Bash

log_error "3" "Failed to install Docker"

log_error "3" "Failed to install Docker"

Output:

Bash

2025-08-03 14:30:15 ERROR [3] Failed to install Docker

2025-08-03 14:30:15 ERROR [3] Failed to install Docker

3. check_command() Function. Automatically checks if the previous command failed and exits if it did.

Takes 2 inputs: Step number and error message
Checks Exit Code: Uses $? to see if the last command failed (non-zero exit code)
Auto-Exit: If command failed, logs error and exits the entire script

Bash

check_command() {
    if [ $? -ne 0 ]; then
        log_error "$1" "$2"
        exit 1
    fi
}

check_command() {
    if [ $? -ne 0 ]; then
        log_error "$1" "$2"
        exit 1
    fi
}

Example:

Bash

apt-get update
check_command "1" "Failed to update packages"

apt-get update
check_command "1" "Failed to update packages"

If apt-get update fails:

Logs the error message
Exits the script immediately with code 1

4. log_file() Function. Flexible logging that can write to files or just console.

Takes 2 inputs: Message and optional log file path
Conditional Logging:
- If no file specified: Just prints to console
- If file specified: Prints to both console AND file
Uses tee: Shows output on screen while also writing to file

Example:

Bash

# Console only
log_file "Starting process"

# Console + file
log_file "Debug info" "/tmp/debug.log"

# Console only
log_file "Starting process"

# Console + file
log_file "Debug info" "/tmp/debug.log"

How These Functions Work Together:

During Normal Operation

Bash

# Script starts
log_step "1" "Starting installation"     # Success log
apt-get update                           # Run command
check_command "1" "Package update failed" # Check if it worked
log_step "2" "Packages updated"          # Success log

# Script starts
log_step "1" "Starting installation"     # Success log
apt-get update                           # Run command
check_command "1" "Package update failed" # Check if it worked
log_step "2" "Packages updated"          # Success log

During Failure

Bash

# Script starts
log_step "1" "Starting installation"     # Success log
some_failing_command                     # This command fails
check_command "1" "Command failed"       # Detects failure, logs error, exits
# Script stops here - never reaches next step

# Script starts
log_step "1" "Starting installation"     # Success log
some_failing_command                     # This command fails
check_command "1" "Command failed"       # Detects failure, logs error, exits
# Script stops here - never reaches next step

After running scripts with these functions, you get:

/var/log/k8s-install-success.txt - Contains all successful steps
/var/log/k8s-install-error.txt   - Contains any errors that occurred

Why This is Useful:

Debugging: If installation fails, you can check the error log to see exactly what went wrong
Progress Tracking: Success log shows how far the installation got
Automation: Scripts can automatically stop when something fails
Consistency: All scripts use the same logging format
Auditing: Permanent record of what happened during installation

In essence: These functions create a “flight recorder” for your Kubernetes installation, tracking every step and automatically stopping if anything goes wrong.

The ${common_functions} you see at the top of other scripts gets replaced with these functions, so every script has access to this logging toolkit.

Sample Output

Sample Healthy Registered Targets

For full code visit my github https://github.com/rinavillaruz/terraform-aws-kubernetes.

If you happen to finish reading the tutorial, thank you. I know it’s kind of lengthy 🙂

Building Better Containers: Docker Multi-Stage Builds

There are several factors why multi-stage builds can help build better containers.

Reduced image size means less disk space is used. This can also speed up builds, deployments, and container startup.
Separating build time and run time means keeping the tools and libraries separate from the runtime environment, which can achieve a cleaner and more maintainable setup.
Minimize attacks. By having only the needed libraries in the runtime, attackers will have difficulty exploiting unnecessary scripts.

Let’s use Golang as our first example. We will first create a single-stage build and compare the size to the multi-stage build.

main.go file

package main

import (
   "fmt"
   "net/http"
)

func handler(w http.ResponseWriter, r *http.Request) {
   fmt.Fprintf(w, "Hello World!\n")
}

func main() {
   http.HandleFunc("/", handler)

   // Start the server on port 80
   err := http.ListenAndServe(":80", nil)
   if err != nil {
       fmt.Println("Error starting server:", err)
   }
}

Create a single-stage build Dockerfile named Dockerfile-single-golang.

Dockerfile

FROM ubuntu:latest

WORKDIR /app

RUN apt update \
   && apt install -y golang

COPY /hello-world-go/main.go .

RUN go mod init hello-world-go && go build -o hello-world-go

EXPOSE 80

CMD ["./hello-world-go"]

Let’s use Ubuntu as the base image, install go lang, and build the executable file. How did I come up with this? At first, I ran the Dockerfile with only FROM ubuntu:latest as its content. I go inside the container and install golang.

Dockerfile

FROM ubuntu:latest
CMD [“bash”]

I get the history of all my commands using the command history. From the history, I managed to get all commands and put them together to create a Dockerfile.

Build the Dockerfile.

Bash

docker build -t rinavillaruz/single-golang -f Dockerfile-single-golang .

Check the docker image.

The terminal shows the size of the single-golang image is 680MB which is huge. We can do better.

Revise the Dockerfile and make it a multi-stage build. Name it as Dockerfile-multi-golang.

Dockerfile

FROM ubuntu:latest AS build

WORKDIR /app

RUN apt update \
   && apt install -y golang

COPY /hello-world-go/main.go .

RUN go mod init hello-world-go \
   && go mod tidy \
   && CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o hello-world-go \
   && chmod +x hello-world-go

FROM scratch AS final

WORKDIR /app

COPY --from=build /app/hello-world-go .

EXPOSE 80

CMD ["./hello-world-go"]

We created a second stage or final stage. Scratch is an empty image but you can run a binary inside it.

It just basically copies the binary file from the build stage onto the final stage.

Let’s see the difference. Build the Dockerfile.

Bash

docker build -t rinavillaruz/multi-golang -f Dockerfile-multi-golang .

Check the image size.

The image is now reduced to 7MB. That’s a big difference. I have a demo of it at https://www.youtube.com/watch?v=GxW4-yEz_Mg.

In our example, we did a binary executable which is a standalone. How about PHP? Let’s try installing WordPress. You can get the instructions here https://wiki.alpinelinux.org/wiki/WordPress

Create a Dockerfile named Dockerfile-single-php. I used the base image php:8.4.3-fpm-alpine3.20 which is lightweight.

Dockerfile

FROM php:8.4.3-fpm-alpine3.20

# Set working directory
WORKDIR /usr/share/webapps/

RUN apk add --no-cache \
   bash \
   lighttpd \
   php82 \
   fcgi \
   php82-cgi \
   wget

# Configure lighttpd to enable FastCGI
RUN sed -i 's|#   include "mod_fastcgi.conf"|include "mod_fastcgi.conf"|' /etc/lighttpd/lighttpd.conf && \
   sed -i 's|/usr/bin/php-cgi|/usr/bin/php-cgi82|' /etc/lighttpd/mod_fastcgi.conf

# Download and extract WordPress
RUN wget https://wordpress.org/latest.tar.gz && \
   tar -xzvf latest.tar.gz && \
   rm latest.tar.gz && \
   chown -R lighttpd:lighttpd /usr/share/webapps/wordpress

EXPOSE 9000

CMD ["sh", "-c", "php-fpm & lighttpd -D -f /etc/lighttpd/lighttpd.conf"]

Build the Dockerfile.

Bash

docker build -t rinavillaruz/single-php --no-cache -f Dockerfile-single-php .

Check the image size.

Let’s try reducing.

Dockerfile

FROM php:8.4.3-fpm-alpine3.20 AS build

# Set working directory
WORKDIR /usr/share/webapps/

RUN apk add --no-cache \
   bash \
   lighttpd \
   php82 \
   fcgi \
   php82-cgi \
   wget

# Configure lighttpd to enable FastCGI
RUN sed -i 's|#   include "mod_fastcgi.conf"|include "mod_fastcgi.conf"|' /etc/lighttpd/lighttpd.conf && \
   sed -i 's|/usr/bin/php-cgi|/usr/bin/php-cgi82|' /etc/lighttpd/mod_fastcgi.conf

# Download and extract WordPress
RUN wget https://wordpress.org/latest.tar.gz && \
   tar -xzvf latest.tar.gz && \
   rm latest.tar.gz && \
   chown -R lighttpd:lighttpd /usr/share/webapps/wordpress

FROM php:8.4.3-fpm-alpine3.20 AS final

WORKDIR /usr/share/webapps/

# Copy the compiled binary from the build stag
COPY --from=build /usr/share/webapps/ /usr/share/webapps/

# Install only the runtime dependencies
RUN apk add --no-cache \
   lighttpd \
   fcgi \
   php82-cgi

EXPOSE 9000

CMD ["sh", "-c", "php-fpm & lighttpd -D -f /etc/lighttpd/lighttpd.conf"]

In our example, since PHP doesn’t produce an executable binary, we don’t use the scratch image in the final stage. Instead, we reuse the php:8.4.3-fpm-alpine3.20 image and still need to install the required dependencies. As demonstrated, we only install the essential ones.

Let’s build the Dockerfile.

Bash

docker build -t rinavillaruz/multi-php --no-cache -f Dockerfile-multi-php .

Let’s check the image size.

The size reduction was minimal because PHP is an interpreted language. Unlike compiled languages, PHP cannot run independently—it relies on its runtime environment. PHP executes code line by line at runtime, essentially processing it “on the fly.” Additionally, installing dependencies like MySQL, PDO, or others increases the overall image size.

That’s it. Visit my YT Video for demo https://www.youtube.com/watch?v=GxW4-yEz_Mg

Debugging Vagrant 2.4.3 Installation Failures on Ubuntu 22.04 Jammy Jellyfish with VirtualBox 7.14

Vagrant and VirtualBox on Mac M2 Max are a no-no for now because of the arm architecture, which is not fully available yet. So, I decided to install it on my Ubuntu 22. I came across a few hurdles, so I am going to discuss how I managed to fix them.

Error: VirtualBox USB enumeration

The reason the error exists is because according to this blog https://www.techrepublic.com/article/fix-virtualbox-usb-error/, the Virtual Box must be a member of the vboxusers.

To fix this, run the command:

Bash

sudo usermod -aG vboxusers $USER

Close the VirtualBox and restart the computer.

Error: Kernel driver not installed

Console Output

The VirtualBox Linux kernel driver is either not loaded or not set up correctly. Please try setting it up again by executing
'/sbin/vboxconfig'

as root.

I tried doing sudo /sbin/vboxconfig but it didnt work and the error says:

Console Output

Look at /var/log/vbox-setup.log to find out what went wrong.

I found this:

Console Output

gcc-12: not found

To fix this, install gcc https://www.dedoimedo.com/computers/virtualbox-kernel-driver-gcc-12.html

Run the command again:

Bash

sudo /sbin/vboxconfig

Error: Vagrant box hashicorp/bionic64 not working

I tried adding the box but it failed. The error says Unrecognized archive format.

Bash

vagrant box add hashicorp/bionic64

I downloaded the box here: https://portal.cloud.hashicorp.com/vagrant/discover/ubuntu/bionic64 and tried adding the box manually.

Bash

vagrant box add --name ubuntu/bionic64 bionic-server-cloudimg-amd64-vagrant.box

Bash

vagrant init ubuntu/bionic64

Bash

vagrant up

Check the VM.

Check the vagrant status.

That’s it!

Upgrading Docker from version 25 to version 27 in Amazon Linux 2023.6.20241212

I have this Jenkins server that I haven’t touched for a year, and when I started it, I was greeted with two upgrades. One is from Jenkins, and the other one is from AWS. I did an upgrade of AWS to 2023.6.20241212. Everything went fine until suddenly I checked my Docker, and I have an outdated version 25 which I believe is correct based on this information from AWS https://docs.aws.amazon.com/linux/al2023/release-notes/all-packages-AL2023.6.html

The latest version of Docker is now on version 27 based on the time of writing. I checked the details of the installed AWS Linux and it is based on Fedora https://aws.amazon.com/linux/amazon-linux-2023/faqs/

Let’s verify.

I did a google and found this command:

Bash

sudo dnf install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

I got an error:

Console Output

Docker CE Stable - x86_64                                         693  B/s | 417  B     00:00   
Errors during downloading metadata for repository 'docker-ce-stable':
 - Status code: 404 for https://download.docker.com/linux/fedora/2023.6.20241212/x86_64/stable/repodata/repomd.xml (IP: 0.0.0.0)
Error: Failed to download metadata for repo 'docker-ce-stable': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
Ignoring repositories: docker-ce-stable
Last metadata expiration check: 1:07:15 ago on Tue Jan  7 14:23:13 2025.
No match for argument: docker-ce
No match for argument: docker-ce-cli
No match for argument: containerd.io
No match for argument: docker-buildx-plugin
No match for argument: docker-compose-plugin
Error: Unable to find a match: docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

I followed this link just to see what it shows https://download.docker.com/linux/fedora/2023.6.20241212/x86_64/stable/repodata/repomd.xml

Looks like the package is unavailable.

So I decided to install it using Binaries or Manual Installation. You can check it here https://docs.docker.com/engine/install/binaries/

Or you can follow what i did (it’s just the same actually 😀):

Check the version you need at https://download.docker.com/linux/static/stable/

As of this writing the version is 27.4.1

Bash

// Download 
curl -fsSL https://download.docker.com/linux/static/stable/x86_64/docker-27.4.1.tgz -o docker.tgz

// Extract
tar xzvf docker.tgz

// Move the files
sudo mv docker/* /usr/bin/

// Starts the daemon, runs it in the background
sudo dockerd &

// Check docker version 
docker --version

// Check if it is correctly installed
Docker run hello-world

That’s it!

Automate Background Tasks with Linux Service and Timer

What is a Linux Service? It is a program that runs in the background without user interaction. It usually stays running even after a computer reboot. An example is daily backing up a database from a production server. You run a task in the background that dumps the whole database and saves it to a computer.

For example, I will empty the trash on an Ubuntu desktop every minute.

Create a bash script to empty the trash, name it empty-trash.sh, save it in /opt.

Bash

#!/bin/bash

rm -rf /home/<username>/.local/share/Trash/*

Make it executable.

Bash

sudo chmod +x /opt/empty-trash.sh

Create the service file empty-trash.service in /etc/systemd/system as root.

Bash

sudo touch /etc/systemd/system/empty-trash.service

Next is to copy and paste this to the empty-trash.service file.

[Unit]
Description=Empty trash every minute

[Service]
ExecStart=bash /opt/empty-trash.sh
User=<username>

[Install]
WantedBy=multi-user.target

The [Unit] section holds the description of the service file.
The [Service] section tells how the service should run
- ExecStart directive is where we specify the command that we want to run which is the bash /opt/empty-trash.sh.
- The User directive specifies the account under which the script will execute.
The [Install] section defines how systemd should manage the service during system startup or when it is enabled.
- WantedBy with a value of multi-user.target means it will be linked to the system’s boot process.

Create a timer file empty-trash.timer in /etc/systemd/system as root.

Bash

sudo touch /etc/systemd/system/empty-trash.timer

Next is to copy and paste this to the empty-trash.timer file.

[Unit]
Description=Run empty-trash service every minute

[Timer]
OnUnitActiveSec=1min
Persistent=true

[Install]
WantedBy=timers.target

The [Unit] section holds the description of the timer file.
The [Timer] section holds the configuration of the scheduling of the timer
- OnUnitActiveSec directive means that the timer will trigger on the scheduled time
- Persistent=true means the timer will compensate for the missed schedule if the system is powered off.
The [Install] section specifies how the timer will connect to the system
- WantedBy=timers.target means the timer will be added to timers.target. timers.target organizes and starts all timers.

To modify the time, change the value of OnUnitActiveSec.

Run the command below to let the system know of a new service:

Bash

sudo systemctl daemon-reload

Run the commands below to start the services:

Bash

sudo systemctl start empty-trash.service
sudo systemctl start empty-trash.timer

Run the commands below if you want it to enable the service at startup:

Bash

sudo systemctl enable empty-trash.service
sudo systemctl enable empty-trash.timer

Run the commands below if you want to check the status:

Bash

sudo systemctl status empty-trash.service
sudo systemctl status empty-trash.timer

Run the commands below if you want to stop the services:

Bash

sudo systemctl stop empty-trash.service
sudo systemctl stop empty-trash.timer

Now, you can check the trash from time to time to see if it empties.

Github Gist: https://gist.github.com/rinavillaruz/eb4b5230d5bb0fbded2e916ad0ca189c

START/STOP AWS Instance Using AWS-CLI, Docker, Docker-Compose and Bash Script

Using Docker Compose, create a service for the aws-cli and make sure to add stdin_open:true, tty:true and command:help to be able to prevent an immediate exit.

docker-compose.yml

services:
  aws:
    container_name: 'sample-aws'
    image: amazon/aws-cli:latest
    stdin_open: true # = -i
    tty: true        # = -t
    command: help
    volumes:
      - ./.aws:/root/.aws
    ports:
      - "8080:80"

Go inside the container and configure the aws by adding your credentials. It should create the directory .aws/. To make this tutorial simpler, I opted to configure it inside the container rather than making an environment variable.

Bash

docker exec -it sample-aws bash

Note: Make sure your instances have a tag like this to be able to identify them without entering the instance ID.

Create a bash script file named change-instance-state.sh and put it inside .aws/ directory.

Create a function to be re-used when starting and stopping an instance.

Bash

#!/bin/bash

change_instance_state() {
    action='.StartingInstances[0]';
    status='stopped'
    instances='start-instances'

    if [ $1 == "STOP" ]; then
        action='.StoppingInstances[0]';
        status='running'
        instances='stop-instances'
    fi

    # Get all instances with status of stopped or running with Tag Value of PRODUCTION. Lastly, get the instance-id and change the state
    aws ec2 describe-instances --query 'Reservations[*].Instances[*].[State.Name, InstanceId, Tags[0].Value]' --output text |
    grep ${status} |
    grep PRODUCTION |
    awk '{print $2}' |
    while read line; do 
        result=$(aws ec2 ${instances} --instance-ids $line)
        current_state=$(echo "$result" | jq -r "$action.CurrentState.Name")
        previous_state=$(echo "$result" | jq -r "$action.PreviousState.Name")
        echo "Previous State is: ${previous_state} and Current State is: ${current_state}"
    done
}

In my example, I used the command aws ec2 describe-instances to list all ec2 instances. I filtered them out using grep and awk to target only the PRODUCTION instance.

How did I get the result (previous and current state)? Install jq and you can parse the result of this $(aws ec2 ${instances} –instance-ids $line) line. You can skip this as I added it so that I can see the previous and current states of the instance.

Ask the users if they want to start or stop. Assign it to the variable USER_OPTION

Bash

read -p 'Do you want to START or STOP the production instance? ' USER_OPTION;

Write an if/else statement that decides the value of the USER_OPTION variable, then pass that value to the change_instance_state function. The double carets ^^ are just used to convert input to uppercase.

Bash

if [[ ${USER_OPTION^^} == "START" ]]; then
    echo 'STARTING THE PRODUCTION SERVER...'
    change_instance_state "START"
else
    change_instance_state "STOP"
fi

Lastly, to be able to execute this script outside of the container,

Bash

docker exec -it sample-aws /root/.aws/change-instance-state.sh

The output should look like this

Console Output

Do you want to START or STOP the production instance? start
STARTING THE PRODUCTION SERVER...
Previous State is: stopped and Current State is: pending

View gist here: https://gist.github.com/rinavillaruz/0443633aec8f189b044b0ad7febc4735

Execute bash script on a remote host using Jenkins Pipeline and Publish Over SSH plugin

xr:d:DAF72BcxxVs:7,j:8856746316671605793,t:24020417

I decided to install Jenkins on another server because I don’t want it to mess up with my containers and I will execute different remote scripts from different servers.

Prerequisite: A running Docker and Docker Compose, a DNS record that points to https://jenkins.yourdomain.com, and knowledge of ssh keys.

Let’s install Jenkins first by creating a Docker Compose file.

docker-compose.yml

version: '3.8'
services:
  jenkins:
    image: jenkins/jenkins:lts
    ports:
      - 80:8080
    container_name: jenkins
    restart: always
    volumes:
      - .:/var/jenkins_home

The file states that it will pull the latest Jenkins image from Docker Hub and assign it to port 80 so that we don’t need to type https://jenkins.yourdomain.com:8080 in the browser.

Make sure to have the restart: always because when you install any Jenkins plugins, the service will automatically restart, and you will lose connection to the container.

From the Jenkins server, copy the public key and save it to the authorized keys of the remote server server.

Bash

Jenkins Server: cat ~/.ssh/id_rsa.pub

Go to Dashboard > Manage Jenkins and click on the Plugins section.

Install Publish Over SSH Version plugin.

Let’s add the private key and passphrase from the remote server. On your Name, click the drop-down and you will see the Credentials.

Click the + Add Credentials blue button.

Choose SSH Username with private key as Kind, Global as Scope, enter any ID or leave it blank, your username, and the Private key and passphrase from the remote server.

To get the Private key, ssh to your remote server and execute the command: cat ~/.ssh/id_rsa. Don’t forget to enter the passphrase if there is any.

Let’s create a Pipeline. On the Dashboard, click + New Item.

Enter the pipeline’s name, choose Pipeline as the type of project, and click OK.

Once saved, scroll down to the Pipeline section. Copy and paste the following script:

Jenkins Pipeline

pipeline {
  agent any

  stages {
        stage('Build') {
          steps {
            echo 'Building..'
            sshagent(['production']) {
                sh "ssh -o StrictHostKeyChecking=no -l root 127.0.0.1 'cd ~/public_html && ./bash_script.sh'"
            }
         }
      }
   }
}

The production in sshagent([‘production’]) comes from the Credentials that we made.

Specify the user, IP address of the server, and the bash script to be executed.

Click the Build Now to start the building process. Go to Console Output and you will see something like this:

Console Output

Started by user Rina
Resume disabled by user, switching to high-performance, low-durability mode.
[Pipeline] Start of Pipeline
[Pipeline] node
Running on Jenkins in /var/jenkins_home/workspace/production
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Build)
[Pipeline] echo
Building..
[Pipeline] sshagent
[ssh-agent] Using credentials root
[ssh-agent] Looking for ssh-agent implementation...
[ssh-agent]   Exec ssh-agent (binary ssh-agent on a remote machine)
$ ssh-agent
SSH_AUTH_SOCK=/tmp/ssh-XX/agent.xxx
SSH_AGENT_PID=1234
Running ssh-add (command line suppressed)
Identity added: /var/jenkins_home/workspace/production@tmp/private_key_123.key (root@127.0.0.1)
[ssh-agent] Started.
[Pipeline] {
[Pipeline] sh
+ ssh -o StrictHostKeyChecking=no -l root 127.0.0.1 cd ~/public_html && ./bash_script.sh
This is a test script from the remote host.
[Pipeline] }
$ ssh-agent -k
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 123 killed;
[ssh-agent] Stopped.
[Pipeline] // sshagent
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
Finished: SUCCESS

The build is successful if there is a Finished: SUCCESS at the bottom of the Console Ouput.

That’s it.

https://www.jenkins.io/

EBS Volume Resizing in AWS

xr:d:DAF7zuNYMRs:17,j:5158165079861424818,t:24020407

Select an instance and click Instance state. Choose Stop Instance.

After the instance has been stopped, go to the Storage tab and click the Volume ID. It will take you to the Volume configuration.

Tick the checkbox next to the Volume ID and click Actions. Select Modify volume.

The previous size of my volume is 30GB. I resized it to 60GB.

Click Modify.

Ssh to your instance and type in df -h. In my volume, it can be seen below that it is now 60GB in size /dev/nvme0n1p1.

The volume that I have doesn’t need to expand the partition because it was automatically expanded. You might encounter that the partition is still the size of the original volume. For example, my original volume is 30GB, when I resized the volume to 60GB, the partition stayed at 30GB. To fix that, just type

Bash

growpart /dev/nvme0n1 1

If it’s already expanded, you get the message that it cannot be grown.

Host is blocked because of many connection errors; unblock with ‘mysqladmin flush-hosts’

I was changing the password of our RDS yesterday and I thought that right after I modified the database, I could test it on my DBeaver immediately. It turns out it takes some minutes for it to be active again. I tried to test it a couple times on my DBeaver and checked the production site if it would load successfully.

It seems it didn’t. When you are on WordPress, it will just say Error Establishing Database Connection. Scratched my head a couple of times because I just changed the password and tried to reconnect but the site keeps saying Error Establishing Database Connection. When I checked the log file, it says:

Console Output

Host is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'

Basically, the answer is on flushing hosts. I went to DBeaver and typed in

SQL

FLUSH HOSTS;

I googled why it happens and I found out that if a host tries to connect but it is unsuccessful and it exceeds the max_connection_error, mysql will block the host.

Flushing the host file means that MYSQL will empty the host file and unblock any blocked hosts. I tried to reconnect again and it went fine.