Building cloud infrastructure used to mean spending hours clicking through the AWS console and trying to remember which settings you used last time. I’ve been there – frantically taking screenshots of configurations and keeping messy notes just to recreate environments.
That’s where Terraform comes in. Instead of all that manual work, you write a few configuration files that describe exactly what you want your infrastructure to look like, and Terraform handles the rest.
Here’s what I am going to build:
A VPC
Public subnets for things that need internet access
Private subnets for your databases and sensitive stuff
All the networking pieces to make everything talk to each other
Security groups to keep the bad guys out
3 EC2 instances
What you’ll need before we start:
An AWS account (the free tier works fine for this / elastic ip is around i think $0.005 per hr / if you have free aws credits, this won’t hurt)
Terraform installed on your computer
AWS CLI set up with your credentials
About an hour of your time
Let’s setup the environment first. Go to IAM > Create a User > Security Credentials Tab > Create Access Key. Create an AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
Create a docker-compose file that creates a containerized Terraform environment with AWS CLI. The latest image of terraform will be used, make sure to mount the current directory to /workspace, and stdin_open and tty are set to true so that it will enable terminal access.
Run docker-compose up -d to start, then docker exec -it terraform-aws sh to access the container.
Bash
docker-composeup-ddockerexec-itterraform-awssh
And this is the Terraform Project Structure:
Key Pairs
Key Pairs are cryptographic SSH keys used to securely authenticate and access EC2 instances without passwords. Key Pairs are like the master key and ID badges that let you securely access your employees or EC2 instances in your office building. Even if someone breaks into your building, they can’t access your employee’s work stations without the specific private badge reader. It’s important to note if you lose the private key file, you lose ssh access to your EC2 instance permanently or you can setup alternative access methods.
Generates a 4096-bit RSA private/public key pair in memory.
A VPC is a logically isolated section of a cloud provider’s network where you can launch and manage your cloud resources in a virtual network that you define and control. They are fundamental to cloud architecture because they can give you the network foundation needed to build a secure and scalable applications while maintaining control over your network environment. Like an office building floor, you can rent an entire floor of a sky scraper, you can decide how to divide it into rooms or subnets, who can access each rooms or security groups and weather some rooms have windows to the outside world or internet access. While others are interior offices or private subnets. It’s essentially cloud providers saying “here’s your own private piece of the internet where you can build whatever you need.
A Public Subnet has a route to an Internet Gateway. Think of it as a building’s main entrance that connects your floor directly to the street. Any server you put in a public subnet can receive traffic directly from the internet. Just like how people on the street can see and access those street-facing conference rooms. You have to note that just because a room has windows, doesn’t mean anyone can just walk in. You still have security or security groups, firewalls controlling exactly who can enter and what they can do.
Defines an input variable that accepts a list of subnet CIDR blocks, defaulting to one subnet (10.0.1.0/24).
A Private Subnet is like the interior offices and back rooms. They have no windows to the street and can’t be accessed directly from the outside world. It has no direct route to the internet gateway. There’s no street entrance to these rooms. These servers can’t receive traffic directly from the internet just like how people on the street cant walk directly in to your back offices. To be able for private subnet to access the outside world, it must go through NAT Gateway which we will not cover. Even if someone breaks through your perimeter security, your most critical systems: databases, internal applications, are in these window-less backrooms. Where they are much harder to reach and attack directly.
Defines an input variable for private subnet CIDR blocks, defaulting to two subnets (10.0.2.0/24 and 10.0.3.0/24).
An Internet Gateway is like the main building entrance and lobby. Its a single point where your office floors connect to the outside world or the internet. Simply put, its an aws-managed component that connects your VPC to the internet. Without Internet Gateway, your VPC has no internet connectivity at all – completely isolated network.
Creates an internet gateway and attaches it to the VPC to enable internet access.
Route Tables are network routing rules that determine where to send traffic based on destination IP addresses. When your web servers wants to download updates from the internet, it checks it’s route table. To reach 0.0.0.0/0, go to the Internet Gateway and sends the traffic there.
Security Groups are virtual firewalls that control inbound and outbound traffic at the instance level like EC2 and RDS. Unlike the building’s main security that controls who can enter each room or area, security groups are like individual bodyguards that stick with specific employees wherever they go. For example, if someone approaches your employee, start a conversation which is called the inbound request and you bodyguard approves it, they automatically allow that person to respond back which is called the outbound response.
The Security Groups module will be accepting returned values from the Networking module which are the vpc_id and vpc_cidr_block. Create a variable for them.
Just like you hire different types of employees with different skills and assign them to different rooms in your office, EC2 instances are virtual computers that you hire from AWS that performs specific tasks. EC2 instances are your actual workforce. The virtual computers doing their real work in your cloud infrastructure just like employees doing real work in your office building.
Defines the private IP address for the public instance, defaulting to 10.0.1.10.
Creates the compute module, passing networking details, security group IDs, and key pair information from other modules, and waits for all dependencies to complete first.
This is one of the hard ways to install and run Kubernetes. I recommend this for learning purposes and not for production use. There is an Amazon EKS (Elastic Kubernetes Service) which you can use rather than setting up your own just like this tutorial.
First, the VPC, Subnets, Security Groups, Key Pairs, SSM, IAM roles, Network Load Balancer, EC2 Instances (1 Bastion host, 3 Control Planes, 3 Worker Nodes) need to be setup first before Kubernetes can be installed through bash scripting.
Terraform will be used to spin AWS resources. It is an Infrastructure-as-code that lets you create AWS resources without having to provision them manually by mouse clicks.
Key Pairs are secure authentication method for accessing EC2 instances via SSH. It consists of Public and Private Keys.
Public Key: Gets installed on EC2 instances during launch
Private Key: Stays on your local machine (like a secret password)
Create the Key Pairs first, name it as terraform-key-pair.pem and save it locally.
HCL
# modules/keypair/main.tf# Generate an RSA key pairresource"tls_private_key""private"{algorithm="RSA"rsa_bits=4096}# Create an AWS key pair using the generated public keyresource"aws_key_pair""generated_key"{key_name="terraform-key-pair"public_key=tls_private_key.private.public_key_openssh}# Save the private key locallyresource"local_file""private_key"{content=tls_private_key.private.private_key_pemfilename="${path.root}/terraform-key-pair.pem"}
Expose the return values of the above code using outputs to be used on other modules.
HCL
# modules/keypair/outputs.tfoutput"key_pair_name"{description="Name of the AWS key pair for SSH access to EC2 instances"value=aws_key_pair.generated_key.key_name}output"tls_private_key_pem"{description="Private key in PEM format for SSH access - keep secure and do not expose"value=tls_private_key.private.private_key_pemsensitive=true}
IAM (Identity Access Management) is AWS’s security system where in it controls who can do what in an AWS account. IAM can do two things: Authenticate (who are you?) and Authorize (what can you do?).
Authentication – Users: Individual people (e.g., developers) – Roles: Temporary identities for services/applications – Groups: Collections of users with similar permissions
Authorization – Policies: Rules that define permissions – Permissions: Specific actions allowed/denied
Example IAM Users:
John (Developer)-> Can create EC2 instances but not delete themSarah (Admin)-> Can do everythingCI/CD System -> Can deploy applications but not manage billing
Example IAM Roles
EC2 Instance Role -> Can read from S3 bucketsLambda Function Role -> Can write to DynamoDBKubernetes Node Role -> Can join cluster and pull images
Generates a random identifier that will be consistent across Terraform runs. It’s like creating a unique “serial number” for your cluster. This ensures all resources are uniquely named and belong to the same cluster deployment. What it does:
Creates: A random 4-byte (32-bit) identifier
Formats: Usually displayed as hexadecimal (like a1b2c3d4)
Persistence: Same value every time you run terraform apply (unless you destroy and recreate)
It is used in aws_iam_role.kubernetes_master, aws_iam_instance_profile.kubernetes_master, aws_iam_role.kubernetes_worker, aws_iam_instance_profile.kubernetes_worker.
This is used to avoid naming conflicts. When multiple people or environments deploy the same Terraform code, IAM resources need unique names because:
IAM names are globally unique within an AWS account
Multiple deployments would conflict without unique identifiers
Easy identification of which resources belong to which cluster
If Consistent Random Suffixes are implemented there will be:
No Conflicts: Multiple developers/environments can deploy simultaneously
Easy Cleanup: All resources for one cluster have the same suffix
Clear Ownership: Can identify which resources belong to which deployment
Testing: Can deploy multiple test environments without conflicts
HCL
resource"random_id""cluster"{byte_length=4}
Control Plane Master IAM Setup
Master Role– Identity for control plane nodes
This creates an IAM role that EC2 instances can assume to get AWS permissions. The assume_role_policy is a trust policy that says “only EC2 instances can use this role” – it controls WHO can assume the role, not WHAT they can do. The actual permissions (like accessing S3 or Parameter Store) are added later by attaching separate IAM policies to this role.
HCL
# modules/iam/main.tfresource"aws_iam_role""kubernetes_master"{name="kubernetes-master-role"assume_role_policy=jsonencode({Version="2012-10-17"Statement= [{Action="sts:AssumeRole"Effect="Allow"Principal={Service="ec2.amazonaws.com"}} ]})tags={Name="${terraform.workspace} - Kubernetes Master Role"Description="IAM role for Kubernetes control plane nodes with AWS API permissions"Purpose="Kubernetes Control Plane"Environment= terraform.workspaceManagedBy="Terraform"Project="Kubernetes"NodeType="Control Plane"Service="EC2"}}
Master Instance Profile – Attaches role to EC2.
This creates an IAM instance profile that acts as a bridge between EC2 instances and IAM roles. The instance profile gets attached to EC2 instances and allows them to assume the specified IAM role to obtain temporary AWS credentials. Think of it as the mechanism that lets EC2 instances “wear” the IAM role – without an instance profile, EC2 instances cannot access AWS APIs because they have no way to authenticate or assume roles.
HCL
# modules/iam/main.tfresource"aws_iam_instance_profile""kubernetes_master"{name="kubernetes-master-profile-${random_id.cluster.hex}"role=aws_iam_role.kubernetes_master.nametags={Name="${terraform.workspace} - Kubernetes Control Plane Instance Profile"Description="Instance profile for control plane nodes - enables AWS API access for cluster management"Purpose="Kubernetes Control Plane"Environment= terraform.workspaceManagedBy="Terraform"}}
Master SSM Policy – Parameter store permissions
This policy gives the control plane nodes permission to store and manage cluster secrets in AWS Parameter Store. When the first control plane node sets up the cluster, it creates a “join command” (like a password) and stores it in AWS Parameter Store so other nodes can retrieve it and join the cluster. The policy restricts access to only parameters that start with /k8s/ for security.
What control plane can do:
PutParameter: Store cluster join command and tokens
GetParameter: Read existing cluster info
DeleteParameter: Clean up old/expired tokens
DescribeParameters: List available parameters
HCL
# modules/iam/main.tf# SSM parameter access policy for Kubernetes control plane - allows storing/retrieving cluster join tokensresource"aws_iam_role_policy""kubernetes_master_ssm"{name="kubernetes-master-ssm-policy"role=aws_iam_role.kubernetes_master.idpolicy=jsonencode({# Policy grants control plane full access to SSM parameters under /k8s/ namespaceVersion="2012-10-17"Statement= [{Effect="Allow"Action= ["ssm:PutParameter", # Store cluster join command with tokens and CA cert hash"ssm:GetParameter", # Retrieve existing parameters for validation"ssm:DeleteParameter", # Clean up expired or invalid join tokens"ssm:DescribeParameters"# List and discover available k8s parameters ]# Restrict access to only k8s namespace parameters for securityResource="arn:aws:ssm:us-east-1:${data.aws_caller_identity.current.account_id}:parameter/k8s/*"} ]})}
Worker Nodes IAM Setup
Worker Role – Identity for worker nodes
This creates an IAM role specifically for worker node EC2 instances. The assume_role_policy is a trust policy that allows only EC2 instances to assume this role and get AWS credentials. This role will later have policies attached that give worker nodes the specific permissions they need (like pulling container images, managing storage volumes, and handling pod networking) – but this just creates the empty role container that worker nodes can use.
HCL
# modules/iam/main.tfresource"aws_iam_role""kubernetes_worker"{name="kubernetes-worker-profile-${random_id.cluster.hex}"assume_role_policy=jsonencode({Version="2012-10-17"Statement= [{Action="sts:AssumeRole"Effect="Allow"Principal={Service="ec2.amazonaws.com"}} ]})tags={Name="${terraform.workspace} - Kubernetes Worker Role"Description="IAM role for Kubernetes worker nodes with permissions for pod networking, storage, and container operations"Purpose="Kubernetes Worker Nodes"Environment= terraform.workspaceManagedBy="Terraform"}}
Worker Instance Profile – Attaches role to EC2
This creates an IAM instance profile that acts as a bridge between worker node EC2 instances and the worker IAM role. The instance profile gets attached to worker EC2 instances and allows them to assume the kubernetes_worker role to obtain AWS credentials. This enables worker nodes to access AWS APIs for tasks like pulling container images, managing EBS volumes, and configuring networking – without it, worker nodes couldn’t authenticate with AWS services.
HCL
# modules/iam/main.tfresource"aws_iam_instance_profile""kubernetes_worker"{name="kubernetes-worker-profile"role=aws_iam_role.kubernetes_worker.nametags={Name="${terraform.workspace} - Kubernetes Worker Instance Profile"Description="Instance profile for worker nodes - enables AWS API access for container operations and networking"Purpose="Kubernetes Worker Nodes"Environment= terraform.workspaceManagedBy="Terraform"}}
Worker SSM Policy – Read-only parameter access
This creates an IAM policy that gets attached to the worker role, giving worker nodes read-only access to AWS Parameter Store. It allows worker nodes to retrieve the cluster join command that was stored by the control plane, but restricts access to only parameters under the /k8s/ path for security. This is how worker nodes get the secret tokens they need to join the existing Kubernetes cluster.
HCL
# modules/iam/main.tf# Worker node SSM access - read-only permissions to get cluster join commandresource"aws_iam_role_policy""kubernetes_worker_ssm"{name="kubernetes-worker-ssm-policy"role=aws_iam_role.kubernetes_worker.idpolicy=jsonencode({# Policy allows worker nodes to read SSM parameters under /k8s/ pathVersion="2012-10-17"Statement= [{Effect="Allow"Action= ["ssm:GetParameter", # Read join command stored by control plane"ssm:GetParameters"# Batch read multiple parameters if needed ]# Only allow access to k8s namespace parametersResource="arn:aws:ssm:us-east-1:${data.aws_caller_identity.current.account_id}:parameter/k8s/*"} ]})}
How IAM works:
Control plane starts -> Gets master role
Kubernetes initializes -> Generates join token
Control plane stores join command in SSM parameter /k8s/join-command
Worker nodes start -> Get worker role
Workers read join command from SSM
Workers join the cluster using the token
Expose the return values to be used in other modules.
HCL
# modules/iam/outputs.tfoutput"kubernetes_master_instance_profile"{description="IAM instance profile name for Kubernetes control plane nodes - provides AWS API permissions"value=aws_iam_instance_profile.kubernetes_master.name}output"kubernetes_worker_instance_profile"{description="IAM instance profile name for Kubernetes worker nodes - provides AWS API permissions for pods and services"value=aws_iam_instance_profile.kubernetes_worker.name}
This SSM parameter provides a secure, automated way for control plane nodes to share fresh join tokens with worker nodes, eliminating manual steps and security risks. You don’t go to every node and ssh just to enter the join command.
HCL
# modules/ssm/main.tfresource"aws_ssm_parameter""join_command"{name="/k8s/control-plane/join-command"type="SecureString"value="placeholder-will-be-updated-by-script"description="Kubernetes cluster join command for worker nodes - automatically updated by control plane initialization script"lifecycle{ignore_changes=[value]# Let the script update the value}}
Name & Path:
HCL
name="/k8s/control-plane/join-command"
Hierarchical path: Organized under /k8s/ namespace
Specific location: Control plane section for join commands
Matches IAM policy: IAM roles above have access to /k8s/* path
Security Type:
HCL
type="SecureString"
Encrypted storage: Value is encrypted at rest in AWS
Secure transmission: Encrypted in transit when accessed
Better than plaintext: Protects sensitive cluster tokens
The Join Command Content What gets stored (after control plane runs):
# Real example of what replaces the placeholder:"kubeadm join 10.0.1.10:6443 --token abc123.def456ghi789 --discovery-token-ca-cert-hash sha256:1234567890abcdef..."
Why Ignore Changes is needed?
Control plane script updates the value with real join command
Without lifecycle: Terraform would overwrite script’s value back to placeholder
With lifecycle: Terraform ignores value changes, lets script manage it
HCL
lifecycle{ignore_changes=[value]# Let the script update the value}
The networking section creates the foundational network infrastructure for the Kubernetes cluster.
Create the variables first.
HCL
# modules/networking/variables.tfvariable"aws_region"{type=map(string)description="AWS region for each environment - maps workspace to region"default={"development" = "us-east-1""production" = "us-east-2"}}variable"public_subnet_cidrs"{type=list(string)description="Public Subnet CIDR values for load balancers and internet-facing resources"default=["10.0.1.0/24"]}variable"private_subnet_cidrs"{type=list(string)description="Private Subnet CIDR values for Kubernetes nodes and internal services"default=["10.0.2.0/24","10.0.3.0/24","10.0.4.0/24","10.0.5.0/24","10.0.6.0/24"]}variable"azs"{type=map(list(string))description="Availability Zones for each environment - ensures high availability across multiple AZs"default={"development" = ["us-east-1a", "us-east-1b", "us-east-1c", "us-east-1d", "us-east-1f"]"production" = ["us-east-2a", "us-east-2b", "us-east-2c", "us-east-2d", "us-east-2f"]}}
VPC (Virtual Private Cloud)
A VPC (Virtual Private Cloud) in Amazon Web Services (AWS) is your own isolated network within the AWS cloud — like a private data center you control.
A public subnet in AWS is a subnet inside a VPC that can directly communicate with the internet — typically used for resources that need to be accessible from outside AWS
In AWS, a DMZ (Demilitarized Zone) is a subnet or network segment that acts as a buffer zone between the public internet and your private/internal AWS resources. It’s used to host public-facing services while minimizing the exposure of your internal network.
The public subnet contains the bastion host – a dedicated EC2 instance that acts as a secure gateway for accessing private resources. The bastion has a public IP and sits in the public subnet, allowing administrators to SSH into it from the internet, then use it as a stepping stone to securely connect to instances in private subnets that don’t have direct internet access.
HCL
# modules/networking/main.tfresource"aws_subnet""public_subnets"{count=length(var.public_subnet_cidrs)vpc_id=aws_vpc.main.idcidr_block=element(var.public_subnet_cidrs, count.index)availability_zone=element(var.azs[terraform.workspace], count.index)tags={Name="${terraform.workspace} - Public Subnet ${count.index+1}"Description="Public subnet for bastion host and load balancers"Type="Public"Environment= terraform.workspaceAvailabilityZone=element(var.azs[terraform.workspace], count.index)Purpose="DMZ"ManagedBy="Terraform"Project="Kubernetes"Tier="DMZ"# Demilitarized Zone}}
Private Subnets (Internal)
A private subnet in AWS is a subnet within your VPC that does NOT have direct access to or from the public internet. It’s used to host internal resources that should remain isolated from external access, such as: Application servers, Databases (e.g., RDS), Internal services (e.g., Redis, internal APIs).
Hosts the Kubernetes control plane and worker nodes. No direct internet access (protected from external access).
HCL
# modules/networking/main.tfresource"aws_subnet""private_subnets"{count=min(length(var.private_subnet_cidrs),length(var.azs[terraform.workspace]))vpc_id=aws_vpc.main.idcidr_block=var.private_subnet_cidrs[count.index]availability_zone=var.azs[terraform.workspace][count.index]# Ensures 1 AZ per subnettags={Name="${terraform.workspace} - Private Subnet ${count.index+1}"Description="Private subnet for Kubernetes worker and control plane nodes"Type="Private"Environment= terraform.workspaceAvailabilityZone= var.azs[terraform.workspace][count.index]Purpose="Kubernetes Nodes"ManagedBy="Terraform"Project="Kubernetes"Tier="Internal"}}
Multi-AZ Distribution: Spreads resources across multiple data centers (High availability). If one AZ fails, others continue running (Fault tolerance).
An Internet Gateway (IGW) in AWS is a component that connects your VPC to the internet. It allows resources in your VPC (like EC2 instances in a public subnet) to send traffic to the internet and receive traffic from the internet. It is attached to the public subnet. It enables bastion host to receive ssh connections. It handles:
Outbound connections (e.g., your EC2 instance accessing a website)
Inbound connections (e.g., users accessing your public web server)
HCL
# modules/networking/main.tfresource"aws_internet_gateway""igw"{vpc_id=aws_vpc.main.idtags={Name="${terraform.workspace} - Internet Gateway"Purpose="Internet access for public subnets"Description="Provides internet connectivity for bastion host and load balancers"Type="Gateway"}}
Public route table
A public route table in AWS is a route table associated with one or more public subnets, and it directs traffic destined for the internet to an Internet Gateway (IGW).
Traffic flow: Public subnet -> Internet Gateway -> Internet
HCL
# modules/networking/main.tfresource"aws_route_table""public"{vpc_id=aws_vpc.main.idtags={Name="${terraform.workspace} - Public Route Table"Description="Route table for public subnets - directs traffic to internet gateway"Type="Public"Purpose="Internet routing for DMZ resources"Environment= terraform.workspaceManagedBy="Terraform"Tier="DMZ"RouteType="Internet-bound"Project="Kubernetes"}}
Private Route Table
A private route table in AWS is a route table used by private subnets—subnets that do not have direct access to or from the internet.
A private route table does NOT have a route to an Internet Gateway (IGW). Instead, it may have a route to a NAT Gateway or no external route at all, depending on whether you want outbound internet access (e.g., for software updates) or complete isolation.
Traffic flow: Private subnet -> NAT Gateway -> Internet Gateway -> Internet
HCL
# modules/networking/main.tfresource"aws_route_table""private"{vpc_id=aws_vpc.main.idtags={Name="${terraform.workspace} - Private Route Table"Description="Route table for private subnets - directs internet traffic through NAT Gateway"Type="Private"Environment= terraform.workspacePurpose="NAT Gateway Routing"ManagedBy="Terraform"}}
Elastic IP for NAT Gateway
Static public IP provides consistent IP address and it is required for NAT Gateway operation.
What NAT Gateway Does:
Private Subnet (10.0.1.x)-> NAT Gateway -> Internet
It translates private IPs to public IP for outbound traffic. It needs public IP to communicate with internet on behalf of private resources. Without EIP – NAT Gateway Won’t Work.
Without EIP (Dynamic IP):
Today: NAT uses IP 12.123.45.67Tomorrow: AWS changes it to 12.234.56.78Result: External services block your new IP
With EIP (Static IP):
Always: NAT uses IP 54.123.45.67Result: Consistent external identity
HCL
# modules/networking/main.tfresource"aws_eip""nat_eip"{domain="vpc"tags={Name="${terraform.workspace} - NAT Gateway EIP"Description="Elastic IP for NAT Gateway - enables internet access for private subnets"Purpose="NAT Gateway"Environment= terraform.workspaceManagedBy="Terraform"}}
NAT Gateway
Allows private subnets to reach internet. Outbound traffic only, no inbound from internet. It’s essential for Kubernetes nodes to download images, updates and etc.
HCL
# modules/networking/main.tfresource"aws_nat_gateway""nat"{allocation_id=aws_eip.nat_eip.idsubnet_id=aws_subnet.public_subnets[0].idtags={Name="${terraform.workspace} - NAT Gateway"Description="NAT Gateway for private subnet internet access - enables Kubernetes nodes to reach external services"Purpose="Private Subnet Internet Access"Environment= terraform.workspaceManagedBy="Terraform"}depends_on=[aws_internet_gateway.igw]}
Add a default route to the internet gateway in the public route table
Expose the return values to be used in other modules.
HCL
# modules/networking/outputs.tfoutput"vpc_id"{description="ID of the VPC for the Kubernetes cluster"value=aws_vpc.main.id}output"vpc_cidr_block"{description="CIDR block of the VPC for security group rules and network configuration"value=aws_vpc.main.cidr_block}output"private_subnets"{description="Private subnets for Kubernetes worker nodes and internal services"value=aws_subnet.private_subnets}output"public_subnets"{description="Public subnets for load balancers, bastion hosts, and internet-facing resources"value=aws_subnet.public_subnets}
The security groups creates network security rules (firewalls) for Kubernetes cluster. This creates a secure, layered defense where each Kubernetes component can communicate as needed while preventing unauthorized access from the internet.
# modules/security_groups/variables.tf// FROM Other Modulevariable"vpc_id"{description="VPC ID from AWS module"type=string}variable"vpc_cidr_block"{description="CIDR block of the VPC for internal network communication"type=string}
1. Bastion Security Group: Creates a firewall group for the bastion host.
HCL
# modules/security_groups/main.tfresource"aws_security_group""bastion"{name="bastion-sg"vpc_id=var.vpc_iddescription="Security group for the bastion host"tags={Name="${terraform.workspace} - Bastion Host SG"}}
2. Bastion SSH from Internet: Allows SSH connections to bastion host from anywhere on the internet.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_ingress_rule""bastion_ssh_anywhere"{security_group_id=aws_security_group.bastion.idfrom_port=22to_port=22ip_protocol="tcp"cidr_ipv4="0.0.0.0/0"description="Allow SSH access to bastion host from any IP address"tags={Name="${terraform.workspace} - Bastion SSH Internet Access"}}
3. Bastion SSH to Control Plane: Allows bastion host to SSH to control plane nodes.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_egress_rule""bastion_egress_control_plane"{security_group_id=aws_security_group.bastion.idfrom_port=22to_port=22ip_protocol="tcp"referenced_security_group_id=aws_security_group.control_plane.iddescription="Allow SSH from bastion host to Kubernetes control plane nodes for cluster administration"tags={Name="${terraform.workspace} - Bastion SSH to Control Plane"}}
4. Bastion SSH to Workers: Allows bastion host to SSH to worker nodes.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_egress_rule""bastion_egress_workers"{security_group_id=aws_security_group.bastion.idfrom_port=22to_port=22ip_protocol="tcp"referenced_security_group_id=aws_security_group.worker_node.iddescription="Allow SSH from bastion host to worker nodes for maintenance and troubleshooting"tags={Name="${terraform.workspace} - Bastion SSH to Worker Nodes"}}
5. Control Plane Security Group: Creates a firewall group for Kubernetes master nodes.
HCL
# modules/security_groups/main.tfresource"aws_security_group""control_plane"{name="control-plane-sg"vpc_id=var.vpc_iddescription="Security group for the Kubernetes control plane"tags={Name="${terraform.workspace} - Kubernetes Control Plane SG"}}
6. Control Plane SSH Access: Allows SSH to control plane from bastion host only.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_ingress_rule""control_plane_ssh"{security_group_id=aws_security_group.control_plane.idfrom_port=22to_port=22ip_protocol="tcp"referenced_security_group_id=aws_security_group.bastion.iddescription="Allow SSH access to control plane nodes from bastion host for cluster administration"tags={Name="${terraform.workspace} - Control Plane SSH from Bastion"}}
7. Control Plane etcd: Allows etcd database communication. Kubernetes stores all data in etcd.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_ingress_rule""control_plane_etcd"{security_group_id=aws_security_group.control_plane.idfrom_port=2379to_port=2380ip_protocol="tcp"cidr_ipv4=var.vpc_cidr_blockdescription="Allow etcd client and peer communication within VPC for Kubernetes cluster state management"tags={Name="${terraform.workspace} - Control Plane etcd Communication"}}
8. Control Plane kubelet: Allows kubelet API access. Use for monitoring and managing pods.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_ingress_rule""control_plane_self_control_plane"{security_group_id=aws_security_group.control_plane.idfrom_port=10250to_port=10250ip_protocol="tcp"cidr_ipv4=var.vpc_cidr_blockdescription="Allow kubelet API access within VPC for control plane node communication and monitoring"tags={Name="${terraform.workspace} - Control Plane kubelet API"}}
9. Control Plane Scheduler: Allows access to scheduler metrics. Use for health checks and monitoring.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_ingress_rule""control_plane_kube_scheduler"{security_group_id=aws_security_group.control_plane.idfrom_port=10259to_port=10259ip_protocol="tcp"cidr_ipv4=var.vpc_cidr_blockdescription="Allow kube-scheduler metrics and health check access from VPC for cluster monitoring"tags={Name="${terraform.workspace} - Control Plane kube-scheduler"}}
10. Control Plane Controller Manager: Allows access to controller manager metrics. Use for health checks and monitoring.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_ingress_rule""control_plane_kube_controller_manager"{security_group_id=aws_security_group.control_plane.idfrom_port=10257to_port=10257ip_protocol="tcp"cidr_ipv4=var.vpc_cidr_blockdescription="Allow kube-controller-manager metrics and health check access from VPC for cluster monitoring"tags={Name="${terraform.workspace} - Control Plane kube-controller-manager"}}
11. Control Plane All Outbound: Allows control plane to connect to anything on internet. Use for downloading of updates and call AWS APIs.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_egress_rule""control_plane_egress_all"{security_group_id=aws_security_group.control_plane.idip_protocol="-1"cidr_ipv4="0.0.0.0/0"description="Allow all outbound traffic from control plane for AWS APIs, container registries, and external services"tags={Name="${terraform.workspace} - Control Plane Outbound All"}}
12. Worker Node Security Group: Creates a firewall group for worker nodes.
HCL
# modules/security_groups/main.tfresource"aws_security_group""worker_node"{name="worker-node-sg"vpc_id=var.vpc_iddescription="Security group for Kubernetes worker nodes - controls pod and application traffic"tags={Name="${terraform.workspace} - Worker Nodes SG"}}
13. Worker All Outbound: Allows workers to connect to anything on internet. Use for downloading of container images and call external APIs.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_egress_rule""worker_node_egress_all"{security_group_id=aws_security_group.worker_node.idip_protocol="-1"cidr_ipv4="0.0.0.0/0"description="Allow all outbound traffic from worker nodes for container images, application traffic, and AWS services"tags={Name="${terraform.workspace} - Worker Nodes Outbound All"}}
14. Worker SSH Access: Allows SSH to workers from bastion only.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_ingress_rule""worker_node_ssh"{security_group_id=aws_security_group.worker_node.idfrom_port=22to_port=22ip_protocol="tcp"referenced_security_group_id=aws_security_group.bastion.iddescription="Allow SSH access to worker nodes from bastion host for maintenance and troubleshooting"tags={Name="${terraform.workspace} - Worker Nodes SSH from Bastion"}}
15. Worker kubelet API: Allows control plane to manage worker pods. Use to how Kubernetes schedules pods.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_ingress_rule""worker_node_kubelet_api"{security_group_id=aws_security_group.worker_node.idfrom_port=10250to_port=10250ip_protocol="tcp"referenced_security_group_id=aws_security_group.control_plane.iddescription="Allow control plane access to worker node kubelet API for pod management and monitoring"tags={Name="${terraform.workspace} - Worker Nodes kubelet API"}}
16. Worker kube-proxy: Allows load balancer to check worker health.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_ingress_rule""worker_node_kube_proxy"{security_group_id=aws_security_group.worker_node.idfrom_port=10256to_port=10256ip_protocol="tcp"referenced_security_group_id=aws_security_group.elb.iddescription="Allow load balancer access to kube-proxy health check endpoint on worker nodes"tags={Name="${terraform.workspace} - Worker Nodes kube-proxy"}}
17. Worker NodePort TCP: Allows internet to access applications on workers. Expose web apps and APIs.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_ingress_rule""worker_node_tcp_nodeport_services"{security_group_id=aws_security_group.worker_node.idfrom_port=30000to_port=32767ip_protocol="tcp"cidr_ipv4="0.0.0.0/0"description="Allow internet access to Kubernetes NodePort services (TCP 30000-32767) for application traffic"tags={Name="${terraform.workspace} - Worker Nodes NodePort TCP"}}
18. Worker NodePort UDP: Allows internet to access UDP applications on workers. Expose UDP services like DNS and games.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_ingress_rule""worker_node_udp_nodeport_services"{security_group_id=aws_security_group.worker_node.idfrom_port=30000to_port=32767ip_protocol="udp"cidr_ipv4="0.0.0.0/0"description="Allow internet access to Kubernetes NodePort services (UDP 30000-32767) for application traffic"tags={Name="${terraform.workspace} - Worker Nodes NodePort UDP"}}
19. Control Plane API Health Check: Allows load balancer to check if API server is healthy.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_ingress_rule""allow_nlb_health_check"{security_group_id=aws_security_group.control_plane.idfrom_port=6443to_port=6443ip_protocol="tcp"cidr_ipv4=var.vpc_cidr_blockdescription="Allow Network Load Balancer health checks to Kubernetes API server on port 6443"tags={Name="${terraform.workspace} - Control Plane NLB Health Check"}}
20. Control Plane BGP: Allows advanced networking protocols for Service mesh, advanced CNI plugins.
HCL
# modules/security_groups/main.tfresource"aws_vpc_security_group_ingress_rule""allow_bgp"{security_group_id=aws_security_group.control_plane.idfrom_port=179to_port=179ip_protocol="tcp"cidr_ipv4=var.vpc_cidr_blockdescription="Allow BGP protocol communication within VPC for network routing and service mesh"tags={Name="${terraform.workspace} - Control Plane BGP Communication"}}
When to use cidr_ipv4 = var.vpc_cidr_block? Use VPC CIDR when communication needs to happen with:
Multiple different security groups (avoiding many separate rules)
Load balancers or services that don’t have their own security groups
System-level protocols that need broad VPC access
Health checks that come from various AWS services
For example,
# etcd (ports 2379-2380) - Multiple control plane nodes need to communicatecidr_ipv4=var.vpc_cidr_block# kubelet API (port 10250) - Control plane, workers, monitoring all need accesscidr_ipv4=var.vpc_cidr_block# kube-scheduler (10259) - Monitoring systems need accesscidr_ipv4=var.vpc_cidr_block# BGP (port 179) - Network routing between various nodescidr_ipv4=var.vpc_cidr_block# NLB health checks (port 6443) - Load balancer health checkscidr_ipv4=var.vpc_cidr_block
For Quick Decisions, use cidr_ipv4 = var.vpc_cidr_block when:
“Do multiple types of resources need access?” → YES = VPC CIDR
“Is this a system/infrastructure port?” → YES = VPC CIDR
“Do health checks or monitoring need access?” → YES = VPC CIDR
Use referenced_security_group_id when:
“Is this one specific service talking to another?” → YES = Security Group
“Can I identify exactly who should have access?” → YES = Security Group
“Is this application-level communication?” → YES = Security Group
Expose the return values to be used in other modules.
HCL
# modules/security_groups/outputs.tfoutput"bastion_security_group_id"{description="Security group ID for the bastion host - used for SSH access to cluster nodes"value=aws_security_group.bastion.id}output"control_plane_security_group_id"{description="Security group ID for Kubernetes control plane nodes - manages API server and cluster components"value=aws_security_group.control_plane.id}output"worker_node_security_group_id"{description="Security group ID for Kubernetes worker nodes - handles application workloads and pod traffic"value=aws_security_group.worker_node.id}
Create a custom module named networking. Pass values from networking so that it can be used inside the security groups (vpc_id and vpc_cidr_block).
EC2 instances are virtual computers that you rent from AWS. They are servers running in Amazon’s data centers that you can control remotely. EC2 instances are virtual computers in the cloud that you can create, configure, and control through code, giving you the flexibility to build infrastructure without buying physical hardware. In the Kubernetes setup, we will use 7 ec2 instances to hold the bastion host, control planes and worker nodes.
Create the variables first.
HCL
variable"public_subnet_cidrs"{type=list(string)description="Public Subnet CIDR values"default=["10.0.1.0/24"]}variable"control_plane_private_ips"{type=list(string)description="List of private IPs for control plane nodes"default=["10.0.2.10","10.0.3.10","10.0.4.10"]}variable"bastion"{description="Configuration for the bastion host used as a secure gateway to access private cluster resources"type=mapdefault={"ami" = "ami-084568db4383264d4""instance_type" = "t3.micro""private_ip" = "10.0.1.10""name" = "Bastion Host"}}variable"common_functions"{description="Configuration for deploying shared utility scripts across all cluster instances"type=anydefault={"source" = "scripts/common-functions.sh""destination" = "/tmp/common-functions.sh""connection" = {"type" = "ssh""user" = "ubuntu""bastion_user" = "ubuntu""timeout" = "30m"# Allow enough time for installation}}}variable"control_plane"{description="Configuration for the primary Kubernetes control plane node including API server, scheduler, and controller manager"type=anydefault={"ami" = "ami-084568db4383264d4""instance_type" = "t3.xlarge""root_block_device" = {"volume_size" = 20"volume_type" = "gp3""delete_on_termination" = true}"init_file" = "scripts/init-control-plane.sh.tmpl""name" = "Control Plane 1"}}variable"wait_for_master_ready"{description="Configuration for the script that waits for the control plane to be fully operational before proceeding with cluster setup"type=mapdefault={"source" = "scripts/wait-for-master.sh.tmpl"}}variable"control_plane_secondary"{description="Configuration for additional control plane nodes to provide high availability for the Kubernetes cluster"type=anydefault={"ami" = "ami-084568db4383264d4"# Replace with a Ubuntu 12 AMI ID"instance_type" = "t3.xlarge""root_block_device" = {"volume_size" = 20"volume_type" = "gp3""delete_on_termination" = true}"init_file" = "scripts/init-control-plane.sh.tmpl""name" = "Control Plane 1"}}variable"worker_nodes"{description="Configuration for Kubernetes worker nodes that run application workloads and pods"type=anydefault={"count" = 3"ami" = "ami-084568db4383264d4""instance_type" = "t3.large""root_block_device" = {"volume_size" = 20"volume_type" = "gp3""delete_on_termination" = true}"init_file" = "scripts/init-worker-node.sh.tmpl""name" = "Worker Node"}}variable"wait_for_workers_to_join"{description="Configuration for the script that waits for all worker nodes to successfully join the Kubernetes cluster"type=mapdefault={"init_file" = "scripts/wait-for-workers.sh.tmpl""log_file" = "/var/log/k8s-wait-for-workers-$(date +%Y%m%d-%H%M%S).log"}}variable"label_worker_nodes"{description="Configuration for applying labels and taints to worker nodes for workload scheduling and node organization"type=anydefault={"init_file" = "scripts/label-worker-nodes.sh.tmpl""expected_worker_count" = 3}}# FROM Other Modulevariable"vpc_id"{description="VPC ID from AWS module where the Kubernetes cluster will be deployed"type=string}variable"private_subnets"{description="Private subnets from AWS module for deploying worker nodes and internal cluster components"type=any}variable"public_subnets"{description="Public subnets from AWS module for deploying bastion host and load balancers"type=any}variable"bastion_security_group_id"{description="Security group ID for the bastion host allowing SSH access from authorized sources"type=string}variable"control_plane_security_group_id"{description="Security group ID for control plane nodes allowing Kubernetes API and inter-node communication"type=string}variable"worker_node_security_group_id"{description="Security group ID for worker nodes allowing pod-to-pod communication and kubelet access"type=string}variable"kubernetes_master_instance_profile"{description="IAM instance profile for control plane nodes with permissions for Kubernetes master operations"type=string}variable"kubernetes_worker_instance_profile"{description="IAM instance profile for worker nodes with permissions for Kubernetes worker operations"type=string}variable"tls_private_key_pem"{description="TLS private key in PEM format for secure communication within the Kubernetes cluster"type=stringsensitive=true}variable"key_pair_name"{description="AWS EC2 key pair name for SSH access to cluster instances"type=string}
Bastion Host Instance
Creates bastion host EC2 instance (one per public subnet). The purpose is for SSH gateway to access private cluster nodes.
Creates static public IP addresses for bastion hosts. The purpose is to give bastion a fixed IP that doesn’t change when instance restarts. So you always know the IP to SSH to
Copies a script file to the control plane node. Uploads shared utility functions used by other scripts. Uses SSH through bastion host to reach control plane.
HCL
resource"null_resource""upload_common_functions"{depends_on=[null_resource.wait_for_master_ready]provisioner"file"{source="${path.module}/${var.common_functions.source}"destination=var.common_functions.destinationconnection{type=var.common_functions.connection.typeuser=var.common_functions.connection.userprivate_key=var.tls_private_key_pemhost=aws_instance.control_plane["0"].private_ipbastion_host=aws_eip.bastion_eip[0].public_ipbastion_user=var.common_functions.connection.bastion_userbastion_private_key=var.tls_private_key_pem}}# Make sure the file is executableprovisioner"remote-exec"{inline=["chmod +x /tmp/common-functions.sh","echo 'Common functions uploaded successfully'"]connection{type=var.common_functions.connection.typeuser=var.common_functions.connection.userprivate_key=var.tls_private_key_pemhost=aws_instance.control_plane["0"].private_ipbastion_host=aws_eip.bastion_eip[0].public_ipbastion_user=var.common_functions.connection.bastion_userbastion_private_key=var.tls_private_key_pem}}}
Control Plane Instance
Creates the primary Kubernetes master node. The purpose is to run API server, scheduler, controller manager. Located in a private subnet (protected from internet).
Creates additional master nodes for high availability. The purpose is if primary master fails, these can take over. The location is different subnets from primary master. Key difference: is_first_control_plane = "false" in user_data.
Creates Kubernetes worker nodes (default: 3 nodes). The purpose is to Run application pods and workloads. Distributed across private subnets using modulo.
HCL
resource"aws_instance""worker_nodes"{count=var.worker_nodes.countami=var.worker_nodes.amiinstance_type=var.worker_nodes.instance_typekey_name=var.key_pair_namevpc_security_group_ids=[var.worker_node_security_group_id]# Use modulo to distribute worker nodes across available subnetssubnet_id=var.private_subnets[count.index%length(var.private_subnets)].idiam_instance_profile=var.kubernetes_worker_instance_profileroot_block_device{volume_size=var.worker_nodes.root_block_device.volume_sizevolume_type=var.worker_nodes.root_block_device.volume_typedelete_on_termination=var.worker_nodes.root_block_device.delete_on_termination}user_data=templatefile("${path.module}/${var.worker_nodes.init_file}",{common_functions=file("${path.module}/${var.common_functions.source}")})# Wait for at least the master control plane to be readydepends_on=[null_resource.wait_for_master_ready]tags={Name="${terraform.workspace} - ${var.worker_nodes.name}${count.index+1}"Environment= terraform.workspaceProject="Kubernetes"Role="worker-node"ManagedBy="Terraform"CostCenter="Infrastructure"MonitoringEnabled="true"SubnetType="private"NodeType="compute"WorkloadCapable="true"CreatedDate=formatdate("YYYY-MM-DD",timestamp())}lifecycle{ignore_changes=[tags["CreatedDate"]]}}
Wait for Workers to Join
Runs script that waits for all worker nodes to join cluster.
Applies labels to worker nodes to organize them for workload scheduling. Labels nodes with the “worker” role so they display properly in kubectl output instead of showing <none> as their role.
Creates internal (private subnets only) Network Load Balancer for Kubernetes API server. The purpose is to distribute API requests across multiple master nodes.
HCL
resource"aws_lb""k8s_api"{name="k8s-api-lb"internal=trueload_balancer_type="network"subnets=[forsubnetinvar.private_subnets:subnet.id]tags={Name="${terraform.workspace} - Kubernetes API Load Balancer"Environment= terraform.workspaceProject="Kubernetes"Role="api-load-balancer"Component="networking"Purpose="kubernetes-api-endpoint"ManagedBy="Terraform"CostCenter="Infrastructure"MonitoringEnabled="true"LoadBalancerType="network"Scheme="internal"Protocol="tcp"HighAvailability="true"SecurityLevel="high"CreatedDate=formatdate("YYYY-MM-DD",timestamp())}lifecycle{ignore_changes=[tags["CreatedDate"]]}}
API Target Group
Creates target group for API server health checks. The purpose is to define which servers receive traffic and how to check if they’re healthy. TCP connection test every 10 seconds
Port: 6443 (standard Kubernetes API port)
HCL
resource"aws_lb_target_group""k8s_api"{name="k8s-api-tg"port=6443protocol="TCP"vpc_id=var.vpc_idhealth_check{protocol="TCP"port=6443healthy_threshold=2unhealthy_threshold=2interval=10}tags={Name="${terraform.workspace} - Kubernetes API Target Group"Environment= terraform.workspaceProject="Kubernetes"Role="api-target-group"Component="networking"Purpose="kubernetes-api-health-check"ManagedBy="Terraform"CostCenter="Infrastructure"MonitoringEnabled="true"Protocol="TCP"Port="6443"HealthCheck="enabled"ServiceType="kubernetes-api-server"TargetType="control-plane-nodes"CreatedDate=formatdate("YYYY-MM-DD",timestamp())}lifecycle{ignore_changes=[tags["CreatedDate"]]}}
Master Target Group Attachment
Adds primary master node to load balancer target group. The purpose is for primary master receives API traffic through load balancer.
Configures load balancer to listen on port 6443. The purpose is to accept incoming API requests and forwards to healthy masters. Forward all traffic to target group.
Load balancer distributes API traffic across all masters
Create a custom module and name it compute. Pass outputs from other modules.
HCL
# environments/development/main.tfmodule"compute"{source="../../modules/compute"# Pass AWS resources from development moduleprivate_subnets=module.networking.private_subnetspublic_subnets=module.networking.public_subnetsbastion_security_group_id=module.security_groups.bastion_security_group_idcontrol_plane_security_group_id=module.security_groups.control_plane_security_group_idworker_node_security_group_id=module.security_groups.worker_node_security_group_idkubernetes_master_instance_profile=module.iam.kubernetes_master_instance_profilekubernetes_worker_instance_profile=module.iam.kubernetes_worker_instance_profilekey_pair_name=module.keypair.key_pair_nametls_private_key_pem=module.keypair.tls_private_key_pemvpc_id=module.networking.vpc_iddepends_on=[module.iam,module.keypair,module.networking,module.security_groups]}
Kubernetes Control Planes Installation using Bash Script
This script essentially automates the creation of a highly available Kubernetes cluster on AWS, handling both the initial cluster setup and the addition of subsequent control plane nodes.
1. wait_for_variables() Waits for required environment variables to be available.
Polls for 30 attempts (60 seconds total) checking if control_plane_master_private_ip, control_plane_endpoint, and is_first_control_plane are set
Returns 0 if all variables are available, 1 if timeout occurs
Bash
#!/bin/bashset-e# Function: Wait for required environment variables to be availablewait_for_variables(){max_attempts=30sleep_interval=2attempt=1while[$attempt-le$max_attempts];do# Check if all required variables are set and non-emptyif[-n"${control_plane_master_private_ip}"]&&[-n"${control_plane_endpoint}"]&&[-n"${is_first_control_plane}"];thenreturn0fisleep$sleep_intervalattempt=$((attempt +1))donereturn1}# Wait for variables or exit if timeoutif!wait_for_variables;thenexit1fi# Validate required environment variables are setif[-z"${control_plane_master_private_ip}"]||[-z"${control_plane_endpoint}"]||[-z"${is_first_control_plane}"];thenexit1fi
2. System Preparation Block. Prepares the system for Kubernetes installation.
Swap Management: Disables swap memory (required by Kubernetes) and comments it out in /etc/fstab to prevent re-enabling on reboot
Network Configuration: Enables IP forwarding by setting net.ipv4.ip_forward = 1 for pod-to-pod communication
Package Updates: Updates system packages with retry logic for reliability
Bash
# SYSTEM PREPARATION# Disable swap (required for Kubernetes)swapoff-a# Permanently disable swap by commenting it out in fstabsed-i'/ swap / s/^/#/'/etc/fstab# Enable IP forwarding for pod networkingcat<<EOF| tee /etc/sysctl.d/k8s.confnet.ipv4.ip_forward = 1EOF# Apply sysctl settings without rebootsysctl--system# Update package lists with retry logicforattemptin 1 2 3;doifapt-getupdate;thenbreakelseif[$attempt-eq3];thenexit1fisleep10fidone
3. Container Runtime Setup Block. Installs and configures containerd as the container runtime.
Package Installation: Installs essential packages including containerd and security tools
Repository Setup: Adds Docker’s GPG key and repository for containerd installation
Containerd Configuration:
Generates default config file
Enables systemd cgroup driver (required for Kubernetes)
Starts and enables the containerd service
Bash
# CONTAINER RUNTIME SETUP (containerd)# Install required packagesapt-getinstall-yca-certificatescurlgnupglsb-releasecontainerdapt-transport-httpsunzip# Create directory for APT keyringsmkdir-p/etc/apt/keyrings# Add Docker GPG key (for containerd installation)curl-fsSLhttps://download.docker.com/linux/ubuntu/gpg|sudogpg--dearmor-o/etc/apt/keyrings/docker.gpg# Make the key readablesudochmoda+r/etc/apt/keyrings/docker.gpg# Add Docker repositoryecho\"deb [arch=$(dpkg--print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ noble stable"|tee/etc/apt/sources.list.d/docker.list>/dev/null# Update package list again after adding repositoryforattemptin 1 2 3;doifapt-getupdate;thenbreakelseif[$attempt-eq3];thenexit1fisleep10fidone# Configure containerdmkdir-p/etc/containerd# Generate default containerd configurationcontainerdconfigdefault|tee/etc/containerd/config.toml# Enable systemd cgroup driver (required for Kubernetes)sed-i's/SystemdCgroup = false/SystemdCgroup = true/'/etc/containerd/config.toml# Start and enable containerd servicesystemctlrestartcontainerdsystemctlenablecontainerd
6. First Control Plane Node Block (is_first_control_plane = true). Initializes the first control plane node and sets up the cluster.
Configuration Validation: Validates kubeadm config before cluster initialization
Cluster Initialization: Creates the Kubernetes cluster with specific networking settings
User Setup: Configures kubectl access for the ubuntu user
Control Plane Health Check: Waits up to 150 seconds for the control plane to become responsive
CNI Installation: Installs Calico for pod networking
Certificate Regeneration:
Backs up existing API server certificates
Regenerates certificates to include load balancer DNS as Subject Alternative Name (SAN)
This allows external access through the load balancer
Join Command Generation:
Creates join commands for both worker nodes and additional control planes
Replaces private IP with load balancer DNS for external access
Parameter Store Operations: Stores join commands in AWS Systems Manager for other nodes to retrieve
Bash
# CLUSTER INITIALIZATION OR JOINif["${is_first_control_plane}"="true"];then# FIRST CONTROL PLANE NODE SETUP# Validate kubeadm configuration before initializationif!kubeadmconfigvalidate--config<(cat<<EOFapiVersion: kubeadm.k8s.io/v1beta3kind: InitConfigurationlocalAPIEndpoint: advertiseAddress: "${control_plane_master_private_ip}" bindPort: 6443---apiVersion: kubeadm.k8s.io/v1beta3kind: ClusterConfigurationcontrolPlaneEndpoint: "${control_plane_master_private_ip}:6443"apiServer: certSANs: - "${control_plane_endpoint}"networking: podSubnet: "192.168.0.0/16"EOF);thenexit1fi# Initialize Kubernetes clusterkubeadminit\--control-plane-endpoint"${control_plane_master_private_ip}:6443"\--apiserver-advertise-address="${control_plane_master_private_ip}"\--upload-certs\--pod-network-cidr=192.168.0.0/16\--apiserver-cert-extra-sans"${control_plane_endpoint}"# Setup kubeconfig for ubuntu userexportKUBE_USER=ubuntumkdir-p/home/$KUBE_USER/.kubesudocp-i/etc/kubernetes/admin.conf/home/$KUBE_USER/.kube/configsudochown$KUBE_USER:$KUBE_USER/home/$KUBE_USER/.kube/config# Wait for control plane to become responsivecontrol_plane_ready=falseforiin{1..30};doifKUBECONFIG=/etc/kubernetes/admin.confkubectlgetnodes&>/dev/null;thencontrol_plane_ready=truebreakfisleep5doneif["$control_plane_ready"=false];thenexit1fi# Install Calico CNI (Container Network Interface)forattemptin 1 2 3;doifKUBECONFIG=/etc/kubernetes/admin.confkubectlapply-fhttps://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml;thenbreakelseif[$attempt-eq3];thenexit1fisleep10fidone# CERTIFICATE REGENERATION FOR LOAD BALANCER# Backup existing certificatesif[!-f /etc/kubernetes/pki/apiserver.crt ];thenexit1fisudomv/etc/kubernetes/pki/apiserver.crt/etc/kubernetes/pki/apiserver.crt.baksudomv/etc/kubernetes/pki/apiserver.key/etc/kubernetes/pki/apiserver.key.bak# Create configuration for certificate regeneration with load balancer DNScat<<EOF| sudo tee /root/kubeadm-dns.yamlapiVersion: kubeadm.k8s.io/v1beta3kind: ClusterConfigurationcontrolPlaneEndpoint: "${control_plane_endpoint}:6443"apiServer: certSANs: - "${control_plane_endpoint}" - "${control_plane_master_private_ip}"EOF# Regenerate API server certificates with load balancer DNS as SANsudokubeadminitphasecertsapiserver--config/root/kubeadm-dns.yaml# Restart kubelet to pick up new certificatessudosystemctlrestartkubelet# JOIN COMMAND GENERATION# Generate join command for worker nodesJOIN_COMMAND=$(kubeadm token create --print-join-command2>/dev/null)if[-z"$JOIN_COMMAND"];thenexit1fi# Generate certificate key for control plane nodesCERT_KEY=$(sudo kubeadm init phase upload-certs --upload-certs2>/dev/null |tail-n1)if[-z"$CERT_KEY"];thenexit1fi# Create control plane join commandCONTROL_PLANE_JOIN_COMMAND="$JOIN_COMMAND --control-plane --certificate-key $CERT_KEY"WORKER_NODE_JOIN_COMMAND="$JOIN_COMMAND"# Replace private IP with load balancer DNS in join commandsJOIN_COMMAND_WITH_DNS=$(echo "$CONTROL_PLANE_JOIN_COMMAND" |sed "s/${control_plane_master_private_ip}:6443/${control_plane_endpoint}:6443/g")WORKER_NODE_JOIN_COMMAND_WITH_DNS=$(echo "$WORKER_NODE_JOIN_COMMAND" |sed "s/${control_plane_master_private_ip}:6443/${control_plane_endpoint}:6443/g")# Store join commands in AWS Systems Manager Parameter Storeforattemptin 1 2 3;doifawsssmput-parameter\--name"/k8s/control-plane/join-command"\--value"$JOIN_COMMAND_WITH_DNS"\--type"SecureString"\--overwrite\--region"us-east-1"\--cli-connect-timeout10\--cli-read-timeout30;thenbreakelseif[$attempt-eq3];thenexit1fisleep10fidoneforattemptin 1 2 3;doifawsssmput-parameter\--name"/k8s/worker-node/join-command"\--value"$WORKER_NODE_JOIN_COMMAND_WITH_DNS"\--type"SecureString"\--overwrite\--region"us-east-1"\--cli-connect-timeout10\--cli-read-timeout30;thenbreakelseif[$attempt-eq3];thenexit1fisleep10fidoneelse...
7. Additional Control Plane Node Block (is_first_control_plane = false) inside else statement. Joins additional control plane nodes to the existing cluster
Command Retrieval: Retrieves the control plane join command from AWS Parameter Store with retry logic
Cluster Join: Executes the join command to add this node as an additional control plane
Configuration Update: Updates the local kubeconfig to use the load balancer endpoint instead of the first node’s IP
User Setup: Configures kubectl access for the ubuntu user so that the load balancer endpoint will reflect.
Bash
else...# ADDITIONAL CONTROL PLANE NODE SETUP# Wait before retrieving join commandsleep120# Retrieve join command from AWS Systems Manager Parameter Storeforattemptin 1 2 3;doJOIN_CMD=$(aws ssm get-parameter \--region us-east-1 \--name "/k8s/control-plane/join-command" \--with-decryption\--query "Parameter.Value" \--output text \--no-cli-pager\--cli-read-timeout30\--cli-connect-timeout102>/dev/null)if[$?-eq0]&&[-n"$JOIN_CMD"]&&[["$JOIN_CMD"!=*"error"*]]&&[["$JOIN_CMD"!="None"]];thenbreakelseif[$attempt-eq3];thenexit1fisleep20fidone# Join the existing cluster as additional control planeifeval"sudo $JOIN_CMD";then# Success:elseexit1fi# Update kubeconfig to use load balancer endpointif[-f /etc/kubernetes/admin.conf ];thensudosed-i"s|https://${control_plane_master_private_ip}:6443|https://${control_plane_endpoint}:6443|g"/etc/kubernetes/admin.conf# Setup kubeconfig for ubuntu userexportKUBE_USER=ubuntumkdir-p/home/$KUBE_USER/.kubesudocp-i/etc/kubernetes/admin.conf/home/$KUBE_USER/.kube/configsudochown$KUBE_USER:$KUBE_USER/home/$KUBE_USER/.kube/configelseexit1fifi
Key Design Pattern:
Retry Logic: Most network operations include retry mechanisms for reliability
Conditional Execution: The script branches based on whether this is the first control plane node
Error Handling: Uses set -e to exit on any command failure
High Availability: Configures the cluster to use a load balancer endpoint for external access
Security: Uses proper certificate management and secure parameter storage
Idempotency: Many operations are designed to be safely re-runnable
Kubernetes Master Control Plane Wait Script
This script is typically used where you need to:
Wait for a newly created Kubernetes master node to become fully operational
Verify the installation completed successfully before proceeding with additional configuration
Ensure the cluster is ready to accept worker nodes or workload deployments
Provide debugging information if the setup fails
The script essentially acts as a “health check” that confirms a Kubernetes control plane is not just installed, but fully ready for use.
1. Cloud-Init Completion Wait Block. Waits for the cloud-init process to complete before proceeding.
Timeout Protection: Uses a 20-minute timeout (1200 seconds) to prevent infinite waiting
Status Monitoring: Continuously polls cloud-init status every 30 seconds
Success Detection: Looks for “done” status indicating successful completion
Error Handling: If status contains “error”, displays detailed error information and exits
Bash
#!/bin/bash# Wait for Kubernetes master control plane to be readyset-e# CLOUD-INIT COMPLETION WAIT# Wait for cloud-init to finish (up to 20 minutes)timeout1200bash-c' while true; do status=$(sudo cloud-init status 2>/dev/null || echo "unknown") if [[ "$status" == *"done"* ]]; then break elif [[ "$status" == *"error"* ]]; then sudo cloud-init status --long 2>&1 exit 1 else sleep 30 fi done'
2. Installation Verification Block. Verifies that the Kubernetes installation completed successfully.
Success Log Check: Looks for /var/log/k8s-install-success.txt as proof of successful installation
Success Path: If found, displays the last 10 lines of the success log
Error Log Check: If no success log, checks for /var/log/k8s-install-error.txt
Error Path: If error log exists, displays its contents and exits with failure
Fallback: If neither log exists, shows cloud-init output for debugging and exits
Bash
# INSTALLATION VERIFICATION# Verify Kubernetes installation completed successfullyif[-f /var/log/k8s-install-success.txt ];then# Installation success log foundtail-10/var/log/k8s-install-success.txtelse# No success log found, check for errorsif[-f /var/log/k8s-install-error.txt ];then# Error log found - installation failedcat/var/log/k8s-install-error.txtexit1else# No error log either, check cloud-init outputsudotail-50/var/log/cloud-init-output.logexit1fifi
3. Filesystem Verification Block. Inspects the filesystem to verify expected files and directories exist. Provides diagnostic information about what files were created during installation.
Home Directory Check: Lists contents of /home/ubuntu/ directory
Kube Directory Check: Checks for /home/ubuntu/.kube/ directory (user kubectl config)
Kubernetes Directory Check: Checks for /etc/kubernetes/ directory (system configs)
Non-Fatal: Uses error suppression (2>/dev/null) since some directories might not exist yet
4. Kubeconfig Detection Block. Locates and sets up kubectl configuration for cluster access. kubectl requires proper configuration to communicate with the cluster
User Config Priority: First checks for user-specific config at /home/ubuntu/.kube/config
Admin Config Fallback: If user config missing, tries system admin config at /etc/kubernetes/admin.conf
Environment Setup: Sets KUBECONFIG environment variable to point to found config file
Failure Handling: If no config found, lists kubernetes directory contents and exits
Bash
# KUBECONFIG DETECTION# Check for kubeconfig file and set KUBECONFIG environment variableif[-f /home/ubuntu/.kube/config ];thenexportKUBECONFIG=/home/ubuntu/.kube/configelif[-f /etc/kubernetes/admin.conf ];thenexportKUBECONFIG=/etc/kubernetes/admin.confelse# No kubeconfig found after installationls-la/etc/kubernetes/2>/dev/null||echo'No /etc/kubernetes directory'exit1fi
5. kubectl Functionality Test Block. Verifies kubectl command-line tool is working properly. Ensures the kubectl tool itself is functional before testing cluster connectivity
Version Check: Runs kubectl version --client to test basic functionality
Binary Verification: Confirms kubectl is installed and accessible
Path Debugging: If kubectl fails, shows where (or if) kubectl is installed and displays PATH
Bash
# KUBECTL FUNCTIONALITY TEST# Test kubectl client functionalitykubectlversion--client2>&1ifkubectlversion--client>/dev/null2>&1;then# kubectl is working:else# kubectl not workingwhichkubectl2>/dev/null||echo'kubectl not in PATH'echo"PATH contents: $PATH"exit1fi
6. API Server Connectivity Test Block. Tests connectivity to the Kubernetes API server. The API server must be responding before the cluster can be considered ready
Health Endpoint: Uses kubectl get --raw /healthz to test API server health
Timeout Protection: 5-minute timeout (300 seconds) to prevent infinite waiting
Retry Logic: Continuously retries every 10 seconds until success or timeout
Bash
# API SERVER CONNECTIVITY TEST# Test API server connectivity and readinesstimeout300bash-c' while ! kubectl get --raw /healthz >/dev/null 2>&1; do sleep 10 done'
7. System Services Status Check Block. Verifies critical Kubernetes system services are running. These services must be running for the cluster to function properly.
kubelet Status: Checks the Kubernetes node agent service
containerd Status: Checks the container runtime service
Limited Output: Shows only first 10 lines to avoid overwhelming output
Bash
# SYSTEM SERVICES STATUS CHECK# Check status of critical Kubernetes servicessystemctlstatuskubelet--no-pager2>&1|head-10systemctlstatuscontainerd--no-pager2>&1|head-10
8. Final Cluster Verification Block. Performs comprehensive cluster functionality tests. Confirms the cluster is not just running, but fully functional
Node Status: Lists all cluster nodes to verify cluster membership
System Pods: Checks status of system pods in kube-system namespace
Pod Verification: Writes pod output to temporary file and displays first 10 entries
Bash
# FINAL CLUSTER VERIFICATION# Verify cluster is functionalkubectlgetnodes2>&1# Check system pods statuskubectlgetpods-nkube-system--no-headers>/tmp/pods_output2>&1head-10/tmp/pods_output# SUCCESS - Control plane is ready
Key Design Patterns:
Progressive Validation: Each step builds on the previous one, from basic system readiness to full cluster functionality
Timeout Protection: Critical waits include timeouts to prevent infinite hanging
Graceful Degradation: Provides diagnostic information when things fail
Error Propagation: Uses set -e to exit immediately on any command failure
Comprehensive Testing: Tests multiple layers from file system to cluster API
Kubernetes Worker Nodes Installation using Bash Script
This script essentially automates the process of preparing a server and joining it to an existing Kubernetes cluster as a worker node, handling all the prerequisites and configuration needed for the node to participate in the cluster and run workloads. For example,
You:"I need 3 more worker nodes for my cluster"AWS/Terraform:"Creating 3 new servers..."ThisScript (on eachserver): "Let me become a worker node..."Script:"Preparing system... Installing container runtime... Installing Kubernetes..."Script:"Getting join command from the managers..."Script:"Joining cluster as worker node..."Script:"SUCCESS! I'm now a worker node ready to run applications!"
What happens after this script runs:
The server becomes a worker node in your Kubernetes cluster
It can now run your applications (pods, containers)
The control plane can schedule work on this node
Your cluster has more capacity to run workloads
1. System Preparation Block. Prepares the system for Kubernetes installation.
Swap Management:
Disables active swap memory (Kubernetes requirement)
Comments out swap entries in /etc/fstab to prevent re-enabling on reboot
Network Configuration:
Enables IP forwarding (net.ipv4.ip_forward = 1) for pod-to-pod communication
Applies network settings immediately without requiring a reboot
Package Updates: Updates system packages with retry logic for network reliability
Bash
#!/bin/bashset-e# SYSTEM PREPARATION# Disable swap (required for Kubernetes)swapoff-a# Permanently disable swap by commenting it out in fstabsed-i'/ swap / s/^/#/'/etc/fstab# Enable IP forwarding for pod networkingcat<<EOF| tee /etc/sysctl.d/k8s.confnet.ipv4.ip_forward = 1EOF# Apply sysctl settings without rebootsysctl--system# Update package lists with retry logicforattemptin 1 2 3;doifapt-getupdate;thenbreakelseif[$attempt-eq3];thenexit1fisleep10fidone
2. Container Runtime Setup Block. Installs and configures containerd as the container runtime.
5. Cluster Join Process Block. Joins this node to the existing Kubernetes cluster as a worker.
Wait Period: Waits 2 minutes to ensure the control plane has stored the join command in Parameter Store
Command Retrieval:
Retrieves worker node join command from AWS Systems Manager Parameter Store
Uses retry logic with 20-second intervals
Validates the command is not empty, doesn’t contain errors, and isn’t “None”
Accesses the /k8s/worker-node/join-command parameter (different from control plane command)
Cluster Join:
Executes the retrieved join command with sudo privileges
The join command typically looks like: kubeadm join <load-balancer-dns>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>
Bash
# CLUSTER JOIN PROCESS# Wait for join command to be available in Parameter Storesleep120# Retrieve worker node join command from AWS Systems Manager Parameter Storeforattemptin 1 2 3;doJOIN_CMD=$(aws ssm get-parameter \--region us-east-1 \--name "/k8s/worker-node/join-command" \--with-decryption\--query "Parameter.Value" \--output text \--no-cli-pager\--cli-read-timeout30\--cli-connect-timeout102>/dev/null)if[$?-eq0]&&[-n"$JOIN_CMD"]&&[["$JOIN_CMD"!=*"error"*]]&&[["$JOIN_CMD"!="None"]];thenbreakelseif[$attempt-eq3];thenexit1fisleep20fidone# Execute the join command to add this node as a worker to the clusterifeval"sudo $JOIN_CMD";then# Success - node joined cluster:elseexit1fi
Key Differences from Control Plane Script:
Simpler Role: Worker nodes only need to join the cluster, not initialize or manage it
No kubectl: Worker nodes don’t need cluster management tools
No Certificate Management: Workers don’t handle cluster certificates
No CNI Installation: Container networking is managed by control plane
Single Join Command: Uses worker-specific join command from Parameter Store
No Additional Configuration: No need to update configs or generate new commands
Design Patterns:
Retry Logic: Network operations include retry mechanisms for reliability
Parameter Store Integration: Uses AWS SSM to retrieve join commands securely
Error Handling: Uses set -e to exit on any command failure
Validation: Checks command retrieval success before execution
Minimal Installation: Only installs components needed for worker node functionality
Kubernetes Wait for Worker Node Script
This script waits and watches for worker nodes to join a Kubernetes cluster and become ready to run applications. Something like:
You’re organizing a team project:
You’re expecting 3 team members to join you. (EXPECTED_WORKERS = 3)
You’re willing to wait up to 30 minutes for everyone to show up. (TIMEOUT_SECONDS = 1800)
Every 30 seconds, you’ll check to see who’s arrived so far. (CHECK_INTERVAL = 30)
What the scripts monitor?
Stage 1: Node Join Detection
Counts how many worker nodes have joined the cluster
Like counting how many people walked into the office
Stage 2: Node Readiness Check
Counts how many worker nodes are ready (not just joined)
Like checking if people have their computers set up and are actually ready to work
When you create a Kubernetes cluster:
Control plane starts first (the “manager” nodes)
Worker nodes join later (the “worker” nodes that run your apps)
You need to wait for all workers to join and be ready before you can deploy applications
For example,
You:"I want 5 worker nodes in my cluster"Terraform:"OK, creating 5 worker nodes..."ThisScript:"I'll wait here and watch for all 5 to join and be ready"*Time passes...*Script:"1 worker joined... 2 workers joined... 3 workers joined..."Script:"All 5 joined! Now waiting for them to be ready..."Script:"Worker 1 ready... Worker 2 ready... All ready!"Script:"SUCCESS! Your cluster is ready to use!"
Without this script, you might try to deploy apps too early and get errors like:
“No nodes available to schedule pods”
“Insufficient resources”
Apps failing because nodes aren’t ready yet
With this script, you know for certain that your cluster is 100% ready before you try to use it. In essence: It’s a “safety check” that prevents you from using a cluster before it’s fully operational.
1. Configuration Setup Block. Initializes the script environment and configuration.
Kubeconfig Export: Sets KUBECONFIG=/home/ubuntu/.kube/config for kubectl access
Variable Assignment: Retrieves configuration from Terraform variables:
EXPECTED_WORKERS: Number of worker nodes expected to join
TIMEOUT_SECONDS: Maximum time to wait for nodes
CHECK_INTERVAL: Time between status checks
LOG_FILE: Path where to save detailed logs
Log File Setup: Creates the log file and makes it writable (chmod 666)
Bash
#!/bin/bashset-e# CONFIGURATION SETUP# Export kubeconfig for kubectl accessexportKUBECONFIG=/home/ubuntu/.kube/config# Configuration from Terraform variablesEXPECTED_WORKERS=${expected_workers}TIMEOUT_SECONDS=${timeout_seconds}CHECK_INTERVAL=${check_interval}LOG_FILE="${log_file}"# Create and configure log filesudotouch"$LOG_FILE"sudochmod666"$LOG_FILE"
2. count_worker_nodes() Function. Counts how many worker nodes have joined the cluster (regardless of readiness). Tracks the joining progress of worker nodes
Node Listing: Gets all nodes without headers using kubectl get nodes --no-headers
Filtering: Excludes control plane nodes by filtering out:
Lines containing “control-plane”
Lines containing “master”
Counting: Uses wc -l to count remaining lines
Error Handling: Returns 0 if kubectl fails
Bash
# WORKER NODE COUNTING FUNCTIONS# Function to count current worker nodes (joined but may not be ready)count_worker_nodes(){kubectlgetnodes--no-headers2>/dev/null|\grep-vcontrol-plane|\grep-vmaster|\wc-l||echo0}
3. count_ready_worker_nodes() Function. Counts how many worker nodes are both joined AND ready for workloads. Ensures nodes are not just joined but actually functional.
Node Listing: Gets all nodes without headers
Multi-Stage Filtering:
Excludes control plane nodes (same as above)
Additionally filters for “Ready” status using grep Ready
Counting: Counts nodes that pass all filters
Error Handling: Returns 0 if kubectl fails
Bash
# Function to count ready worker nodes (joined and ready for workloads)count_ready_worker_nodes(){kubectlgetnodes--no-headers2>/dev/null|\grep-vcontrol-plane|\grep-vmaster|\grepReady|\wc-l||echo0}
4. Main Wait Loop with Timeout Block. Continuously monitors worker node status until completion or timeout.
4a. Timeout Management
Time Calculation: Calculates start time, end time, and current time in Unix timestamps
Timeout Check: Exits with error if current time exceeds end time
Failure Path: Shows current cluster state and exits with code 1
4b. Status Monitoring
Node Counting: Calls both counting functions to get current status
Time Tracking: Calculates elapsed time and remaining time
Progress Display: Shows comprehensive status including:
Current timestamp
Worker node counts (joined vs ready)
Expected count
Time statistics
4c. Completion Logic
Join Check: Verifies if enough nodes have joined (current_workers >= EXPECTED_WORKERS)
Readiness Check: Verifies if enough nodes are ready (ready_workers >= EXPECTED_WORKERS)
Two-Stage Success:
First celebrates when nodes join
Then waits for them to become ready
Loop Exit: Breaks out of loop only when both conditions are met
4d. Status Display and Wait
Cluster State: Shows current node status with kubectl get nodes
Log Recording: Saves output to the log file
Interval Wait: Sleeps for the configured check interval before next iteration
Bash
# MAIN WAIT LOOP WITH TIMEOUT# Calculate timeout timestampsstart_time=$(date +%s)end_time=$((start_time + TIMEOUT_SECONDS))whiletrue;docurrent_time=$(date +%s)# Check if timeout has been reachedif[$current_time-gt$end_time];thenecho"TIMEOUT: Worker nodes did not join within $TIMEOUT_SECONDS seconds"echo"Current cluster state:"kubectlgetnodes--no-headers2>&1|tee-a"$LOG_FILE"||echo"kubectl failed"exit1fi# Count current worker nodescurrent_workers=$(count_worker_nodes)ready_workers=$(count_ready_worker_nodes)# Calculate elapsed and remaining timeelapsed=$((current_time - start_time))remaining=$((end_time - current_time))echo"Status check at $(date)"echo"Current worker nodes: $current_workers"echo"Ready worker nodes: $ready_workers"echo"Expected: $EXPECTED_WORKERS"echo"Elapsed: $elapsed s, Remaining: $remaining s"# Check if we have enough worker nodes joinedif["$current_workers"-ge"$EXPECTED_WORKERS"];thenecho"All $EXPECTED_WORKERS worker nodes have joined the cluster!"# Check if they are all readyif["$ready_workers"-ge"$EXPECTED_WORKERS"];thenecho"All worker nodes are also ready!"breakelseecho"Worker nodes joined but not all are ready yet. Waiting for readiness..."fifi# Show current cluster stateecho"Current cluster state:"kubectlgetnodes--no-headers2>&1|tee-a"$LOG_FILE"||echo"kubectl command failed"echo"Waiting $CHECK_INTERVAL seconds before next check..."sleep$CHECK_INTERVALdone
5. Final Status Display Block. Shows completion status and saves final results.
Detailed Output: Uses kubectl get nodes -o wide for comprehensive node information
Log Location: Reminds user where detailed logs are saved
Bash
# FINAL STATUS DISPLAY# Show detailed final cluster stateecho"Final cluster state:"kubectlgetnodes-owide2>&1|tee-a"$LOG_FILE"echo"Worker nodes join process completed successfully!"echo"Log saved to: $LOG_FILE"
Key Design Patterns:
Polling Loop: Continuously checks status at regular intervals
Timeout Protection: Prevents infinite waiting with configurable timeout
Two-Stage Validation: Distinguishes between “joined” and “ready” states
Progress Tracking: Provides detailed status updates during the wait
Comprehensive Logging: Saves detailed information for debugging
This script is typically used in infrastructure automation scenarios where:
Terraform/CloudFormation: Waits for worker nodes to join after provisioning
CI/CD Pipelines: Ensures cluster is fully ready before deploying applications
Cluster Scaling: Verifies new nodes are operational after scaling events
Testing: Confirms cluster readiness in automated testing environments
The script implements a two-stage success model:
Stage 1: Worker nodes join the cluster (appear in kubectl get nodes)
Stage 2: Worker nodes become ready (can schedule and run pods)
This is important because nodes can join a cluster but still be initializing, pulling images, or having network issues that prevent them from being ready for workloads.
The script ensures the cluster is not just numerically complete, but functionally ready for production use.
Kubernetes Worker Node Labelling Script
This script assigns proper “worker” labels to nodes in a Kubernetes cluster so they show up with the correct role instead of <none>.
1. Configuration Setup Block. Sets up the environment and logging.
Kubeconfig Export: Sets up kubectl to access the cluster
Worker Count: Gets expected number of workers from configuration
Log File: Creates a timestamped log file to track what happens
Bash
#!/bin/bashset-e# CONFIGURATION SETUPexportKUBECONFIG=/home/ubuntu/.kube/configEXPECTED_WORKERS=${expected_worker_count}# Create log file with timestampLOG_FILE="/var/log/k8s-worker-labeling-$(date +%Y%m%d-%H%M%S).log"sudotouch$LOG_FILEsudochmod666$LOG_FILE
2. Initial Cluster State Check Block. Shows what the cluster looks like before making changes.
Display Nodes: Shows all nodes and their current status
Error Handling: Exits if kubectl doesn’t work
Documentation: Saves the “before” state to the log file
Bash
# INITIAL CLUSTER STATE CHECKecho'Current cluster state before labeling:'kubectlgetnodes-owide2>&1|tee-a$LOG_FILE||{echo'FAILED to get nodes'exit1}
3. Stabilization Wait Block. Gives nodes time to fully initialize. Newly joined nodes might still be initializing.
30-Second Wait: Ensures nodes are fully ready before labeling
Bash
# STABILIZATION WAITecho'Waiting 30 seconds for nodes to stabilize...'sleep30
4. Node Discovery Block. Finds all nodes in the cluster
Get Node List: Uses kubectl to get all node names
JSONPath Query: Extracts just the names from the full node information
Bash
# NODE DISCOVERY# Get all node names in the clusternode_list=$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}')echo"All nodes found: $node_list"
5. Labeling Function with Retry Logic Block. Creates a reliable function to label individual nodes. What label_node_with_retry() does:
Readiness Check: Waits up to 60 seconds for the node to be “Ready”
# LABELING FUNCTION WITH RETRY LOGIClabel_node_with_retry(){localnode="$1"localmax_attempts=3localattempt=1while[$attempt-le$max_attempts];doecho"Attempt $attempt/$max_attempts to label node: $node"# Wait for node to be readyifkubectlwait--for=condition=Readynode/$node--timeout=60s2>&1|tee-a$LOG_FILE;thenecho"$node is ready, attempting to label..."# Apply worker labelifkubectllabelnode"$node"node-role.kubernetes.io/worker=worker--overwrite2>&1|tee-a$LOG_FILE;thenecho"SUCCEEDED to label $node as worker"return0elseecho"FAILED to label $node (attempt $attempt)"fielseecho"$node not ready yet (attempt $attempt)"fiattempt=$((attempt +1))if[$attempt-le$max_attempts];thenecho"Waiting 10 seconds before retry..."sleep10fidoneecho"FAILED to label $node after $max_attempts attempts"return1}
6. First Labeling Pass Block. Goes through each node and labels appropriate ones as workers.
Check Each Node: Loops through all discovered nodes
Role Detection: Checks if node already has “control-plane” or “master” labels
Skip Control Planes: Leaves management nodes alone
Label Workers: Applies worker label to non-control-plane nodes
Bash
# FIRST LABELING PASS# Process each node and determine if it should be labeled as workerfornodein$node_list;doif[-n"$node"];thenecho"Processing node: $node"# Check if node has control-plane or master rolenode_labels=$(kubectl get node "$node" -o jsonpath='{.metadata.labels}' 2>/dev/null ||echo '')ifecho"$node_labels"|grep-E'control-plane|master'>/dev/null2>&1;thenecho"$node is a control plane node, skipping"elseecho"$node appears to be a worker node"label_node_with_retry"$node"fifidoneecho'First labeling pass completed'
7. Second Labeling Pass Block. Catches any nodes that were missed in the first pass.
Some nodes might have joined after the first pass
Network issues might have caused failures
Find Unlabeled: Looks for nodes with <none> role
Final Attempt: Tries to label any remaining unlabeled nodes
Bash
# SECOND LABELING PASS# Check for any remaining unlabeled worker nodesecho'Checking for any remaining unlabeled nodes...'unlabeled_nodes=$(kubectl get nodes --no-headers|grep '<none>' |awk '{print $1}' ||true)if[-n"$unlabeled_nodes"];thenecho"Found unlabeled nodes: $unlabeled_nodes"fornodein$unlabeled_nodes;doecho"Final attempt to label remaining node: $node"label_node_with_retry"$node"doneelseecho'No unlabeled nodes found'fi
8. Final Verification Block. Confirms the job was completed successfully.
Show Results: Displays final cluster state
Count Check: Counts how many nodes still have <none> role
Success/Failure: Exits with error if any nodes remain unlabeled
Log Information: Tells user where to find detailed logs
Bash
# FINAL VERIFICATIONecho'Labeling process completed'echo'Final cluster state:'kubectlgetnodes-owide2>&1|tee-a$LOG_FILE# Check if any nodes still remain unlabeledremaining_unlabeled=$(kubectl get nodes --no-headers|grep '<none>' |wc-l||echo '0')if["$remaining_unlabeled"-gt0];thenecho"WARNING: $remaining_unlabeled node(s) still have no role assigned"kubectlgetnodes--no-headers|grep'<none>'2>&1|tee-a$LOG_FILEexit1elseecho'SUCCEEDED: All nodes have roles assigned'fiecho"Worker labeling process completed. Full log saved to: $LOG_FILE"echo"To view the log later, run: sudo cat $LOG_FILE"
# Script startslog_step"1""Starting installation"# Success logapt-getupdate# Run commandcheck_command"1""Package update failed"# Check if it workedlog_step"2""Packages updated"# Success log
During Failure
Bash
# Script startslog_step"1""Starting installation"# Success logsome_failing_command# This command failscheck_command"1""Command failed"# Detects failure, logs error, exits# Script stops here - never reaches next step
After running scripts with these functions, you get:
/var/log/k8s-install-success.txt- Contains all successful steps/var/log/k8s-install-error.txt- Contains any errors that occurred
Why This is Useful:
Debugging: If installation fails, you can check the error log to see exactly what went wrong
Progress Tracking: Success log shows how far the installation got
Automation: Scripts can automatically stop when something fails
Consistency: All scripts use the same logging format
Auditing: Permanent record of what happened during installation
In essence: These functions create a “flight recorder” for your Kubernetes installation, tracking every step and automatically stopping if anything goes wrong.
The ${common_functions} you see at the top of other scripts gets replaced with these functions, so every script has access to this logging toolkit.
There are several factors why multi-stage builds can help build better containers.
Reduced image size means less disk space is used. This can also speed up builds, deployments, and container startup.
Separating build time and run time means keeping the tools and libraries separate from the runtime environment, which can achieve a cleaner and more maintainable setup.
Minimize attacks. By having only the needed libraries in the runtime, attackers will have difficulty exploiting unnecessary scripts.
Let’s use Golang as our first example. We will first create a single-stage build and compare the size to the multi-stage build.
main.go file
Go
packagemainimport("fmt""net/http")funchandler(w http.ResponseWriter, r *http.Request){ fmt.Fprintf(w,"Hello World!\n")}funcmain(){ http.HandleFunc("/", handler)// Start the server on port 80err:= http.ListenAndServe(":80",nil)if err !=nil{ fmt.Println("Error starting server:", err)}}
Create a single-stage build Dockerfile named Dockerfile-single-golang.
Dockerfile
FROM ubuntu:latestWORKDIR /appRUN apt update \ && apt install -y golangCOPY /hello-world-go/main.go .RUN go mod init hello-world-go && go build -o hello-world-goEXPOSE 80CMD ["./hello-world-go"]
Let’s use Ubuntu as the base image, install go lang, and build the executable file. How did I come up with this? At first, I ran the Dockerfile with only FROM ubuntu:latest as its content. I go inside the container and install golang.
Dockerfile
FROM ubuntu:latestCMD [“bash”]
I get the history of all my commands using the command history. From the history, I managed to get all commands and put them together to create a Dockerfile.
In our example, we did a binary executable which is a standalone. How about PHP? Let’s try installing WordPress. You can get the instructions here https://wiki.alpinelinux.org/wiki/WordPress
Create a Dockerfile named Dockerfile-single-php. I used the base image php:8.4.3-fpm-alpine3.20 which is lightweight.
Dockerfile
FROM php:8.4.3-fpm-alpine3.20# Set working directoryWORKDIR /usr/share/webapps/RUN apk add --no-cache \ bash \ lighttpd \ php82 \ fcgi \ php82-cgi \ wget# Configure lighttpd to enable FastCGIRUN sed -i 's|# include "mod_fastcgi.conf"|include "mod_fastcgi.conf"|' /etc/lighttpd/lighttpd.conf && \ sed -i 's|/usr/bin/php-cgi|/usr/bin/php-cgi82|' /etc/lighttpd/mod_fastcgi.conf# Download and extract WordPressRUN wget https://wordpress.org/latest.tar.gz && \ tar -xzvf latest.tar.gz && \ rm latest.tar.gz && \ chown -R lighttpd:lighttpd /usr/share/webapps/wordpressEXPOSE 9000CMD ["sh", "-c", "php-fpm & lighttpd -D -f /etc/lighttpd/lighttpd.conf"]
FROM php:8.4.3-fpm-alpine3.20 AS build# Set working directoryWORKDIR /usr/share/webapps/RUN apk add --no-cache \ bash \ lighttpd \ php82 \ fcgi \ php82-cgi \ wget# Configure lighttpd to enable FastCGIRUN sed -i 's|# include "mod_fastcgi.conf"|include "mod_fastcgi.conf"|' /etc/lighttpd/lighttpd.conf && \ sed -i 's|/usr/bin/php-cgi|/usr/bin/php-cgi82|' /etc/lighttpd/mod_fastcgi.conf# Download and extract WordPressRUN wget https://wordpress.org/latest.tar.gz && \ tar -xzvf latest.tar.gz && \ rm latest.tar.gz && \ chown -R lighttpd:lighttpd /usr/share/webapps/wordpressFROM php:8.4.3-fpm-alpine3.20 AS finalWORKDIR /usr/share/webapps/# Copy the compiled binary from the build stagCOPY --from=build /usr/share/webapps/ /usr/share/webapps/# Install only the runtime dependenciesRUN apk add --no-cache \ lighttpd \ fcgi \ php82-cgiEXPOSE 9000CMD ["sh", "-c", "php-fpm & lighttpd -D -f /etc/lighttpd/lighttpd.conf"]
In our example, since PHP doesn’t produce an executable binary, we don’t use the scratch image in the final stage. Instead, we reuse the php:8.4.3-fpm-alpine3.20 image and still need to install the required dependencies. As demonstrated, we only install the essential ones.
The size reduction was minimal because PHP is an interpreted language. Unlike compiled languages, PHP cannot run independently—it relies on its runtime environment. PHP executes code line by line at runtime, essentially processing it “on the fly.” Additionally, installing dependencies like MySQL, PDO, or others increases the overall image size.
Vagrant and VirtualBox on Mac M2 Max are a no-no for now because of the arm architecture, which is not fully available yet. So, I decided to install it on my Ubuntu 22. I came across a few hurdles, so I am going to discuss how I managed to fix them.
I have this Jenkins server that I haven’t touched for a year, and when I started it, I was greeted with two upgrades. One is from Jenkins, and the other one is from AWS. I did an upgrade of AWS to 2023.6.20241212. Everything went fine until suddenly I checked my Docker, and I have an outdated version 25 which I believe is correct based on this information from AWS https://docs.aws.amazon.com/linux/al2023/release-notes/all-packages-AL2023.6.html
What is a Linux Service? It is a program that runs in the background without user interaction. It usually stays running even after a computer reboot. An example is daily backing up a database from a production server. You run a task in the background that dumps the whole database and saves it to a computer.
For example, I will empty the trash on an Ubuntu desktop every minute.
Create a bash script to empty the trash, name it empty-trash.sh, save it in /opt.
Using Docker Compose, create a service for the aws-cli and make sure to add stdin_open:true, tty:true and command:help to be able to prevent an immediate exit.
Go inside the container and configure the aws by adding your credentials. It should create the directory .aws/. To make this tutorial simpler, I opted to configure it inside the container rather than making an environment variable.
Bash
dockerexec-itsample-awsbash
Note: Make sure your instances have a tag like this to be able to identify them without entering the instance ID.
Create a bash script file named change-instance-state.sh and put it inside .aws/directory.
Create a function to be re-used when starting and stopping an instance.
Bash
#!/bin/bashchange_instance_state(){action='.StartingInstances[0]';status='stopped'instances='start-instances'if[$1=="STOP"];thenaction='.StoppingInstances[0]';status='running'instances='stop-instances'fi# Get all instances with status of stopped or running with Tag Value of PRODUCTION. Lastly, get the instance-id and change the stateawsec2describe-instances--query'Reservations[*].Instances[*].[State.Name, InstanceId, Tags[0].Value]'--outputtext|grep${status}|grepPRODUCTION|awk'{print $2}'|whilereadline;doresult=$(aws ec2 ${instances}--instance-ids$line)current_state=$(echo "$result" |jq-r "$action.CurrentState.Name")previous_state=$(echo "$result" |jq-r "$action.PreviousState.Name")echo"Previous State is: ${previous_state} and Current State is: ${current_state}"done}
In my example, I used the command aws ec2 describe-instances to list all ec2 instances. I filtered them out using grep and awk to target only the PRODUCTION instance.
How did I get the result (previous and current state)? Install jq and you can parse the result of this $(aws ec2 ${instances} –instance-ids $line) line. You can skip this as I added it so that I can see the previous and current states of the instance.
Ask the users if they want to start or stop. Assign it to the variable USER_OPTION
Bash
read-p'Do you want to START or STOP the production instance? 'USER_OPTION;
Write an if/else statement that decides the value of the USER_OPTION variable, then pass that value to the change_instance_state function. The double carets ^^ are just used to convert input to uppercase.
Bash
if[[${USER_OPTION^^}=="START"]];thenecho'STARTING THE PRODUCTION SERVER...'change_instance_state"START"elsechange_instance_state"STOP"fi
Lastly, to be able to execute this script outside of the container,
I decided to install Jenkins on another server because I don’t want it to mess up with my containers and I will execute different remote scripts from different servers.
Prerequisite: A running Docker and Docker Compose, a DNS record that points to https://jenkins.yourdomain.com, and knowledge of ssh keys.
Let’s install Jenkins first by creating a Docker Compose file.
The file states that it will pull the latest Jenkins image from Docker Hub and assign it to port 80 so that we don’t need to type https://jenkins.yourdomain.com:8080 in the browser.
Make sure to have the restart: always because when you install any Jenkins plugins, the service will automatically restart, and you will lose connection to the container.
From the Jenkins server, copy the public key and save it to the authorized keys of the remote server server.
Bash
JenkinsServer:cat~/.ssh/id_rsa.pub
Go to Dashboard > Manage Jenkins and click on the Plugins section.
Install Publish Over SSH Version plugin.
Let’s add the private key and passphrase from the remote server. On your Name, click the drop-down and you will see the Credentials.
Click the + Add Credentials blue button.
Choose SSH Username with private key as Kind, Global as Scope, enter any ID or leave it blank, your username, and the Private key and passphrase from the remote server.
To get the Private key, ssh to your remote server and execute the command: cat ~/.ssh/id_rsa. Don’t forget to enter the passphrase if there is any.
Let’s create a Pipeline. On the Dashboard, click + New Item.
Enter the pipeline’s name, choose Pipeline as the type of project, and click OK.
Once saved, scroll down to the Pipeline section. Copy and paste the following script:
Select an instance and click Instance state. Choose Stop Instance.
After the instance has been stopped, go to the Storage tab and click the Volume ID. It will take you to the Volume configuration.
Tick the checkbox next to the Volume ID and click Actions. Select Modify volume.
The previous size of my volume is 30GB. I resized it to 60GB.
Click Modify.
Sshto your instance and type in df -h. In my volume, it can be seen below that it is now 60GB in size /dev/nvme0n1p1.
The volume that I have doesn’t need to expand the partition because it was automatically expanded. You might encounter that the partition is still the size of the original volume. For example, my original volume is 30GB, when I resized the volume to 60GB, the partition stayed at 30GB. To fix that, just type
Bash
growpart/dev/nvme0n11
If it’s already expanded, you get the message that it cannot be grown.
I was changing the password of our RDS yesterday and I thought that right after I modified the database, I could test it on my DBeaver immediately. It turns out it takes some minutes for it to be active again. I tried to test it a couple times on my DBeaver and checked the production site if it would load successfully.
It seems it didn’t. When you are on WordPress, it will just say Error Establishing Database Connection. Scratched my head a couple of times because I just changed the password and tried to reconnect but the site keeps saying Error Establishing Database Connection. When I checked the log file, it says:
Basically, the answer is on flushing hosts. I went to DBeaver and typed in
SQL
FLUSH HOSTS;
I googled why it happens and I found out that if a host tries to connect but it is unsuccessful and it exceeds the max_connection_error, mysql will block the host.
Flushing the host file means that MYSQL will empty the host file and unblock any blocked hosts. I tried to reconnect again and it went fine.