
This is one of the hard ways to install and run Kubernetes. I recommend this for learning purposes and not for production use. There is an Amazon EKS (Elastic Kubernetes Service) which you can use rather than setting up your own just like this tutorial.
First, the VPC, Subnets, Security Groups, Key Pairs, SSM, IAM roles, Network Load Balancer, EC2 Instances (1 Bastion host, 3 Control Planes, 3 Worker Nodes) need to be setup first before Kubernetes can be installed through bash scripting.
Terraform will be used to spin AWS resources. It is an Infrastructure-as-code that lets you create AWS resources without having to provision them manually by mouse clicks.
Table of Contents:
- Key Pair
- IAM
- SSM
- Networking
- Security Groups
- EC2 Instances
- Kubernetes Control Planes Installation using Bash Script
- Kubernetes Master Control Plane Wait Script
- Kubernetes Worker Nodes Installation using Bash Script
- Kubernetes Wait for Worker Node Script
- Kubernetes Worker Node Labelling Script
- Common Functions Script
Key Pair
Key Pairs are secure authentication method for accessing EC2 instances via SSH. It consists of Public and Private Keys.
- Public Key: Gets installed on EC2 instances during launch
- Private Key: Stays on your local machine (like a secret password)
Create the Key Pairs first, name it as terraform-key-pair.pem and save it locally.
# modules/keypair/main.tf
# Generate an RSA key pair
resource "tls_private_key" "private" {
algorithm = "RSA"
rsa_bits = 4096
}
# Create an AWS key pair using the generated public key
resource "aws_key_pair" "generated_key" {
key_name = "terraform-key-pair"
public_key = tls_private_key.private.public_key_openssh
}
# Save the private key locally
resource "local_file" "private_key" {
content = tls_private_key.private.private_key_pem
filename = "${path.root}/terraform-key-pair.pem"
}Expose the return values of the above code using outputs to be used on other modules.
# modules/keypair/outputs.tf
output "key_pair_name" {
description = "Name of the AWS key pair for SSH access to EC2 instances"
value = aws_key_pair.generated_key.key_name
}
output "tls_private_key_pem" {
description = "Private key in PEM format for SSH access - keep secure and do not expose"
value = tls_private_key.private.private_key_pem
sensitive = true
}Create a custom module named keypair.
# environment/development/main.tf
module keypair {
source = "../../modules/keypair"
}IAM
IAM (Identity Access Management) is AWS’s security system where in it controls who can do what in an AWS account. IAM can do two things: Authenticate (who are you?) and Authorize (what can you do?).
Authentication
– Users: Individual people (e.g., developers)
– Roles: Temporary identities for services/applications
– Groups: Collections of users with similar permissions
Authorization
– Policies: Rules that define permissions
– Permissions: Specific actions allowed/denied
Example IAM Users:
John (Developer) -> Can create EC2 instances but not delete them
Sarah (Admin) -> Can do everything
CI/CD System -> Can deploy applications but not manage billingExample IAM Roles
EC2 Instance Role -> Can read from S3 buckets
Lambda Function Role -> Can write to DynamoDB
Kubernetes Node Role -> Can join cluster and pull imagesExample Policies:
{
"Effect": "Allow",
"Action": "ec2:DescribeInstances",
"Resource": "*"
}1. data "aws_caller_identity" "current" {}
Gets information about the current AWS account and user that Terraform is using. It’s like asking “Who am I?” to AWS. It provides:
- Account ID: The AWS account number (like 123456789012)
- User ID: The unique identifier of the current user/role
- ARN: The full Amazon Resource Name of the current user/role
# modules/iam/main.tf
# IAM
data "aws_caller_identity" "current" {}Example usage:
"arn:aws:ssm:us-east-1:${data.aws_caller_identity.current.account_id}:parameter/k8s/*"2. resource "random_id" "cluster" { byte_length = 4 }
Generates a random identifier that will be consistent across Terraform runs. It’s like creating a unique “serial number” for your cluster. This ensures all resources are uniquely named and belong to the same cluster deployment. What it does:
- Creates: A random 4-byte (32-bit) identifier
- Formats: Usually displayed as hexadecimal (like a1b2c3d4)
- Persistence: Same value every time you run
terraform apply(unless you destroy and recreate)
It is used in aws_iam_role.kubernetes_master, aws_iam_instance_profile.kubernetes_master, aws_iam_role.kubernetes_worker, aws_iam_instance_profile.kubernetes_worker.
This is used to avoid naming conflicts. When multiple people or environments deploy the same Terraform code, IAM resources need unique names because:
- IAM names are globally unique within an AWS account
- Multiple deployments would conflict without unique identifiers
- Easy identification of which resources belong to which cluster
If Consistent Random Suffixes are implemented there will be:
- No Conflicts: Multiple developers/environments can deploy simultaneously
- Easy Cleanup: All resources for one cluster have the same suffix
- Clear Ownership: Can identify which resources belong to which deployment
- Testing: Can deploy multiple test environments without conflicts
resource "random_id" "cluster" {
byte_length = 4
}Control Plane Master IAM Setup
Master Role – Identity for control plane nodes
This creates an IAM role that EC2 instances can assume to get AWS permissions. The assume_role_policy is a trust policy that says “only EC2 instances can use this role” – it controls WHO can assume the role, not WHAT they can do. The actual permissions (like accessing S3 or Parameter Store) are added later by attaching separate IAM policies to this role.
# modules/iam/main.tf
resource "aws_iam_role" "kubernetes_master" {
name = "kubernetes-master-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}
]
})
tags = {
Name = "${terraform.workspace} - Kubernetes Master Role"
Description = "IAM role for Kubernetes control plane nodes with AWS API permissions"
Purpose = "Kubernetes Control Plane"
Environment = terraform.workspace
ManagedBy = "Terraform"
Project = "Kubernetes"
NodeType = "Control Plane"
Service = "EC2"
}
}Master Instance Profile – Attaches role to EC2.
This creates an IAM instance profile that acts as a bridge between EC2 instances and IAM roles. The instance profile gets attached to EC2 instances and allows them to assume the specified IAM role to obtain temporary AWS credentials. Think of it as the mechanism that lets EC2 instances “wear” the IAM role – without an instance profile, EC2 instances cannot access AWS APIs because they have no way to authenticate or assume roles.
# modules/iam/main.tf
resource "aws_iam_instance_profile" "kubernetes_master" {
name = "kubernetes-master-profile-${random_id.cluster.hex}"
role = aws_iam_role.kubernetes_master.name
tags = {
Name = "${terraform.workspace} - Kubernetes Control Plane Instance Profile"
Description = "Instance profile for control plane nodes - enables AWS API access for cluster management"
Purpose = "Kubernetes Control Plane"
Environment = terraform.workspace
ManagedBy = "Terraform"
}
}Master SSM Policy – Parameter store permissions
This policy gives the control plane nodes permission to store and manage cluster secrets in AWS Parameter Store. When the first control plane node sets up the cluster, it creates a “join command” (like a password) and stores it in AWS Parameter Store so other nodes can retrieve it and join the cluster. The policy restricts access to only parameters that start with /k8s/ for security.
What control plane can do:
PutParameter:Store cluster join command and tokensGetParameter:Read existing cluster infoDeleteParameter:Clean up old/expired tokensDescribeParameters:List available parameters
# modules/iam/main.tf
# SSM parameter access policy for Kubernetes control plane - allows storing/retrieving cluster join tokens
resource "aws_iam_role_policy" "kubernetes_master_ssm" {
name = "kubernetes-master-ssm-policy"
role = aws_iam_role.kubernetes_master.id
policy = jsonencode({
# Policy grants control plane full access to SSM parameters under /k8s/ namespace
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"ssm:PutParameter", # Store cluster join command with tokens and CA cert hash
"ssm:GetParameter", # Retrieve existing parameters for validation
"ssm:DeleteParameter", # Clean up expired or invalid join tokens
"ssm:DescribeParameters" # List and discover available k8s parameters
]
# Restrict access to only k8s namespace parameters for security
Resource = "arn:aws:ssm:us-east-1:${data.aws_caller_identity.current.account_id}:parameter/k8s/*"
}
]
})
}Worker Nodes IAM Setup
Worker Role – Identity for worker nodes
This creates an IAM role specifically for worker node EC2 instances. The assume_role_policy is a trust policy that allows only EC2 instances to assume this role and get AWS credentials. This role will later have policies attached that give worker nodes the specific permissions they need (like pulling container images, managing storage volumes, and handling pod networking) – but this just creates the empty role container that worker nodes can use.
# modules/iam/main.tf
resource "aws_iam_role" "kubernetes_worker" {
name = "kubernetes-worker-profile-${random_id.cluster.hex}"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}
]
})
tags = {
Name = "${terraform.workspace} - Kubernetes Worker Role"
Description = "IAM role for Kubernetes worker nodes with permissions for pod networking, storage, and container operations"
Purpose = "Kubernetes Worker Nodes"
Environment = terraform.workspace
ManagedBy = "Terraform"
}
}Worker Instance Profile – Attaches role to EC2
This creates an IAM instance profile that acts as a bridge between worker node EC2 instances and the worker IAM role. The instance profile gets attached to worker EC2 instances and allows them to assume the kubernetes_worker role to obtain AWS credentials. This enables worker nodes to access AWS APIs for tasks like pulling container images, managing EBS volumes, and configuring networking – without it, worker nodes couldn’t authenticate with AWS services.
# modules/iam/main.tf
resource "aws_iam_instance_profile" "kubernetes_worker" {
name = "kubernetes-worker-profile"
role = aws_iam_role.kubernetes_worker.name
tags = {
Name = "${terraform.workspace} - Kubernetes Worker Instance Profile"
Description = "Instance profile for worker nodes - enables AWS API access for container operations and networking"
Purpose = "Kubernetes Worker Nodes"
Environment = terraform.workspace
ManagedBy = "Terraform"
}
}Worker SSM Policy – Read-only parameter access
This creates an IAM policy that gets attached to the worker role, giving worker nodes read-only access to AWS Parameter Store. It allows worker nodes to retrieve the cluster join command that was stored by the control plane, but restricts access to only parameters under the /k8s/ path for security. This is how worker nodes get the secret tokens they need to join the existing Kubernetes cluster.
# modules/iam/main.tf
# Worker node SSM access - read-only permissions to get cluster join command
resource "aws_iam_role_policy" "kubernetes_worker_ssm" {
name = "kubernetes-worker-ssm-policy"
role = aws_iam_role.kubernetes_worker.id
policy = jsonencode({
# Policy allows worker nodes to read SSM parameters under /k8s/ path
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"ssm:GetParameter", # Read join command stored by control plane
"ssm:GetParameters" # Batch read multiple parameters if needed
]
# Only allow access to k8s namespace parameters
Resource = "arn:aws:ssm:us-east-1:${data.aws_caller_identity.current.account_id}:parameter/k8s/*"
}
]
})
}How IAM works:
- Control plane starts -> Gets master role
- Kubernetes initializes -> Generates join token
- Control plane stores join command in SSM parameter /k8s/join-command
- Worker nodes start -> Get worker role
- Workers read join command from SSM
- Workers join the cluster using the token
Expose the return values to be used in other modules.
# modules/iam/outputs.tf
output "kubernetes_master_instance_profile" {
description = "IAM instance profile name for Kubernetes control plane nodes - provides AWS API permissions"
value = aws_iam_instance_profile.kubernetes_master.name
}
output "kubernetes_worker_instance_profile" {
description = "IAM instance profile name for Kubernetes worker nodes - provides AWS API permissions for pods and services"
value = aws_iam_instance_profile.kubernetes_worker.name
}Create a custom module named iam.
# environments/development/main.tf
module iam {
source = "../../modules/iam"
}SSM Parameter Store
This SSM parameter provides a secure, automated way for control plane nodes to share fresh join tokens with worker nodes, eliminating manual steps and security risks. You don’t go to every node and ssh just to enter the join command.
# modules/ssm/main.tf
resource "aws_ssm_parameter" "join_command" {
name = "/k8s/control-plane/join-command"
type = "SecureString"
value = "placeholder-will-be-updated-by-script"
description = "Kubernetes cluster join command for worker nodes - automatically updated by control plane initialization script"
lifecycle {
ignore_changes = [value] # Let the script update the value
}
}Name & Path:
name = "/k8s/control-plane/join-command"- Hierarchical path: Organized under /k8s/ namespace
- Specific location: Control plane section for join commands
- Matches IAM policy: IAM roles above have access to /k8s/* path
Security Type:
type = "SecureString"- Encrypted storage: Value is encrypted at rest in AWS
- Secure transmission: Encrypted in transit when accessed
- Better than plaintext: Protects sensitive cluster tokens
The Join Command Content
What gets stored (after control plane runs):
# Real example of what replaces the placeholder:
"kubeadm join 10.0.1.10:6443 --token abc123.def456ghi789 --discovery-token-ca-cert-hash sha256:1234567890abcdef..."Why Ignore Changes is needed?
- Control plane script updates the value with real join command
- Without lifecycle: Terraform would overwrite script’s value back to placeholder
- With lifecycle: Terraform ignores value changes, lets script manage it
lifecycle {
ignore_changes = [value] # Let the script update the value
}Create a custom module named ssm.
# environments/development/main.tf
module ssm {
source = "../../modules/ssm"
}Networking
The networking section creates the foundational network infrastructure for the Kubernetes cluster.
Create the variables first.
# modules/networking/variables.tf
variable "aws_region" {
type = map(string)
description = "AWS region for each environment - maps workspace to region"
default = {
"development" = "us-east-1"
"production" = "us-east-2"
}
}
variable "public_subnet_cidrs" {
type = list(string)
description = "Public Subnet CIDR values for load balancers and internet-facing resources"
default = ["10.0.1.0/24"]
}
variable "private_subnet_cidrs" {
type = list(string)
description = "Private Subnet CIDR values for Kubernetes nodes and internal services"
default = ["10.0.2.0/24", "10.0.3.0/24", "10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"]
}
variable "azs" {
type = map(list(string))
description = "Availability Zones for each environment - ensures high availability across multiple AZs"
default = {
"development" = ["us-east-1a", "us-east-1b", "us-east-1c", "us-east-1d", "us-east-1f"]
"production" = ["us-east-2a", "us-east-2b", "us-east-2c", "us-east-2d", "us-east-2f"]
}
}VPC (Virtual Private Cloud)
A VPC (Virtual Private Cloud) in Amazon Web Services (AWS) is your own isolated network within the AWS cloud — like a private data center you control.
# modules/networking/main.tf
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${terraform.workspace} - Kubernetes Cluster VPC"
Environment = terraform.workspace
Purpose = "Kubernetes Infrastructure"
}
}Public Subnet (Internet-facing)
A public subnet in AWS is a subnet inside a VPC that can directly communicate with the internet — typically used for resources that need to be accessible from outside AWS
In AWS, a DMZ (Demilitarized Zone) is a subnet or network segment that acts as a buffer zone between the public internet and your private/internal AWS resources. It’s used to host public-facing services while minimizing the exposure of your internal network.
The public subnet contains the bastion host – a dedicated EC2 instance that acts as a secure gateway for accessing private resources. The bastion has a public IP and sits in the public subnet, allowing administrators to SSH into it from the internet, then use it as a stepping stone to securely connect to instances in private subnets that don’t have direct internet access.
# modules/networking/main.tf
resource "aws_subnet" "public_subnets" {
count = length(var.public_subnet_cidrs)
vpc_id = aws_vpc.main.id
cidr_block = element(var.public_subnet_cidrs, count.index)
availability_zone = element(var.azs[terraform.workspace], count.index)
tags = {
Name = "${terraform.workspace} - Public Subnet ${count.index + 1}"
Description = "Public subnet for bastion host and load balancers"
Type = "Public"
Environment = terraform.workspace
AvailabilityZone = element(var.azs[terraform.workspace], count.index)
Purpose = "DMZ"
ManagedBy = "Terraform"
Project = "Kubernetes"
Tier = "DMZ" # Demilitarized Zone
}
}Private Subnets (Internal)
A private subnet in AWS is a subnet within your VPC that does NOT have direct access to or from the public internet. It’s used to host internal resources that should remain isolated from external access, such as: Application servers, Databases (e.g., RDS), Internal services (e.g., Redis, internal APIs).
Hosts the Kubernetes control plane and worker nodes. No direct internet access (protected from external access).
# modules/networking/main.tf
resource "aws_subnet" "private_subnets" {
count = min(length(var.private_subnet_cidrs), length(var.azs[terraform.workspace]))
vpc_id = aws_vpc.main.id
cidr_block = var.private_subnet_cidrs[count.index]
availability_zone = var.azs[terraform.workspace][count.index] # Ensures 1 AZ per subnet
tags = {
Name = "${terraform.workspace} - Private Subnet ${count.index + 1}"
Description = "Private subnet for Kubernetes worker and control plane nodes"
Type = "Private"
Environment = terraform.workspace
AvailabilityZone = var.azs[terraform.workspace][count.index]
Purpose = "Kubernetes Nodes"
ManagedBy = "Terraform"
Project = "Kubernetes"
Tier = "Internal"
}
}Multi-AZ Distribution: Spreads resources across multiple data centers (High availability). If one AZ fails, others continue running (Fault tolerance).
availability_zone = element(var.azs[terraform.workspace], count.index)Internet Gateway
An Internet Gateway (IGW) in AWS is a component that connects your VPC to the internet. It allows resources in your VPC (like EC2 instances in a public subnet) to send traffic to the internet and receive traffic from the internet. It is attached to the public subnet. It enables bastion host to receive ssh connections. It handles:
- Outbound connections (e.g., your EC2 instance accessing a website)
- Inbound connections (e.g., users accessing your public web server)
# modules/networking/main.tf
resource "aws_internet_gateway" "igw" {
vpc_id = aws_vpc.main.id
tags = {
Name = "${terraform.workspace} - Internet Gateway"
Purpose = "Internet access for public subnets"
Description = "Provides internet connectivity for bastion host and load balancers"
Type = "Gateway"
}
}Public route table
A public route table in AWS is a route table associated with one or more public subnets, and it directs traffic destined for the internet to an Internet Gateway (IGW).
Traffic flow: Public subnet -> Internet Gateway -> Internet
# modules/networking/main.tf
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
tags = {
Name = "${terraform.workspace} - Public Route Table"
Description = "Route table for public subnets - directs traffic to internet gateway"
Type = "Public"
Purpose = "Internet routing for DMZ resources"
Environment = terraform.workspace
ManagedBy = "Terraform"
Tier = "DMZ"
RouteType = "Internet-bound"
Project = "Kubernetes"
}
}Private Route Table
A private route table in AWS is a route table used by private subnets—subnets that do not have direct access to or from the internet.
A private route table does NOT have a route to an Internet Gateway (IGW). Instead, it may have a route to a NAT Gateway or no external route at all, depending on whether you want outbound internet access (e.g., for software updates) or complete isolation.
Traffic flow: Private subnet -> NAT Gateway -> Internet Gateway -> Internet
# modules/networking/main.tf
resource "aws_route_table" "private" {
vpc_id = aws_vpc.main.id
tags = {
Name = "${terraform.workspace} - Private Route Table"
Description = "Route table for private subnets - directs internet traffic through NAT Gateway"
Type = "Private"
Environment = terraform.workspace
Purpose = "NAT Gateway Routing"
ManagedBy = "Terraform"
}
}Elastic IP for NAT Gateway
Static public IP provides consistent IP address and it is required for NAT Gateway operation.
What NAT Gateway Does:
Private Subnet (10.0.1.x) -> NAT Gateway -> InternetIt translates private IPs to public IP for outbound traffic. It needs public IP to communicate with internet on behalf of private resources. Without EIP – NAT Gateway Won’t Work.
Without EIP (Dynamic IP):
Today: NAT uses IP 12.123.45.67
Tomorrow: AWS changes it to 12.234.56.78
Result: External services block your new IPWith EIP (Static IP):
Always: NAT uses IP 54.123.45.67
Result: Consistent external identity# modules/networking/main.tf
resource "aws_eip" "nat_eip" {
domain = "vpc"
tags = {
Name = "${terraform.workspace} - NAT Gateway EIP"
Description = "Elastic IP for NAT Gateway - enables internet access for private subnets"
Purpose = "NAT Gateway"
Environment = terraform.workspace
ManagedBy = "Terraform"
}
}NAT Gateway
Allows private subnets to reach internet. Outbound traffic only, no inbound from internet. It’s essential for Kubernetes nodes to download images, updates and etc.
# modules/networking/main.tf
resource "aws_nat_gateway" "nat" {
allocation_id = aws_eip.nat_eip.id
subnet_id = aws_subnet.public_subnets[0].id
tags = {
Name = "${terraform.workspace} - NAT Gateway"
Description = "NAT Gateway for private subnet internet access - enables Kubernetes nodes to reach external services"
Purpose = "Private Subnet Internet Access"
Environment = terraform.workspace
ManagedBy = "Terraform"
}
depends_on = [aws_internet_gateway.igw]
}Add a default route to the internet gateway in the public route table
# modules/networking/main.tf
resource "aws_route" "public_internet_access" {
route_table_id = aws_route_table.public.id
destination_cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.igw.id
}
Associate only the first public subnet with the public route table
# modules/networking/main.tf
resource "aws_route_table_association" "public_first_subnet" {
subnet_id = aws_subnet.public_subnets[0].id
route_table_id = aws_route_table.public.id
}Add a route in the private route table to direct internet traffic through the NAT Gateway
# modules/networking/main.tf
resource "aws_route" "private_nat" {
route_table_id = aws_route_table.private.id
destination_cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.nat.id
}Link private subnets to the private route table
# modules/networking/main.tf
resource "aws_route_table_association" "private" {
count = length(var.private_subnet_cidrs)
subnet_id = element(aws_subnet.private_subnets[*].id, count.index)
route_table_id = aws_route_table.private.id
}Expose the return values to be used in other modules.
# modules/networking/outputs.tf
output "vpc_id" {
description = "ID of the VPC for the Kubernetes cluster"
value = aws_vpc.main.id
}
output "vpc_cidr_block" {
description = "CIDR block of the VPC for security group rules and network configuration"
value = aws_vpc.main.cidr_block
}
output "private_subnets" {
description = "Private subnets for Kubernetes worker nodes and internal services"
value = aws_subnet.private_subnets
}
output "public_subnets" {
description = "Public subnets for load balancers, bastion hosts, and internet-facing resources"
value = aws_subnet.public_subnets
}Create a custom module named networking.
# environments/development/main.tf
module networking {
source = "../../modules/networking"
}Security Groups
The security groups creates network security rules (firewalls) for Kubernetes cluster. This creates a secure, layered defense where each Kubernetes component can communicate as needed while preventing unauthorized access from the internet.
To know more about Kubernetes Ports and Protocols, visit https://kubernetes.io/docs/reference/networking/ports-and-protocols/.
Create the variables.
# modules/security_groups/variables.tf
// FROM Other Module
variable "vpc_id" {
description = "VPC ID from AWS module"
type = string
}
variable "vpc_cidr_block" {
description = "CIDR block of the VPC for internal network communication"
type = string
}1. Bastion Security Group: Creates a firewall group for the bastion host.
# modules/security_groups/main.tf
resource "aws_security_group" "bastion" {
name = "bastion-sg"
vpc_id = var.vpc_id
description = "Security group for the bastion host"
tags = {
Name = "${terraform.workspace} - Bastion Host SG"
}
}2. Bastion SSH from Internet: Allows SSH connections to bastion host from anywhere on the internet.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_ingress_rule" "bastion_ssh_anywhere" {
security_group_id = aws_security_group.bastion.id
from_port = 22
to_port = 22
ip_protocol = "tcp"
cidr_ipv4 = "0.0.0.0/0"
description = "Allow SSH access to bastion host from any IP address"
tags = {
Name = "${terraform.workspace} - Bastion SSH Internet Access"
}
}3. Bastion SSH to Control Plane: Allows bastion host to SSH to control plane nodes.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_egress_rule" "bastion_egress_control_plane" {
security_group_id = aws_security_group.bastion.id
from_port = 22
to_port = 22
ip_protocol = "tcp"
referenced_security_group_id = aws_security_group.control_plane.id
description = "Allow SSH from bastion host to Kubernetes control plane nodes for cluster administration"
tags = {
Name = "${terraform.workspace} - Bastion SSH to Control Plane"
}
}4. Bastion SSH to Workers: Allows bastion host to SSH to worker nodes.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_egress_rule" "bastion_egress_workers" {
security_group_id = aws_security_group.bastion.id
from_port = 22
to_port = 22
ip_protocol = "tcp"
referenced_security_group_id = aws_security_group.worker_node.id
description = "Allow SSH from bastion host to worker nodes for maintenance and troubleshooting"
tags = {
Name = "${terraform.workspace} - Bastion SSH to Worker Nodes"
}
}5. Control Plane Security Group: Creates a firewall group for Kubernetes master nodes.
# modules/security_groups/main.tf
resource "aws_security_group" "control_plane" {
name = "control-plane-sg"
vpc_id = var.vpc_id
description = "Security group for the Kubernetes control plane"
tags = {
Name = "${terraform.workspace} - Kubernetes Control Plane SG"
}
}6. Control Plane SSH Access: Allows SSH to control plane from bastion host only.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_ingress_rule" "control_plane_ssh" {
security_group_id = aws_security_group.control_plane.id
from_port = 22
to_port = 22
ip_protocol = "tcp"
referenced_security_group_id = aws_security_group.bastion.id
description = "Allow SSH access to control plane nodes from bastion host for cluster administration"
tags = {
Name = "${terraform.workspace} - Control Plane SSH from Bastion"
}
}7. Control Plane etcd: Allows etcd database communication. Kubernetes stores all data in etcd.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_ingress_rule" "control_plane_etcd" {
security_group_id = aws_security_group.control_plane.id
from_port = 2379
to_port = 2380
ip_protocol = "tcp"
cidr_ipv4 = var.vpc_cidr_block
description = "Allow etcd client and peer communication within VPC for Kubernetes cluster state management"
tags = {
Name = "${terraform.workspace} - Control Plane etcd Communication"
}
}8. Control Plane kubelet: Allows kubelet API access. Use for monitoring and managing pods.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_ingress_rule" "control_plane_self_control_plane" {
security_group_id = aws_security_group.control_plane.id
from_port = 10250
to_port = 10250
ip_protocol = "tcp"
cidr_ipv4 = var.vpc_cidr_block
description = "Allow kubelet API access within VPC for control plane node communication and monitoring"
tags = {
Name = "${terraform.workspace} - Control Plane kubelet API"
}
}9. Control Plane Scheduler: Allows access to scheduler metrics. Use for health checks and monitoring.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_ingress_rule" "control_plane_kube_scheduler" {
security_group_id = aws_security_group.control_plane.id
from_port = 10259
to_port = 10259
ip_protocol = "tcp"
cidr_ipv4 = var.vpc_cidr_block
description = "Allow kube-scheduler metrics and health check access from VPC for cluster monitoring"
tags = {
Name = "${terraform.workspace} - Control Plane kube-scheduler"
}
}10. Control Plane Controller Manager: Allows access to controller manager metrics. Use for health checks and monitoring.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_ingress_rule" "control_plane_kube_controller_manager" {
security_group_id = aws_security_group.control_plane.id
from_port = 10257
to_port = 10257
ip_protocol = "tcp"
cidr_ipv4 = var.vpc_cidr_block
description = "Allow kube-controller-manager metrics and health check access from VPC for cluster monitoring"
tags = {
Name = "${terraform.workspace} - Control Plane kube-controller-manager"
}
}11. Control Plane All Outbound: Allows control plane to connect to anything on internet. Use for downloading of updates and call AWS APIs.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_egress_rule" "control_plane_egress_all" {
security_group_id = aws_security_group.control_plane.id
ip_protocol = "-1"
cidr_ipv4 = "0.0.0.0/0"
description = "Allow all outbound traffic from control plane for AWS APIs, container registries, and external services"
tags = {
Name = "${terraform.workspace} - Control Plane Outbound All"
}
}12. Worker Node Security Group: Creates a firewall group for worker nodes.
# modules/security_groups/main.tf
resource "aws_security_group" "worker_node" {
name = "worker-node-sg"
vpc_id = var.vpc_id
description = "Security group for Kubernetes worker nodes - controls pod and application traffic"
tags = {
Name = "${terraform.workspace} - Worker Nodes SG"
}
}13. Worker All Outbound: Allows workers to connect to anything on internet. Use for downloading of container images and call external APIs.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_egress_rule" "worker_node_egress_all" {
security_group_id = aws_security_group.worker_node.id
ip_protocol = "-1"
cidr_ipv4 = "0.0.0.0/0"
description = "Allow all outbound traffic from worker nodes for container images, application traffic, and AWS services"
tags = {
Name = "${terraform.workspace} - Worker Nodes Outbound All"
}
}14. Worker SSH Access: Allows SSH to workers from bastion only.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_ingress_rule" "worker_node_ssh" {
security_group_id = aws_security_group.worker_node.id
from_port = 22
to_port = 22
ip_protocol = "tcp"
referenced_security_group_id = aws_security_group.bastion.id
description = "Allow SSH access to worker nodes from bastion host for maintenance and troubleshooting"
tags = {
Name = "${terraform.workspace} - Worker Nodes SSH from Bastion"
}
}15. Worker kubelet API: Allows control plane to manage worker pods. Use to how Kubernetes schedules pods.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_ingress_rule" "worker_node_kubelet_api" {
security_group_id = aws_security_group.worker_node.id
from_port = 10250
to_port = 10250
ip_protocol = "tcp"
referenced_security_group_id = aws_security_group.control_plane.id
description = "Allow control plane access to worker node kubelet API for pod management and monitoring"
tags = {
Name = "${terraform.workspace} - Worker Nodes kubelet API"
}
}16. Worker kube-proxy: Allows load balancer to check worker health.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_ingress_rule" "worker_node_kube_proxy" {
security_group_id = aws_security_group.worker_node.id
from_port = 10256
to_port = 10256
ip_protocol = "tcp"
referenced_security_group_id = aws_security_group.elb.id
description = "Allow load balancer access to kube-proxy health check endpoint on worker nodes"
tags = {
Name = "${terraform.workspace} - Worker Nodes kube-proxy"
}
}17. Worker NodePort TCP: Allows internet to access applications on workers. Expose web apps and APIs.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_ingress_rule" "worker_node_tcp_nodeport_services" {
security_group_id = aws_security_group.worker_node.id
from_port = 30000
to_port = 32767
ip_protocol = "tcp"
cidr_ipv4 = "0.0.0.0/0"
description = "Allow internet access to Kubernetes NodePort services (TCP 30000-32767) for application traffic"
tags = {
Name = "${terraform.workspace} - Worker Nodes NodePort TCP"
}
}18. Worker NodePort UDP: Allows internet to access UDP applications on workers. Expose UDP services like DNS and games.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_ingress_rule" "worker_node_udp_nodeport_services" {
security_group_id = aws_security_group.worker_node.id
from_port = 30000
to_port = 32767
ip_protocol = "udp"
cidr_ipv4 = "0.0.0.0/0"
description = "Allow internet access to Kubernetes NodePort services (UDP 30000-32767) for application traffic"
tags = {
Name = "${terraform.workspace} - Worker Nodes NodePort UDP"
}
}19. Control Plane API Health Check: Allows load balancer to check if API server is healthy.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_ingress_rule" "allow_nlb_health_check" {
security_group_id = aws_security_group.control_plane.id
from_port = 6443
to_port = 6443
ip_protocol = "tcp"
cidr_ipv4 = var.vpc_cidr_block
description = "Allow Network Load Balancer health checks to Kubernetes API server on port 6443"
tags = {
Name = "${terraform.workspace} - Control Plane NLB Health Check"
}
}20. Control Plane BGP: Allows advanced networking protocols for Service mesh, advanced CNI plugins.
# modules/security_groups/main.tf
resource "aws_vpc_security_group_ingress_rule" "allow_bgp" {
security_group_id = aws_security_group.control_plane.id
from_port = 179
to_port = 179
ip_protocol = "tcp"
cidr_ipv4 = var.vpc_cidr_block
description = "Allow BGP protocol communication within VPC for network routing and service mesh"
tags = {
Name = "${terraform.workspace} - Control Plane BGP Communication"
}
}When to use cidr_ipv4 = var.vpc_cidr_block? Use VPC CIDR when communication needs to happen with:
- Multiple different security groups (avoiding many separate rules)
- Load balancers or services that don’t have their own security groups
- System-level protocols that need broad VPC access
- Health checks that come from various AWS services
For example,
# etcd (ports 2379-2380) - Multiple control plane nodes need to communicate
cidr_ipv4 = var.vpc_cidr_block
# kubelet API (port 10250) - Control plane, workers, monitoring all need access
cidr_ipv4 = var.vpc_cidr_block
# kube-scheduler (10259) - Monitoring systems need access
cidr_ipv4 = var.vpc_cidr_block
# BGP (port 179) - Network routing between various nodes
cidr_ipv4 = var.vpc_cidr_block
# NLB health checks (port 6443) - Load balancer health checks
cidr_ipv4 = var.vpc_cidr_blockFor Quick Decisions, use cidr_ipv4 = var.vpc_cidr_block when:
- “Do multiple types of resources need access?” → YES = VPC CIDR
- “Is this a system/infrastructure port?” → YES = VPC CIDR
- “Do health checks or monitoring need access?” → YES = VPC CIDR
Use referenced_security_group_id when:
- “Is this one specific service talking to another?” → YES = Security Group
- “Can I identify exactly who should have access?” → YES = Security Group
- “Is this application-level communication?” → YES = Security Group
Expose the return values to be used in other modules.
# modules/security_groups/outputs.tf
output "bastion_security_group_id" {
description = "Security group ID for the bastion host - used for SSH access to cluster nodes"
value = aws_security_group.bastion.id
}
output "control_plane_security_group_id" {
description = "Security group ID for Kubernetes control plane nodes - manages API server and cluster components"
value = aws_security_group.control_plane.id
}
output "worker_node_security_group_id" {
description = "Security group ID for Kubernetes worker nodes - handles application workloads and pod traffic"
value = aws_security_group.worker_node.id
}Create a custom module named networking. Pass values from networking so that it can be used inside the security groups (vpc_id and vpc_cidr_block).
# environments/development/main.tf
module security_groups {
source = "../../modules/security_groups"
vpc_id = module.networking.vpc_id
vpc_cidr_block = module.networking.vpc_cidr_block
depends_on = [module.networking]
}EC2 Instances
EC2 instances are virtual computers that you rent from AWS. They are servers running in Amazon’s data centers that you can control remotely. EC2 instances are virtual computers in the cloud that you can create, configure, and control through code, giving you the flexibility to build infrastructure without buying physical hardware. In the Kubernetes setup, we will use 7 ec2 instances to hold the bastion host, control planes and worker nodes.
Create the variables first.
variable "public_subnet_cidrs" {
type = list(string)
description = "Public Subnet CIDR values"
default = ["10.0.1.0/24"]
}
variable "control_plane_private_ips" {
type = list(string)
description = "List of private IPs for control plane nodes"
default = ["10.0.2.10", "10.0.3.10", "10.0.4.10"]
}
variable "bastion" {
description = "Configuration for the bastion host used as a secure gateway to access private cluster resources"
type = map
default = {
"ami" = "ami-084568db4383264d4"
"instance_type" = "t3.micro"
"private_ip" = "10.0.1.10"
"name" = "Bastion Host"
}
}
variable "common_functions" {
description = "Configuration for deploying shared utility scripts across all cluster instances"
type = any
default = {
"source" = "scripts/common-functions.sh"
"destination" = "/tmp/common-functions.sh"
"connection" = {
"type" = "ssh"
"user" = "ubuntu"
"bastion_user" = "ubuntu"
"timeout" = "30m" # Allow enough time for installation
}
}
}
variable "control_plane" {
description = "Configuration for the primary Kubernetes control plane node including API server, scheduler, and controller manager"
type = any
default = {
"ami" = "ami-084568db4383264d4"
"instance_type" = "t3.xlarge"
"root_block_device" = {
"volume_size" = 20
"volume_type" = "gp3"
"delete_on_termination" = true
}
"init_file" = "scripts/init-control-plane.sh.tmpl"
"name" = "Control Plane 1"
}
}
variable "wait_for_master_ready" {
description = "Configuration for the script that waits for the control plane to be fully operational before proceeding with cluster setup"
type = map
default = {
"source" = "scripts/wait-for-master.sh.tmpl"
}
}
variable "control_plane_secondary" {
description = "Configuration for additional control plane nodes to provide high availability for the Kubernetes cluster"
type = any
default = {
"ami" = "ami-084568db4383264d4" # Replace with a Ubuntu 12 AMI ID
"instance_type" = "t3.xlarge"
"root_block_device" = {
"volume_size" = 20
"volume_type" = "gp3"
"delete_on_termination" = true
}
"init_file" = "scripts/init-control-plane.sh.tmpl"
"name" = "Control Plane 1"
}
}
variable "worker_nodes" {
description = "Configuration for Kubernetes worker nodes that run application workloads and pods"
type = any
default = {
"count" = 3
"ami" = "ami-084568db4383264d4"
"instance_type" = "t3.large"
"root_block_device" = {
"volume_size" = 20
"volume_type" = "gp3"
"delete_on_termination" = true
}
"init_file" = "scripts/init-worker-node.sh.tmpl"
"name" = "Worker Node"
}
}
variable "wait_for_workers_to_join" {
description = "Configuration for the script that waits for all worker nodes to successfully join the Kubernetes cluster"
type = map
default = {
"init_file" = "scripts/wait-for-workers.sh.tmpl"
"log_file" = "/var/log/k8s-wait-for-workers-$(date +%Y%m%d-%H%M%S).log"
}
}
variable "label_worker_nodes" {
description = "Configuration for applying labels and taints to worker nodes for workload scheduling and node organization"
type = any
default = {
"init_file" = "scripts/label-worker-nodes.sh.tmpl"
"expected_worker_count" = 3
}
}
# FROM Other Module
variable "vpc_id" {
description = "VPC ID from AWS module where the Kubernetes cluster will be deployed"
type = string
}
variable "private_subnets" {
description = "Private subnets from AWS module for deploying worker nodes and internal cluster components"
type = any
}
variable "public_subnets" {
description = "Public subnets from AWS module for deploying bastion host and load balancers"
type = any
}
variable "bastion_security_group_id" {
description = "Security group ID for the bastion host allowing SSH access from authorized sources"
type = string
}
variable "control_plane_security_group_id" {
description = "Security group ID for control plane nodes allowing Kubernetes API and inter-node communication"
type = string
}
variable "worker_node_security_group_id" {
description = "Security group ID for worker nodes allowing pod-to-pod communication and kubelet access"
type = string
}
variable "kubernetes_master_instance_profile" {
description = "IAM instance profile for control plane nodes with permissions for Kubernetes master operations"
type = string
}
variable "kubernetes_worker_instance_profile" {
description = "IAM instance profile for worker nodes with permissions for Kubernetes worker operations"
type = string
}
variable "tls_private_key_pem" {
description = "TLS private key in PEM format for secure communication within the Kubernetes cluster"
type = string
sensitive = true
}
variable "key_pair_name" {
description = "AWS EC2 key pair name for SSH access to cluster instances"
type = string
}Bastion Host Instance
Creates bastion host EC2 instance (one per public subnet). The purpose is for SSH gateway to access private cluster nodes.
resource "aws_instance" "bastion" {
count = length(var.public_subnet_cidrs)
ami = var.bastion.ami
instance_type = var.bastion.instance_type
key_name = var.key_pair_name
vpc_security_group_ids = [var.bastion_security_group_id]
subnet_id = var.public_subnets[count.index].id
private_ip = var.bastion.private_ip
tags = {
Name = "${terraform.workspace} - ${var.bastion.name}"
Environment = terraform.workspace
Project = "Kubernetes"
Role = "bastion-host"
ManagedBy = "Terraform"
CostCenter = "Infrastructure"
MonitoringEnabled = "true"
SubnetType = "public"
CreatedDate = formatdate("YYYY-MM-DD", timestamp())
}
lifecycle {
ignore_changes = [tags["CreatedDate"]]
}
}Bastion Elastic IP
Creates static public IP addresses for bastion hosts. The purpose is to give bastion a fixed IP that doesn’t change when instance restarts. So you always know the IP to SSH to
resource "aws_eip" "bastion_eip" {
count = length(var.public_subnet_cidrs)
domain = "vpc"
}Bastion EIP Association
Attaches the static IP to the bastion instance. The purpose is to link the elastic IP to the actual server.
resource "aws_eip_association" "bastion_eip_assoc" {
count = length(var.public_subnet_cidrs)
instance_id = aws_instance.bastion[count.index].id
allocation_id = aws_eip.bastion_eip[count.index].id
}Upload Common Functions
Copies a script file to the control plane node. Uploads shared utility functions used by other scripts. Uses SSH through bastion host to reach control plane.
resource "null_resource" "upload_common_functions" {
depends_on = [null_resource.wait_for_master_ready]
provisioner "file" {
source = "${path.module}/${var.common_functions.source}"
destination = var.common_functions.destination
connection {
type = var.common_functions.connection.type
user = var.common_functions.connection.user
private_key = var.tls_private_key_pem
host = aws_instance.control_plane["0"].private_ip
bastion_host = aws_eip.bastion_eip[0].public_ip
bastion_user = var.common_functions.connection.bastion_user
bastion_private_key = var.tls_private_key_pem
}
}
# Make sure the file is executable
provisioner "remote-exec" {
inline = [
"chmod +x /tmp/common-functions.sh",
"echo 'Common functions uploaded successfully'"
]
connection {
type = var.common_functions.connection.type
user = var.common_functions.connection.user
private_key = var.tls_private_key_pem
host = aws_instance.control_plane["0"].private_ip
bastion_host = aws_eip.bastion_eip[0].public_ip
bastion_user = var.common_functions.connection.bastion_user
bastion_private_key = var.tls_private_key_pem
}
}
}Control Plane Instance
Creates the primary Kubernetes master node. The purpose is to run API server, scheduler, controller manager. Located in a private subnet (protected from internet).
resource "aws_instance" "control_plane" {
for_each = { "0" = true }
ami = var.control_plane.ami
instance_type = var.control_plane.instance_type
key_name = var.key_pair_name
vpc_security_group_ids = [var.control_plane_security_group_id]
subnet_id = var.private_subnets[0].id
private_ip = var.control_plane_private_ips[0]
iam_instance_profile = var.kubernetes_master_instance_profile
root_block_device {
volume_size = var.control_plane.root_block_device.volume_size
volume_type = var.control_plane.root_block_device.volume_type
delete_on_termination = var.control_plane.root_block_device.delete_on_termination
}
user_data = templatefile("${path.module}/${var.control_plane.init_file}", {
common_functions = file("${path.module}/${var.common_functions.source}")
control_plane_endpoint = aws_lb.k8s_api.dns_name
control_plane_master_private_ip = var.control_plane_private_ips[0]
is_first_control_plane = "true"
})
tags = {
Name = "${terraform.workspace} - ${var.control_plane.name}"
}
}Wait for Master Ready
Runs a script that waits for Kubernetes master to be fully started. The purpose is to ensure master is ready before creating other nodes.
resource "null_resource" "wait_for_master_ready" {
depends_on = [aws_instance.control_plane]
provisioner "remote-exec" {
inline = [
templatefile("${path.module}/${var.wait_for_master_ready.source}", {
common_functions = file("${path.module}/${var.common_functions.source}")
})
]
connection {
type = var.common_functions.connection.type
user = var.common_functions.connection.user
private_key = var.tls_private_key_pem
host = aws_instance.control_plane["0"].private_ip
bastion_host = aws_eip.bastion_eip[0].public_ip
bastion_user = var.common_functions.connection.bastion_user
bastion_private_key = var.tls_private_key_pem
timeout = var.common_functions.connection.timeout
}
}
triggers = {
instance_id = aws_instance.control_plane["0"].id
}
}Secondary Control Plane Instances
Creates additional master nodes for high availability. The purpose is if primary master fails, these can take over. The location is different subnets from primary master. Key difference: is_first_control_plane = "false" in user_data.
resource "aws_instance" "control_plane_secondary" {
for_each = { "1" = 1, "2" = 2 }
ami = var.control_plane_secondary.ami
instance_type = var.control_plane_secondary.instance_type
key_name = var.key_pair_name
vpc_security_group_ids = [var.control_plane_security_group_id]
subnet_id = var.private_subnets[each.value].id
private_ip = var.control_plane_private_ips[each.value]
iam_instance_profile = var.kubernetes_master_instance_profile
root_block_device {
volume_size = var.control_plane_secondary.root_block_device.volume_size
volume_type = var.control_plane_secondary.root_block_device.volume_type
delete_on_termination = var.control_plane_secondary.root_block_device.delete_on_termination
}
user_data = templatefile("${path.module}/${var.control_plane_secondary.init_file}", {
common_functions = file("${path.module}/${var.common_functions.source}")
control_plane_endpoint = aws_lb.k8s_api.dns_name
control_plane_master_private_ip = var.control_plane_private_ips[0]
is_first_control_plane = "false"
})
depends_on = [null_resource.wait_for_master_ready]
tags = {
Name = "${terraform.workspace} - ${var.control_plane.name}"
Environment = terraform.workspace
Project = "Kubernetes"
Role = "control-plane"
ManagedBy = "Terraform"
CostCenter = "Infrastructure"
MonitoringEnabled = "true"
SubnetType = "private"
CreatedDate = formatdate("YYYY-MM-DD", timestamp())
}
lifecycle {
ignore_changes = [tags["CreatedDate"]]
}
}Worker Node Instances
Creates Kubernetes worker nodes (default: 3 nodes). The purpose is to Run application pods and workloads. Distributed across private subnets using modulo.
resource "aws_instance" "worker_nodes" {
count = var.worker_nodes.count
ami = var.worker_nodes.ami
instance_type = var.worker_nodes.instance_type
key_name = var.key_pair_name
vpc_security_group_ids = [var.worker_node_security_group_id]
# Use modulo to distribute worker nodes across available subnets
subnet_id = var.private_subnets[count.index % length(var.private_subnets)].id
iam_instance_profile = var.kubernetes_worker_instance_profile
root_block_device {
volume_size = var.worker_nodes.root_block_device.volume_size
volume_type = var.worker_nodes.root_block_device.volume_type
delete_on_termination = var.worker_nodes.root_block_device.delete_on_termination
}
user_data = templatefile("${path.module}/${var.worker_nodes.init_file}", {
common_functions = file("${path.module}/${var.common_functions.source}")
})
# Wait for at least the master control plane to be ready
depends_on = [null_resource.wait_for_master_ready]
tags = {
Name = "${terraform.workspace} - ${var.worker_nodes.name} ${count.index + 1}"
Environment = terraform.workspace
Project = "Kubernetes"
Role = "worker-node"
ManagedBy = "Terraform"
CostCenter = "Infrastructure"
MonitoringEnabled = "true"
SubnetType = "private"
NodeType = "compute"
WorkloadCapable = "true"
CreatedDate = formatdate("YYYY-MM-DD", timestamp())
}
lifecycle {
ignore_changes = [tags["CreatedDate"]]
}
}Wait for Workers to Join
Runs script that waits for all worker nodes to join cluster.
resource "null_resource" "wait_for_workers_to_join" {
depends_on = [
aws_instance.worker_nodes,
aws_instance.control_plane_secondary
]
provisioner "remote-exec" {
inline = [
templatefile("${path.module}/${var.wait_for_workers_to_join.init_file}", {
common_functions = file("${path.module}/${var.common_functions.source}")
expected_workers = length(aws_instance.worker_nodes)
timeout_seconds = 600
check_interval = 30
log_file = var.wait_for_workers_to_join.log_file
})
]
connection {
type = var.common_functions.connection.type
user = var.common_functions.connection.user
private_key = var.tls_private_key_pem
host = aws_instance.control_plane["0"].private_ip
bastion_host = aws_eip.bastion_eip[0].public_ip
bastion_user = var.common_functions.connection.bastion_user
bastion_private_key = var.tls_private_key_pem
}
}
triggers = {
worker_instances = join(",", aws_instance.worker_nodes[*].id)
control_plane_instances = join(",", values(aws_instance.control_plane_secondary)[*].id)
}
}Label Worker Nodes
Applies labels to worker nodes to organize them for workload scheduling. Labels nodes with the “worker” role so they display properly in kubectl output instead of showing <none> as their role.
resource "null_resource" "label_worker_nodes" {
depends_on = [null_resource.wait_for_workers_to_join]
provisioner "remote-exec" {
inline = [
templatefile("${path.module}/${var.label_worker_nodes.init_file}", {
common_functions = file("${path.module}/${var.common_functions.source}")
expected_worker_count = var.label_worker_nodes.expected_worker_count
})
]
connection {
type = var.common_functions.connection.type
user = var.common_functions.connection.user
private_key = var.tls_private_key_pem
host = aws_instance.control_plane["0"].private_ip
bastion_host = aws_eip.bastion_eip[0].public_ip
bastion_user = var.common_functions.connection.bastion_user
bastion_private_key = var.tls_private_key_pem
}
}
triggers = {
worker_wait_complete = null_resource.wait_for_workers_to_join.id
}
}Kubernetes API Load Balancer
Creates internal (private subnets only) Network Load Balancer for Kubernetes API server. The purpose is to distribute API requests across multiple master nodes.
resource "aws_lb" "k8s_api" {
name = "k8s-api-lb"
internal = true
load_balancer_type = "network"
subnets = [for subnet in var.private_subnets : subnet.id]
tags = {
Name = "${terraform.workspace} - Kubernetes API Load Balancer"
Environment = terraform.workspace
Project = "Kubernetes"
Role = "api-load-balancer"
Component = "networking"
Purpose = "kubernetes-api-endpoint"
ManagedBy = "Terraform"
CostCenter = "Infrastructure"
MonitoringEnabled = "true"
LoadBalancerType = "network"
Scheme = "internal"
Protocol = "tcp"
HighAvailability = "true"
SecurityLevel = "high"
CreatedDate = formatdate("YYYY-MM-DD", timestamp())
}
lifecycle {
ignore_changes = [tags["CreatedDate"]]
}
}API Target Group
Creates target group for API server health checks. The purpose is to define which servers receive traffic and how to check if they’re healthy. TCP connection test every 10 seconds
Port: 6443 (standard Kubernetes API port)
resource "aws_lb_target_group" "k8s_api" {
name = "k8s-api-tg"
port = 6443
protocol = "TCP"
vpc_id = var.vpc_id
health_check {
protocol = "TCP"
port = 6443
healthy_threshold = 2
unhealthy_threshold = 2
interval = 10
}
tags = {
Name = "${terraform.workspace} - Kubernetes API Target Group"
Environment = terraform.workspace
Project = "Kubernetes"
Role = "api-target-group"
Component = "networking"
Purpose = "kubernetes-api-health-check"
ManagedBy = "Terraform"
CostCenter = "Infrastructure"
MonitoringEnabled = "true"
Protocol = "TCP"
Port = "6443"
HealthCheck = "enabled"
ServiceType = "kubernetes-api-server"
TargetType = "control-plane-nodes"
CreatedDate = formatdate("YYYY-MM-DD", timestamp())
}
lifecycle {
ignore_changes = [tags["CreatedDate"]]
}
}Master Target Group Attachment
Adds primary master node to load balancer target group. The purpose is for primary master receives API traffic through load balancer.
resource "aws_lb_target_group_attachment" "k8s_api_master" {
target_group_arn = aws_lb_target_group.k8s_api.arn
target_id = aws_instance.control_plane["0"].id
port = 6443
}Secondary Target Group Attachments
Adds secondary master nodes to load balancer target group. The purpose is for all masters receive API traffic for high availability.
resource "aws_lb_target_group_attachment" "k8s_api_secondary" {
for_each = aws_instance.control_plane_secondary
target_group_arn = aws_lb_target_group.k8s_api.arn
target_id = each.value.id
port = 6443
}Load Balancer Listener
Configures load balancer to listen on port 6443. The purpose is to accept incoming API requests and forwards to healthy masters. Forward all traffic to target group.
resource "aws_lb_listener" "k8s_api" {
load_balancer_arn = aws_lb.k8s_api.arn
port = 6443
protocol = "TCP"
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.k8s_api.arn
}
}Summary of flow:
- Bastion gets created with static IP
- Primary master gets created and initialized
- Wait for master to be ready
- Secondary masters join the cluster
- Worker nodes get created and join
- Wait for all workers to join
- Label workers for organization
- Load balancer distributes API traffic across all masters
Create a custom module and name it compute. Pass outputs from other modules.
# environments/development/main.tf
module "compute" {
source = "../../modules/compute"
# Pass AWS resources from development module
private_subnets = module.networking.private_subnets
public_subnets = module.networking.public_subnets
bastion_security_group_id = module.security_groups.bastion_security_group_id
control_plane_security_group_id = module.security_groups.control_plane_security_group_id
worker_node_security_group_id = module.security_groups.worker_node_security_group_id
kubernetes_master_instance_profile = module.iam.kubernetes_master_instance_profile
kubernetes_worker_instance_profile = module.iam.kubernetes_worker_instance_profile
key_pair_name = module.keypair.key_pair_name
tls_private_key_pem = module.keypair.tls_private_key_pem
vpc_id = module.networking.vpc_id
depends_on = [module.iam, module.keypair, module.networking, module.security_groups]
}Kubernetes Control Planes Installation using Bash Script
This script essentially automates the creation of a highly available Kubernetes cluster on AWS, handling both the initial cluster setup and the addition of subsequent control plane nodes.
1. wait_for_variables() Waits for required environment variables to be available.
- Polls for 30 attempts (60 seconds total) checking if
control_plane_master_private_ip,control_plane_endpoint, andis_first_control_planeare set - Returns 0 if all variables are available, 1 if timeout occurs
#!/bin/bash
set -e
# Function: Wait for required environment variables to be available
wait_for_variables() {
max_attempts=30
sleep_interval=2
attempt=1
while [ $attempt -le $max_attempts ]; do
# Check if all required variables are set and non-empty
if [ -n "${control_plane_master_private_ip}" ] && [ -n "${control_plane_endpoint}" ] && [ -n "${is_first_control_plane}" ]; then
return 0
fi
sleep $sleep_interval
attempt=$((attempt + 1))
done
return 1
}
# Wait for variables or exit if timeout
if ! wait_for_variables; then
exit 1
fi
# Validate required environment variables are set
if [ -z "${control_plane_master_private_ip}" ] || [ -z "${control_plane_endpoint}" ] || [ -z "${is_first_control_plane}" ]; then
exit 1
fi2. System Preparation Block. Prepares the system for Kubernetes installation.
- Swap Management: Disables swap memory (required by Kubernetes) and comments it out in /etc/fstab to prevent re-enabling on reboot
- Network Configuration: Enables IP forwarding by setting net.ipv4.ip_forward = 1 for pod-to-pod communication
- Package Updates: Updates system packages with retry logic for reliability
# SYSTEM PREPARATION
# Disable swap (required for Kubernetes)
swapoff -a
# Permanently disable swap by commenting it out in fstab
sed -i '/ swap / s/^/#/' /etc/fstab
# Enable IP forwarding for pod networking
cat <<EOF | tee /etc/sysctl.d/k8s.conf
net.ipv4.ip_forward = 1
EOF
# Apply sysctl settings without reboot
sysctl --system
# Update package lists with retry logic
for attempt in 1 2 3; do
if apt-get update; then
break
else
if [ $attempt -eq 3 ]; then
exit 1
fi
sleep 10
fi
done3. Container Runtime Setup Block. Installs and configures containerd as the container runtime.
- Package Installation: Installs essential packages including containerd and security tools
- Repository Setup: Adds Docker’s GPG key and repository for containerd installation
- Containerd Configuration:
- Generates default config file
- Enables systemd cgroup driver (required for Kubernetes)
- Starts and enables the containerd service
# CONTAINER RUNTIME SETUP (containerd)
# Install required packages
apt-get install -y ca-certificates curl gnupg lsb-release containerd apt-transport-https unzip
# Create directory for APT keyrings
mkdir -p /etc/apt/keyrings
# Add Docker GPG key (for containerd installation)
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
# Make the key readable
sudo chmod a+r /etc/apt/keyrings/docker.gpg
# Add Docker repository
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
noble stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
# Update package list again after adding repository
for attempt in 1 2 3; do
if apt-get update; then
break
else
if [ $attempt -eq 3 ]; then
exit 1
fi
sleep 10
fi
done
# Configure containerd
mkdir -p /etc/containerd
# Generate default containerd configuration
containerd config default | tee /etc/containerd/config.toml
# Enable systemd cgroup driver (required for Kubernetes)
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
# Start and enable containerd service
systemctl restart containerd
systemctl enable containerd4. Kubernetes Installation Block. Installs Kubernetes components.
- Repository Setup: Adds Kubernetes GPG key and repository
- Component Installation: Installs kubelet, kubeadm, and kubectl
- Package Protection: Uses apt-mark hold to prevent automatic updates that could break the cluster
- Service Management: Enables kubelet service
# KUBERNETES INSTALLATION
# Add Kubernetes GPG key with retry logic
for attempt in 1 2 3; do
if curl --connect-timeout 30 --max-time 60 -fsSL https://pkgs.k8s.io/core:/stable:/v1.33/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg; then
break
else
if [ $attempt -eq 3 ]; then
exit 1
fi
sleep 10
fi
done
# Add Kubernetes repository
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.33/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
# Update package list
apt-get update
# Install Kubernetes components
apt-get install -y kubelet kubeadm kubectl
# Prevent automatic updates of Kubernetes packages
apt-mark hold kubelet kubeadm kubectl
# Enable kubelet service
systemctl enable --now kubelet5. AWS CLI Installation Block. Installs AWS CLI for Parameter Store operations.
- Downloads, extracts, and installs AWS CLI v2
- Includes retry logic and verification
# AWS CLI INSTALLATION
# Download AWS CLI with retry logic
for attempt in 1 2 3; do
if curl --connect-timeout 30 --max-time 300 "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"; then
break
else
if [ $attempt -eq 3 ]; then
exit 1
fi
sleep 10
fi
done
# Extract and install AWS CLI
unzip awscliv2.zip
sudo ./aws/install
# Verify AWS CLI installation
if ! aws --version; then
exit 1
fi6. First Control Plane Node Block (is_first_control_plane = true). Initializes the first control plane node and sets up the cluster.
- Configuration Validation: Validates kubeadm config before cluster initialization
- Cluster Initialization: Creates the Kubernetes cluster with specific networking settings
- User Setup: Configures kubectl access for the ubuntu user
- Control Plane Health Check: Waits up to 150 seconds for the control plane to become responsive
- CNI Installation: Installs Calico for pod networking
- Certificate Regeneration:
- Backs up existing API server certificates
- Regenerates certificates to include load balancer DNS as Subject Alternative Name (SAN)
- This allows external access through the load balancer
- Join Command Generation:
- Creates join commands for both worker nodes and additional control planes
- Replaces private IP with load balancer DNS for external access
- Parameter Store Operations: Stores join commands in AWS Systems Manager for other nodes to retrieve
# CLUSTER INITIALIZATION OR JOIN
if [ "${is_first_control_plane}" = "true" ]; then
# FIRST CONTROL PLANE NODE SETUP
# Validate kubeadm configuration before initialization
if ! kubeadm config validate --config <(cat <<EOF
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: "${control_plane_master_private_ip}"
bindPort: 6443
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
controlPlaneEndpoint: "${control_plane_master_private_ip}:6443"
apiServer:
certSANs:
- "${control_plane_endpoint}"
networking:
podSubnet: "192.168.0.0/16"
EOF
); then
exit 1
fi
# Initialize Kubernetes cluster
kubeadm init \
--control-plane-endpoint "${control_plane_master_private_ip}:6443" \
--apiserver-advertise-address="${control_plane_master_private_ip}" \
--upload-certs \
--pod-network-cidr=192.168.0.0/16 \
--apiserver-cert-extra-sans "${control_plane_endpoint}"
# Setup kubeconfig for ubuntu user
export KUBE_USER=ubuntu
mkdir -p /home/$KUBE_USER/.kube
sudo cp -i /etc/kubernetes/admin.conf /home/$KUBE_USER/.kube/config
sudo chown $KUBE_USER:$KUBE_USER /home/$KUBE_USER/.kube/config
# Wait for control plane to become responsive
control_plane_ready=false
for i in {1..30}; do
if KUBECONFIG=/etc/kubernetes/admin.conf kubectl get nodes &>/dev/null; then
control_plane_ready=true
break
fi
sleep 5
done
if [ "$control_plane_ready" = false ]; then
exit 1
fi
# Install Calico CNI (Container Network Interface)
for attempt in 1 2 3; do
if KUBECONFIG=/etc/kubernetes/admin.conf kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml; then
break
else
if [ $attempt -eq 3 ]; then
exit 1
fi
sleep 10
fi
done
# CERTIFICATE REGENERATION FOR LOAD BALANCER
# Backup existing certificates
if [ ! -f /etc/kubernetes/pki/apiserver.crt ]; then
exit 1
fi
sudo mv /etc/kubernetes/pki/apiserver.crt /etc/kubernetes/pki/apiserver.crt.bak
sudo mv /etc/kubernetes/pki/apiserver.key /etc/kubernetes/pki/apiserver.key.bak
# Create configuration for certificate regeneration with load balancer DNS
cat <<EOF | sudo tee /root/kubeadm-dns.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
controlPlaneEndpoint: "${control_plane_endpoint}:6443"
apiServer:
certSANs:
- "${control_plane_endpoint}"
- "${control_plane_master_private_ip}"
EOF
# Regenerate API server certificates with load balancer DNS as SAN
sudo kubeadm init phase certs apiserver --config /root/kubeadm-dns.yaml
# Restart kubelet to pick up new certificates
sudo systemctl restart kubelet
# JOIN COMMAND GENERATION
# Generate join command for worker nodes
JOIN_COMMAND=$(kubeadm token create --print-join-command 2>/dev/null)
if [ -z "$JOIN_COMMAND" ]; then
exit 1
fi
# Generate certificate key for control plane nodes
CERT_KEY=$(sudo kubeadm init phase upload-certs --upload-certs 2>/dev/null | tail -n 1)
if [ -z "$CERT_KEY" ]; then
exit 1
fi
# Create control plane join command
CONTROL_PLANE_JOIN_COMMAND="$JOIN_COMMAND --control-plane --certificate-key $CERT_KEY"
WORKER_NODE_JOIN_COMMAND="$JOIN_COMMAND"
# Replace private IP with load balancer DNS in join commands
JOIN_COMMAND_WITH_DNS=$(echo "$CONTROL_PLANE_JOIN_COMMAND" | sed "s/${control_plane_master_private_ip}:6443/${control_plane_endpoint}:6443/g")
WORKER_NODE_JOIN_COMMAND_WITH_DNS=$(echo "$WORKER_NODE_JOIN_COMMAND" | sed "s/${control_plane_master_private_ip}:6443/${control_plane_endpoint}:6443/g")
# Store join commands in AWS Systems Manager Parameter Store
for attempt in 1 2 3; do
if aws ssm put-parameter \
--name "/k8s/control-plane/join-command" \
--value "$JOIN_COMMAND_WITH_DNS" \
--type "SecureString" \
--overwrite \
--region "us-east-1" \
--cli-connect-timeout 10 \
--cli-read-timeout 30; then
break
else
if [ $attempt -eq 3 ]; then
exit 1
fi
sleep 10
fi
done
for attempt in 1 2 3; do
if aws ssm put-parameter \
--name "/k8s/worker-node/join-command" \
--value "$WORKER_NODE_JOIN_COMMAND_WITH_DNS" \
--type "SecureString" \
--overwrite \
--region "us-east-1" \
--cli-connect-timeout 10 \
--cli-read-timeout 30; then
break
else
if [ $attempt -eq 3 ]; then
exit 1
fi
sleep 10
fi
done
else
...7. Additional Control Plane Node Block (is_first_control_plane = false) inside else statement. Joins additional control plane nodes to the existing cluster
- Command Retrieval: Retrieves the control plane join command from AWS Parameter Store with retry logic
- Cluster Join: Executes the join command to add this node as an additional control plane
- Configuration Update: Updates the local kubeconfig to use the load balancer endpoint instead of the first node’s IP
- User Setup: Configures kubectl access for the ubuntu user so that the load balancer endpoint will reflect.
else
...
# ADDITIONAL CONTROL PLANE NODE SETUP
# Wait before retrieving join command
sleep 120
# Retrieve join command from AWS Systems Manager Parameter Store
for attempt in 1 2 3; do
JOIN_CMD=$(aws ssm get-parameter \
--region us-east-1 \
--name "/k8s/control-plane/join-command" \
--with-decryption \
--query "Parameter.Value" \
--output text \
--no-cli-pager \
--cli-read-timeout 30 \
--cli-connect-timeout 10 2>/dev/null)
if [ $? -eq 0 ] && [ -n "$JOIN_CMD" ] && [[ "$JOIN_CMD" != *"error"* ]] && [[ "$JOIN_CMD" != "None" ]]; then
break
else
if [ $attempt -eq 3 ]; then
exit 1
fi
sleep 20
fi
done
# Join the existing cluster as additional control plane
if eval "sudo $JOIN_CMD"; then
# Success
:
else
exit 1
fi
# Update kubeconfig to use load balancer endpoint
if [ -f /etc/kubernetes/admin.conf ]; then
sudo sed -i "s|https://${control_plane_master_private_ip}:6443|https://${control_plane_endpoint}:6443|g" /etc/kubernetes/admin.conf
# Setup kubeconfig for ubuntu user
export KUBE_USER=ubuntu
mkdir -p /home/$KUBE_USER/.kube
sudo cp -i /etc/kubernetes/admin.conf /home/$KUBE_USER/.kube/config
sudo chown $KUBE_USER:$KUBE_USER /home/$KUBE_USER/.kube/config
else
exit 1
fi
fiKey Design Pattern:
- Retry Logic: Most network operations include retry mechanisms for reliability
- Conditional Execution: The script branches based on whether this is the first control plane node
- Error Handling: Uses set -e to exit on any command failure
- High Availability: Configures the cluster to use a load balancer endpoint for external access
- Security: Uses proper certificate management and secure parameter storage
- Idempotency: Many operations are designed to be safely re-runnable
Kubernetes Master Control Plane Wait Script
This script is typically used where you need to:
- Wait for a newly created Kubernetes master node to become fully operational
- Verify the installation completed successfully before proceeding with additional configuration
- Ensure the cluster is ready to accept worker nodes or workload deployments
- Provide debugging information if the setup fails
The script essentially acts as a “health check” that confirms a Kubernetes control plane is not just installed, but fully ready for use.
1. Cloud-Init Completion Wait Block. Waits for the cloud-init process to complete before proceeding.
- Timeout Protection: Uses a 20-minute timeout (1200 seconds) to prevent infinite waiting
- Status Monitoring: Continuously polls cloud-init status every 30 seconds
- Success Detection: Looks for “done” status indicating successful completion
- Error Handling: If status contains “error”, displays detailed error information and exits
#!/bin/bash
# Wait for Kubernetes master control plane to be ready
set -e
# CLOUD-INIT COMPLETION WAIT
# Wait for cloud-init to finish (up to 20 minutes)
timeout 1200 bash -c '
while true; do
status=$(sudo cloud-init status 2>/dev/null || echo "unknown")
if [[ "$status" == *"done"* ]]; then
break
elif [[ "$status" == *"error"* ]]; then
sudo cloud-init status --long 2>&1
exit 1
else
sleep 30
fi
done
'2. Installation Verification Block. Verifies that the Kubernetes installation completed successfully.
- Success Log Check: Looks for /var/log/k8s-install-success.txt as proof of successful installation
- Success Path: If found, displays the last 10 lines of the success log
- Error Log Check: If no success log, checks for /var/log/k8s-install-error.txt
- Error Path: If error log exists, displays its contents and exits with failure
- Fallback: If neither log exists, shows cloud-init output for debugging and exits
# INSTALLATION VERIFICATION
# Verify Kubernetes installation completed successfully
if [ -f /var/log/k8s-install-success.txt ]; then
# Installation success log found
tail -10 /var/log/k8s-install-success.txt
else
# No success log found, check for errors
if [ -f /var/log/k8s-install-error.txt ]; then
# Error log found - installation failed
cat /var/log/k8s-install-error.txt
exit 1
else
# No error log either, check cloud-init output
sudo tail -50 /var/log/cloud-init-output.log
exit 1
fi
fi3. Filesystem Verification Block. Inspects the filesystem to verify expected files and directories exist. Provides diagnostic information about what files were created during installation.
- Home Directory Check: Lists contents of
/home/ubuntu/directory - Kube Directory Check: Checks for
/home/ubuntu/.kube/directory (user kubectl config) - Kubernetes Directory Check: Checks for
/etc/kubernetes/directory (system configs) - Non-Fatal: Uses error suppression (
2>/dev/null) since some directories might not exist yet
# FILESYSTEM VERIFICATION
# Check filesystem after installation
ls -la /home/ubuntu/
ls -la /home/ubuntu/.kube/ 2>/dev/null || echo 'No .kube directory yet'
ls -la /etc/kubernetes/ 2>/dev/null || echo 'No /etc/kubernetes directory yet'4. Kubeconfig Detection Block. Locates and sets up kubectl configuration for cluster access. kubectl requires proper configuration to communicate with the cluster
- User Config Priority: First checks for user-specific config at
/home/ubuntu/.kube/config - Admin Config Fallback: If user config missing, tries system admin config at
/etc/kubernetes/admin.conf - Environment Setup: Sets
KUBECONFIGenvironment variable to point to found config file - Failure Handling: If no config found, lists kubernetes directory contents and exits
# KUBECONFIG DETECTION
# Check for kubeconfig file and set KUBECONFIG environment variable
if [ -f /home/ubuntu/.kube/config ]; then
export KUBECONFIG=/home/ubuntu/.kube/config
elif [ -f /etc/kubernetes/admin.conf ]; then
export KUBECONFIG=/etc/kubernetes/admin.conf
else
# No kubeconfig found after installation
ls -la /etc/kubernetes/ 2>/dev/null || echo 'No /etc/kubernetes directory'
exit 1
fi5. kubectl Functionality Test Block. Verifies kubectl command-line tool is working properly. Ensures the kubectl tool itself is functional before testing cluster connectivity
- Version Check: Runs kubectl
version --clientto test basic functionality - Binary Verification: Confirms kubectl is installed and accessible
- Path Debugging: If kubectl fails, shows where (or if) kubectl is installed and displays PATH
# KUBECTL FUNCTIONALITY TEST
# Test kubectl client functionality
kubectl version --client 2>&1
if kubectl version --client >/dev/null 2>&1; then
# kubectl is working
:
else
# kubectl not working
which kubectl 2>/dev/null || echo 'kubectl not in PATH'
echo "PATH contents: $PATH"
exit 1
fi6. API Server Connectivity Test Block. Tests connectivity to the Kubernetes API server. The API server must be responding before the cluster can be considered ready
- Health Endpoint: Uses
kubectl get --raw /healthzto test API server health - Timeout Protection: 5-minute timeout (300 seconds) to prevent infinite waiting
- Retry Logic: Continuously retries every 10 seconds until success or timeout
# API SERVER CONNECTIVITY TEST
# Test API server connectivity and readiness
timeout 300 bash -c '
while ! kubectl get --raw /healthz >/dev/null 2>&1; do
sleep 10
done
'7. System Services Status Check Block. Verifies critical Kubernetes system services are running. These services must be running for the cluster to function properly.
- kubelet Status: Checks the Kubernetes node agent service
- containerd Status: Checks the container runtime service
- Limited Output: Shows only first 10 lines to avoid overwhelming output
# SYSTEM SERVICES STATUS CHECK
# Check status of critical Kubernetes services
systemctl status kubelet --no-pager 2>&1 | head -10
systemctl status containerd --no-pager 2>&1 | head -108. Final Cluster Verification Block. Performs comprehensive cluster functionality tests. Confirms the cluster is not just running, but fully functional
- Node Status: Lists all cluster nodes to verify cluster membership
- System Pods: Checks status of system pods in
kube-systemnamespace - Pod Verification: Writes pod output to temporary file and displays first 10 entries
# FINAL CLUSTER VERIFICATION
# Verify cluster is functional
kubectl get nodes 2>&1
# Check system pods status
kubectl get pods -n kube-system --no-headers > /tmp/pods_output 2>&1
head -10 /tmp/pods_output
# SUCCESS - Control plane is readyKey Design Patterns:
- Progressive Validation: Each step builds on the previous one, from basic system readiness to full cluster functionality
- Timeout Protection: Critical waits include timeouts to prevent infinite hanging
- Graceful Degradation: Provides diagnostic information when things fail
- Error Propagation: Uses set -e to exit immediately on any command failure
- Comprehensive Testing: Tests multiple layers from file system to cluster API
Kubernetes Worker Nodes Installation using Bash Script
This script essentially automates the process of preparing a server and joining it to an existing Kubernetes cluster as a worker node, handling all the prerequisites and configuration needed for the node to participate in the cluster and run workloads. For example,
You: "I need 3 more worker nodes for my cluster"
AWS/Terraform: "Creating 3 new servers..."
This Script (on each server): "Let me become a worker node..."
Script: "Preparing system... Installing container runtime... Installing Kubernetes..."
Script: "Getting join command from the managers..."
Script: "Joining cluster as worker node..."
Script: "SUCCESS! I'm now a worker node ready to run applications!"What happens after this script runs:
- The server becomes a worker node in your Kubernetes cluster
- It can now run your applications (pods, containers)
- The control plane can schedule work on this node
- Your cluster has more capacity to run workloads
1. System Preparation Block. Prepares the system for Kubernetes installation.
- Swap Management:
- Disables active swap memory (Kubernetes requirement)
- Comments out swap entries in
/etc/fstabto prevent re-enabling on reboot
- Network Configuration:
- Enables IP forwarding (
net.ipv4.ip_forward = 1) for pod-to-pod communication - Applies network settings immediately without requiring a reboot
- Enables IP forwarding (
- Package Updates: Updates system packages with retry logic for network reliability
#!/bin/bash
set -e
# SYSTEM PREPARATION
# Disable swap (required for Kubernetes)
swapoff -a
# Permanently disable swap by commenting it out in fstab
sed -i '/ swap / s/^/#/' /etc/fstab
# Enable IP forwarding for pod networking
cat <<EOF | tee /etc/sysctl.d/k8s.conf
net.ipv4.ip_forward = 1
EOF
# Apply sysctl settings without reboot
sysctl --system
# Update package lists with retry logic
for attempt in 1 2 3; do
if apt-get update; then
break
else
if [ $attempt -eq 3 ]; then
exit 1
fi
sleep 10
fi
done2. Container Runtime Setup Block. Installs and configures containerd as the container runtime.
- Package Installation: Installs essential packages including:
ca-certificates,curl,gnupgfor secure downloadscontainerdfor container runtimeapt-transport-https,unzipfor additional operations
- Repository Setup:
- Creates APT keyring directory
- Downloads and installs Docker’s GPG key (containerd comes from Docker repo)
- Makes the key readable by all users
- Adds Docker repository to APT sources
- Containerd Configuration:
- Creates containerd configuration directory
- Generates default configuration file
- Enables systemd cgroup driver (required for proper Kubernetes integration)
- Restarts and enables containerd service
# CONTAINER RUNTIME SETUP (containerd)
# Install required packages
apt-get install -y ca-certificates curl gnupg lsb-release containerd apt-transport-https unzip
# Create directory for APT keyrings
mkdir -p /etc/apt/keyrings
# Add Docker GPG key (for containerd installation)
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
# Make the key readable
sudo chmod a+r /etc/apt/keyrings/docker.gpg
# Add Docker repository
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
noble stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
# Update package list again after adding repository
for attempt in 1 2 3; do
if apt-get update; then
break
else
if [ $attempt -eq 3 ]; then
exit 1
fi
sleep 10
fi
done
# Configure containerd
mkdir -p /etc/containerd
# Generate default containerd configuration
containerd config default | tee /etc/containerd/config.toml
# Enable systemd cgroup driver (required for Kubernetes)
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
# Start and enable containerd service
systemctl restart containerd
systemctl enable containerd3. Kubernetes Installation Block. Installs Kubernetes components needed for worker nodes
- Repository Setup:
- Downloads Kubernetes GPG key with retry logic and timeouts
- Adds official Kubernetes repository to APT sources
- Component Installation:
- Installs
kubelet(node agent) andkubeadm(cluster management tool) - Note: Does NOT install kubectl since worker nodes don’t need cluster management capabilities
- Installs
- Package Protection: Uses apt-mark hold to prevent automatic updates that could break cluster compatibility
- Service Management: Enables kubelet service to start automatically
# KUBERNETES INSTALLATION
# Add Kubernetes GPG key with retry logic
for attempt in 1 2 3; do
if curl --connect-timeout 30 --max-time 60 -fsSL https://pkgs.k8s.io/core:/stable:/v1.33/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg; then
break
else
if [ $attempt -eq 3 ]; then
exit 1
fi
sleep 10
fi
done
# Add Kubernetes repository
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.33/deb/ /' | tee /etc/apt/sources.list.d/kubernetes.list
# Update package list
apt-get update
# Install Kubernetes components (kubelet and kubeadm only - no kubectl needed on worker)
apt-get install -y kubelet kubeadm
# Prevent automatic updates of Kubernetes packages
apt-mark hold kubelet kubeadm
# Enable kubelet service
systemctl enable --now kubelet4. AWS CLI Installation Block. Installs AWS CLI for Parameter Store access.
- Download: Downloads AWS CLI v2 installer with retry logic and extended timeout (5 minutes)
- Installation: Extracts ZIP file and runs installer
- Verification: Confirms AWS CLI is properly installed and accessible
# AWS CLI INSTALLATION
# Download AWS CLI with retry logic
for attempt in 1 2 3; do
if curl --connect-timeout 30 --max-time 300 "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"; then
break
else
if [ $attempt -eq 3 ]; then
exit 1
fi
sleep 10
fi
done
# Extract and install AWS CLI
unzip awscliv2.zip
sudo ./aws/install
# Verify AWS CLI installation
if ! aws --version; then
exit 1
fi5. Cluster Join Process Block. Joins this node to the existing Kubernetes cluster as a worker.
- Wait Period: Waits 2 minutes to ensure the control plane has stored the join command in Parameter Store
- Command Retrieval:
- Retrieves worker node join command from AWS Systems Manager Parameter Store
- Uses retry logic with 20-second intervals
- Validates the command is not empty, doesn’t contain errors, and isn’t “None”
- Accesses the
/k8s/worker-node/join-commandparameter (different from control plane command)
- Cluster Join:
- Executes the retrieved join command with sudo privileges
- The join command typically looks like:
kubeadm join <load-balancer-dns>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>
# CLUSTER JOIN PROCESS
# Wait for join command to be available in Parameter Store
sleep 120
# Retrieve worker node join command from AWS Systems Manager Parameter Store
for attempt in 1 2 3; do
JOIN_CMD=$(aws ssm get-parameter \
--region us-east-1 \
--name "/k8s/worker-node/join-command" \
--with-decryption \
--query "Parameter.Value" \
--output text \
--no-cli-pager \
--cli-read-timeout 30 \
--cli-connect-timeout 10 2>/dev/null)
if [ $? -eq 0 ] && [ -n "$JOIN_CMD" ] && [[ "$JOIN_CMD" != *"error"* ]] && [[ "$JOIN_CMD" != "None" ]]; then
break
else
if [ $attempt -eq 3 ]; then
exit 1
fi
sleep 20
fi
done
# Execute the join command to add this node as a worker to the cluster
if eval "sudo $JOIN_CMD"; then
# Success - node joined cluster
:
else
exit 1
fiKey Differences from Control Plane Script:
- Simpler Role: Worker nodes only need to join the cluster, not initialize or manage it
- No kubectl: Worker nodes don’t need cluster management tools
- No Certificate Management: Workers don’t handle cluster certificates
- No CNI Installation: Container networking is managed by control plane
- Single Join Command: Uses worker-specific join command from Parameter Store
- No Additional Configuration: No need to update configs or generate new commands
Design Patterns:
- Retry Logic: Network operations include retry mechanisms for reliability
- Parameter Store Integration: Uses AWS SSM to retrieve join commands securely
- Error Handling: Uses set -e to exit on any command failure
- Validation: Checks command retrieval success before execution
- Minimal Installation: Only installs components needed for worker node functionality
Kubernetes Wait for Worker Node Script
This script waits and watches for worker nodes to join a Kubernetes cluster and become ready to run applications. Something like:
You’re organizing a team project:
- You’re expecting 3 team members to join you. (EXPECTED_WORKERS = 3)
- You’re willing to wait up to 30 minutes for everyone to show up. (TIMEOUT_SECONDS = 1800)
- Every 30 seconds, you’ll check to see who’s arrived so far. (CHECK_INTERVAL = 30)
What the scripts monitor?
Stage 1: Node Join Detection
- Counts how many worker nodes have joined the cluster
- Like counting how many people walked into the office
Stage 2: Node Readiness Check
- Counts how many worker nodes are ready (not just joined)
- Like checking if people have their computers set up and are actually ready to work
When you create a Kubernetes cluster:
- Control plane starts first (the “manager” nodes)
- Worker nodes join later (the “worker” nodes that run your apps)
- You need to wait for all workers to join and be ready before you can deploy applications
For example,
You: "I want 5 worker nodes in my cluster"
Terraform: "OK, creating 5 worker nodes..."
This Script: "I'll wait here and watch for all 5 to join and be ready"
*Time passes...*
Script: "1 worker joined... 2 workers joined... 3 workers joined..."
Script: "All 5 joined! Now waiting for them to be ready..."
Script: "Worker 1 ready... Worker 2 ready... All ready!"
Script: "SUCCESS! Your cluster is ready to use!"Without this script, you might try to deploy apps too early and get errors like:
- “No nodes available to schedule pods”
- “Insufficient resources”
- Apps failing because nodes aren’t ready yet
With this script, you know for certain that your cluster is 100% ready before you try to use it. In essence: It’s a “safety check” that prevents you from using a cluster before it’s fully operational.
1. Configuration Setup Block. Initializes the script environment and configuration.
- Kubeconfig Export: Sets
KUBECONFIG=/home/ubuntu/.kube/configfor kubectl access - Variable Assignment: Retrieves configuration from Terraform variables:
EXPECTED_WORKERS: Number of worker nodes expected to joinTIMEOUT_SECONDS: Maximum time to wait for nodesCHECK_INTERVAL: Time between status checksLOG_FILE: Path where to save detailed logs
- Log File Setup: Creates the log file and makes it writable (chmod 666)
#!/bin/bash
set -e
# CONFIGURATION SETUP
# Export kubeconfig for kubectl access
export KUBECONFIG=/home/ubuntu/.kube/config
# Configuration from Terraform variables
EXPECTED_WORKERS=${expected_workers}
TIMEOUT_SECONDS=${timeout_seconds}
CHECK_INTERVAL=${check_interval}
LOG_FILE="${log_file}"
# Create and configure log file
sudo touch "$LOG_FILE"
sudo chmod 666 "$LOG_FILE"2. count_worker_nodes() Function. Counts how many worker nodes have joined the cluster (regardless of readiness). Tracks the joining progress of worker nodes
- Node Listing: Gets all nodes without headers using
kubectl get nodes --no-headers - Filtering: Excludes control plane nodes by filtering out:
- Lines containing “control-plane”
- Lines containing “master”
- Counting: Uses
wc -lto count remaining lines - Error Handling: Returns 0 if kubectl fails
# WORKER NODE COUNTING FUNCTIONS
# Function to count current worker nodes (joined but may not be ready)
count_worker_nodes() {
kubectl get nodes --no-headers 2>/dev/null | \
grep -v control-plane | \
grep -v master | \
wc -l || echo 0
}3. count_ready_worker_nodes() Function. Counts how many worker nodes are both joined AND ready for workloads. Ensures nodes are not just joined but actually functional.
- Node Listing: Gets all nodes without headers
- Multi-Stage Filtering:
- Excludes control plane nodes (same as above)
- Additionally filters for “Ready” status using grep Ready
- Counting: Counts nodes that pass all filters
- Error Handling: Returns 0 if kubectl fails
# Function to count ready worker nodes (joined and ready for workloads)
count_ready_worker_nodes() {
kubectl get nodes --no-headers 2>/dev/null | \
grep -v control-plane | \
grep -v master | \
grep Ready | \
wc -l || echo 0
}4. Main Wait Loop with Timeout Block. Continuously monitors worker node status until completion or timeout.
4a. Timeout Management
- Time Calculation: Calculates start time, end time, and current time in Unix timestamps
- Timeout Check: Exits with error if current time exceeds end time
- Failure Path: Shows current cluster state and exits with code 1
4b. Status Monitoring
- Node Counting: Calls both counting functions to get current status
- Time Tracking: Calculates elapsed time and remaining time
- Progress Display: Shows comprehensive status including:
- Current timestamp
- Worker node counts (joined vs ready)
- Expected count
- Time statistics
4c. Completion Logic
- Join Check: Verifies if enough nodes have joined (
current_workers >= EXPECTED_WORKERS) - Readiness Check: Verifies if enough nodes are ready (
ready_workers >= EXPECTED_WORKERS) - Two-Stage Success:
- First celebrates when nodes join
- Then waits for them to become ready
- Loop Exit: Breaks out of loop only when both conditions are met
4d. Status Display and Wait
- Cluster State: Shows current node status with
kubectl get nodes - Log Recording: Saves output to the log file
- Interval Wait: Sleeps for the configured check interval before next iteration
# MAIN WAIT LOOP WITH TIMEOUT
# Calculate timeout timestamps
start_time=$(date +%s)
end_time=$((start_time + TIMEOUT_SECONDS))
while true; do
current_time=$(date +%s)
# Check if timeout has been reached
if [ $current_time -gt $end_time ]; then
echo "TIMEOUT: Worker nodes did not join within $TIMEOUT_SECONDS seconds"
echo "Current cluster state:"
kubectl get nodes --no-headers 2>&1 | tee -a "$LOG_FILE" || echo "kubectl failed"
exit 1
fi
# Count current worker nodes
current_workers=$(count_worker_nodes)
ready_workers=$(count_ready_worker_nodes)
# Calculate elapsed and remaining time
elapsed=$((current_time - start_time))
remaining=$((end_time - current_time))
echo "Status check at $(date)"
echo "Current worker nodes: $current_workers"
echo "Ready worker nodes: $ready_workers"
echo "Expected: $EXPECTED_WORKERS"
echo "Elapsed: $elapsed s, Remaining: $remaining s"
# Check if we have enough worker nodes joined
if [ "$current_workers" -ge "$EXPECTED_WORKERS" ]; then
echo "All $EXPECTED_WORKERS worker nodes have joined the cluster!"
# Check if they are all ready
if [ "$ready_workers" -ge "$EXPECTED_WORKERS" ]; then
echo "All worker nodes are also ready!"
break
else
echo "Worker nodes joined but not all are ready yet. Waiting for readiness..."
fi
fi
# Show current cluster state
echo "Current cluster state:"
kubectl get nodes --no-headers 2>&1 | tee -a "$LOG_FILE" || echo "kubectl command failed"
echo "Waiting $CHECK_INTERVAL seconds before next check..."
sleep $CHECK_INTERVAL
done5. Final Status Display Block. Shows completion status and saves final results.
- Detailed Output: Uses
kubectl get nodes -o widefor comprehensive node information - Log Persistence: Saves final state to log file
- Success Confirmation: Confirms successful completion
- Log Location: Reminds user where detailed logs are saved
# FINAL STATUS DISPLAY
# Show detailed final cluster state
echo "Final cluster state:"
kubectl get nodes -o wide 2>&1 | tee -a "$LOG_FILE"
echo "Worker nodes join process completed successfully!"
echo "Log saved to: $LOG_FILE"Key Design Patterns:
- Polling Loop: Continuously checks status at regular intervals
- Timeout Protection: Prevents infinite waiting with configurable timeout
- Two-Stage Validation: Distinguishes between “joined” and “ready” states
- Progress Tracking: Provides detailed status updates during the wait
- Error Resilience: Handles kubectl failures gracefully
- Comprehensive Logging: Saves detailed information for debugging
This script is typically used in infrastructure automation scenarios where:
- Terraform/CloudFormation: Waits for worker nodes to join after provisioning
- CI/CD Pipelines: Ensures cluster is fully ready before deploying applications
- Cluster Scaling: Verifies new nodes are operational after scaling events
- Testing: Confirms cluster readiness in automated testing environments
The script implements a two-stage success model:
- Stage 1: Worker nodes join the cluster (appear in kubectl get nodes)
- Stage 2: Worker nodes become ready (can schedule and run pods)
This is important because nodes can join a cluster but still be initializing, pulling images, or having network issues that prevent them from being ready for workloads.
The script ensures the cluster is not just numerically complete, but functionally ready for production use.
Kubernetes Worker Node Labelling Script
This script assigns proper “worker” labels to nodes in a Kubernetes cluster so they show up with the correct role instead of <none>.
1. Configuration Setup Block. Sets up the environment and logging.
- Kubeconfig Export: Sets up kubectl to access the cluster
- Worker Count: Gets expected number of workers from configuration
- Log File: Creates a timestamped log file to track what happens
#!/bin/bash
set -e
# CONFIGURATION SETUP
export KUBECONFIG=/home/ubuntu/.kube/config
EXPECTED_WORKERS=${expected_worker_count}
# Create log file with timestamp
LOG_FILE="/var/log/k8s-worker-labeling-$(date +%Y%m%d-%H%M%S).log"
sudo touch $LOG_FILE
sudo chmod 666 $LOG_FILE2. Initial Cluster State Check Block. Shows what the cluster looks like before making changes.
- Display Nodes: Shows all nodes and their current status
- Error Handling: Exits if kubectl doesn’t work
- Documentation: Saves the “before” state to the log file
# INITIAL CLUSTER STATE CHECK
echo 'Current cluster state before labeling:'
kubectl get nodes -o wide 2>&1 | tee -a $LOG_FILE || {
echo 'FAILED to get nodes'
exit 1
}3. Stabilization Wait Block. Gives nodes time to fully initialize. Newly joined nodes might still be initializing.
- 30-Second Wait: Ensures nodes are fully ready before labeling
# STABILIZATION WAIT
echo 'Waiting 30 seconds for nodes to stabilize...'
sleep 304. Node Discovery Block. Finds all nodes in the cluster
- Get Node List: Uses kubectl to get all node names
- JSONPath Query: Extracts just the names from the full node information
# NODE DISCOVERY
# Get all node names in the cluster
node_list=$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}')
echo "All nodes found: $node_list"5. Labeling Function with Retry Logic Block. Creates a reliable function to label individual nodes. What label_node_with_retry() does:
- Readiness Check: Waits up to 60 seconds for the node to be “Ready”
- Apply Label: Adds
node-role.kubernetes.io/worker=workerlabel - Retry Logic: Tries up to 3 times if it fails
- Error Recovery: Waits 10 seconds between attempts
# LABELING FUNCTION WITH RETRY LOGIC
label_node_with_retry() {
local node="$1"
local max_attempts=3
local attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Attempt $attempt/$max_attempts to label node: $node"
# Wait for node to be ready
if kubectl wait --for=condition=Ready node/$node --timeout=60s 2>&1 | tee -a $LOG_FILE; then
echo "$node is ready, attempting to label..."
# Apply worker label
if kubectl label node "$node" node-role.kubernetes.io/worker=worker --overwrite 2>&1 | tee -a $LOG_FILE; then
echo "SUCCEEDED to label $node as worker"
return 0
else
echo "FAILED to label $node (attempt $attempt)"
fi
else
echo "$node not ready yet (attempt $attempt)"
fi
attempt=$((attempt + 1))
if [ $attempt -le $max_attempts ]; then
echo "Waiting 10 seconds before retry..."
sleep 10
fi
done
echo "FAILED to label $node after $max_attempts attempts"
return 1
}6. First Labeling Pass Block. Goes through each node and labels appropriate ones as workers.
- Check Each Node: Loops through all discovered nodes
- Role Detection: Checks if node already has “control-plane” or “master” labels
- Skip Control Planes: Leaves management nodes alone
- Label Workers: Applies worker label to non-control-plane nodes
# FIRST LABELING PASS
# Process each node and determine if it should be labeled as worker
for node in $node_list; do
if [ -n "$node" ]; then
echo "Processing node: $node"
# Check if node has control-plane or master role
node_labels=$(kubectl get node "$node" -o jsonpath='{.metadata.labels}' 2>/dev/null || echo '')
if echo "$node_labels" | grep -E 'control-plane|master' > /dev/null 2>&1; then
echo "$node is a control plane node, skipping"
else
echo "$node appears to be a worker node"
label_node_with_retry "$node"
fi
fi
done
echo 'First labeling pass completed'7. Second Labeling Pass Block. Catches any nodes that were missed in the first pass.
- Some nodes might have joined after the first pass
- Network issues might have caused failures
- Find Unlabeled: Looks for nodes with <none> role
- Final Attempt: Tries to label any remaining unlabeled nodes
# SECOND LABELING PASS
# Check for any remaining unlabeled worker nodes
echo 'Checking for any remaining unlabeled nodes...'
unlabeled_nodes=$(kubectl get nodes --no-headers | grep '<none>' | awk '{print $1}' || true)
if [ -n "$unlabeled_nodes" ]; then
echo "Found unlabeled nodes: $unlabeled_nodes"
for node in $unlabeled_nodes; do
echo "Final attempt to label remaining node: $node"
label_node_with_retry "$node"
done
else
echo 'No unlabeled nodes found'
fi8. Final Verification Block. Confirms the job was completed successfully.
- Show Results: Displays final cluster state
- Count Check: Counts how many nodes still have
<none>role - Success/Failure: Exits with error if any nodes remain unlabeled
- Log Information: Tells user where to find detailed logs
# FINAL VERIFICATION
echo 'Labeling process completed'
echo 'Final cluster state:'
kubectl get nodes -o wide 2>&1 | tee -a $LOG_FILE
# Check if any nodes still remain unlabeled
remaining_unlabeled=$(kubectl get nodes --no-headers | grep '<none>' | wc -l || echo '0')
if [ "$remaining_unlabeled" -gt 0 ]; then
echo "WARNING: $remaining_unlabeled node(s) still have no role assigned"
kubectl get nodes --no-headers | grep '<none>' 2>&1 | tee -a $LOG_FILE
exit 1
else
echo 'SUCCEEDED: All nodes have roles assigned'
fi
echo "Worker labeling process completed. Full log saved to: $LOG_FILE"
echo "To view the log later, run: sudo cat $LOG_FILE"Before this script:
NAME STATUS ROLES AGE
master-node Ready master 5m
worker-1 Ready <none> 2m
worker-2 Ready <none> 2m
worker-3 Ready <none> 2mAfter this script:
NAME STATUS ROLES AGE
master-node Ready master 5m
worker-1 Ready worker 2m
worker-2 Ready worker 2m
worker-3 Ready worker 2mWhy is labelling important?
- Visual Clarity: Makes it clear which nodes do what
- Management Tools: Some Kubernetes tools rely on proper labels
- Best Practices: Follows Kubernetes conventions
- Troubleshooting: Easier to identify node types when debugging
Common Functions Script
This script contains utility functions for logging and error handling that are used by all the other Kubernetes setup scripts.
1. log_step() Function. Records successful steps and displays progress
- Takes 2 inputs: Step number and message
- Dual Logging:
- Writes to
/var/log/k8s-install-success.txt(for permanent record) - Displays on screen with timestamp
- Writes to
- Format:
[STEP_NUMBER] MESSAGE
log_step() {
local step="$1"
local message="$2"
echo "[$step] $message" | sudo tee -a /var/log/k8s-install-success.txt > /dev/null
echo "$(date '+%Y-%m-%d %H:%M:%S') [$step] $message"
}Example:
log_step "5" "Kubernetes installed successfully"Output:
2025-08-03 14:30:15 [5] Kubernetes installed successfully2. log_error() Function. Records errors and displays them prominently
- Takes 2 inputs: Step number and error message
- Dual Logging:
- Writes to
/var/log/k8s-install-error.txt(for error tracking) - Displays on screen with timestamp to stderr (error output)
- Writes to
- Format:
ERROR [STEP_NUMBER] MESSAGE
log_error() {
local step="$1"
local message="$2"
echo "ERROR [$step] $message" | sudo tee -a /var/log/k8s-install-error.txt > /dev/null
echo "$(date '+%Y-%m-%d %H:%M:%S') ERROR [$step] $message" >&2
}Example:
log_error "3" "Failed to install Docker"Output:
2025-08-03 14:30:15 ERROR [3] Failed to install Docker3. check_command() Function. Automatically checks if the previous command failed and exits if it did.
- Takes 2 inputs: Step number and error message
- Checks Exit Code: Uses
$?to see if the last command failed (non-zero exit code) - Auto-Exit: If command failed, logs error and exits the entire script
check_command() {
if [ $? -ne 0 ]; then
log_error "$1" "$2"
exit 1
fi
}Example:
apt-get update
check_command "1" "Failed to update packages"If apt-get update fails:
- Logs the error message
- Exits the script immediately with code 1
4. log_file() Function. Flexible logging that can write to files or just console.
- Takes 2 inputs: Message and optional log file path
- Conditional Logging:
- If no file specified: Just prints to console
- If file specified: Prints to both console AND file
- Uses tee: Shows output on screen while also writing to file
Example:
# Console only
log_file "Starting process"
# Console + file
log_file "Debug info" "/tmp/debug.log"How These Functions Work Together:
During Normal Operation
# Script starts
log_step "1" "Starting installation" # Success log
apt-get update # Run command
check_command "1" "Package update failed" # Check if it worked
log_step "2" "Packages updated" # Success logDuring Failure
# Script starts
log_step "1" "Starting installation" # Success log
some_failing_command # This command fails
check_command "1" "Command failed" # Detects failure, logs error, exits
# Script stops here - never reaches next stepAfter running scripts with these functions, you get:
/var/log/k8s-install-success.txt - Contains all successful steps
/var/log/k8s-install-error.txt - Contains any errors that occurredWhy This is Useful:
- Debugging: If installation fails, you can check the error log to see exactly what went wrong
- Progress Tracking: Success log shows how far the installation got
- Automation: Scripts can automatically stop when something fails
- Consistency: All scripts use the same logging format
- Auditing: Permanent record of what happened during installation
In essence: These functions create a “flight recorder” for your Kubernetes installation, tracking every step and automatically stopping if anything goes wrong.
The ${common_functions} you see at the top of other scripts gets replaced with these functions, so every script has access to this logging toolkit.
Sample Output

Sample Healthy Registered Targets

For full code visit my github https://github.com/rinavillaruz/terraform-aws-kubernetes.
If you happen to finish reading the tutorial, thank you. I know it’s kind of lengthy 🙂
