Customer Segmentation using K-Means Clustering in R

OVERVIEW

The Customer Segmentation using K-Means Clustering in R project applies unsupervised machine learning techniques to analyze and group mall customers based on their purchasing behavior. Using the dataset, which includes features such as gender, age, annual income, and spending score, the project identifies distinct customer segments that share similar characteristics. By implementing the K-Means Clustering algorithm, the project uncovers meaningful patterns within unlabelled data, helping businesses better understand their customers and make data-driven marketing decisions. The analysis involves data visualization, feature selection, and determining the optimal number of clusters using the Elbow Method. Each identified cluster represents a unique customer group — such as high-income luxury shoppers or low-income budget-conscious customers — enabling companies to tailor their marketing strategies, improve engagement, and maximize profitability. Developed in R Studio using libraries like ggplot2, dplyr, and factoextra, this project demonstrates how data analytics and visualization can transform raw customer data into actionable business insights.

1. Introduction

Customer segmentation is a fundamental data science technique that helps businesses understand and categorize their customers based on common characteristics. Instead of treating every customer the same, segmentation allows companies to design targeted marketing strategies, personalized offers, and optimized business decisions.

In this project, we use unsupervised machine learning — specifically the K-Means Clustering algorithm — to segment mall customers based on their demographic and behavioral data such as gender, age, annual income, and spending score. The implementation is done in R, leveraging visualization and clustering libraries to derive clear, interpretable insights from raw data.

GitHub Link

2. Objective

The main objective of this project is to:

Identify distinct groups (clusters) of mall customers based on their spending behavior and income levels.
Understand customer patterns to help businesses target specific groups more effectively.
Demonstrate how unsupervised learning can be used in marketing and business analytics.

3. Tools and Technologies

Programming Language: R
Software Environment: VsCode
Operating System: Windows 7/8/10
Libraries Used:
- ggplot2 – Data visualization
- dplyr – Data manipulation
- factoextra, NbClust – Cluster evaluation and visualization
- purrr – Functional operations for repeated calculations
- cluster, plotrix, gridExtra – Supporting libraries for clustering and plotting
Dataset: Customer Dataset.csv
This dataset contains customer information including:
- CustomerID
- Gender
- Age
- Annual Income (in $000s)
- Spending Score (1–100)

4. Methodology

Step 1: Data Understanding and Preparation

We begin by importing the Customer Dataset.csv dataset and exploring its structure. Basic descriptive statistics and visualizations (bar charts and pie charts) are used to understand gender distribution, income range, and spending behavior.

A bar chart shows how many male and female customers are present.
A pie chart displays the percentage distribution of genders across the dataset.

These visualizations help identify demographic balance and potential biases in the data.

Step 2: Selecting Features

The clustering process focuses on numerical attributes that influence purchasing decisions — specifically:

Annual Income
Spending Score

These two variables are chosen because they strongly represent customer buying power and shopping behavior.

Step 3: K-Means Clustering Algorithm

The K-Means algorithm is used to divide the dataset into k clusters based on similarity. The steps are as follows:

Choose the number of clusters (k).
Randomly select k initial centroids.
Assign each customer to the nearest centroid based on Euclidean distance.
Recalculate the centroids as the mean position of all points in each cluster.
Repeat the process until the cluster assignments no longer change (convergence).

To determine the optimal number of clusters, the Elbow Method is applied:

Compute the total within-cluster sum of squares (WSS) for different values of k (from 1 to 10).
Plot k versus WSS.
The “elbow point” (where the curve bends) indicates the best number of clusters, balancing accuracy and simplicity.

Example Code:

library(purrr)
set.seed(123)

iss <- function(k) {
  kmeans(customer_data[, 3:5], k, iter.max = 100, nstart = 100, algorithm = "Lloyd")$tot.withinss
}

k.values <- 1:10
iss_values <- map_dbl(k.values, iss)

plot(k.values, iss_values, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of Clusters (K)", ylab = "Total Within-Cluster Sum of Squares")

Step 4: Cluster Visualization

Once the optimal number of clusters (e.g., k = 5 or k = 6) is determined, we visualize the groups using ggplot2:

Each point represents a customer.
The x-axis is Annual Income, and the y-axis is Spending Score.
Different colors represent different clusters.

Example Code:

set.seed(1)
ggplot(customer_data, aes(x = Annual.Income..k.., y = Spending.Score..1.100.)) +
  geom_point(aes(color = as.factor(k6$cluster))) +
  scale_color_discrete(name = "Customer Segments",
                       labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6")) +
  ggtitle("Mall Customer Segments", subtitle = "Using K-Means Clustering")

This visualization helps interpret the customer segments. For example:

Cluster 1: Low income, low spending → Budget-conscious customers
Cluster 2: High income, high spending → Luxury shoppers
Cluster 3: Moderate income, high spending → Impulsive buyers
Cluster 4: Low income, high spending → Value-seeking customers
Cluster 5: High income, low spending → Cautious spenders

5. Results and Insights

After performing clustering, we successfully identified multiple distinct customer groups with similar income and spending habits.
Key findings include:

Clear segmentation of customer behavior based on income and expenditure levels.
Insights that can guide personalized marketing — e.g., offering loyalty rewards to high spenders or discounts to cautious shoppers.
Clusters that reveal potential areas for business strategy improvement, such as under-engaged customer groups.

6. Conclusion

This project demonstrates the practical application of unsupervised machine learning in business analytics.
By implementing K-Means clustering in R, we effectively segmented mall customers into meaningful categories based on spending behavior and income.

These insights can be directly applied to:

Optimize marketing strategies
Increase customer retention
Improve profitability through data-driven decisions

Overall, this project showcases how machine learning can turn simple customer data into actionable business intelligence.

Customer Segmentation using K-Means Clustering in R

OVERVIEW

The Customer Segmentation using K-Means Clustering in R project applies unsupervised machine learning techniques to analyze and group mall customers based on their purchasing behavior. Using the dataset, which includes features such as gender, age, annual income, and spending score, the project identifies distinct customer segments that share similar characteristics. By implementing the K-Means Clustering algorithm, the project uncovers meaningful patterns within unlabelled data, helping businesses better understand their customers and make data-driven marketing decisions. The analysis involves data visualization, feature selection, and determining the optimal number of clusters using the Elbow Method. Each identified cluster represents a unique customer group — such as high-income luxury shoppers or low-income budget-conscious customers — enabling companies to tailor their marketing strategies, improve engagement, and maximize profitability. Developed in R Studio using libraries like ggplot2, dplyr, and factoextra, this project demonstrates how data analytics and visualization can transform raw customer data into actionable business insights.

1. Introduction

Customer segmentation is a fundamental data science technique that helps businesses understand and categorize their customers based on common characteristics. Instead of treating every customer the same, segmentation allows companies to design targeted marketing strategies, personalized offers, and optimized business decisions.

In this project, we use unsupervised machine learning — specifically the K-Means Clustering algorithm — to segment mall customers based on their demographic and behavioral data such as gender, age, annual income, and spending score. The implementation is done in R, leveraging visualization and clustering libraries to derive clear, interpretable insights from raw data.

GitHub Link

2. Objective

The main objective of this project is to:

Identify distinct groups (clusters) of mall customers based on their spending behavior and income levels.
Understand customer patterns to help businesses target specific groups more effectively.
Demonstrate how unsupervised learning can be used in marketing and business analytics.

3. Tools and Technologies

Programming Language: R
Software Environment: VsCode
Operating System: Windows 7/8/10
Libraries Used:
- ggplot2 – Data visualization
- dplyr – Data manipulation
- factoextra, NbClust – Cluster evaluation and visualization
- purrr – Functional operations for repeated calculations
- cluster, plotrix, gridExtra – Supporting libraries for clustering and plotting
Dataset: Customer Dataset.csv
This dataset contains customer information including:
- CustomerID
- Gender
- Age
- Annual Income (in $000s)
- Spending Score (1–100)

4. Methodology

Step 1: Data Understanding and Preparation

We begin by importing the Customer Dataset.csv dataset and exploring its structure. Basic descriptive statistics and visualizations (bar charts and pie charts) are used to understand gender distribution, income range, and spending behavior.

A bar chart shows how many male and female customers are present.
A pie chart displays the percentage distribution of genders across the dataset.

These visualizations help identify demographic balance and potential biases in the data.

Step 2: Selecting Features

The clustering process focuses on numerical attributes that influence purchasing decisions — specifically:

Annual Income
Spending Score

These two variables are chosen because they strongly represent customer buying power and shopping behavior.

Step 3: K-Means Clustering Algorithm

The K-Means algorithm is used to divide the dataset into k clusters based on similarity. The steps are as follows:

Choose the number of clusters (k).
Randomly select k initial centroids.
Assign each customer to the nearest centroid based on Euclidean distance.
Recalculate the centroids as the mean position of all points in each cluster.
Repeat the process until the cluster assignments no longer change (convergence).

To determine the optimal number of clusters, the Elbow Method is applied:

Compute the total within-cluster sum of squares (WSS) for different values of k (from 1 to 10).
Plot k versus WSS.
The “elbow point” (where the curve bends) indicates the best number of clusters, balancing accuracy and simplicity.

Example Code:

library(purrr)
set.seed(123)

iss <- function(k) {
  kmeans(customer_data[, 3:5], k, iter.max = 100, nstart = 100, algorithm = "Lloyd")$tot.withinss
}

k.values <- 1:10
iss_values <- map_dbl(k.values, iss)

plot(k.values, iss_values, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of Clusters (K)", ylab = "Total Within-Cluster Sum of Squares")

Step 4: Cluster Visualization

Once the optimal number of clusters (e.g., k = 5 or k = 6) is determined, we visualize the groups using ggplot2:

Each point represents a customer.
The x-axis is Annual Income, and the y-axis is Spending Score.
Different colors represent different clusters.

Example Code:

set.seed(1)
ggplot(customer_data, aes(x = Annual.Income..k.., y = Spending.Score..1.100.)) +
  geom_point(aes(color = as.factor(k6$cluster))) +
  scale_color_discrete(name = "Customer Segments",
                       labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6")) +
  ggtitle("Mall Customer Segments", subtitle = "Using K-Means Clustering")

This visualization helps interpret the customer segments. For example:

Cluster 1: Low income, low spending → Budget-conscious customers
Cluster 2: High income, high spending → Luxury shoppers
Cluster 3: Moderate income, high spending → Impulsive buyers
Cluster 4: Low income, high spending → Value-seeking customers
Cluster 5: High income, low spending → Cautious spenders

5. Results and Insights

After performing clustering, we successfully identified multiple distinct customer groups with similar income and spending habits.
Key findings include:

Clear segmentation of customer behavior based on income and expenditure levels.
Insights that can guide personalized marketing — e.g., offering loyalty rewards to high spenders or discounts to cautious shoppers.
Clusters that reveal potential areas for business strategy improvement, such as under-engaged customer groups.

6. Conclusion

This project demonstrates the practical application of unsupervised machine learning in business analytics.
By implementing K-Means clustering in R, we effectively segmented mall customers into meaningful categories based on spending behavior and income.

These insights can be directly applied to:

Optimize marketing strategies
Increase customer retention
Improve profitability through data-driven decisions

Overall, this project showcases how machine learning can turn simple customer data into actionable business intelligence.

Customer Segmentation using K-Means Clustering in R

OVERVIEW

The Customer Segmentation using K-Means Clustering in R project applies unsupervised machine learning techniques to analyze and group mall customers based on their purchasing behavior. Using the dataset, which includes features such as gender, age, annual income, and spending score, the project identifies distinct customer segments that share similar characteristics. By implementing the K-Means Clustering algorithm, the project uncovers meaningful patterns within unlabelled data, helping businesses better understand their customers and make data-driven marketing decisions. The analysis involves data visualization, feature selection, and determining the optimal number of clusters using the Elbow Method. Each identified cluster represents a unique customer group — such as high-income luxury shoppers or low-income budget-conscious customers — enabling companies to tailor their marketing strategies, improve engagement, and maximize profitability. Developed in R Studio using libraries like ggplot2, dplyr, and factoextra, this project demonstrates how data analytics and visualization can transform raw customer data into actionable business insights.

1. Introduction

Customer segmentation is a fundamental data science technique that helps businesses understand and categorize their customers based on common characteristics. Instead of treating every customer the same, segmentation allows companies to design targeted marketing strategies, personalized offers, and optimized business decisions.

In this project, we use unsupervised machine learning — specifically the K-Means Clustering algorithm — to segment mall customers based on their demographic and behavioral data such as gender, age, annual income, and spending score. The implementation is done in R, leveraging visualization and clustering libraries to derive clear, interpretable insights from raw data.

GitHub Link

2. Objective

The main objective of this project is to:

Identify distinct groups (clusters) of mall customers based on their spending behavior and income levels.
Understand customer patterns to help businesses target specific groups more effectively.
Demonstrate how unsupervised learning can be used in marketing and business analytics.

3. Tools and Technologies

Programming Language: R
Software Environment: VsCode
Operating System: Windows 7/8/10
Libraries Used:
- ggplot2 – Data visualization
- dplyr – Data manipulation
- factoextra, NbClust – Cluster evaluation and visualization
- purrr – Functional operations for repeated calculations
- cluster, plotrix, gridExtra – Supporting libraries for clustering and plotting
Dataset: Customer Dataset.csv
This dataset contains customer information including:
- CustomerID
- Gender
- Age
- Annual Income (in $000s)
- Spending Score (1–100)

4. Methodology

Step 1: Data Understanding and Preparation

We begin by importing the Customer Dataset.csv dataset and exploring its structure. Basic descriptive statistics and visualizations (bar charts and pie charts) are used to understand gender distribution, income range, and spending behavior.

A bar chart shows how many male and female customers are present.
A pie chart displays the percentage distribution of genders across the dataset.

These visualizations help identify demographic balance and potential biases in the data.

Step 2: Selecting Features

The clustering process focuses on numerical attributes that influence purchasing decisions — specifically:

Annual Income
Spending Score

These two variables are chosen because they strongly represent customer buying power and shopping behavior.

Step 3: K-Means Clustering Algorithm

The K-Means algorithm is used to divide the dataset into k clusters based on similarity. The steps are as follows:

Choose the number of clusters (k).
Randomly select k initial centroids.
Assign each customer to the nearest centroid based on Euclidean distance.
Recalculate the centroids as the mean position of all points in each cluster.
Repeat the process until the cluster assignments no longer change (convergence).

To determine the optimal number of clusters, the Elbow Method is applied:

Compute the total within-cluster sum of squares (WSS) for different values of k (from 1 to 10).
Plot k versus WSS.
The “elbow point” (where the curve bends) indicates the best number of clusters, balancing accuracy and simplicity.

Example Code:

library(purrr)
set.seed(123)

iss <- function(k) {
  kmeans(customer_data[, 3:5], k, iter.max = 100, nstart = 100, algorithm = "Lloyd")$tot.withinss
}

k.values <- 1:10
iss_values <- map_dbl(k.values, iss)

plot(k.values, iss_values, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of Clusters (K)", ylab = "Total Within-Cluster Sum of Squares")

Step 4: Cluster Visualization

Once the optimal number of clusters (e.g., k = 5 or k = 6) is determined, we visualize the groups using ggplot2:

Each point represents a customer.
The x-axis is Annual Income, and the y-axis is Spending Score.
Different colors represent different clusters.

Example Code:

set.seed(1)
ggplot(customer_data, aes(x = Annual.Income..k.., y = Spending.Score..1.100.)) +
  geom_point(aes(color = as.factor(k6$cluster))) +
  scale_color_discrete(name = "Customer Segments",
                       labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6")) +
  ggtitle("Mall Customer Segments", subtitle = "Using K-Means Clustering")

This visualization helps interpret the customer segments. For example:

Cluster 1: Low income, low spending → Budget-conscious customers
Cluster 2: High income, high spending → Luxury shoppers
Cluster 3: Moderate income, high spending → Impulsive buyers
Cluster 4: Low income, high spending → Value-seeking customers
Cluster 5: High income, low spending → Cautious spenders

5. Results and Insights

After performing clustering, we successfully identified multiple distinct customer groups with similar income and spending habits.
Key findings include:

Clear segmentation of customer behavior based on income and expenditure levels.
Insights that can guide personalized marketing — e.g., offering loyalty rewards to high spenders or discounts to cautious shoppers.
Clusters that reveal potential areas for business strategy improvement, such as under-engaged customer groups.

6. Conclusion

This project demonstrates the practical application of unsupervised machine learning in business analytics.
By implementing K-Means clustering in R, we effectively segmented mall customers into meaningful categories based on spending behavior and income.

These insights can be directly applied to:

Optimize marketing strategies
Increase customer retention
Improve profitability through data-driven decisions

Overall, this project showcases how machine learning can turn simple customer data into actionable business intelligence.