The Customer Segmentation using K-Means Clustering in R project applies unsupervised machine learning techniques to analyze and group mall customers based on their purchasing behavior. Using the dataset, which includes features such as gender, age, annual income, and spending score, the project identifies distinct customer segments that share similar characteristics. By implementing the K-Means Clustering algorithm, the project uncovers meaningful patterns within unlabelled data, helping businesses better understand their customers and make data-driven marketing decisions. The analysis involves data visualization, feature selection, and determining the optimal number of clusters using the Elbow Method. Each identified cluster represents a unique customer group — such as high-income luxury shoppers or low-income budget-conscious customers — enabling companies to tailor their marketing strategies, improve engagement, and maximize profitability. Developed in R Studio using libraries like ggplot2, dplyr, and factoextra, this project demonstrates how data analytics and visualization can transform raw customer data into actionable business insights.

1. Introduction
Customer segmentation is a fundamental data science technique that helps businesses understand and categorize their customers based on common characteristics. Instead of treating every customer the same, segmentation allows companies to design targeted marketing strategies, personalized offers, and optimized business decisions.
In this project, we use unsupervised machine learning — specifically the K-Means Clustering algorithm — to segment mall customers based on their demographic and behavioral data such as gender, age, annual income, and spending score. The implementation is done in R, leveraging visualization and clustering libraries to derive clear, interpretable insights from raw data.
2. Objective
The main objective of this project is to:
Identify distinct groups (clusters) of mall customers based on their spending behavior and income levels.
Understand customer patterns to help businesses target specific groups more effectively.
Demonstrate how unsupervised learning can be used in marketing and business analytics.
3. Tools and Technologies
Programming Language: R
Software Environment: VsCode
Operating System: Windows 7/8/10
Libraries Used:
ggplot2– Data visualizationdplyr– Data manipulationfactoextra,NbClust– Cluster evaluation and visualizationpurrr– Functional operations for repeated calculationscluster,plotrix,gridExtra– Supporting libraries for clustering and plotting
Dataset:
Customer Dataset.csv
This dataset contains customer information including:CustomerID
Gender
Age
Annual Income (in $000s)
Spending Score (1–100)
4. Methodology
Step 1: Data Understanding and Preparation
We begin by importing the Customer Dataset.csv dataset and exploring its structure. Basic descriptive statistics and visualizations (bar charts and pie charts) are used to understand gender distribution, income range, and spending behavior.
A bar chart shows how many male and female customers are present.
A pie chart displays the percentage distribution of genders across the dataset.
These visualizations help identify demographic balance and potential biases in the data.
Step 2: Selecting Features
The clustering process focuses on numerical attributes that influence purchasing decisions — specifically:
Annual Income
Spending Score
These two variables are chosen because they strongly represent customer buying power and shopping behavior.
Step 3: K-Means Clustering Algorithm
The K-Means algorithm is used to divide the dataset into k clusters based on similarity. The steps are as follows:
Choose the number of clusters (k).
Randomly select k initial centroids.
Assign each customer to the nearest centroid based on Euclidean distance.
Recalculate the centroids as the mean position of all points in each cluster.
Repeat the process until the cluster assignments no longer change (convergence).
To determine the optimal number of clusters, the Elbow Method is applied:
Compute the total within-cluster sum of squares (WSS) for different values of k (from 1 to 10).
Plot k versus WSS.
The “elbow point” (where the curve bends) indicates the best number of clusters, balancing accuracy and simplicity.
Example Code:
library(purrr) set.seed(123) iss <- function(k) { kmeans(customer_data[, 3:5], k, iter.max = 100, nstart = 100, algorithm = "Lloyd")$tot.withinss } k.values <- 1:10 iss_values <- map_dbl(k.values, iss) plot(k.values, iss_values, type = "b", pch = 19, frame = FALSE, xlab = "Number of Clusters (K)", ylab = "Total Within-Cluster Sum of Squares")
Step 4: Cluster Visualization
Once the optimal number of clusters (e.g., k = 5 or k = 6) is determined, we visualize the groups using ggplot2:
Each point represents a customer.
The x-axis is Annual Income, and the y-axis is Spending Score.
Different colors represent different clusters.
Example Code:
set.seed(1) ggplot(customer_data, aes(x = Annual.Income..k.., y = Spending.Score..1.100.)) + geom_point(aes(color = as.factor(k6$cluster))) + scale_color_discrete(name = "Customer Segments", labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6")) + ggtitle("Mall Customer Segments", subtitle = "Using K-Means Clustering")
This visualization helps interpret the customer segments. For example:
Cluster 1: Low income, low spending → Budget-conscious customers
Cluster 2: High income, high spending → Luxury shoppers
Cluster 3: Moderate income, high spending → Impulsive buyers
Cluster 4: Low income, high spending → Value-seeking customers
Cluster 5: High income, low spending → Cautious spenders
5. Results and Insights
After performing clustering, we successfully identified multiple distinct customer groups with similar income and spending habits.
Key findings include:
Clear segmentation of customer behavior based on income and expenditure levels.
Insights that can guide personalized marketing — e.g., offering loyalty rewards to high spenders or discounts to cautious shoppers.
Clusters that reveal potential areas for business strategy improvement, such as under-engaged customer groups.
6. Conclusion
This project demonstrates the practical application of unsupervised machine learning in business analytics.
By implementing K-Means clustering in R, we effectively segmented mall customers into meaningful categories based on spending behavior and income.
These insights can be directly applied to:
Optimize marketing strategies
Increase customer retention
Improve profitability through data-driven decisions
Overall, this project showcases how machine learning can turn simple customer data into actionable business intelligence.
The Customer Segmentation using K-Means Clustering in R project applies unsupervised machine learning techniques to analyze and group mall customers based on their purchasing behavior. Using the dataset, which includes features such as gender, age, annual income, and spending score, the project identifies distinct customer segments that share similar characteristics. By implementing the K-Means Clustering algorithm, the project uncovers meaningful patterns within unlabelled data, helping businesses better understand their customers and make data-driven marketing decisions. The analysis involves data visualization, feature selection, and determining the optimal number of clusters using the Elbow Method. Each identified cluster represents a unique customer group — such as high-income luxury shoppers or low-income budget-conscious customers — enabling companies to tailor their marketing strategies, improve engagement, and maximize profitability. Developed in R Studio using libraries like ggplot2, dplyr, and factoextra, this project demonstrates how data analytics and visualization can transform raw customer data into actionable business insights.

1. Introduction
Customer segmentation is a fundamental data science technique that helps businesses understand and categorize their customers based on common characteristics. Instead of treating every customer the same, segmentation allows companies to design targeted marketing strategies, personalized offers, and optimized business decisions.
In this project, we use unsupervised machine learning — specifically the K-Means Clustering algorithm — to segment mall customers based on their demographic and behavioral data such as gender, age, annual income, and spending score. The implementation is done in R, leveraging visualization and clustering libraries to derive clear, interpretable insights from raw data.
2. Objective
The main objective of this project is to:
Identify distinct groups (clusters) of mall customers based on their spending behavior and income levels.
Understand customer patterns to help businesses target specific groups more effectively.
Demonstrate how unsupervised learning can be used in marketing and business analytics.
3. Tools and Technologies
Programming Language: R
Software Environment: VsCode
Operating System: Windows 7/8/10
Libraries Used:
ggplot2– Data visualizationdplyr– Data manipulationfactoextra,NbClust– Cluster evaluation and visualizationpurrr– Functional operations for repeated calculationscluster,plotrix,gridExtra– Supporting libraries for clustering and plotting
Dataset:
Customer Dataset.csv
This dataset contains customer information including:CustomerID
Gender
Age
Annual Income (in $000s)
Spending Score (1–100)
4. Methodology
Step 1: Data Understanding and Preparation
We begin by importing the Customer Dataset.csv dataset and exploring its structure. Basic descriptive statistics and visualizations (bar charts and pie charts) are used to understand gender distribution, income range, and spending behavior.
A bar chart shows how many male and female customers are present.
A pie chart displays the percentage distribution of genders across the dataset.
These visualizations help identify demographic balance and potential biases in the data.
Step 2: Selecting Features
The clustering process focuses on numerical attributes that influence purchasing decisions — specifically:
Annual Income
Spending Score
These two variables are chosen because they strongly represent customer buying power and shopping behavior.
Step 3: K-Means Clustering Algorithm
The K-Means algorithm is used to divide the dataset into k clusters based on similarity. The steps are as follows:
Choose the number of clusters (k).
Randomly select k initial centroids.
Assign each customer to the nearest centroid based on Euclidean distance.
Recalculate the centroids as the mean position of all points in each cluster.
Repeat the process until the cluster assignments no longer change (convergence).
To determine the optimal number of clusters, the Elbow Method is applied:
Compute the total within-cluster sum of squares (WSS) for different values of k (from 1 to 10).
Plot k versus WSS.
The “elbow point” (where the curve bends) indicates the best number of clusters, balancing accuracy and simplicity.
Example Code:
library(purrr) set.seed(123) iss <- function(k) { kmeans(customer_data[, 3:5], k, iter.max = 100, nstart = 100, algorithm = "Lloyd")$tot.withinss } k.values <- 1:10 iss_values <- map_dbl(k.values, iss) plot(k.values, iss_values, type = "b", pch = 19, frame = FALSE, xlab = "Number of Clusters (K)", ylab = "Total Within-Cluster Sum of Squares")
Step 4: Cluster Visualization
Once the optimal number of clusters (e.g., k = 5 or k = 6) is determined, we visualize the groups using ggplot2:
Each point represents a customer.
The x-axis is Annual Income, and the y-axis is Spending Score.
Different colors represent different clusters.
Example Code:
set.seed(1) ggplot(customer_data, aes(x = Annual.Income..k.., y = Spending.Score..1.100.)) + geom_point(aes(color = as.factor(k6$cluster))) + scale_color_discrete(name = "Customer Segments", labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6")) + ggtitle("Mall Customer Segments", subtitle = "Using K-Means Clustering")
This visualization helps interpret the customer segments. For example:
Cluster 1: Low income, low spending → Budget-conscious customers
Cluster 2: High income, high spending → Luxury shoppers
Cluster 3: Moderate income, high spending → Impulsive buyers
Cluster 4: Low income, high spending → Value-seeking customers
Cluster 5: High income, low spending → Cautious spenders
5. Results and Insights
After performing clustering, we successfully identified multiple distinct customer groups with similar income and spending habits.
Key findings include:
Clear segmentation of customer behavior based on income and expenditure levels.
Insights that can guide personalized marketing — e.g., offering loyalty rewards to high spenders or discounts to cautious shoppers.
Clusters that reveal potential areas for business strategy improvement, such as under-engaged customer groups.
6. Conclusion
This project demonstrates the practical application of unsupervised machine learning in business analytics.
By implementing K-Means clustering in R, we effectively segmented mall customers into meaningful categories based on spending behavior and income.
These insights can be directly applied to:
Optimize marketing strategies
Increase customer retention
Improve profitability through data-driven decisions
Overall, this project showcases how machine learning can turn simple customer data into actionable business intelligence.
The Customer Segmentation using K-Means Clustering in R project applies unsupervised machine learning techniques to analyze and group mall customers based on their purchasing behavior. Using the dataset, which includes features such as gender, age, annual income, and spending score, the project identifies distinct customer segments that share similar characteristics. By implementing the K-Means Clustering algorithm, the project uncovers meaningful patterns within unlabelled data, helping businesses better understand their customers and make data-driven marketing decisions. The analysis involves data visualization, feature selection, and determining the optimal number of clusters using the Elbow Method. Each identified cluster represents a unique customer group — such as high-income luxury shoppers or low-income budget-conscious customers — enabling companies to tailor their marketing strategies, improve engagement, and maximize profitability. Developed in R Studio using libraries like ggplot2, dplyr, and factoextra, this project demonstrates how data analytics and visualization can transform raw customer data into actionable business insights.

1. Introduction
Customer segmentation is a fundamental data science technique that helps businesses understand and categorize their customers based on common characteristics. Instead of treating every customer the same, segmentation allows companies to design targeted marketing strategies, personalized offers, and optimized business decisions.
In this project, we use unsupervised machine learning — specifically the K-Means Clustering algorithm — to segment mall customers based on their demographic and behavioral data such as gender, age, annual income, and spending score. The implementation is done in R, leveraging visualization and clustering libraries to derive clear, interpretable insights from raw data.
2. Objective
The main objective of this project is to:
Identify distinct groups (clusters) of mall customers based on their spending behavior and income levels.
Understand customer patterns to help businesses target specific groups more effectively.
Demonstrate how unsupervised learning can be used in marketing and business analytics.
3. Tools and Technologies
Programming Language: R
Software Environment: VsCode
Operating System: Windows 7/8/10
Libraries Used:
ggplot2– Data visualizationdplyr– Data manipulationfactoextra,NbClust– Cluster evaluation and visualizationpurrr– Functional operations for repeated calculationscluster,plotrix,gridExtra– Supporting libraries for clustering and plotting
Dataset:
Customer Dataset.csv
This dataset contains customer information including:CustomerID
Gender
Age
Annual Income (in $000s)
Spending Score (1–100)
4. Methodology
Step 1: Data Understanding and Preparation
We begin by importing the Customer Dataset.csv dataset and exploring its structure. Basic descriptive statistics and visualizations (bar charts and pie charts) are used to understand gender distribution, income range, and spending behavior.
A bar chart shows how many male and female customers are present.
A pie chart displays the percentage distribution of genders across the dataset.
These visualizations help identify demographic balance and potential biases in the data.
Step 2: Selecting Features
The clustering process focuses on numerical attributes that influence purchasing decisions — specifically:
Annual Income
Spending Score
These two variables are chosen because they strongly represent customer buying power and shopping behavior.
Step 3: K-Means Clustering Algorithm
The K-Means algorithm is used to divide the dataset into k clusters based on similarity. The steps are as follows:
Choose the number of clusters (k).
Randomly select k initial centroids.
Assign each customer to the nearest centroid based on Euclidean distance.
Recalculate the centroids as the mean position of all points in each cluster.
Repeat the process until the cluster assignments no longer change (convergence).
To determine the optimal number of clusters, the Elbow Method is applied:
Compute the total within-cluster sum of squares (WSS) for different values of k (from 1 to 10).
Plot k versus WSS.
The “elbow point” (where the curve bends) indicates the best number of clusters, balancing accuracy and simplicity.
Example Code:
library(purrr) set.seed(123) iss <- function(k) { kmeans(customer_data[, 3:5], k, iter.max = 100, nstart = 100, algorithm = "Lloyd")$tot.withinss } k.values <- 1:10 iss_values <- map_dbl(k.values, iss) plot(k.values, iss_values, type = "b", pch = 19, frame = FALSE, xlab = "Number of Clusters (K)", ylab = "Total Within-Cluster Sum of Squares")
Step 4: Cluster Visualization
Once the optimal number of clusters (e.g., k = 5 or k = 6) is determined, we visualize the groups using ggplot2:
Each point represents a customer.
The x-axis is Annual Income, and the y-axis is Spending Score.
Different colors represent different clusters.
Example Code:
set.seed(1) ggplot(customer_data, aes(x = Annual.Income..k.., y = Spending.Score..1.100.)) + geom_point(aes(color = as.factor(k6$cluster))) + scale_color_discrete(name = "Customer Segments", labels = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6")) + ggtitle("Mall Customer Segments", subtitle = "Using K-Means Clustering")
This visualization helps interpret the customer segments. For example:
Cluster 1: Low income, low spending → Budget-conscious customers
Cluster 2: High income, high spending → Luxury shoppers
Cluster 3: Moderate income, high spending → Impulsive buyers
Cluster 4: Low income, high spending → Value-seeking customers
Cluster 5: High income, low spending → Cautious spenders
5. Results and Insights
After performing clustering, we successfully identified multiple distinct customer groups with similar income and spending habits.
Key findings include:
Clear segmentation of customer behavior based on income and expenditure levels.
Insights that can guide personalized marketing — e.g., offering loyalty rewards to high spenders or discounts to cautious shoppers.
Clusters that reveal potential areas for business strategy improvement, such as under-engaged customer groups.
6. Conclusion
This project demonstrates the practical application of unsupervised machine learning in business analytics.
By implementing K-Means clustering in R, we effectively segmented mall customers into meaningful categories based on spending behavior and income.
These insights can be directly applied to:
Optimize marketing strategies
Increase customer retention
Improve profitability through data-driven decisions
Overall, this project showcases how machine learning can turn simple customer data into actionable business intelligence.