This is a sales transaction public available dataset of UK-based e-commerce (online retail) for one year.
This shop has been selling gifts and homewares for adults and children through the website since 2007.
This project analyzes a random year of UK e-commerce transactions to uncover:
- Customer behavior ➡️ customer segmentation based on Recency-Frequency-Monetary (RFM)
- Product trends ➡️ market basket analysis
- Opportunities for business growth ➡️ cohort analysis
This analysis reveal key drivers of revenue, retention, and product associations.
- 🔍 Dataset - Data source
- 🧹 Data Preprocessing - Cleaning and feature engineering
- 📊 Exploratory Data Analysis - Key insights and patterns
- 👥 Customer Segmentation - RFM Analysis and Pareto Principle
- 🧩 Customer Retention Analysis - Cohort analysis
- 🛒 Market Basket Analysis - Product associations rules and cross-selling
- 💡 Business Recommendations - Actionable insights and strategies
- 🚀 Future Enhancements - Next steps and improvements
- 🔁 Reproducibility - Install dependencies
I used the Kaggle API (Program > kaggle_data_download.py) to automatically load the dataset.
Navigate to the Reproducibility for more details regarding Kaggle API.
More info about data is available at the following link.
- Data Cleaning: Filtered cancelled transactions (~1.6%), removed duplicates (<1%), handled missing values
- Feature Engineering: Created revenue calculations (+ log-transformed), date components and metrics for visualisations
- Create Categories: Created categories using regex patterns
- Recorded period: 2018-12-01 to 2019-11-30 (Removed 2019-12 because the dataset did not contain the whole month)
- Total Revenue: £60,269,235
- Total Customers: 4,671
- Geographic Concentration: UK represents 83% of total revenue
- Seasonal Patterns: September/October/November show peak revenue performances
- Customer Behavior: Friday/Sunday emerge as the highest revenue days
- Transaction Volume: Average of 278 items per transaction indicates bulk purchasing
- Most Customers buy bulk (70.6% of customers buy more than 100 products per order on average)
- Retail Customers : 1373 (AVG Quantity Per Transaction <= 100)
- Bulk Customers : 3298 (AVG Quantity Per Transaction > 100)

Customers were segmented using RFM analysis (Recency, Frequency, Monetary) with each variable devided into 5 equal size parts to identify distinct behavioral groups:
- VIP Customers: High-value, frequent, recent purchasers ➡️ Those who belonged to the top parts (4th and 5th)
- Loyal Customers: Consistent engagement and spending
- At Risk: Previously valuable customers showing declining activity
- New Customers: Recent acquisitions
- Pareto Principle: VIP customers alone (20%) generates ~61% of total revenue
- Top 3 segments generate 80.7% of total revenue

Cohort analysis groups customers by the month they first made a purchase and tracks how many return in later months.
This helps measure monthly customer retention and spot seasonal patterns

First month removed from cohort analysis because the dataset only covers one full year. Including the first month would give incomplete retention results, since can't track earlier activity or repeat purchases from before the dataset starts.
First Month Dropoff:
- Average first-month retention: 20.3%
Top 3 months for retention :
- September: 24.1%
- October: 23.3%
- April: 20.6%
Bottom 3 months for retention :
- March: 14.1%
- June: 17.5%
- July: 14.7%
- We can recognize a big dropoff on the first month (79.7%) which probably correlates to the fact that most customers buy bulk.
- We can see a clear seasonal pattern of customer retention:
- Summer months seem to be the worst for retention due to the holiday season (June, July).
- Retention increases from September, likely with the start of the academic year.
Market basket analysis looks at customer purchase data to find out which products are often bought together.
- Support: measures how common is the pairing of products A and B
- Confidence: measures how likely is product B bought when product A is bought
- Lift: measures how much does product A boost the chances of product B being bought compared to random
Highlights
- Transactions with at lest two items: 17527 | 92.4% of total transactions
- Average basket size: 26.2 items
Top 5 Product Associations (by Lift):
- Herb Marker Thyme → Herb Marker Rosemary | Lift: 71.19 | Conf: 93.4%
- Herb Marker Rosemary → Herb Marker Thyme | Lift: 71.19 | Conf: 93.0%
- Herb Marker Chives → Herb Marker Parsley | Lift: 71.07 | Conf: 92.4%
- Herb Marker Chives → Herb Marker Thyme | Lift: 69.98 | Conf: 91.4%
- Regency Tea Plate Pink → Regency Tea Plate Green | Lift: 43.85 | Conf: 91.2%
Top 5 Product Associations (by Support):
- Jumbo Bag Pink Polkadot → Jumbo Bag Red Retrospot | Support: 4.69% | Count: 812
- Jumbo Bag Red Retrospot → Jumbo Bag Pink Polkadot | Support: 4.69% | Count: 812
- Green Regency Teacup And Saucer → Roses Regency Teacup And Saucer | Support: 4.26% | Count: 737
- Roses Regency Teacup And Saucer → Green Regency Teacup And Saucer | Support: 4.26% | Count: 737
- Jumbo Storage Bag Suki → Jumbo Bag Red Retrospot | Support: 4.05% | Count: 701
- Customer Engagement:
- Launch targeted retention campaigns for "At Risk" customers
- Offer exclusive deals or perks to VIP customers to encourage loyalty
- Product Bundling:
- Develop and promote bundles featuring top-performing product pairs identified through market basket analysis
- Market Expansion:
- Explore marketing strategies and tailored offers for high-potential European countries to diversify revenue streams
- Seasonal Planning:
- Align inventory and promotions prior to peak sales periods (e.g., October for November)
- Scale back marketing spend and inventory during low-activity summer months to control costs
- Advanced Category Creation: Create product categories using NLP
- Predictive Modeling: Predict customer value and churn
- Real-time Analytics: Add interactive dashboards for performance monitoring
- Advanced Segmentation: Machine learning algorithms for clustering or classification
- Time series models: Predict the next purchase move of VIP customers for strategic marketing
git clone https://github.com/Papagiannopoulos/ecommerce-business-analytics.git
cd 'ecommerce-business-analytics'
2. Create a fresh virtual env
uv venv
Note: If uv is not already installed, run the following command in PowerShell.
- On macOS and Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh - On Windows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
uv sync
Note: At this step, Microsoft Visual C++ is required. If sync crashes follow the provided steps.
- Create a Kaggle account
- Go to Account Settings and click "Create New API Token" to download the kaggle.json file
- Navigate to C:\Users<your_user_name> on your computer
- Create a new folder named .kaggle
- Move the downloaded kaggle.json file into the .kaggle folder






