Exploratory Data Analysis of Sales Data in SQL

This project showcases how I use SQL as an exploratory data analysis tool on a star-schema sales data mart built from CRM and ERP sources. Before building dashboards or advanced models, I wanted to understand the structure of the data, its coverage over time, and the basic behavior of customers and products. The analysis focuses on answering practical questions such as “How big is the business?”, “Who are the most valuable customers?”, and “Over what time horizon do we actually have trustworthy data?”.

1. Understanding the Data Structure

I started with a structural exploration of the warehouse using the INFORMATION_SCHEMA views. Listing all tables across the Bronze, Silver, and Gold schemas gave me a quick inventory of what is available and how the pipeline is organized. This step is useful not only for orientation, but also for validating that all expected objects were created correctly after the ETL processes.

After confirming the list of tables, I drilled into dim_customers to inspect its columns and data types. Understanding fields such as customer_key, birthdate, country, and gender helped me plan later analysis around customer demographics and segmentation. At this stage I am not doing any statistics yet, just making sure I fully understand how the data is modeled and where each business concept lives.

-- List all tables
SELECT * 
FROM INFORMATION_SCHEMA.TABLES;

-- Inspect columns of the main customer dimension
SELECT * 
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'dim_customers';
Listing all tables in the warehouse
Exploring columns in dim_customers

2. Date Coverage and Customer Demographics

Once the structure was clear, I focused on time coverage, because every downstream analysis depends on knowing “from when to when” the data is reliable. By finding the first and last order dates in fact_sales and computing the range in months, I could see exactly how many years of sales history were available. This helps prevent mistakes like drawing conclusions about seasonality when only a short window of data actually exists.

In parallel, I examined basic customer demographics by calculating the youngest and oldest customers in dim_customers. The age span tells me whether the dataset is focused on a specific life stage, such as young adults, or whether it covers a broader population. That context matters when interpreting future insights about purchasing patterns, average order value, or responsiveness to campaigns across age groups.

First/last order dates and age range of customers

3. High‑Level Business Metrics

After understanding the temporal and demographic context, I wanted a concise view of the overall business scale. Instead of running separate queries for every metric, I combined several aggregations into one compact “KPI table” using UNION ALL. This report summarizes total sales, total quantity sold, number of distinct orders, number of active products, and total number of customers present in the warehouse.

Having these metrics in a single result set is very convenient when sharing early findings with non‑technical stakeholders. It reads almost like the top row of a dashboard, but is generated directly in SQL. It also serves as a quick sanity check for the data engineering pipeline: if the numbers are unexpectedly low or high, that may indicate issues in the ETL or filters applied upstream.

Summary table of key business measures

4. Customer Revenue Distribution

A core EDA question for any commerce dataset is how revenue is distributed across customers. To answer this, I joined fact_sales with dim_customers and summed sales_amount per customer, then ordered the results by total revenue. This query immediately reveals whether the business depends on a small group of high‑value customers or whether revenue is evenly spread across the customer base.

The output can be exported to a spreadsheet or visualization tool to build Pareto charts or cumulative distribution plots, but even in raw table form it already provides powerful insights. Seeing actual names and keys attached to the highest‑revenue rows makes it easier to think about next steps, such as creating VIP segments, designing tailored campaigns, or examining churn risk among the most valuable customers.

Top customers ranked by total revenue

5. Key Insights from the SQL EDA

Overall, the SQL‑based exploratory analysis confirmed that the data warehouse is structurally sound and analytically rich. I verified that the star schema is complete, that sales data spans a meaningful time window, and that the customer base is large enough to support segmentation and behavioral analysis. The quick KPI table provides a baseline understanding of business scale before any dashboards are even opened.

The customer revenue exploration showed that a relatively small set of customers contributes a substantial portion of total sales, suggesting clear opportunities for VIP programs and retention strategies. These EDA findings directly influenced how I designed the Gold‑layer customer report and Tableau dashboards in the broader data warehouse project, ensuring that visualizations and metrics focus on what is actually most important in the underlying data.