Introduction to Preppin' Data Challenges
Preppin' Data Challenges are weekly exercises designed to help you develop your data preparation skills. While originally created for Tableau Prep, these challenges can also be completed using Python, SQL, R, dbt, EasyMorph, and other tools. By tackling these challenges, you gain hands-on experience in manipulating datasets, handling missing values, formatting dates, and structuring data for analysis.
In this post, we'll walk through a Python-based solution for the fourth Preppin' Data Challenge of 2021, breaking down each step to help beginners understand the process.
Here is a link to the Challenge: https://preppindata.blogspot.com/2021/01/2021-week-4.html
You can also find the solution on my GitHub: https://github.com/harveyjoyce/Preppin-Data-Challenges
📊 What Does the Script Do?
- Read data from multiple sheets in an Excel file.
- Combine data from different stores.
- Reshape the data by pivoting columns.
- Extract new fields like "Customer Type" and "Product."
- Summarise sales by store and quarter.
- Compare actual sales to targets.
- Rank stores based on their performance.
- Export the cleaned data to a CSV file.
Let's dive into the code and break it down!
1. Importing Libraries
import os
import pandas as pd
import numpy as np
from pandas import concat, ExcelFile, melt, read_csvWe start by importing essential libraries:
- os: For handling file paths.
- pandas: The powerhouse for data manipulation.
- concat, ExcelFile, melt, read_csv: Specific pandas functions to manipulate data efficiently.
2. Load the Excel File
file_path = r"C:\Users\HarveyJoyce\Downloads\PDC_unprepped\PD 2021 Wk 4 Input.xlsx"
xls = pd.ExcelFile(file_path)We define the file path and use pd.ExcelFile() to load the Excel workbook. This allows us to interact with multiple sheets easily.
3. Combine Data from All Sheets
dfIn = None
for sheet in [a for a in xls.sheet_names if a != 'Targets']:
dfNew = xls.parse(sheet)
dfNew['Store'] = sheet
dfIn = dfNew if dfIn is None else concat([dfIn, dfNew], ignore_index=True)What's Happening Here?
- Loop through all sheets (except the 'Targets' sheet).
- List comprehension:
[a for a in xls.sheet_names if a != 'Targets']generates a list of sheet names excluding 'Targets'. - Loop through sheets: The
forloop iterates through each store's sheet. - Parse each sheet:
xls.parse(sheet)reads the contents of the current sheet into a DataFrame. - Add a Store column:
dfNew['Store'] = sheetcreates a new column that records the store name (based on the sheet name). - Concatenate DataFrames: If
dfInis empty (on the first iteration), it is assigneddfNew. Otherwise,pd.concat()appends the current DataFrame to the existing one.
- List comprehension:
- Instead of writing every sheet out manually we can use a for loop to get everything automatically!
This combines sales data from multiple stores into a single DataFrame.
4. Reshape the Data (Pivot the Columns)
o1 = dfIn.melt(id_vars=['Date', 'Store'],
var_name='Customer Type - Product',
value_name='Values')
o1.rename(columns={'Values':'Products Sold'}, inplace=True)What Does melt() Do?
It un-pivots the data from wide to long format. This makes analysis easier by converting column headers ('New' and 'Existing') into rows.
We also rename the new measure column to 'Products Sold'.
5. Split Columns into Multiple Fields
o1[['Customer Type', 'Product']] = o1['Customer Type - Product'].str.split(' - ', expand=True)We split the combined column into two separate fields:
- Customer Type: New vs. Existing customers.
- Product: Type of product sold.
6. Extract the Quarter from the Date
o1['Date'] = pd.to_datetime(o1['Date'], format="%Y/%m/%d")
o1['Quarter'] = o1['Date'].dt.quarter- Convert the 'Date' column to datetime format.
- Extract the quarter using
.dt.quarterfor grouping by time periods.
7. Summarize the Data
output_1 = o1.groupby(
['Store', 'Quarter']
).agg(
Products_Sold = ('Products Sold', 'sum'),
).reset_index()We group by Store and Quarter and sum the 'Products Sold'.
reset_index() ensures the grouped data returns as a DataFrame.
8. Compare to Targets
target = pd.read_excel(file_path, sheet_name = 'Targets')
joined = output_1.merge(target, how='left', on=['Quarter', 'Store'])- Load the 'Targets' sheet.
- Merge actual sales with targets using a left join on 'Quarter' and 'Store'.
9. Calculate Variance and Rank Stores
joined['Variance to Target'] = joined['Products_Sold'] - joined['Target']
joined['Rank'] = joined.groupby('Quarter')['Variance to Target'].rank(ascending=False)- Variance to Target: Difference between actual and target sales.
- Rank: Rank stores within each quarter based on the variance. We use
ascending=Falseto rank in descending order.
10. Export the Cleaned Data
output_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 4 output.csv"
joined.to_csv(output_path, index=False)Finally, we save the cleaned and enriched dataset to a CSV file.
🏁 Final Thoughts
This script walks through a full data preparation workflow:
- Read and combine multiple sheets.
- Reshape and clean data.
- Summarise, join, and enrich with targets.
- Analyse and rank stores by their performance.
- Output the results for further analysis.
