Preppin' Data in Python #4: Joins, Pivots, For Loops

Introduction to Preppin' Data Challenges

Preppin' Data Challenges are weekly exercises designed to help you develop your data preparation skills. While originally created for Tableau Prep, these challenges can also be completed using Python, SQL, R, dbt, EasyMorph, and other tools. By tackling these challenges, you gain hands-on experience in manipulating datasets, handling missing values, formatting dates, and structuring data for analysis.

In this post, we'll walk through a Python-based solution for the fourth Preppin' Data Challenge of 2021, breaking down each step to help beginners understand the process.

Here is a link to the Challenge: https://preppindata.blogspot.com/2021/01/2021-week-4.html

You can also find the solution on my GitHub: https://github.com/harveyjoyce/Preppin-Data-Challenges

📊 What Does the Script Do?

Read data from multiple sheets in an Excel file.
Combine data from different stores.
Reshape the data by pivoting columns.
Extract new fields like "Customer Type" and "Product."
Summarise sales by store and quarter.
Compare actual sales to targets.
Rank stores based on their performance.
Export the cleaned data to a CSV file.

Let's dive into the code and break it down!

1. Importing Libraries

import os
import pandas as pd
import numpy as np
from pandas import concat, ExcelFile, melt, read_csv

We start by importing essential libraries:

os: For handling file paths.
pandas: The powerhouse for data manipulation.
concat, ExcelFile, melt, read_csv: Specific pandas functions to manipulate data efficiently.

2. Load the Excel File

file_path = r"C:\Users\HarveyJoyce\Downloads\PDC_unprepped\PD 2021 Wk 4 Input.xlsx"

xls = pd.ExcelFile(file_path)

We define the file path and use pd.ExcelFile() to load the Excel workbook. This allows us to interact with multiple sheets easily.

3. Combine Data from All Sheets

dfIn = None
for sheet in [a for a in xls.sheet_names if a != 'Targets']:
    dfNew = xls.parse(sheet)
    dfNew['Store'] = sheet
    dfIn = dfNew if dfIn is None else concat([dfIn, dfNew], ignore_index=True)

What's Happening Here?

Loop through all sheets (except the 'Targets' sheet).
- List comprehension: [a for a in xls.sheet_names if a != 'Targets'] generates a list of sheet names excluding 'Targets'.
- Loop through sheets: The for loop iterates through each store's sheet.
- Parse each sheet: xls.parse(sheet) reads the contents of the current sheet into a DataFrame.
- Add a Store column: dfNew['Store'] = sheet creates a new column that records the store name (based on the sheet name).
- Concatenate DataFrames: If dfIn is empty (on the first iteration), it is assigned dfNew. Otherwise, pd.concat() appends the current DataFrame to the existing one.
Instead of writing every sheet out manually we can use a for loop to get everything automatically!

This combines sales data from multiple stores into a single DataFrame.

4. Reshape the Data (Pivot the Columns)

o1 = dfIn.melt(id_vars=['Date', 'Store'],
        var_name='Customer Type - Product',
        value_name='Values')

o1.rename(columns={'Values':'Products Sold'}, inplace=True)

What Does `melt()` Do?

It un-pivots the data from wide to long format. This makes analysis easier by converting column headers ('New' and 'Existing') into rows.

We also rename the new measure column to 'Products Sold'.

5. Split Columns into Multiple Fields

o1[['Customer Type', 'Product']] = o1['Customer Type - Product'].str.split(' - ', expand=True)

We split the combined column into two separate fields:

Customer Type: New vs. Existing customers.
Product: Type of product sold.

6. Extract the Quarter from the Date

o1['Date'] = pd.to_datetime(o1['Date'], format="%Y/%m/%d")
o1['Quarter'] = o1['Date'].dt.quarter

Convert the 'Date' column to datetime format.
Extract the quarter using .dt.quarter for grouping by time periods.

7. Summarize the Data

output_1 = o1.groupby(
    ['Store', 'Quarter']
    ).agg(
        Products_Sold = ('Products Sold', 'sum'),
    ).reset_index()

We group by Store and Quarter and sum the 'Products Sold'.

reset_index() ensures the grouped data returns as a DataFrame.

8. Compare to Targets

target = pd.read_excel(file_path, sheet_name = 'Targets')

joined = output_1.merge(target, how='left', on=['Quarter', 'Store'])

Load the 'Targets' sheet.
Merge actual sales with targets using a left join on 'Quarter' and 'Store'.

9. Calculate Variance and Rank Stores

joined['Variance to Target'] = joined['Products_Sold'] - joined['Target']

joined['Rank'] = joined.groupby('Quarter')['Variance to Target'].rank(ascending=False)

Variance to Target: Difference between actual and target sales.
Rank: Rank stores within each quarter based on the variance. We use ascending=False to rank in descending order.

10. Export the Cleaned Data

output_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 4 output.csv"

joined.to_csv(output_path, index=False)

Finally, we save the cleaned and enriched dataset to a CSV file.

🏁 Final Thoughts

This script walks through a full data preparation workflow:

Read and combine multiple sheets.
Reshape and clean data.
Summarise, join, and enrich with targets.
Analyse and rank stores by their performance.
Output the results for further analysis.

Author:

Harvey Joyce

View Profile