Preppin' Data Challenges are weekly exercises designed to help you develop your data preparation skills. While originally created for Tableau Prep, these challenges can also be completed using Python, SQL, R, dbt, EasyMorph, and other tools. By tackling these challenges, you gain hands-on experience in manipulating datasets, handling missing values, formatting dates, and structuring data for analysis.
In this post, we'll walk through a Python-based solution for the sixth Preppin' Data Challenge of 2021, breaking down each step to help beginners understand the process.
Here is a link to the Challenge: https://preppindata.blogspot.com/2021/02/2021-week-6-comparing-prize-money-for.html
You can also find the solution on my GitHub: https://github.com/harveyjoyce/Preppin-Data-Challenges
🏌️♂️ What Does the Script Do?
- Read data from an Excel file containing golfer earnings.
- Calculate additional metrics like average prize money per event.
- Rank golfers within their tours and overall.
- Aggregate data to answer questions about total prize money, player counts, and ranking differences.
- Compare differences between the LPGA and PGA tours.
- Export the cleaned and summarized data to a CSV file.
Let's break down the code step by step!
1. Importing Libraries
import os
import pandas as pd
import numpy as np
import re
pd.options.display.float_format = '{:.2f}'.formatWe import essential libraries:
- os: For handling file paths.
- pandas: The go-to library for data manipulation.
- numpy: Provides numerical operations (used here for calculations).
We also set the display format to suppress scientific notation for better readability of financial figures.
2. Load the Data
file_path = r"C:\Users\HarveyJoyce\Downloads\PDC_unprepped\PD 2021 Wk 6 Input.xlsx"
df = pd.read_excel(file_path, engine='openpyxl', sheet_name='OfficialMoney')We define the file path and load the OfficialMoney sheet from the Excel file using pd.read_excel(). The engine='openpyxl' parameter ensures compatibility with modern Excel formats.
3. Calculate New Metrics
df['Avg_Money_per_Event'] = df['MONEY'] / df['EVENTS']Here, we calculate a new column called Avg_Money_per_Event, which represents how much prize money each player earns per event. This is a simple division of MONEY by EVENTS.
4. Ranking Players
df['Rank_per_Tour'] = df.groupby('TOUR')['MONEY'].rank(ascending=False)
df['Overall_Rank'] = df['MONEY'].rank(ascending=False)
df['Difference_in_Ranking'] = df['Overall_Rank'] - df['Rank_per_Tour']We add three new ranking columns:
- Rank_per_Tour: Ranks players within their respective tours based on prize money (highest earns rank 1).
- Overall_Rank: Ranks players across both tours.
- Difference_in_Ranking: Measures how much a player's rank differs between their tour and the overall ranking.
The rank() function assigns ranks, and setting ascending=False means higher earnings get lower ranks (1st place).
5. Aggregating Data
df = df.groupby(['TOUR']).agg(
Total_Prize_Money=('MONEY', 'sum'),
Number_of_Players=('PLAYER NAME', 'nunique'),
Number_of_Events=('EVENTS', 'sum'),
Avg_Money_per_Event=('Avg_Money_per_Event', 'mean'),
Avg_Diff_in_Ranking=('Difference_in_Ranking', 'mean')
).reset_index()We group by each TOUR and calculate the following metrics:
- Total_Prize_Money: Total earnings for each tour.
- Number_of_Players: Unique player count per tour.
- Number_of_Events: Total number of events.
- Avg_Money_per_Event: Average earnings per event.
- Avg_Diff_in_Ranking: Average difference between a player's tour rank and their overall rank.
reset_index() ensures the grouped output is returned as a DataFrame.
6. Pivoting Data for Better Comparison
df = df.set_index('TOUR').TWe transpose the DataFrame using .T, which swaps rows and columns. Now, each row represents a metric, and each column represents a tour (PGA, LPGA).
7. Comparing the LPGA and PGA Tours
df['Difference between Tours'] = df['LPGA'] - df['PGA']We calculate a new column called Difference between Tours by subtracting PGA values from LPGA values.
- Negative values indicate the LPGA has lower figures than the PGA.
- Positive values indicate the LPGA surpasses the PGA in that metric.
8. Exporting the Cleaned Data
output_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 6 output.csv"
df.to_csv(output_path, index=False)
print(df)Finally, we save the cleaned and transformed data to a CSV file for further analysis.
🏁 Final Thoughts
In this tutorial, we walked through how to:
- Load Excel data into Python.
- Calculate new metrics for player earnings.
- Rank players within and across tours.
- Aggregate and pivot data to summarize key insights.
- Compare earnings between the LPGA and PGA tours.
