How to Incrementally Extract and Deduplicate Your MailChimp Data

Today I spent some time building the start data pipeline for Mailchimp campaigns, and it got me thinking about the cadence of data, incremental refreshes, and the need to avoid duplicates.

Project Context

My goal was to extract Mailchimp's subscriber email activity campaign data onto my local machine so that down the line, this data will be used alongside other sources, like Amplitude engagement data, to measure campaign performance and user engagement.

Mailchimp campaigns are basically mass emails that get sent out, and each campaign has its own subscriber activity report.

When thinking about campaigns in Mailchimp, the sent date is key. Once a campaign is sent, that date never changes. This makes campaigns themselves naturally incremental: yesterday’s campaigns stay the same today.

Subscriber activity, on the other hand, is dynamic. People can open emails, click links, or interact with the campaign at any time after it’s sent. That means the activity API needs to be called every time you want fresh data, even for campaigns that were sent weeks ago.

We use the campaign id to find the Subscriber activity but I didn’t want to fetch all campaigns every time just to get the activity reports. So the challenge was designing a system that could:

  • Pull only new campaigns (based on sent date)
  • Fetch fresh subscriber activity for both new and existing campaigns without creating duplicates

Solution

  • We maintain a master list of all campaigns in data/campaigns_summary.json. On each run, the script loads this file (or creates it if it doesn’t exist) to know which campaigns have already been processed.
  • A Daily Scan for New Campaigns. Each time the script runs, it looks back over a last day, Mailchimp returns any campaigns that were created or sent during that time.
  • Deduplication
    Check for duplicates, using campaign IDs from summary table and Append and Save any new Campaigns to a Summary List
existing_ids = {c["id"] for c in campaign_summary}

new_campaigns = []
for camp in campaigns:
    if camp["id"] not in existing_ids:
        summary = {
            "id": camp["id"],
            "title": camp.get("settings", {}).get("title", "Untitled"),
            "send_time": camp.get("send_time"),
            "found_time": datetime.utcnow().isoformat()
        }
        campaign_summary.append(summary)
        existing_ids.add(camp["id"])
        new_campaigns.append(summary)

Once we have the master list of campaign IDs, the next step is to fetch click details and subscriber activity by looping through id

for campaign in campaign_summary:
    campaign_id = campaign["id"]
    output_file = f"data/campaigns/{campaign_id}.json"

    try:
        activity = mailchimp.reports.get_email_activity_for_campaign(campaign_id)
    except ApiClientError as e:
        activity = {"error": str(e)}

    campaign_full = {
        "campaign": campaign,
        "activity": activity,
    }

    # Save or update the JSON file for this campaign
    with open(f"data/campaigns/{camp_id}.json", "w") as f:
        json.dump(campaign_full, f, indent=2)

By separating campaign metadata from subscriber activity and maintaining a master summary with deduplication, we’ve made the script more robust and dynamic. Now, each campaign is stored as its own JSON file, which allows us to:

  • Add new campaigns automatically as they appear in the Mailchimp API.
  • Update existing campaigns with fresh subscriber activity without touching unchanged campaigns.
  • Control the lookback window, so we can fetch campaigns from a short or long period of time.
  • Avoid duplicate records, even if the script runs multiple times a day.

For example, here is the output in a recent run:

Scanning for campaigns created between 2025-09-01 → 2025-10-25
Found 14 campaigns in scan window.
New campaigns added: 4
Refreshing reports for 14 total campaigns...
 → Updating 921dcsd23eb1f...
 → Updating 6dsdc32ed3a2a...
 → Updating 1cx2423d2ew39...
 ...
Done.

Each campaign file is individually refreshed, meaning we can always have the most up-to-date activity without reprocessing the entire dataset.

Author:
Otto Richardson
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2025 The Information Lab