Useful python bits for Data Engineering:

Are you building ETL pipelines, processing massive data file, or obtaining data from APIs, python is such a powerful for this. The fact that it's open source allows you to use a vast collection of libraries.

In this post, I'll walk you through some of my favourite list of python functions, from directory crawling to requesting API data, here are some of the most useful functions to start as a data engineer.

  • get() - strait from the requests library, this function will take your api url and obtain the response. You can include various parameters as your api key or dates for example:

response = requests.get("https://api.com/data", params = parameters, auth=api_key)

  • raise_for_status() - once you've got a response from your api, you can use this to raise an error if the http status code is not between 200 and 299, meaning an invalid response. This can be very powerful when trying to catch errors:

respone.raise_for_status()

  • try / except blocks - This handles exceptions to assure the code is running, anything put after try will run, if an error occurs, the python script jumps directly to the except blocks, running what happens in case of exception.
try: 
    respone.raise_for_status()
    print('Success!')
except: 
    print('API Data was not retrieved) 
  • raise - Useful with the previously mentioned try and except blocks, you can use it to raise a specific type of error, this means once you get to the except part of your try and except block, you can raise a ValueError for example, this will print then your message in the terminal.

raise ValueError('Broken :( ' )

  • strftime() - this will format a datatime object as a string, allowing you to choose the date format it will be shown as:

date.strftime("%Y-%m-%d")

  • os.walk() - from the os library, this allows you to go iterate through the items in a directory, this can be files, subdirectories etc. Especially useful in loops:

for files in os.walk(directory):

  • os.join() - can be used to concated multiple file paths, it will smartly construct your path but reset whenever you get a new '/'

path = os.join("/home", "/user", "test.py")

path = "/user/test.py"

  • mkdtemp() - can be used to create a temporary directory in which actions can occur, this directory disappears after and would never appear in your file explorer:

temp_dir - path.mkdtemp()

  • makedirs() - like the previous function but this one the directories are permanent. You can specify that it won't error if the directory already exists using exist_ok=True

os.makedirs('output', exsist_ok=True)

  • zipfile() - allows you to go through a compressed file (or zip), useful to loop through the items within it. include the zip file, r for r or w for write for example:

with ZipFile("data.zip", "r") as f:

  • extractall() this simply extracts all of the content from a zip file.:

zip.extractall('path')

  • next() - returns the next item from an iterator such as a list:

next(list)

  • shutil.copyfileobj() - copies specific files, for example we looped through a zip file and copied everything to a new path.

shutil.copyfileobj(file, destination_path)

  • logging.basicConfig() -this is how you set up what your logs will look like, logging is crucial to developments and to error handling, use it wisely!

logging.basicConfig(level=Logging.INFO)

  • logging.log() - allows you to log a message, this is similar to a print but will follow the format and changed based on the level of the info your are logging.

logging.log(logging.INFO, "Start run"

  • os.remove() - allows you to delete a file within the directory

os.remove('bad_file.csv')

  • .endswith() - returns true if a string (or path ;) ) ends with the character specified:

file.endswith('json')

  • time.sleep() - from the time library, pauses the programs execution for a specific number of seconds

time.sleep(5)

  • datetime.now() - returns the current local date and time

datetime.now()

.split() -splits a string into a list based on a delimiter

"one, two, three". split(",")

response.content() - returns the content of an http response, very useful when looking at API calls.

api_call = response.content

There's many more resources to explore, libraries to study, scripts to write! I hope these functions are a good start to your data engineer experience! Who knows maybe part 2 will come out when I have another list.

Author:
Jules Claeys
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2025 The Information Lab