Tips: Cache Intermediate Results with pickle

Here’s a useful pattern I’ve been getting a lot of mileage out of lately. If you’re running an analysis that has a time consuming step you can save the result as a python readable “pickle” file. Addendum: In some cases pickling a python objects can sometimes succeed in storing and retrieving data where a library’s built in functions for saving/loading data fails.

import pickle as pkl

path = "./data_intermediates/processed_data.pkl"
if os.path.exists(path):
    processed_data = pkl.load(open(path, 'rb'))
else:
    # Make `processed_data` here
    pkl.dump(processed_data, open(path, 'wb'))

This also lets you batch a process so that you can do more with your resources. For example here’s a list comprehension that will (for each day from 0-287) rearrange the weather data to be in “long” format. This is concise but requires processing the whole list at once which takes a lot of resources.

sal_long_list = [_get_weather_long(results_list = res,
                                   current_day = ith_day) for ith_day in np.linspace(start = 0, stop = 287, num = 288)]

If we incorporate it into the pattern above we can hold fewer items in memory at a time and then merge them (e.g. with list.extend() ) after the fact.

for ii in range(3):
    file_path = '../data/result_intermediates/sal_df_W_long_part_day'+['0-95', 
                                                                       '96-191', 
                                                                       '192-287'][ii]+'.pkl'
    if os.path.exists(file_path):
        sal_long_list = pkl.load(open(file_path, 'rb'))

    else:
        # The original list comprehension is here, 
        # just made messier by selecting a subset of the indices.
        sal_long_list = [_get_weather_long(                                
            results_list = res,
            current_day = current_day) for current_day in [
            [int(e) for e in np.linspace(start = 0, stop = 95, num = 96)],   # Batch 1
            [int(e) for e in np.linspace(start = 96, stop = 191, num = 96)], # Batch 2
            [int(e) for e in np.linspace(start = 192, stop = 287, num = 96)] # Batch 3
        ][ii]
        ]
        pkl.dump(sal_long_list, open(file_path, 'wb'))