## The Task

I have a dataframe that looks like this:

date | money_spent ($) | meals_eaten | weight |
---|---|---|---|

2021-01-01 10:00:00 | 350 | 5 | 140 |

2021-01-02 18:00:00 | 250 | 2 | 170 |

2021-01-03 12:10:00 | 200 | 3 | 160 |

2021-01-04 19:40:00 | 100 | 1 | 150 |

I want to discretize this so that it “cuts” the rows every `$X`

. I want to know some statistics on how much is being done for every `$X`

i spend.

So if I were to use `$500`

as a threshold, the first two rows would fall in the first cut, and I could aggregate the remaining columns as follows:

- first
`date`

of the cut - average
`meals_eaten`

- minimum
`weight`

- maximum
`weight`

So the final table would be two rows like this:

date | cumulative_spent ($) | meals_eaten | min_weight | max_weight |
---|---|---|---|---|

2021-01-01 10:00:00 | 600 | 3.5 | 140 | 170 |

2021-01-03 12:10:00 | 300 | 2 | 150 | 160 |

## My Approach:

My first instinct is to calculate the `cumsum()`

of the `money_spent`

(assume the data is sorted by date), then I use `pd.cut()`

to basically make a new column, we call it `spent_bin`

, that determines each row’s bin.

**Note:** In this toy example, `spent_bin`

would basically be: `[0,500]`

for the first two rows and `(500-1000]`

for the last two.

Then it’s fairly simple, I do a groupby `spent_bin`

then aggregate as follows:

.agg({ 'date':'first', 'meals_eaten':'mean', 'returns': ['min', 'max'] })

## What I’ve Tried

import pandas as pd rows = [ {"date":"2021-01-01 10:00:00","money_spent":350, "meals_eaten":5, "weight":140}, {"date":"2021-01-02 18:00:00","money_spent":250, "meals_eaten":2, "weight":170}, {"date":"2021-01-03 12:10:00","money_spent":200, "meals_eaten":3, "weight":160}, {"date":"2021-01-05 22:07:00","money_spent":100, "meals_eaten":1, "weight":150}] df = pd.DataFrame.from_dict(rows) df['date'] = pd.to_datetime(df.date) df['cum_spent'] = df.money_spent.cumsum() print(df) print(pd.cut(df.cum_spent, 500))

For some reason, I can’t get the `cut`

step to work. Here is my toy code from above. The labels are not cleanly `[0-500], (500,1000]`

for some reason. Honestly I’d settle for `[350,500],(500-800]`

(this is what the actual cum sum values are at the edges of the cuts), but I can’t even get that to work even though I’m doing the exact same as the documentation example. Any help with this?

### Caveats and Difficulties:

It’s pretty easy to write this in a for loop of course, just do a `while cum_spent < 500:`

. The problem is I have millions of rows in my actual dataset, it currently takes me 20 minutes to process a single df this way.

There’s also a minor issue that sometimes rows will break the interval. When that happens, I want that last row included. This problem is in the toy example where row #2 actually ends at $600 not $500. But it is *the first row that ends at or surpasses $500*, so I’m including it in the first bin.

## Answer

The customized function to achieve the `cumsum`

with reset limitation

df['new'] = cumli(df['money_spent ($)'].values,500) out = df.groupby(df.new.iloc[::-1].cumsum()).agg( date = ('date','first'), meals_eaten = ('meals_eaten','mean'), min_weight = ('weight','min'), max_weight = ('weight','max')).sort_index(ascending=False) Out[81]: date meals_eaten min_weight max_weight new 1 2021-01-01 3.5 140 170 0 2021-01-03 2.0 150 160

from numba import njit @njit def cumli(x, lim): total = 0 result = [] for i, y in enumerate(x): check = 0 total += y if total >= lim: total = 0 check = 1 result.append(check) return result