pandas groupby that finally clicks: split, apply, combine
Most pandas confusion disappears once you see groupby as three steps, not one magic call. Here's the mental model — agg vs transform vs filter — with runnable examples.
If pandas groupby still feels like guesswork, it's almost always because it's taught as one operation. It's three: split the rows into groups, apply something to each group, then combine the results back. Hold that single picture and the whole API falls into place.
The fast version:
groupby(...)only describes the split — nothing computes until you apply something.- Pick the right apply step —
agg,transform, orfilter— and the answer is one line. - The three differ by the shape they return, which is the thing that trips people up.
Split, apply, combine
import pandas as pd
df = pd.DataFrame({
"team": ["A", "A", "B", "B", "B"],
"points": [10, 14, 7, 9, 21],
})
# split by team → apply mean to points → combine into one row per team
df.groupby("team")["points"].mean()groupby("team") doesn't compute anything yet — it just describes the split. The work happens when you apply something to each group. That laziness is the part people miss, and it's why chaining the wrong method feels unpredictable.
The three things you can apply
The whole API comes down to choosing what to apply, and each choice returns a different shape:
| You want | Method | Result shape |
|---|---|---|
| One value per group | .agg() | one row per group |
| A value per original row | .transform() | same shape as input |
| To keep or drop whole groups | .filter() | subset of input rows |
transform is the quiet hero. Need each row's value as a share of its group total? That's a transform, not an aggregate — because you want the answer broadcast back onto every original row:
# group total broadcast back onto every row
df["share"] = df["points"] / df.groupby("team")["points"].transform("sum")If you reached for agg here, you'd get one number per team and then have to merge it back — fighting the shape instead of using it.
Why the shape is the whole game
Most "why doesn't this work?" pandas moments are shape mismatches: you produced one row per group when you needed one per original row, or vice versa. Naming the three apply steps by their output shape — collapse (agg), preserve (transform), subset (filter) — turns a guessing game into a decision. Ask "what shape do I need back?" first, and the method picks itself.
How this shows up on CodeOak
CodeOak's pandas track is table-native: you're handed a real DataFrame and graded on the table you produce — by deterministic comparison against the expected output, exactly like the SQL track. No AI judges your code; the result table either matches or it doesn't. That means the split-apply-combine instinct is precisely what's being tested. Pick the right apply step and you're done in a line; pick the wrong one and you're wrestling the shape of the result.
Practice a few agg vs transform problems back to back. The moment you stop reaching for a loop, you've got it. Window functions are the SQL version of the same instinct — see window functions explained — and if you want a system that drills the exact pattern you keep missing, start with the assessment.
FAQ
What does split-apply-combine actually mean in pandas?
It's the three stages of every groupby: split the rows into groups by a key, apply a function to each group, then combine the results into a new Series or DataFrame. groupby() only sets up the split; the apply step does the work.
When should I use transform instead of agg in pandas?
Use transform when you need a result for every original row — like each value as a share of its group total — because it returns the same shape as the input. Use agg when you want one summarized row per group.
Does groupby compute anything on its own?
No. df.groupby("team") is lazy — it only describes how rows are split. Nothing is calculated until you chain an apply step such as .mean(), .agg(), .transform(), or .filter().