uv init
uv venv
source venv/bin/activate
uv add ruff
uv add polars
My team (Data Science) at Smart Data Foundry have started to spend an hour of CPD time together once a week. July is our first month doing it, and I figured I would use it as an excuse to blog a little bit.
2025-07-10
Advent of Code 2024 Day 1, Part 1
I did AoC Day 1 last year and challenged myself to only use base R. We figured this would be a nice little problem to start our sessions with.
This time, I wanted to try using Python. I’m kinda enjoying the language now, and between boot.dev and work am using it more than R. But! Instead of the standard library, I wanted to try using Polars and compare the solution to my base R one. Recently I’ve used Polars a fair bit for work but the operations I needed have been very simple. My spidey senses told me that if I used Polars then I’d end up learning something new.
I also used uv, the hot new Python project/package manager, to start the project off:
That creates a few files, including a uv.lock
and pyproject.toml
. These essentially contain metadata such as the Python and package versions. The files can be copied into another project, or a container image, to install the same version of Python and packages with a simple uv sync
.
Solution
import polars as pl
= (
df "aoc_day1/input.txt", has_header=False)
pl.read_csv(
.with_columns(=pl.col("column_1")
column_1str.split_exact(" ", 1)
."l", "r"])
.struct.rename_fields([
)"column_1")
.unnest(
)
= df.with_columns(
df =pl.col("l").cast(pl.Int32).sort(),
l=pl.col("r").cast(pl.Int32).sort(),
r
)
= df.select((pl.col("r") - pl.col("l")).abs().sum()).item()
res
print(res)
Comparing to the R version of the code:
It’s more verbose. Most of that comes down to fundamental differences between base R and Polars. In this case - which is common - the R solution uses functions. We also benefit from data.frames being native data structures, and read.table
which does two things for us: identifies the delimiter (three whitespace characters), and infers the types correctly (numeric).
The Polars solution is fiddlier. The csv parser can only take a single byte delimiter, meaning no regex, no multiple whitespace. I had to look this up, because it seemed frankly idiotic for a modern data processing package, and read the author saying they did this on purpose: It puts speed before anything else, and speed is at the heart of Polars. Doing anything else would go against that. Not so idiotic after all!!!
Anyway this leads to extra processing because the table gets read as a single string column. When it gets split, it becomes a Polars struct. I didn’t know what a struct actually was, though I saw the name appear in the user guide as I hacked on projects for work. Turns out they’re essentially a typed dictionary.
str.split_exact()
returns “a struct of n+1
fields”. So the result of
" ", 1) split_exact(
on a row of data like 50123 10023
in a column called column_1
is
"column_1": "50123", "column_2": "10023"} {
Every row gets a dict like that, which taken together are the values in the struct.
Calling unnest()
splits a struct into columns. The column names come from whatever the fields are called.
So, all in all, it’s a little bit like an R list-column situation.
Once the columns were unnested they were still string and have to be cast to the correct integer type. And then finally we get to do the calculation, which also looks way more verbose to me. One benefit of chaining methods is that you get to see what happens sequentially. That’s not the case with nested function calls, like in the R code. I don’t like the amount of noise in having to refer to columns with pl.col("l")
, but it’s not that bad in this case to be honest.
2025-07-17
Advent of Code 2024 Day 1, Part 2
Same deal as last week, but part two of the problem.
Solution
# df was defined in the week 1 section
= df.select("r").to_series().value_counts()
counts
= df.join(counts, how="inner", left_on="l", right_on="r")
joined
= joined.with_columns(multiplied=pl.col("count") * pl.col("l"))
joined
= joined.select("multiplied").sum().item()
summed
print(f"Part 2 result: {summed}")
Comparing to the R version:
It’s a similar approach: Make a frequency table of the values in the right-hand column, join those values onto the original table, multiply the left-hand values against the frequency values and sum up the results.
I couldn’t remember what I did in the R version, but my mind went to almost exactly the same places. In both versions I started by making a frequency table, which was trivial. However, in the R version I first filtered the values in the right-hand column by those of the left-hand column, so that the frequency table would be much smaller (there were only 37 shared values between the columns in my dataset). This seemed more efficient for further steps because with a smaller dataframe there will be fewer comparisons to make.
This was what I wanted to do in the Polars version, too, but it was fiddly. There might be a better way of doing this, but my code ended up looking like:
= df.select("r").to_series().is_in(df.select("l").to_series())
mask = df.select("r").to_series().filter(mask).value_counts() counts
That’s not terrible, but to me the R way of subsetting a vector looks far more elegant. Once again Polars is more verbose; I get lost trying to read the repetition of select
and to_series
and the brackets. There is a df.to_series()
method, so the code could be streamlined slightly, but it takes a column’s position as it’s arg, rather than a name. I stand resolute and refuse to do df.to_series(0)
. That… is opaque bullshit!
Instead, I did an inner join between the two DataFrames. An inner join is a “filtering join”, where the left-hand df gets filtered by the contents of the right-hand df. And in the R version I did the same thing using merge()
.
At this juncture of writing the blog post something quite nice happened. I realise that the R code was doing some redundant things. Because merge()
does an inner join by default, there was no need to specify all.x = TRUE
. In fact, it causes the opposite behaviour of what is needed: It’s like saying “Yes, make sure you keep all the rows from the left-hand df, and if you don’t mind, add NAs whenever there was no matching row in the right-hand df!”. That’s why the subsequent sum requires na.rm = TRUE
.
In my opinion the more elegant approach is the Polars version: Don’t bother filtering the frequency table first, simply allow the inner join to do its job and filter for you.
With those changes made, the R code is even slicker (isn’t it satisfying to delete things??):
All in all:
- Humbling to look back at simple problems and try to solve them with a new tool
- Satisfying to revisit concepts and end up streamlining old code
- Interesting to observe what feels nicer to read as I look at the code. Never underestimate how important feel is to craft