Optimizing Game of Life with Goroutines and Bit Packing in Go
How I Got Here
I stumbled into Conway’s Game of Life from a YouTube video. A zero-player game - the evolution is entirely determined by the initial state, no input needed. Just a grid of cells, a few simple rules, and emergent complexity that still surprises me.
I figured I’d implement it in Go, get it working, and move on. Instead, I spent a weekend trying to make it fast. The naive version was too slow for large grids, which sent me down a rabbit hole of bit packing and goroutines.
The Rules
The game takes place on a 2D grid. Each cell is alive or dead, and the next generation follows four rules:
where is the current cell state and is the count of live neighbours. In plain English:
- Live cell with 2 or 3 neighbours → survives
- Dead cell with exactly 3 neighbours → becomes alive
- Everything else → dies or stays dead
In equation , the live cell [1] has 4 live neighbours - overpopulation kills it. In equation , the dead cell [0] has exactly 3 live neighbours - reproduction brings it alive.
The Naive Version
My first pass used a 2D grid of bool values. Straightforward, easy to reason about.
type Game struct {
World []bool // current world
NextWorld []bool // next world (temporary)
Rows int
Cols int
}
I used a flat 1D array instead of a 2D slice - index math is simpler and memory layout is contiguous, which helps cache performance.
For each cell, I counted live neighbours with a nested loop over the 3x3 neighborhood:
func (g *Game) aliveNeighbours(row, col int) int {
alive := 0
for i := row - 1; i <= row + 1; i++ {
for j := col - 1; j <= col + 1; j++ {
if i == row && j == col {
continue
}
ni := (i + g.Rows) % g.Rows
nj := (j + g.Cols) % g.Cols
idx := ni * g.Cols + nj
if g.World[idx] {
alive++
}
}
}
return alive
}
The modulo wrapping handles edge cells by treating the grid as a torus - top connects to bottom, left to right.
The generation update was the obvious double-loop:
func (g *Game) NextGeneration() {
for i := 0; i < g.Rows; i++ {
for j := 0; j < g.Cols; j++ {
idx := i * g.Cols + j
alive := g.aliveNeighbours(i, j)
if g.World[idx] {
if alive < 2 || alive > 3 {
g.NextWorld[idx] = false
} else {
g.NextWorld[idx] = true
}
} else {
if alive == 3 {
g.NextWorld[idx] = true
} else {
g.NextWorld[idx] = false
}
}
}
}
copy(g.World, g.NextWorld)
}
It worked. But on a 2048×2048 grid, each generation was painfully slow. The problem: for every single cell, I’m iterating all 8 neighbours and calling aliveNeighbours from scratch. That’s million checks per generation.
Optimization 1: Bit Packing
This is where things got interesting. The key insight: a cell’s state is 1 bit (alive/dead), and the neighbour count fits in 4 bits (max 8 neighbours). That’s 5 bits total - fits in a single byte.
graph LR
Byte[byte: 8 bits] --> Unused[Bits 7-5<br/>unused]
Byte --> Count[Bits 4-1<br/>neighbour count]
Byte --> State[Bit 0<br/>alive/dead]
Instead of counting neighbours every generation, I maintain a running count. When a cell changes state, I update its neighbours’ counts incrementally.
type Game struct {
World []byte
NextWorld []byte
Rows int
Cols int
}
The bit operations to manage cell state:
g.World[idx] |= 0x01 // set alive (LSB = 1)
g.World[idx] &= 0xFE // set dead (LSB = 0)
And to update neighbour counts when a cell changes:
if true { // if the current cell is alive
g.World[idx] += byte(0x02)
} else { // if the current cell is dead
g.World[idx] -= byte(0x02)
}
Since the state occupies bit 0, the neighbour count starts at bit 1. Adding/subtracting 0x02 increments/decrements the count without touching the state bit.
The generation update becomes much cheaper:
func (g *Game) NextGeneration() {
copy(g.NextWorld, g.World)
tempWorld := g.NextWorld
for y := 0; y < g.Rows; y++ {
for x := 0; x < g.Cols; x++ {
idx := y * g.Cols + x
if tempWorld[idx] == 0 {
continue // skip dead cells with zero live neighbours
}
liveNeighbours := tempWorld[idx] >> 1 // extract count from bits 4-1
if tempWorld[idx] & 0x01 == 0x01 { // if the cell is alive
if liveNeighbours < 2 || liveNeighbours > 3 {
g.UpdateCell(x, y, false)
}
} else {
if liveNeighbours == 3 {
g.UpdateCell(x, y, true)
}
}
}
}
}
The tempWorld[idx] == 0 check is the big win - if a byte is zero, the cell is dead AND has zero live neighbours, so nothing can possibly change. In a typical Game of Life, most cells are in this state. This skips the vast majority of the grid.
Optimization 2: Concurrency with Goroutines
The byte-packed version was already fast, but the grid update is embarrassingly parallel - each cell’s next state depends only on the current generation, so rows can be processed independently.
graph TD
Grid[Grid: N rows] --> W1[Worker 1<br/>rows 0..k]
Grid --> W2[Worker 2<br/>rows k..2k]
Grid --> W3[Worker ...<br/>...]
Grid --> WN[Worker M<br/>rows N-k..N]
W1 --> WG[sync.WaitGroup]
W2 --> WG
W3 --> WG
WN --> WG
WG --> Next[Next Generation]
I split the grid into chunks of and assigned each chunk to a goroutine:
var wg sync.WaitGroup
rowsPerWorker := g.Rows / runtime.NumCPU()
...
worker := func(startRow, endRow int) {
...
}
for i := 0; i < numWorkers; i++ {
wg.Add(1)
....
if i == numWorkers - 1 {
endRow = g.Rows
}
go worker(startRow, endRow)
}
...
wg.Wait()
This immediately introduced a race condition - multiple goroutines calling UpdateCell on adjacent rows would modify overlapping neighbour counts. The fix was a mutex:
type Game struct {
...
mu sync.Mutex
}
...
func (g *Game) NextGeneration() {
...
if liveNeighbours < 2 || liveNeighbours > 3 {
g.mu.Lock()
g.UpdateCell(x, y, false)
g.mu.Unlock()
}
...
}
The mutex adds some contention, but the parallelism wins overall - especially on larger grids where there’s more work per row and less relative time spent on boundary cells.
The Results
I benchmarked all three versions on different grid sizes, measuring time per generation:
graph LR
V1[Naive<br/>bool grid] -->|x50-85 faster| V2[Byte Packed<br/>bit operations]
V2 -->|x2-4 faster| V3[Byte + Goroutines<br/>parallel rows]
| Implementation | 512×512 | 1024×1024 | 2048×2048 |
|---|---|---|---|
| Naive (bool) | x1 | x1 | x1 |
| Byte packed | x85 | x50 | x79 |
| Byte packed + goroutines | x179 | x265 | x320 |
The byte packing alone gave a 50-85x speedup - mostly from skipping dead cells and avoiding the per-cell neighbour counting loop. Concurrency added another 2-4x on top of that, with the gains scaling as grid size increased (more work to distribute).
The 320x speedup on 2048×2048 was satisfying. What started as a slow simulation running a few generations per second became smooth real-time animation.
What Made the Biggest Difference
| Optimization | Why It Helped |
|---|---|
| Flat 1D array | Contiguous memory, better cache locality |
| Byte packing | Skip dead cells (byte == 0), no per-cell neighbour loop |
| Incremental counts | Update only affected neighbours when a cell changes |
| Goroutines | Parallel row processing, scales with CPU cores |
| Mutex on writes | Minimal - only locks during cell state changes, not reads |
The biggest lesson: the algorithmic optimization (byte packing + skipping dead cells) mattered far more than the concurrency. Throwing goroutines at the naive version wouldn’t have gotten close to these numbers. Fix the algorithm first, parallelize second.
The code is on GitHub: tasnimzotder/artificial-life
References
- Conway’s Game of Life. (2023, November 30). In Wikipedia. https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life
- Abrash, M. (1997). Michael Abrash’s Graphics Programming Black Book (Special Edition). Coriolis Group Books.
- Gerrand, A. (2013, January 16). Concurrency is not parallelism. The Go Programming Language. https://go.dev/blog/waza-talk