Scaling to Millions of users with no JOINs and a lot of Foreign Keys! (And Other Hilarious Database ‘Strategies’)

Grab a brew and gather ’round, folks, for a tale as old as time, or at least as old as the swag-bag from that tech conference you barely remember. We’re diving into the world of database design,- a topic guaranteed to unleash the inner “thought leader” in every engineer’s soul.

The Prelude

During one of my gigs I was tasked to review a customer app, look for bottlenecks, potential red flags that will lead to issues down the line. Its database schema caught my eye. It looked like a victim of a raccoon who took revenge after someone locked him in a room with zero snacks and a mechanical keyboard. Scattered tables, redundant fields, and types so bloated they’d need their own zip code.

I made a rookie mistake and started suggesting sensible improvements:
“Hey, maybe we trim cross-links? Normalize to 3NF? Shrink those data types?”
In hindsight, I should’ve NOPEd straight out to “check on that deployment” and spared myself the carnage.

The Resistance

“WHAT?!” one engineer cut in, nearly baptizing his MacBook with his mocha-late.
“WON’T SCALE, WON’T PERFORM!” he bellowed, channelling every HackerNews hot take since the dawn of upvotes.
“And what’s wrong with my suggestions?” I shoot back, innocent as a junior dev on day one.

What followed was so akin to ‘cargo culting‘ it would’ve made 19th century Melanesians qualified to bear ‘Senior Software Architect’ on their LinkedIn profiles.

“If we do that, we’ll have to do JOINs!” the engineer shrieked, as-if ThePrimeagen just released a two hour rant about JOINs being worse than a C compiler written in TypeScript.

“Our DB’s like that on purpose”–fewer JOINs, more performance”, he declared, like he’d cracked the secret to infinite scalability.

The Horror

Then it hit me. Each table looked like a clone of its sibling. Not a typical clone mind you, but a “something definitely went wrong in the lab” variety, where each subsequent table emerged out of the cloning chamber more disfigured and bloated, yet somehow still functional enough to store data…

“And those foreign keys?” I pressed. “There are like ten per table!”
“GENIUS, RIGHT?” he grinned. “Simplifies JOINs when we must write ’em.”
“AND BY THE WAY, type minification is dumb. We’re rocking UUIDs for primary keys—memory’s cheap, bro!”

I nodded, marveling at his foresight, “Sure, gotta prep for when we balloon past 5.3 undecillion records…”

“If foreign keys are the issue, we can put our data into JSON blobs and store them in a single table!”, another engineer chimed in before brain could reboot, “MySQL can handle that!”

I froze, awestruck, thinking, “Where have I heard this before?”
“Got it,” I muttered under my breath. “We’re reinventing NoSQL… inside SQL”.

The Apogee

The clock was ticking, any second someone could suggest solving everything by adding more columns.

Perhaps it wasn’t too late, and dropping a truth bomb would slap some sense into anyone who thinks ‘schema’ is just a fancy word used to describe ORM induced vomit, coming out after a late-night coding binge.

I needed something big, something proven, something well documented, something with a similar stack. And that something was staring at me from another tab:

The seventh most visited website on the planet, handling 83 billion media requests a month, over 118 million registered users and over 200,000 monthly sign-ups. No AI-crypto-blockchain nonsense, just good old PHP and MySQL, humming along like a champ on his victory lap.

Its name? ~~Albert Einstein~~ Wikipedia.

The Revelation

So, how does this seventh wonder of the digital world organize its data? Let’s pop the hood and find out.
Wikipedia is driven by MediaWiki – an open source content management system. Its database schema is fully documented and well worth examining.

MediaWiki’s schema consist of 59 tables, organized by function:

5 are used for cache management
17 are used for user, user permission, and user activity management
17 are used for page content management
14 are used for link, stat, search and log management
11 are used for maintenance, caching and multi-site support management

Seventeen content related tables? In 2025!?
What’s even more surprising: each content related table has, on average, only two foreign keys! The horror! The normalization! The… scalability?

To better understand the structure let’s go digging for gold, I mean content.

page table

We’re start off with the core of MeadiaWiki, the page table.
Its name, in its singular form, suggests it comes from the years long past–2005 to be exact.
Every record in this table represents a single Wikipedia article.

Its primary key? page_id
Its data type? A humble unsigned int

Not a UUID, not a binary(16), not event a bigint–just an int. It’s almost like we’re living in a rational universe where indexed column minimization matters.

One thing we will not find in this table however, is a direct reference to the page content.
We’ll need to follow the yellow brick road through page_latest foreign key to the revision table.

revision table

Wikipedia stores all article revisions – yes, all of them! With over 7 million articles averaging 183 revisions each (and some exceeding two million), you’d expect some database-melting primary key…

Instead we find:

+----------------+---------------------+------+-----+---------+
| Field          | Type                | Null | Key | Default |
+----------------+---------------------+------+-----+---------+
| rev_id         | int(10) unsigned    | NO   | PRI | NULL    |
| rev_page       | int(10) unsigned    | NO   | MUL | NULL    |

_{*rev_id was increased to unsigned bigint on December 21st 2024}

“But where’s the content?!” someone yells, spilling a bit of that mocha-latte-grande-americano out of an Apple Find-My enabled heated bluetooth coffee mug.

Turns out you can’t get to the content from the revision table directly, and that’s not a bad thing.

The Myth

In well-designed relational databases, tables connect like gods in ancient myths—distinct yet part of a greater cosmos. Primary tables stand as Olympian deities: Zeus rules the skies, Poseidon the seas, Hades the underworld—each sovereign yet linked in a divine family tree. Secondary tables? They’re the demigods—Perseus or Heracles—tracing their lineage to a primary table through a single foreign key. Like Athena guiding Odysseus, the relationship flows one way: clear, purposeful.

Lack of cross-links isn’t isolation—it’s purposeful storytelling. You know your app’s needs: a user table links to orders, not the other way around. Think Hercules wielding his spear—fierce and solo, yet bound to Olympus when the gods call.

Well-placed foreign keys are like Odysseus’s voyage: a clean path from user_id to order_id lets you trace the hero’s tale without unrolling the whole Iliad. But when complex JOINs arise—say, orders to payments to refunds—their paths should gleam like Apollo’s golden chariot, not snarl like the Hydra’s heads.

The Legend

slots table

Wikipedia’s schema tells us a story, a story of a product scaling beyond initial expectations. You can see it unfold in the plural form of this table name, reflective of the revisions made in 2020, and the fact you can’t get to this table from either the page or the revision table.

This abstraction allows different types of content to be associated with a single revision.
Naming is a matter of taste, and, perhaps, a name like, article_sections, could be more fitting, but who cares, slots, roles, revisions, it clearly works…

+------------------+----------------------+------+-----+
| Field            | Type                 | Null | Key |
+------------------+----------------------+------+-----+
| slot_revision_id | bigint(20) unsigned  | NO   | PRI |
| slot_role_id     | smallint(5) unsigned | NO   | PRI |
| slot_content_id  | bigint(20) unsigned  | NO   |     |

This table is the first to use a much bigger primary key, it’s a composite, consisting of two unsigned bigint’s.

content table

At last, we reach the content table via the slot_content_id in the slots table.

+-----------------+----------------------+------+-----+
| Field           | Type                 | Null | Key |
+-----------------+----------------------+------+-----+
| content_id      | bigint(20) unsigned  | NO   | PRI |
| content_size    | int(10) unsigned     | NO   |     |
| content_sha1    | varbinary(32)        | NO   |     |
| content_model   | smallint(5) unsigned | NO   |     |
| content_address | varbinary(255)       | NO   |     |
+-----------------+----------------------+------+-----+

But where is the actual content? Who knows? Perhaps behind CDNs and storage providers like Cloudflare or AWS S3, or perhaps stored in other DBs, SQL, NoSQL, maybe even a cheeky CSV here and there.

So was this all a scam, and we found that there’s just a JSON at the end of the rainbow?

No, we found a schema that is built to grow. Want to build a WIKI that supports built-in video? How about embedded 3d models? This table supports it all! Just define a new content_model and point to the most appropriate source via content_address.

It’s modular magic—new features pop in, scalability holds, and the relational heart keeps beating.

Last Battle

So having looked at the schema, one question remains: “How would a query that returns the text of a Wikipedia article look like?”

SELECT
   p.page_id,
   p.page_title,
   c.content_address
FROM 
   page p
   -- First hop: From page to its latest revision
   INNER JOIN revision r ON p.page_latest = r.rev_id
   -- Second hop: From revision to content slots
   INNER JOIN slots s ON r.rev_id = s.slot_revision_id 
          AND s.slot_role_id = 1 
          -- 1 holds the page content
   -- Third hop: From slots to content table
   INNER JOIN content c ON s.slot_content_id = c.content_id
   -- Fourth hop: To the actual content blob!
WHERE 
   p.page_namespace = 0  -- Main article namespace
   AND p.page_title = 'MediaWiki'  -- Our quarry!
LIMIT 1;

That’s four JOINs to fetch a single article! According to some this should bring Wikipedia to its knees. And yet, somehow, the site manages to serve 83 billion requests a month.

Why does this work? Because it’s not so much JOINs that impact performance, but the number of times the Database server CPU must hit RAM or the SSD. If you carefully manage what columns are indexed and how big those columns are, you might easily fit enough of them into the CPU’s cache, that doing even ten joins will cost you mere nanoseconds.

The Takeaway

I laid out my Wikipedia truth bomb with all the confidence of someone who once fixed production by turning it off and on again. The schema, the joins, the proper normalization – it was beautiful, rational, time-tested.

“So,” I ventured, clearing my throat awkwardly, “what do you think?”

The room fell silent. The kind of silence that makes you wonder if you just committed a career-limiting move. The engineer’s face was unreadable, somewhere between constipation and revelation.

Had I gotten through? Would this be the watershed moment when cargo-culting gave way to rational database design? Or was I about to be escorted from the building while being lectured about “modern approaches” to data storage?

My palms started sweating. Who was I to lecture anyone? Just last week I’d spent three hours debugging only to find I’d misspelled “database” as “databse” in my connection string.

The lead engineer’s coffee mug hovered midway to his mouth. His eyes darted between his MacBook screen and my increasingly uncomfortable face.

“Well…” he finally began, just as someone’s Slack notification shattered the tension.

“Hey team, just checking – how are we doing on those new features?”

We all turned to see the product manager standing in the doorway, tablet in hand, blissfully unaware of our architectural showdown.

The engineer glanced at me, then back at his screen. I could almost hear the gears turning. Would he choose the path of normalization, or would the siren song of “just add another column” prove too seductive?

What would you do? Stick to your guns and potentially alienate the team? Or go with the flow, adding columns until you hit whatever arbitrary limit MySQL sets these days, all while a little voice in your head screams about referential integrity?

As for me, I’m still waiting for an answer. Though I suspect my invitation to the next architecture review mysteriously got lost in the mail.

Maybe I should have just nodded and suggested we look into MongoDB.