Clik - A Self-Deployable URL Shortener
Building a URL shortener is a classic system design interview question. I built one to find out how many of those whiteboard answers actually survive contact with real code.
The Stack
No abstractions over abstractions. Here is what it runs on:
- Java 21
- Spring Boot 3.4.2 handles HTTP so I can focus on what matters underneath
- MySQL 8.0 (the database choice is load-bearing, explained below)
- Spring Data JPA + Hibernate with Envers for write auditing
- Spring Security with JWT-based auth for URL ownership
- Maven for builds, nothing clever
Two tables. That is the entire data model.
short_urls -> internal_id (BIGINT PK), long_url, short_code, user_id (FK nullable), click_count, created_at
users -> id (PK), email, password_hash, first_name, last_name, ...The redirect path is a single indexed lookup on short_code. That query runs on every click, so that is the one that matters.
The ID Problem
I did not use AUTO_INCREMENT. I did not use UUID. Here is why neither worked for me.
A UUID is a 128-bit number, not a string. The mistake most people make is storing it as VARCHAR(36) (the human-readable hyphenated form) instead of BINARY(16). If you store a UUID as BINARY(16), comparisons are fast and the index stays compact. But the real problem is still randomness. UUID v4 is randomly generated, so every insert lands at a random position in the B-tree, not the end. That causes node splits, which scatter data across disk pages. At millions of rows a randomly-keyed table fragments badly, and reads that miss the buffer pool pay disk seek latency.
UUIDv7 is worth knowing about. It is time-ordered with a 48-bit millisecond timestamp prefix, standardized in RFC 9562, and eliminates the fragmentation problem entirely. If I were starting fresh and wanted a 128-bit ID space without building my own generator, UUIDv7 is the honest answer. MySQL and Postgres support it natively as of recent versions.
I still went with Snowflake-style IDs because the shard ID is baked into every key. You can look at any internal_id and immediately know which machine generated it, which is useful for debugging and operational routing. UUIDv7 does not give you that without a separate column.
AUTO_INCREMENT is sequential and avoids fragmentation. But it couples your ID space to a single database node. The moment you shard, you have a coordination problem.
So I use a Snowflake-inspired 64-bit integer that is sequential, distributed, and carries operational metadata inside the number itself:
+--------------------------------------------------------------+
| Timestamp (41 bits) | ShardID (10 bits) | Seq (12 bits) |
+--------------------------------------------------------------+
| 63 | 22 | 12 | <- bit positions
+--------------------------------------------------------------+- 41 bits of milliseconds since a custom epoch (2023-01-01 00:00:00 UTC). About 69.7 years of headroom before the timestamp field overflows.
- 10 bits of shard ID. 1024 unique shard identifiers embedded in every key. Reading bits 12 through 21 tells you which machine generated the ID.
- 12 bits of sequence. 4096 unique IDs per millisecond per shard before rollover.
The actual implementation:
private synchronized long generateInternalId(int shardId) {
long timestamp = System.currentTimeMillis() - epoch;
int sequenceNumber = sequence.getAndIncrement() & 0xFFF;
return (timestamp << 22) | (shardId << 12) | sequenceNumber;
}Two things worth understanding here.
First, timestamp << 22. Shifting left by 22 puts the timestamp in the most significant bits of the 64-bit integer. Since timestamp dominates the value numerically, IDs generated later are always larger numbers than IDs generated earlier regardless of shard or sequence. A quick example:
timestamp = 100 -> 100 << 22 = 419,430,400
timestamp = 101 -> 101 << 22 = 423,624,704The second ID is always bigger. That is why every new insert goes to the rightmost leaf of the B-tree and index fragmentation is essentially zero.
Second, & 0xFFF. The sequence counter keeps incrementing forever, but we only have 12 bits reserved for it. & 0xFFF is just a cheap way to wrap it. 0xFFF in binary is twelve consecutive 1s. ANDing any number with it throws away everything above bit 11:
4095 & 0xFFF = 4095 // still in range
4096 & 0xFFF = 0 // wraps back cleanlyNo if-statement, no modulo, just a single bitwise AND.
The method is synchronized, so one thread enters at a time on this service instance. The AtomicLong on the sequence counter is belt-and-suspenders given the lock already serializes access.
Once the 64-bit integer is generated it gets Base62-encoded into the short code:
private static final String ALPHABET =
"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
public static String encode(long id) {
StringBuilder sb = new StringBuilder();
while (id > 0) {
sb.append(ALPHABET.charAt((int)(id % 62)));
id /= 62;
}
return sb.reverse().toString();
}A typical Snowflake ID is on the order of 10¹⁸. That encodes to 7 or 8 Base62 characters. Short enough to be a URL slug, and unique enough to never collide across shards.
The end result is a query like this resolving in O(log n):
SELECT long_url FROM short_urls WHERE internal_id = 286435872092454913;Because internal_id is a sequential BIGINT primary key, MySQL’s InnoDB clustered index keeps rows physically ordered on disk. The B-tree traversal is fast, page cache hit rates are high, and there is no fragmentation from random insertions.
The Hard Parts
Sequence overflow, known and unhandled. The current code does not handle generating more than 4,096 IDs within the same millisecond on the same shard. The & 0xFFF mask wraps the counter silently back to zero, which means two calls can produce the same ID if the rate exceeds 4M/s on one shard. The fix is straightforward:
if (sequenceNumber == 0) {
while (System.currentTimeMillis() <= lastTimestamp) {
// spin until the clock moves forward
}
}I have documented this in the source. At current volume it does not trigger. I am not pretending it is solved.
Clock drift, also unhandled. If the system clock moves backward due to an NTP correction or VM live migration, System.currentTimeMillis() - epoch shrinks. The timestamp bits in the new ID become smaller than in the previous one, monotonicity breaks, and you can get duplicate IDs. The standard fix is to track lastTimestamp and throw or wait if the clock regresses. Not implemented. Documented.
Duplicate long URLs. Two users submitting the same long URL should get the same short code. Before generating a new ID, the service checks for an existing long_url record. The lookup is on an indexed column so it is cheap, but it is a read-before-write that adds a round trip on every URL creation. A unique constraint on long_url would make this airtight.
Click tracking on the redirect path. Every redirect needs to increment click_count. Doing that synchronously stalls the redirect on an UPDATE with a row-level lock. Instead, click increments go into a thread pool and get processed asynchronously. The 302 returns immediately and the counter catches up in the background. The tradeoff is honest: if the JVM dies between the redirect and the write, that click is gone. A click counter is an approximation, not an accounting ledger. I accepted that.
Index strategy on redirects. short_code has a unique secondary index and internal_id is the clustered primary key. A redirect traverses the short_code index to locate the row, then does a bookmark lookup into the clustered index to fetch long_url. Two B-tree traversals is the floor for this query shape without a covering index or an in-memory cache in front. If this were serving serious traffic, a Redis layer on short_code -> long_url would collapse the redirect to a single memory read. Not added yet.
Running It
git clone https://github.com/siddharth1729/clik.git
cd clik
./mvnw spring-boot:runUpdate src/main/resources/profiles/dev.application.properties with your MySQL credentials. The schema bootstraps itself via Hibernate DDL on first boot.
API surface:
POST /api/shorten shorten a URL (anonymous)
POST /api/shorten + Authorization Bearer shorten a URL (authenticated)
GET /{shortCode} redirect to long URL
POST /api/users/register create an accountWhy I Built It
Every explanation of this system at the design level gestures at Base62 encoding and sharding and then moves on. I wanted to know what actually breaks when you implement those things.
The ID generator is the crux. Get the bit layout wrong and your index degrades. Miss the sequence overflow case and you silently produce duplicate keys under load. Handle click tracking synchronously and your redirect latency spikes under write contention.
None of that shows up on a whiteboard. All of it matters in the code.
Source: github.com/siddharth1729/clik