64 Bit Distributed ID Generator

01 March 2025 •

Lodhi Garden, Delhi clicked by siddharth sabron

Lodhi Garden, Delhi

Introduction

Unique ID generation is a crucial part of any application, because it is used to represents a single unique row in the table. Furthermore, this ID is often used as the primary-key of the table and heavily used in searching a specific row uniquely.

Example Query:

select  * from  short_urls where internal_id = 286435872092454913;

You might wonder why we are emphasizing the ID so much…..?. Couldn’t we just generate a unique value using something like Math.random() + epoch..? or some random_string_of_some_size + epoch While that could produce unique values, but it would be a poor choice for several reasons, especially at scale:

Meaningful Data: A simple timestamp (or a random number) doesn’t convey any useful information other than the time of creation. Our Snowflake-like approach embeds a timestamp and a shard ID, which can be useful for operational purposes (e.g., identifying which shard a URL was created on).
Database Performance (Crucially): The data type of the primary key has a major impact on database performance, particularly for search operations. Purely numeric IDs (integers or long integers) are vastly superior to strings for primary keys and indexed columns or both as well(A Indexed Primary).

Why Not Random Strings?

Using random strings (e.g., generated by Math.random() + epoch or random string of some size) as primary keys is generally a bad idea for performance reasons:

Slower Comparisons: String comparisons are inherently slower than numeric comparisons. Databases can compare integers much more efficiently at the hardware level.
Larger Index Size: String primary keys create larger database indexes. Indexes are data structures (usually B-trees or B+trees) that speed up lookups. A larger index means:
- More disk I/O to read the index.
- More memory consumption to keep parts of the index in cache.
- Slower query performance.
Index Fragmentation: Inserting random strings into a B-tree index leads to fragmentation. A fragmented index has its data scattered across the disk. This requires more disk seeks (which are slow) to retrieve data. Numeric, sequential IDs minimize fragmentation because new entries are usually appended to the end of the index.
Poor Locality of Reference: Random strings have poor locality of reference. Successive ID lookups are likely to access very different parts of the index and the table data, leading to more disk seeks and cache misses.

Why Numeric, Sequential IDs are Better:

Numeric IDs, especially when generated in a roughly sequential manner (like our Snowflake-like approach), offer substantial advantages:

Fast Comparisons: Databases can compare numbers extremely quickly.
Smaller Indexes: Numeric IDs (e.g., 64-bit integers) take up less space than long, random strings, resulting in smaller, faster indexes.
Reduced Fragmentation: Sequential IDs cause new rows to be inserted at the end of the B-tree index, minimizing fragmentation and improving write performance.
Improved Locality of Reference: Sequential IDs tend to be clustered together in the index and on disk. This improves cache utilization and reduces disk seeks, speeding up both reads and writes.

Database Internals: B-Trees (Simplified)

Most relational databases (MySQL, PostgreSQL, etc.) use B-trees (or variants like B+trees) to implement indexes. Here’s a simplified explanation:

Structure: A B-tree is a self-balancing tree data structure. Data is kept sorted, and searches, insertions, and deletions can be performed in logarithmic time (O(log n)).
Nodes: Each node in the B-tree contains multiple keys (in our case, the internal_id values) and pointers to child nodes.
Searching: To find a specific ID, the database:
1. Starts at the root node.
2. Compares the target ID with the keys in the node.
3. Follows the appropriate pointer to a child node.
4. Repeats this process until it finds the ID or determines it doesn’t exist. Because the keys are sorted within each node, the database can very efficiently narrow down the search.
Insertions:
- Sequential IDs: New keys are usually added to the rightmost leaf node of the B-tree. This is very efficient.
- Random Strings: The new key could need to be inserted anywhere in the B-tree. This might require splitting nodes (if a node is full), leading to multiple write operations and fragmentation.
Time Complexity: The Time complexity for searching using the internal_id which is primary key, will be O(log n). Where ‘n’ is the number of rows in your table. And, If we had chosen a string-based ID, and even if that string column were indexed, the search complexity within the index itself would still involve string comparisons, which are slower than integer comparisons. The complexity would still be logarithmic (O(log n)) in terms of the number of index nodes accessed, but each node comparison would take longer. And, if your are quering on non-indexed string column then it will be O(n*m), Where ‘n’ is the number of rows, and ‘m’ is the average length of your string ID.

Method: `generateInternalId(int shardId)`

The generateInternalId method creates unique IDs without relying on a central coordinating service (like a database auto-increment). It achieves this by combining a timestamp, a shard ID, and a sequence number into a single 64-bit integer.

ID Structure

The generated ID is a 64-bit long integer, structured as follows:

+-------------------------------------------------------------------------------------+
|    Timestamp (41 bits)   |   ShardID(10) |   Sequence(12)   |      extra bit        |
+--------------------------+---------------+------------------+-----------------------+
|            63            |      22       |        12        |  0  (Bit positions)   |
+-------------------------------------------------------------------------------------+

Timestamp (41 bits): Represents the number of milliseconds since a custom epoch. The epoch in this implementation is 2023-01-01 00:00:00 UTC (defined by the epoch variable).
- Calculation: System.currentTimeMillis() - epoch
- Capacity: 41 bits allow for 2⁴¹ milliseconds, which is approximately 69.7 years:
  - 2⁴¹ milliseconds = 2,199,023,255,552 milliseconds
  - 2,199,023,255,552 ms / 1000 ms/s / 60 s/min / 60 min/hr / 24 hr/day / 365 days/year ≈ 69.7 years
- Lifespan: The system can generate unique IDs for about 69.7 years. After this, the timestamp would wrap around, potentially causing ID collisions. To prevent this, you’d need a strategy like re-sharding with a new epoch or implementing collision detection (not done in this simplified version).
Shard ID (10 bits): Identifies the specific database shard or application instance that generated the ID. This is essential for distributed systems. 10 bits allow for 2¹⁰ = 1024 unique shard IDs.
Sequence Number (12 bits): A counter that increments for each ID generated within the same millisecond on the same shard. 12 bits allow for 2¹² = 4096 unique sequence numbers per millisecond per shard. This handles situations where many IDs are generated very quickly.
- Atomic Increment: sequence.getAndIncrement() ensures the sequence number is incremented atomically, preventing race conditions in a multi-threaded environment.
- Bitwise AND (& 0xFFF): Keeps the sequence number within the 12-bit range (0-4095). 0xFFF is the hexadecimal representation of 4095 (binary 111111111111). The bitwise AND effectively masks all but the last 12 bits.

Bitwise Operations

The code uses bitwise operations to combine the timestamp, shard ID, and sequence number:

Left Shift (<<):
- timestamp << 22: Shifts the timestamp 22 bits to the left. This creates 22 empty bits (filled with zeros) on the right-hand side. These bits are reserved for the shard ID (10 bits) and sequence number (12 bits). Left shifting by n bits is equivalent to multiplying by 2ⁿ.
```
Timestamp (binary, simplified): 1001...00100
Timestamp << 22:                1001...00100000000000000000000000000
```
- shardId << 12: Shifts the shard ID 12 bits to the left, creating 12 empty bits on the right for the sequence number (multiplying by 2¹²).
```
Shard ID (binary): 0000000101
Shard ID << 12:    0000000101000000000000
```
Bitwise OR (|): Combines the shifted timestamp, shifted shard ID, and sequence number. Because the left shifts created non-overlapping sections, the bitwise OR effectively concatenates the three values.

+------------------------------------------------------------------------------------+
|       (timestamp << 22)               |       (shardId << 12)    | sequenceNumber  |
+---------------------------------------+--------------------------+-----------------+
|  1001...0010000000000000000000000000  |  0000000101000000000000  | 000001111011    |
+------------------------------------------------------------------------------------+

Resulting 64-bit ID: 1001…001000000000101000001111011

Visual Representation

+--------------------------------------------------------------------------------------+
|                                         ID (64 bits)                                 |
+--------------------------------------------------------------------------------------+
|      Timestamp (41 bits)        |      Shard ID (10 bits)     |    Seq (12 bits)     |
+---------------------------------+-----------------------------+----------------------+
| 10010011...00100000000000       |          0000000101         |      000001111011    |
+---------------------------------+-----------------------------+----------------------+

Example

Let’s say:

Current time: 2024-03-01 10:30:15.500 UTC
Epoch: 2023-01-01 00:00:00.000 UTC
Shard ID: 5
Sequence: 123

Timestamp: 39,486,615,500 milliseconds (difference between current time and epoch).
Binary Representations (simplified):
- Timestamp: 10010011111000000110000110101100000100100
- Shard ID: 0000000101
- Sequence Number: 000001111011
Bitwise Operations: The code performs the shifts and OR as described above, resulting in the final 64-bit ID.

Code

Here is the java implementation of the above explanation:

    private final long epoch = 1672531200000L;
    private final AtomicInteger sequence = new AtomicInteger(0);

    private synchronized long generateInternalId(int shardId) {
        long timestamp = System.currentTimeMillis() - epoch;
        int sequenceNumber = sequence.getAndIncrement() & 0xFFF;
        return (timestamp << 22) | (shardId << 12) | sequenceNumber;
    }

Also i have used this in one of my personal project Url Shortener

shardId: The ID of the database shard (an integer between 0 and 1023, inclusive).
Return Type: The generated 64-bit unique internal ID.

Important Considerations

Sequence Overflow: This code does not handle the case where more than 4096 IDs are generated within the same millisecond on the same shard. In a production system, you must address this (e.g., by waiting for the next millisecond, throwing an exception, or using a larger sequence number space).
Clock Drift/Jump: This code does not handle system clock changes (clock drift or NTP adjustments). Clock changes can cause the timestamp to go backward, potentially leading to duplicate IDs. A production system must handle this (e.g., by tracking the last generated timestamp and waiting or throwing an exception if the clock moves backward).
Shard ID Assignment: You need a robust mechanism to assign shard IDs and to determine the shard ID for a given operation. Consistent hashing is a common approach.

This detailed explanation should cover all aspects of the generateInternalId method and the Snowflake-like ID generation strategy. Remember to use the more robust version of this method (with clock drift and sequence overflow handling) in a production environment.

If you like the post please connect me on LinkedIn And Twitter

Read Previous

Go to top