I am going to write a short post on an ID design choice that I like. I have used this ID design for quite a while across many large software systems. One such software system is a data system for storing a large number of images and videos. Another example is a large-scale collaborative visual editing tool. Another example is an ad tech data system.

I dislike some of the common or default approaches for IDs. For example, mysql defaults to simple incremental. MongoDB uses 24 bytes in hexadecimal. UUID is ugly. Plain number is too long. A sha-256 hash could be easily confused with a git hash. See the last section for more examples.

My choice of ID looks like this: img_6ykNcuNA7u. It is prefixed and base58, and its entropy bit length is chosen based on needs.

Why Prefix and Base58

Prefix IDs make it easy to debug. When I encounter an ID in reports, logs, debugging messages, or random Slack messages, I can quickly understand the context without looking up databases or asking another person. The prefix allows me to quickly identify where to look up that ID.

Prefixes make it easier to mix different data types. One common scenario is that different data types will be placed into different databases. For example, the data are stored in a transaction database, but the same data are replicated in some analytics systems. They have different table layouts. Sometimes multiple data types are merged. Uniqueness across data types is required, and having IDs prefixed usually makes life a lot easier.

Base58 is easy to differentiate and relatively space efficient compared to hex. For example, base58 6ykNcuNA7uFGsA vs. hex 6a696e6875616e676a69. The other candidate is base64. Base64 uses weird delimiting characters like /+=, making them unsafe for urls. Base58 avoids ambiguous characters such as 0OolI. Note that base58 was made popular by cryptocurrency projects, especially Bitcoin.

I only use underscore as a valid delimiter. One might argue that hyphen could work. I prefer to avoid them because underscore is much more universally accepted. Any other characters are bound to cause issues once we cross database, programming language, or UI system boundaries.

Bit Length

There is a tradeoff between entropy and id length. Long IDs are cumbersome and ugly. I prefer ID to be as short as possible because they are easier to copy, debug, and more pleasing to my eyes. One key consideration is the collision rate. I strongly discourage incremental ID, such as 101, 102, 103, etc. This would mean that ID generation requires centralized coordination. Picking an ID should be allowed to be done in a decentralized way. Even client codes should be allowed to generate IDs for new data objects. It avoids dependency on specific database technologies. Without a central coordinator, IDs are just random.

We might also want our ID to be cryptographically strong. For example, if we were to expose an URL that looks like https://video_clip/{id}. We do not need to have any authentication behind the URL if the ID has enough entropy. The ID itself could act as its own password. The bit length requirement in terms of security is quite easy to pick. If we assume an energy cost of 1 nanoJoule per key and $0.10 per kWh, we get the following table about the energy cost required to crack a key with a certain bit length. For almost all enterprise applications, I consider 88 bit to be sufficient.

Bit Len Energy Cost
80 $16 million
88 $4 billion
90 $17 billion
100 $17 trillion
128 > 1000 years of solar output

From the perspective of collision rate, it is helpful to look at a few formulae. The probability of not having a single collision when there are \(x\) items and \(N\) total IDs.

$$ \frac{N}{N} \times \frac{N-1}{N} \times \frac{N-2}{N} \times \dots \times \frac{N-x+1}{N}. $$

That is just that every item is picked differently. This could be approximated by \(\exp \left( \frac{-x^2}{2n} \right)\). The reason is that \(\ln(1 - y) \approx -y\) for small \(y\), and \(\sum_{i=0}^{x-1} i \approx \int_0^x i \, di = \frac{x^2}{2}\). When we are talking about \(n\) bit, the formula for the probability of at least a single collision for \(x\) item is

$$ p \approx 1 - \exp\left(\frac{-x^2}{2 \cdot 2^n}\right). $$

For \(x=10^9\), we have

Bit Length (n) Collision Probability
80 \(10^{-6}\)
88 \(10^{-9}\)
100 \(10^{-12}\)
128 \(10^{-21}\)

That is the probability that there is at least one collision in the system. I would say for most applications, \(10^{-6}\) is good enough. My rationale is that if the probability of the company sponsoring the project going under, or the world facing a total nuclear war, is higher than this system will encounter a single collision, it is safe enough. For example, when I was building a data platform for images and video clips, I estimated the total number of images and videos that our platform will be supporting is less than a trillion. I chose the id to be \(n=100\) bit. The collision probability is roughly \(10^{-6}\).

Monotonicity

I had one real-world success story using an ID system that looks like this: img_3eroZu_2YV2svey1WA3U. It uses 32 bit (i.e. 6 base58 characters) to keep track of a timestamp with resolution in seconds. This ensures that the IDs are generated with rough monotonicity. This is helpful for some database and partitioning operations. This system uses 32 bit for the timestamp, and 72 bit for entropy. Even though it only has 72 bit of randomness, it is sufficient because it is 72 bit for each second.

Case Studies

I am providing a table. It says it all.

ID Type Example
Stripe ch_3MmlLrLkdIwHu7ix0snN0B15
YouTube dQw4w9WgXcQ
Twilio AC3f5d4e1b9c6a8d72e8c1f3a4b5c6d7e
Cloudflare 023e105f4ecef8ad9ca31a8372d0c353
Twitter 1358666646167748608
Shopify gid://shopify/Product/123456789
UUID 550e8400-e29b-41d4-a716-446655440000
ULID 01ARZ3NDEKTSV4RRFFQ69G5FAV
MySQL 123456
MongoDB 507f1f77bcf86cd799439011
Elasticsearch AV4fpZXG8yMoKS2RwuD6
Git Commit f4e3d2c1b0a987654321fedcba0987654321
Bitcoin Transaction f4184fc0ac40b2c6a6cddd08a24bdb1a
Reddit Post ID t3_kx3v5g
Discord Message ID 918273645546738739
Snowflake ID 1456243454251689984
Instagram Media ID 17895695668004547
Google Analytics GA1.2.123456789.987654321
Spotify Track ID 3n3Ppam7vgaVa1iaRUc9Lp
Amazon Order ID 112-1234567-8901234


Related Posts


Published

Tags

Contact