I am going to write a short post on an ID design choice that I like. I have used this ID design for quite a while across many large software systems. One such software system is a data system for storing a large number of images and videos. Another example is a large-scale collaborative visual editing tool. Another example is an ad tech data system.
I dislike some of the common or default approaches for IDs. For example, mysql defaults to simple incremental. MongoDB uses 24 bytes in hexadecimal. UUID is ugly. Plain number is too long. A sha-256 hash could be easily confused with a git hash. See the last section for more examples.
My choice of ID looks like this: img_6ykNcuNA7u
. It is prefixed and base58, and its entropy bit length is chosen based on needs.
Why Prefix and Base58¶
Prefix IDs make it easy to debug. When I encounter an ID in reports, logs, debugging messages, or random Slack messages, I can quickly understand the context without looking up databases or asking another person. The prefix allows me to quickly identify where to look up that ID.
Prefixes make it easier to mix different data types. One common scenario is that different data types will be placed into different databases. For example, the data are stored in a transaction database, but the same data are replicated in some analytics systems. They have different table layouts. Sometimes multiple data types are merged. Uniqueness across data types is required, and having IDs prefixed usually makes life a lot easier.
Base58 is easy to differentiate and relatively space efficient compared to hex. For example, base58 6ykNcuNA7uFGsA
vs. hex 6a696e6875616e676a69
. The other candidate is base64. Base64 uses weird delimiting characters like /+=
, making them unsafe for urls. Base58 avoids ambiguous characters such as 0OolI
. Note that base58 was made popular by cryptocurrency projects, especially Bitcoin.
I only use underscore as a valid delimiter. One might argue that hyphen could work. I prefer to avoid them because underscore is much more universally accepted. Any other characters are bound to cause issues once we cross database, programming language, or UI system boundaries.
Bit Length¶
There is a tradeoff between entropy and id length. Long IDs are cumbersome and ugly. I prefer ID to be as short as possible because they are easier to copy, debug, and more pleasing to my eyes. One key consideration is the collision rate. I strongly discourage incremental ID, such as 101
, 102
, 103
, etc. This would mean that ID generation requires centralized coordination. Picking an ID should be allowed to be done in a decentralized way. Even client codes should be allowed to generate IDs for new data objects. It avoids dependency on specific database technologies. Without a central coordinator, IDs are just random.
We might also want our ID to be cryptographically strong. For example, if we were to expose an URL that looks like https://video_clip/{id}
. We do not need to have any authentication behind the URL if the ID has enough entropy. The ID itself could act as its own password. The bit length requirement in terms of security is quite easy to pick. If we assume an energy cost of 1 nanoJoule per key and $0.10 per kWh, we get the following table about the energy cost required to crack a key with a certain bit length. For almost all enterprise applications, I consider 88 bit to be sufficient.
Bit Len | Energy Cost |
---|---|
80 | $16 million |
88 | $4 billion |
90 | $17 billion |
100 | $17 trillion |
128 | > 1000 years of solar output |
From the perspective of collision rate, it is helpful to look at a few formulae. The probability of not having a single collision when there are \(x\) items and \(N\) total IDs.
That is just that every item is picked differently. This could be approximated by \(\exp \left( \frac{-x^2}{2n} \right)\). The reason is that \(\ln(1 - y) \approx -y\) for small \(y\), and \(\sum_{i=0}^{x-1} i \approx \int_0^x i \, di = \frac{x^2}{2}\). When we are talking about \(n\) bit, the formula for the probability of at least a single collision for \(x\) item is
For \(x=10^9\), we have
Bit Length (n) | Collision Probability |
---|---|
80 | \(10^{-6}\) |
88 | \(10^{-9}\) |
100 | \(10^{-12}\) |
128 | \(10^{-21}\) |
That is the probability that there is at least one collision in the system. I would say for most applications, \(10^{-6}\) is good enough. My rationale is that if the probability of the company sponsoring the project going under, or the world facing a total nuclear war, is higher than this system will encounter a single collision, it is safe enough. For example, when I was building a data platform for images and video clips, I estimated the total number of images and videos that our platform will be supporting is less than a trillion. I chose the id to be \(n=100\) bit. The collision probability is roughly \(10^{-6}\).
Monotonicity¶
I had one real-world success story using an ID system that looks like this: img_3eroZu_2YV2svey1WA3U
. It uses 32 bit (i.e. 6 base58 characters) to keep track of a timestamp with resolution in seconds. This ensures that the IDs are generated with rough monotonicity. This is helpful for some database and partitioning operations. This system uses 32 bit for the timestamp, and 72 bit for entropy. Even though it only has 72 bit of randomness, it is sufficient because it is 72 bit for each second.
Case Studies¶
I am providing a table. It says it all.
ID Type | Example |
---|---|
Stripe | ch_3MmlLrLkdIwHu7ix0snN0B15 |
YouTube | dQw4w9WgXcQ |
Twilio | AC3f5d4e1b9c6a8d72e8c1f3a4b5c6d7e |
Cloudflare | 023e105f4ecef8ad9ca31a8372d0c353 |
1358666646167748608 |
|
Shopify | gid://shopify/Product/123456789 |
UUID | 550e8400-e29b-41d4-a716-446655440000 |
ULID | 01ARZ3NDEKTSV4RRFFQ69G5FAV |
MySQL | 123456 |
MongoDB | 507f1f77bcf86cd799439011 |
Elasticsearch | AV4fpZXG8yMoKS2RwuD6 |
Git Commit | f4e3d2c1b0a987654321fedcba0987654321 |
Bitcoin Transaction | f4184fc0ac40b2c6a6cddd08a24bdb1a |
Reddit Post ID | t3_kx3v5g |
Discord Message ID | 918273645546738739 |
Snowflake ID | 1456243454251689984 |
Instagram Media ID | 17895695668004547 |
Google Analytics | GA1.2.123456789.987654321 |
Spotify Track ID | 3n3Ppam7vgaVa1iaRUc9Lp |
Amazon Order ID | 112-1234567-8901234 |