Is UUID over UUsed?

2024-Aug-10

You might not need a UUID in all cases

UUIDs are a 128-bit number, represented as groups of hexadecimal digits, separated by - characters. For example, a UUIDv4 looks like 62b5dcd2-1549-4e1c-8315-0e83b2518b35. They're primarily intended as identifiers, and they have a few versions that have different properties. I'm going to focus on UUIDv4 here and why you might not need them.

Side quest: Why not use sequential integers?

If you use a relational database (RDBM) like PostgreSQL or MySQL, you can set an autoincrementing integer as a table's primary key. I generally advise against this with few exceptions.

If you suffer something like an insecure direct object reference (IDOR) vulnerability, then an attacker only needs to request IDs 1 to N to get all of the data out of the system. With a UUIDv4, they'll have a much harder time enumerating all of your resources. They'll typically need another vulnerability to get all the IDs they'll need.

Sequential IDs also limit how large your system can get. Any transactions inserting data into that table must get consensus on which transactions get which IDs. If it takes 0.1ms to get consensus, then you're ultimately limited to 10,000 transactions per second (TPS) on that table. You'll never reach 10,000 TPS; you'll see the latency increase as TPS increases until you see timeouts. This problem gets worse if you do it over a network.

There are cases where a sequential ID is fine. If your largest tables will never reach the limit, then you may be fine. If you are happy with the additional risk from an IDOR vulnerability, e.g. your data is publicly available or you have other controls. My general rule of thumb here is to spend 10 mins doing the napkin maths to double-check.

There are also cases where sequential IDs are practically mandatory. If I remember correctly, auditors need to be able to show that there were no "shadow" invoices, and they may look for "gaps" in the invoice numbers to find off-books payments.

Problems with UUIDs

They're larger than sequential IDs. A 32-bit integer is 4 bytes, whereas a UUID is 16 bytes. Any index over UUIDs will take 4x the storage. You'll fit fewer index pages into memory, increasing how often the database needs to hit disk for an index page (assuming it does not all fit in memory). The exact cost will vary from application to application, so I encourage you do to some napkin-maths; the additional cost for your application may be very small compared to, e.g., the increased risk from an IDOR vulnerability.

Second, many RDBMs use B-Trees to index their primary keys. Insertion order affects the write performance of a B-Tree. For large indexes that don't fit in memory, each UUID insertion means the system needs to bring in that index page from disk. In contrast, a sequential ID can fill up a page with few cache-misses. You can mitigate it a UUIDv7, but you make enumerating your resources easier as you'll only have 74-bits of randomness. Some napkin maths will tell you if that matters in your application. There's no difference if you use, e.g., a hash type index, you'll pay a data locality price either way.

Finally, they're difficult to use. Copy-paste doesn't work well, as most OSs consider - a word-splitter. Double clicking the ID only gets you part of it, so you have to click and drag. That's annoying, but it gets hard when you've recently broken a finger. Individuals can work around it by changing their OS settings, and companies can work around it be offering "copy" buttons in their UI. This is small, but easy to fix at the design stage. In the worst case, deleting the - characters with something like s/-//g works.

You don't need global uniqueness all the time

Many applications have resources linked to a single tenant. There will be some natural limit to how many resources a single tenant can have. [1]. Those resources' IDs don't need to be globally unique, only unique within the tenant.

Let's say you have a job management system for plumbers who're sole proprietors. A plumber needs some minimum amount of time to do a job, e.g. they have to drive to the location, greet the owner, do the work, and tidy up. Let's make some outlandish assumptions: There's a plumber who works 24hr a day, every day, for 50 years. They work insanely fast -- 5 minutes per job -- and never take a holiday. Such a plumber would generate around 5.3 million job orders.

A random number with about 50 bits would be sufficient to avoid collisions. This is still larger than a 32-bit sequential ID, but still nowhere near a UUIDv4.

On PostgreSQL, this fits in the bigint type. You can choose a random 64-bit number and put it in this table as a key. You can encode them with Base32 to make it a little easier for humans handling the keys.

If you're using DynamoDB, you can [2] use these in your parititon key by concatenating them with the plumber's ID: PlumberId:JobId. This pair will be globally unique. You'd probably need to store the PlumberId anyway, so you're not paying much more for the storage.

This doesn't solve the data locality issue. In my experience, it's usually cheaper to physically partition the data than accept the performance hit from making all transactions coordinate on a sequence. Your mileage may vary! Check if it's a problem for you with some napkin-maths or a little experiment.

One final caveat: You need to consider the total number of IDs generated, not the number of active. If you only consider live IDs, then you risk collision with deleted resources. You can use your create API rate limit as an approximation for this.

What about tenant IDs or other globally-unique resources?

It's largely the same story as above:

Are sequential IDs non negotiable?
What're the consequences of vulnerabilities like IDOR?
What's the maximum number of tenants you could see over the lifetime of the system?
What's the highest insert rate you might see?
Do humans need to handle these IDs?

These are the key questions to work out if you can "get away" with sequential IDs. If not, they'll guide you to the "right" number of bits for your chosen ID and a reasonable encoding scheme. These don't always suggest UUIDs.

Footnotes

[1]	: You should also probably set a limit too, to stop folks turning your DNS service into a database

[2]	: This is a bad idea in some applications, e.g. if you know a given job can be much more popular than others. I recommend the DynamoDB Book for more info.