Why You Should Graduate from Using Auto-Increment ID for Database Record?
Moving on from serial id when inserting a new database record.
In any database, we always need an identifier to differentiate one record from one another, commonly known as Primary Key. Now let’s say we want to create a table to record user information. It’s very common for us the create a table like:
create table user(
id serial primary key,
Most of us used to create the primary key serial. Remember
serial is implemented through a
sequence, which acts a bit like a counter. Each time we insert new data to the current table, the next sequence value will be incremented. For instance, if the current table has 5 rows inside and the last row has id 5, the next data we insert to the table will create a record with id 6.
So what’s the problem with using the serial id for the primary key?
Unwanted Information Disclosure
Let's say you create a page to view a user profile, and you create a path in the format
user_id is basically refers to the primary key on the user table. If you want to count the number of rows in a given table, all you have to do is to have the system generate a new entity and inspect its id. Imagine you go to Facebook, create a new account. You can roughly know the total user records inside Facebook.
Easily Iterated Entity
Let’s say the previous endpoint is open to the public. Even the most noob attacker now can harvest our table information just by looping and hitting the endpoint starting from endpoint 1 to N. It becomes very easy to scrape all your entities.
A Bottleneck for Horizontal Scalability
Sooner or later our table gonna grow so big, and we want it to still serve the user as fast as possible. At first, the choice is obvious to vertically increase the database size, from 2 core 4 GB to 4 core 8 GB, etc. Sooner or later we will hit the limit of how big our database can grow. Then we learn about horizontal scaling where we can have several master database to split the incoming load.
Now imagine we have 5 master databases that handle the insertion of a new record, and we set the table definition just like before using
serial int as the primary key. Naturally, we do not want database A and database B to have the same id value that references a different user. Record with id 1 should belong to a user across the service. It’s an important requirement because we commonly have other tables that used foreign key or references to the user id. Imagine we have user A with id 1 on database A, and user B with id 1 on database B. Then we have an address table with id 1 and user id 1. Does it belong to user A or user B? Hence, we need to keep a unique id across all of our databases.
In order to keep the user id unique, we need a way to store the actual last
sequence value before inserting the data into any database. Let’s say we do it the most naive way, we ask all the database to return its last
sequence value and take the biggest one and increment it by one. As you can see this is introducing a bottleneck because you need to run the query to all databases just to find the next increment id.
I have seen a service that in order to fix this problem they keep
sequence in another database like
Redis that is still fast enough to give out the information. I am against using another database such as
Redis because it’s pretty weird to have a different table just to count the next
sequence value. In the worst-case scenario when using a database like
Redis is when the
Redis is restarted, we simply lost track of the
sequence. Moreover, we added too many choreography just to do a simple thing like inserting a new record. As you can see although this may work up to a certain level, it’s clearly an inappropriate solution.
So what’s the solution?
Instead of using
serial id use
uuid as the primary key.
UUID stands for Universally Unique IDentifier.
UUID is guaranteed to be Universally Unique, which makes it good candidates for the primary key.
Remember that the
UUID value is randomly generated. This property means:
- It doesn’t expose information like how many records is currently in our table.
- It cannot be easily iterated because it’s random.
- It doesn’t introduce a bottleneck in our system for horizontal scalability because we do not need to find out what’s the next
sequencevalue. We can just generate a new
UUIDand insert it into any of the databases.
Hence, it solves all the problem with using
What’s the drawback of using
Basically, we don’t have a guaranteed uniqueness. However, the probability of a collision is rather small. Keep in mind that in the extremely unlikely case of colliding UUIDs, it will be caught by the DB thanks to the primary key constraint. In a worst-case scenario, we can just try to reinsert the new data again with a newly generated UUID.
For the curious soul out there, you can read the probability of collision in detail here.
Voilà, and we are done. Now, you guys should have a better understanding of why we should graduate from using auto-increment or
serial id and use