Sadraskol

Could not create unique index: how to solve duplication errors

Unique indexes enforce the uniqueness (as their name implies) of one or more columns' values. They can guarantee one-to-one relationships in your model. For instance, let's have a unique shopping cart per user:

=# create unique index uidx_shopping_carts_on_user_id on shopping_carts (user_id);

ERROR:  could not create unique index "uidx_shopping_carts_by_user_id"
DETAIL:  Key (user_id)=(2) is duplicated.

Wow! What was that? It seems that the index can not be created: some existing rows already infringe the constraint.

Oopsy

Let's explore the data a bit. Can we spot culprits?

=# select user_id from shopping_carts order by user_id;

 user_id 
---------
       1
       2
       2
       3
       4
       5

Oopsy, "2" appears twice. This can be quite an issue. It can go wrong for 3 reasons:

Unfortunately, postgres cannot create a unique index as long as the duplicate items exist.

First things first: how many duplicates do you have? Some self joining request does the trick:

=# select
-=   count(*)
-= from shopping_carts a
-= join (select
-=   user_id,
-=   count(*)
-= from shopping_carts
-= group by user_id
-= having count(*) > 1) as b
-=   on b.user_id = a.user_id; 

 count 
-------
 76533

There are too many duplicates for a manual cleanup... Create a dashboard to track the duplicate injection rate. If your table already includes a created_at datetime, you can use a variation of the previous query:

=# select
-#   last_occurrence,
-#   count(*)
-# from shopping_carts a
-# join (select
-#   user_id,
-#   max(date(created_at)) as last_occurrence,
-#   count(*)
-# from shopping_carts
-# group by user_id
-# having count(*) > 1) as b
-#   on b.user_id = a.user_id
-# group by last_occurrence
-# order by last_occurrence DESC;

 last_occurrence | count 
-----------------+-------
 2022-09-13      |   193
 2022-09-12      |   484

We have the duplicate rate per day. This is useful to check if our future fixes work.

Remove the root cause

Let's consider your endpoint implements the following pseudo-java code:

public void create() {
  insert().into("shopping_carts").values(...);
}

A quick fix is to check the existence of the row before creating it:

public void create() {
  if (!exists().shopping_carts().where(user_id: current_user_id)) {
    insert().into("shopping_carts").values(...);
  }
}

This way no duplicate creation, right? You deploy and check the numbers the next day:

 last_occurrence | count 
-----------------+-------
 2022-09-14      |     8
 2022-09-13      |   193
 2022-09-12      |   484

Wait what? Still more duplicates!!!!

Transactions

This is a race condition happening. Two requests check existence and insert at the same time.

request 1request 2select count(*) from shopping_carts where id = 2
exists()... == false-0
-exists()... == false0
insert()...-1
-insert()...2

Transactions are gonna help, but not directly.

public void create() {
  within_transaction(() => {
    if (!exists().shopping_carts().where(user_id: current_user_id)) {
      insert().into("shopping_carts").values(...);
    }
  })
}

You deploy this code and watch its effects:

 last_occurrence | count 
-----------------+-------
 2022-09-15      |    15
 2022-09-14      |     8
 2022-09-13      |   193
 2022-09-12      |   484

The problem is not solved. Transactions are configured to committed read isolation by default. If you replay the previous race condition under this isolation, you'll find no violation. Transactions do not avoid race conditions. Fortunately, there are tools to stop the bleeding: locks.

Locks are usually managed by your database for operations like indexation, etc. Locks are a common way to mutually exclude processes from shared resources. You can also manually lock a row. Mutual exclusion is preventing 2 processes to access the same resource.

In postgres you can lock rows or tables using the for udpate keyword in a select query:

public void create() {
  within_transaction(() => {
    if (execute("select * from shopping_carts where user_id = ? for update;", current_user_id).empty()) {
      insert().into("shopping_carts").values(...);
    }
  })
}

You deploy and check the numbers the next day:

 last_occurrence | count 
-----------------+-------
 2022-09-16      |    11
 2022-09-15      |    15
 2022-09-14      |     8
 2022-09-13      |   193
 2022-09-12      |   484

Oopsy, the lock does not work? Because the select query returns no rows, so no lock is upheld. We need to lock something that exists. Great thing that we have a unique user!

public void create() {
  within_transaction(() => {
    execute("select * from users where user_id = ? for update;", current_user_id);
    if (!exists().shopping_carts().where(user_id: current_user_id)) {
      insert().into("shopping_carts").values(...);
    }
  })
}

Here the transaction is finally helpful. The lock is released when the transaction is committed. You can deploy and see the result:

 last_occurrence | count 
-----------------+-------
 2022-09-16      |    11
 2022-09-15      |    15
 2022-09-14      |     8
 2022-09-13      |   193
 2022-09-12      |   484

No line for 2022-09-17: victory! Next, we remove duplicates and create the unique index.

Removing duplicates

Removing duplicates can be a touchy matter. I find the following heuristic true most of the time: the latest modified row is the most updated. Some frameworks automatically generate updated_at columns. This can be useful to apply the heuristic. We use updated_at column to discriminate the old rows to delete:

select
  b.id
from shopping_carts a
join (select
  user_id,
  max(updated_at) AS last_updated_at
from shopping_carts
group by user_id
having (count(*) > 1)) b
  on a.user_id = b.user_id
  and a.updated_at < b.last_updated_at

This way only the last updated rows remain. Let's check if there's no duplicate left:

=# select
-=   count(*)
-= from shopping_carts a
-= join (select
-=   user_id,
-=   count(*)
-= from shopping_carts
-= group by user_id
-= having count(*) > 1) as b
-=   on b.user_id = a.user_id; 

 count 
-------
     0

Nice!

Creating the unique index

With no duplicate in the table left, we can create the index:

=# create unique index uidx_shopping_carts_on_user_id on shopping_carts (user_id);

This time, there's no error. Every user has at most one shopping cart. When attempting to create duplicates, you encounter the following error:

=# insert into shopping_carts(user_id) values (2), (2);

ERROR:  duplicate key value violates unique constraint "uidx_shopping_carts_by_user_id"

No process can violate the business rule, even under race conditions! Great value for the buck if you ask me.

Conclusion

I hope you learned some sql knowledge. Here are the main takeaways:

  1. Learn some sql syntax to navigate easily in your data
  2. Columns created_at and updated_at managed by your ORM are great tools for maintenance
  3. Transactions do not avoid race conditions
  4. Introduce unique constraints before you feed the data (this applies to other constraints)

I had to solve a similar problem at work and we are hiring talented engineers to solve this kind of problems. Take care.


Bonus: what if you cannot lock a row? You can guarantee unicity outside of a one-to-one relationship.

There's an alternative to locking to check unicity. Remember that read committed isolation does not avoid race conditions? Well, serializable isolation can! It is a whole topic on its own that I won't cover here. Hit me up if you need more information.