Primary Key vs Surrogate Key When Working with 3rd Party Database

Which key should you use as primary key when working with a third party database?

At our company, we use Prisma relational database, and we work with information from a third-party database, something similar to Twitter or Reddit. Let's use Twitter as an example for this blog.

And our decision was how to model our "User" and "Tweets"? Should we use (a) Twitter's IDs as the primary key? (b) Or should we create our own surrogate key, and put Twitter's IDs in another `twitterId` field in the User model?

(a) The benefits of following Twitter's primary key are:

  • Less debugging headache. When we scrape Twitter's database for User and Tweets, we can be assured that the ID they are using is the primary key in our database. Thus, when we are checking for errors or inconsistencies in the information scraped from Twitter, we can easily look up these Users and Tweets based on their primary keys in our database.

(b) The benefits of having our surrogate key are:

  • Less coupling with Twitter's database. If something goes wrong or missing in Twitter's database, our program will be less likely to be affected.

  • Capability to create "User" and "Tweet" without necessarily having to have an existing User and Tweet on Twitter. With this, we can extend the functionalities of our program. For example, we can store user information even before we scrape his Tweets on Twitter.

In the end, we chose a hybrid solution, which was to use a surrogate key in the beginning (when we initially create our User model). And then replace our primary key with Twitter's primary key the first time we scrape the users for their details.

This lets us enjoy the best of both worlds - being able to store Users and Tweets which don't exist in Twitter's database (using surrogate key as primary key), while also being able to keep track of existing Users and Tweets on Twitter (using Twitter id as primary key).

However, we must take note that changing primary keys in the middle of operation may not always be a good idea. This leads to a problem of foreign keys being unlinked.

Imagine if a related "Image" was linked to the our Tweet, and we changed the primary key of the Tweet. We would have to change the foreign key of the Image as well, so that the Image would always be linked to the Tweet.

It is fortunate that Prisma database does that automatically for us. Once we change the primary key of a Tweet, it relinks all the related data such as "Image" back to the Tweet, effectively changing the TweetId foreign key inside of the "Image" table.

The upshot of this is that, when working with third party data, whether you use a primary key or a surrogate key is up to the needs of your company. If you want the model to be as close as possible to the third party database and want to ensure consistency, you have the option use their primary key. However, if you'd like to extend the functionality of their database, then perhaps surrogate key is more appropriate for your use case. You can also use a hybrid solution if your database allows you to change primary keys in the middle of operation.

Let me know what you think!