Database Indexes by Example in postgres
Subtle changes in SQL queries and schema can have huge impacts on performance of an application. Sometimes it is hard to keep the database layer in mind while implementing application (also for me). Especially when applying small fixes after a long time to a productive application.
This blog post is no performance-guide for SQL by any means. It is a showcase for one particular pitfall in order to
- raise overall awareness
- motivate to invest time into learning about SQL databases, even when (or especially when) using ORMs
This is a very verbose step-by-step walk-through I used in an internal presentation on my OS X machine. You need to have Docker installed to continue.
First let's start a throw-away local postgres database and connect to it as admin user.
Now you should be connected to you database. We want to execute some queries and are mostly interested in the execution time and much less in the data. So let's print execution time for the following queries.
Now let's create a table for our sample data. Note that we have one Primary Key, one column with a Unique Constraint and one column with neither both.
Performance never is an issue with small data-sets. So let's add 10 million customers.
Done — now we can query a customer by a given Customer ID. This works as expected and we get the result after a short time. This is fine.
However things look quite different when we query the same customer by its name. Execution time increases by two orders of magnitude.
At first sight this looks quite arbitrary. In order to understand what is going on, let's take a closer look at our database. Some things to known about postgresql:
- our table is in the public schema
- \dt shows table information
- \di shows index information
In a nutshell indexes in postgres (and other databases) are data structures optimized to find rows. Creating the table did automatically add two indexes as well. In the two cases of id, and customer_id it is save to assume that we often have to search rows:
- id — usually the primary key is used to select an individual row
- customer_id — to enforce the Unique Constraint before each write operation rows must be searched for possible customer-ID collisions
If we want to find rows by applying filters to indexed columns queries are rather fast because the index is optimized for this use-case. If we filter by name though the entire table is scanned sequentially. Reading and checking every row just takes its time.
Note that indexes improve read performance but come with a penalty for write performance. So we do not want to index everything.
Now let's look at a straight forward solution: manually add an index for the name column.
This does the trick and everything is fine for the moment. Note that creating the index took rather long. Creating an index like that write–locks the entire table and prevents other queries from writing to it. This is not always an option in an production environment. See the CONCURRENTLY flag for an alternative.
Now let's assume several months have passed and the customer requests a new feature: find customers by name in an case-insensitive manner. This might be done with a tiny adjustment of the SQL query.
Note that the performance dropped again by several orders of magnitude. Why? Because our search index only contains the original names and not the lower-cased ones. This means the query again reads every row of the table. Even worse: it has to lower-case every name as well.
Fortunately you can add an index for the lower-cased names as well. You might also want to delete the original name index, unless you use it somewhere else in the application.
Again this saved the day. Note however that an update to a name requires an update of two indexes. The performance penalty for write operations slightly increased further as now we have four indexes for our table.
In some cases you might want to adjust the offending query instead of creating an index. Let's assume the following SQL query to find customers by their ID.
We could add another index and this would improve performance. However in our case we can just assume that all customer IDs are upper-cased (and enforce it in the application). If we can make this assumption there is no need for an additional index.
Thanks for reading! As usually questions and feedback are very welcome.