Database Indexes by Example in postgres

Tech

Christoph Dähne18.01.2022

Subtle changes in SQL queries and schema can have huge impacts on performance of an application. Sometimes it is hard to keep the database layer in mind while implementing application (also for me). Especially when applying small fixes after a long time to a productive application.

This blog post is no performance-guide for SQL by any means. It is a showcase for one particular pitfall in order to

raise overall awareness
motivate to invest time into learning about SQL databases, even when (or especially when) using ORMs

This is a very verbose step-by-step walk-through I used in an internal presentation on my OS X machine. You need to have Docker installed to continue.

First let's start a throw-away local postgres database and connect to it as admin user.

docker run --rm --env POSTGRES_HOST_AUTH_METHOD=trust --name my-postgres postgres docker exec -it my-postgres bash psql --user postgres

Now you should be connected to you database. We want to execute some queries and are mostly interested in the execution time and much less in the data. So let's print execution time for the following queries.

\timing on -- Timing is on.

Now let's create a table for our sample data. Note that we have one Primary Key, one column with a Unique Constraint and one column with neither both.

CREATE TABLE customers ( "id" bigserial NOT NULL, "customer_id" VARCHAR(255) NOT NULL, "name" VARCHAR(255), PRIMARY KEY(id), UNIQUE(customer_id) ); -- CREATE TABLE -- Time: 10.961 ms

Performance never is an issue with small data-sets. So let's add 10 million customers.

REPLACE INTO customers ("id", "customer_id", "name") SELECT generate_series AS id, 'C' || generate_series AS customer_id, 'Customer Name ' || generate_series AS name FROM generate_series(1, 10000000); -- INSERT 0 10000000 -- Time: 69024.920 ms (01:09.025) SELECT * FROM customers LIMIT 10; -- id | customer_id | name -- ----+-------------+------------------ -- 1 | C1 | Customer Name 1 -- 2 | C2 | Customer Name 2 -- 3 | C3 | Customer Name 3 -- 4 | C4 | Customer Name 4 -- 5 | C5 | Customer Name 5 -- 6 | C6 | Customer Name 6 -- 7 | C7 | Customer Name 7 -- 8 | C8 | Customer Name 8 -- 9 | C9 | Customer Name 9 -- 10 | C10 | Customer Name 10 -- (10 rows) -- -- Time: 0.882 ms SELECT COUNT(*) FROM customers; -- count -- ---------- -- 10000000 -- (1 row) -- -- Time: 421.439 ms

Done — now we can query a customer by a given Customer ID. This works as expected and we get the result after a short time. This is fine.

SELECT * FROM customers WHERE customer_id = 'C10'; -- id | customer_id | name -- ----+-------------+------------------ -- 10 | C10 | Customer Name 10 -- (1 row) -- -- Time: 1.085 ms

However things look quite different when we query the same customer by its name. Execution time increases by two orders of magnitude.

SELECT * FROM customers WHERE "name" = 'Customer Name 10'; -- id | customer_id | name -- ----+-------------+------------------ -- 10 | C10 | Customer Name 10 -- (1 row) -- -- Time: 332.296 ms

At first sight this looks quite arbitrary. In order to understand what is going on, let's take a closer look at our database. Some things to known about postgresql:

our table is in the public schema
\dt shows table information
\di shows index information

In a nutshell indexes in postgres (and other databases) are data structures optimized to find rows. Creating the table did automatically add two indexes as well. In the two cases of id, and customer_id it is save to assume that we often have to search rows:

id — usually the primary key is used to select an individual row
customer_id — to enforce the Unique Constraint before each write operation rows must be searched for possible customer-ID collisions

If we want to find rows by applying filters to indexed columns queries are rather fast because the index is optimized for this use-case. If we filter by name though the entire table is scanned sequentially. Reading and checking every row just takes its time.
Note that indexes improve read performance but come with a penalty for write performance. So we do not want to index everything.

Now let's look at a straight forward solution: manually add an index for the name column.

CREATE INDEX ON customers ("name"); -- think about CONCURRENTLY -- CREATE INDEX -- Time: 18928.158 ms (00:18.928) SELECT * FROM customers WHERE "name" = 'Customer Name 10'; -- id | customer_id | name -- ----+-------------+------------------ -- 10 | C10 | Customer Name 10 -- (1 row) -- -- Time: 0.957 ms

This does the trick and everything is fine for the moment. Note that creating the index took rather long. Creating an index like that write–locks the entire table and prevents other queries from writing to it. This is not always an option in an production environment. See the CONCURRENTLY flag for an alternative.

Now let's assume several months have passed and the customer requests a new feature: find customers by name in an case-insensitive manner. This might be done with a tiny adjustment of the SQL query.

SELECT * FROM customers WHERE lower("name") = lower('Customer Name 10'); -- id | customer_id | name -- ----+-------------+------------------ -- 10 | C10 | Customer Name 10 -- (1 row) -- Time: 1310.627 ms (00:01.311)

Note that the performance dropped again by several orders of magnitude. Why? Because our search index only contains the original names and not the lower-cased ones. This means the query again reads every row of the table. Even worse: it has to lower-case every name as well.

Fortunately you can add an index for the lower-cased names as well. You might also want to delete the original name index, unless you use it somewhere else in the application.

CREATE INDEX customers_lower_name_idx ON customers (lower("name")); -- CREATE INDEX -- Time: 22687.389 ms (00:22.687) SELECT * FROM customers WHERE lower("name") = lower('Customer Name 10'); -- id | customer_id | name -- ----+-------------+------------------ -- 10 | C10 | Customer Name 10 -- (1 row) -- -- Time: 0.751 ms

Again this saved the day. Note however that an update to a name requires an update of two indexes. The performance penalty for write operations slightly increased further as now we have four indexes for our table.

In some cases you might want to adjust the offending query instead of creating an index. Let's assume the following SQL query to find customers by their ID.

SELECT * FROM customers WHERE upper(customer_id) = upper('c10'); -- id | customer_id | name -- ----+-------------+------------------ -- 10 | C10 | Customer Name 10 -- (1 row) -- -- Time: 914.912 ms

We could add another index and this would improve performance. However in our case we can just assume that all customer IDs are upper-cased (and enforce it in the application). If we can make this assumption there is no need for an additional index.

SELECT * FROM customers WHERE customer_id = upper('c10'); -- id | customer_id | name -- ----+-------------+------------------ -- 10 | C10 | Customer Name 10 -- (1 row) -- -- Time: 0.612 ms

Thanks for reading! As usually questions and feedback are very welcome.