The purpose of this article is to explain (mainly to myself) a technique for creating good indexes for mysql database.
Before starting I would like to discuss about selectivity
. Index selectivity is the ratio of number of distinct indexed values ( the cardinality
) over the total numbers of rows in the table (#T), and ranges from 1/#T to 1. A unique index has the selectivity of 1. Cardinality equal to #T (number of rows in the table) over number of rows in the table will give us selectivity = 1. A highly selective index is good because it lets Mysql filter out more rows when it looks for matches.
It is possible to calculate selectivity of a table with cities using this query:
SELECT COUNT(DISTINCT(city))/COUNT(*) FROM db.cities;
In my case it returned:
COUNT(DISTINCT(city))/COUNT(*): 0.0312
In order to save index space we can create an index on first N characters, by choosing N it is important to have a similar cardinality as it is when we calculated it using full length.
Here’s how to find the selectivity for several prefixes lengths in one query:
SELECT COUNT(DISTINCT(LEFT(city,3)))/COUNT(*) as sel3, COUNT(DISTINCT(LEFT(city,4)))/COUNT(*) as sel4, COUNT(DISTINCT(LEFT(city,5)))/COUNT(*) as sel5, COUNT(DISTINCT(LEFT(city,6)))/COUNT(*) as sel6, COUNT(DISTINCT(LEFT(city,7)))/COUNT(*) as sel7 FROM db.cities;
-------------------------------------------- sel3 | sel4 | sel5 | sel6 | sel7 | ------------------------------------------- 0.0239 | 0.0293 | 0.0305 | 0.0309 | 0.0310 | --------------------------------------------
This query shows that increasing the prefix length results in successively smaller improvements as it approaches seven characters.
It’s not a good idea to look only at average selectivity. The caveat is that the worst case selectivity matters, too. The average selectivity might make you think a four-or-five characters prefix is good enough, but if your data is very uneven, that could be a trap – we will see that there are more cities starting which a relatively small number of characters which will result in incorrectly distributed index values (we are talking about B-Tree data structure which is used by default in Mysql).
SELECT COUNT(*) as cnt, LEFT(city, 4) AS pref FROM db.cities GROUP BY pref ORDER BY cnt DESC LIMIT 5;
----------- cnt | pref ----------- 205 | San 200 | Sant 135 | Sout 104 | Chan 91 | Toul -----------
With four characters, the most frequent prefixes occur quite a bit more often than the most frequent full-length values. That is, the selectivity of those values is lower than the average selectivity.
Now that we’ve found a good value for our sample data, here is how to create a prefix index on the column:
ALTER TABLE db.cities ADD KEY (city(7));
Sometimes suffix indexes make sense (e.g., for finding all email addresses from a certain domain). MySQL doesn’t support reversed indexes natively, but it is possible to store reversed strings and index a prefix of it. It is possible to maintain the index with triggers.
Multicolumn indexes
Multicolumn indexes are often very poorly understood. Common mistakes are to index many or all the columns separately, or to index columns in the wrong order.
This strategy of indexing often results when people give vague but authoritative sounding advice such as “create indexes on columns that appear in the WHERE clause”. This advice is wrong. These indexes can be many orders of magnitude slower than truly optimal indexes.
Individual indexes on lots of columns won’t help MySQL improve performance for most queries. MySQL 5.0 and newer can cope a little bit such poorly indexed tables by using a strategy called index merge.
Choosing a Good Column Order
The correct order depends on the queries that will use the index, you must think about how to choose index order such that rows are sorted and grouped in a way that will benefit the query (this refers to B-Tree, hash indexes will not benefit of this).
The order of columns in a multicolumn B-Tree index means that the index is sorted first by the leftmost column, then by next column, and so on. Therefore, the index can scanned in either forward or reverse order, to satisfy queries with ORDER BY, GROUP BY, and DISTINCT clauses that match the column order exactly.
The column order is vitally important in a multicolumn index. There is an old rule of thumb for choosing column order: place the most selective columns first in the index. How useful is this suggestion? It can be helpful in some cases, bit it’s usually much less important than avoiding random I/O on sorting.
Placing the most selective columns first can be a good idea when there is no sorting or grouping to consider, and this the purpose of the index is only to optimize WHERE lookups. You might actually need to choose the column order such that it’s as selective as possible for the queries that you’ll run most.
Let’s use the following query as an example:
SELECT * FROM payment WHERE staff_id = 2 AND customer_id = 584;
Should you create an index on (staff_id, customer_id) or should you reverse column order? We can run some quick queries to help examine the distribution of values in the table and determine which column has a higher selectivity. Let’s transform the query count the cardinality of each predicate in the WHERE clause:
SELECT SUM(staff_id=2), SUM(customer_id=584) FROM payment\G SUM(staff_id=2): 7992 SUM(customer_id=584): 30
According to the rule of thumb, we should place customer_id first in the index, because the predicate matches fewer rows in table. We can then run the query again to see how selective staff_id is within the range of rows selected by this specific customer ID:
SELECT SUM(staff_id=2) FROM payment WHERE customer_id=584\G SUM(staff_id=2): 17
We should be careful with this technique, because the results depend on the specific constants supplied for the chosen query. If you optimize your indexes for this query and other queries don’t fare as well, the server’s performance might suffer overall, or some queries might run unpredictably.
If you are using the “worst” sample query from a report from a tool such as pt-query-digest, this technique can be an effective way to see what might be the most helpful indexes for your queries and your data. But if you don’t have specific samples to run, it might be better to use the old rule of thumb, which is to look at the cardinality across the board, not just for one query:
SELECT COUNT(DISTINCT staff_id)/COUNT(*) as staff_id_selectivity, COUNT(DISTINCt customer_id)/COUNT(*) as customer_id_selectivity, COUNT(*) FROM payment \G staff_id_selectivity: 0.00001 customer_id_selectivity: 0.0373 COUNT(*): 16049
customer_id has higher selectivity, so again the answer is to put that column first in the index:
ALTER TABLE payment ADD KEY(customer_id, staff_id);
This rule of thumb is something that may not work in some special cases.