Composite Indexes¶
Since not all countries with a population >5M are in the continent of Asia, the combination of these two predicates should improve the ability for indexes to eliminate work. That is to say that with this dataset, composite indexes will improve selectivity.
There are two potential choices for composite indexes:
- An index on
p_c (Population, Continent)
- An index on
c_p (Continent, Population)
The difference in ordering in composite indexes matters significantly. In fact because population is a range, the optimizer will only be able [1] to use the first portion of p_c (population, continent)
index. Thus p_c (population, continent)
offers no improvement over an index of only p (population)
. We can see this clearly after forcing the index to be used.
Example 10: The composite index on (population, continent) is a poor choice
ALTER TABLE Country ADD INDEX p_c (Population, Continent);
EXPLAIN FORMAT=JSON
SELECT * FROM Country FORCE INDEX (p_c) WHERE continent='Asia' and population > 5000000;
{
"query_block": {
"select_id": 1,
"cost_info": { # The index has been forced.
"query_cost": "152.21" # Cost is higher than a table scan
},
"table": {
"table_name": "Country",
"access_type": "range",
"possible_keys": [
"p_c"
],
"key": "p_c",
"used_key_parts": [ # Only the population column
"Population" # has been used.
],
"key_length": "4", # population=4 byte integer
"rows_examined_per_scan": 108,
"rows_produced_per_join": 15,
"filtered": "14.29",
"index_condition": "((`world`.`Country`.`Continent` = 'Asia') and (`world`.`Country`.`Population` > 5000000))",
"cost_info": {
"read_cost": "149.12",
"eval_cost": "3.09",
"prefix_cost": "152.21",
"data_read_per_join": "3K"
},
"used_columns": [
"Code",
"Name",
"Continent",
"Region",
"SurfaceArea",
"IndepYear",
"Population",
"LifeExpectancy",
"GNP",
"GNPOld",
"LocalName",
"GovernmentForm",
"HeadOfState",
"Capital",
"Code2"
]
}
}
}
This restriction is due to the indexing structure of a B+tree. A shorthand way to remember it is “ranges to the right” in composite indexes. With this considered, an index on c_p (continent, population)
is demonstrated in Example 11. The combination of these two columns has improved selectivity over an index on Continent alone, and thus the cost has dropped from 28.20 (Example 7) to 24.83. As the two columns are effectively concatenated in the index, the access method is considered a range.
Example 11: A more suitable composite index on Continent, Population
ALTER TABLE Country ADD INDEX c_p (Continent, Population);
EXPLAIN FORMAT=JSON
SELECT * FROM Country WHERE continent='Asia' and population > 5000000;
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "24.83" # The cost is much lower than the
}, # composite index on p,c (152.21)
"table": {
"table_name": "Country",
"access_type": "range",
"possible_keys": [
"p",
"c",
"p_c",
"c_p"
],
"key": "c_p",
"used_key_parts": [ # Both columns have been used
"Continent", # 1B for ENUM,
"Population" # 4B for INT
],
"key_length": "5", # =5B
"rows_examined_per_scan": 32,
"rows_produced_per_join": 15,
"filtered": "100.00",
"index_condition": "((`world`.`Country`.`Continent` = 'Asia') and (`world`.`Country`.`Population` > 5000000))",
"cost_info": {
"read_cost": "18.00",
"eval_cost": "3.09",
"prefix_cost": "24.83",
"data_read_per_join": "3K"
},
"used_columns": [
"Code",
"Name",
"Continent",
"Region",
"SurfaceArea",
"IndepYear",
"Population",
"LifeExpectancy",
"GNP",
"GNPOld",
"LocalName",
"GovernmentForm",
"HeadOfState",
"Capital",
"Code2"
]
}
}
}
Determining the order for composite indexes¶
Determining the correct order of columns in a composite index is tricky. Here are some considerations to be aware of:
- Left most rule. An index on
(First Name, Last Name)
can be used to also satisfy queries that need an index on(First Name)
, but not on queries that require an index on(Last Name)
. Try and design composite indexes in such a way that they can be reused by the greatest number of queries possible. - Ranges to the right. An index on
(Age, First Name)
can not be used to satisfy a query in the form ofWHERE age BETWEEN x and y AND first_name = 'John'
[1] . Or to state more specifically: the rest of the composite index will not be used after the first range condition. - Most selective columns to the left. Think about how you can eliminate work in the index as fast as possible. This usually has the effect of improving memory fit because fewer index pages need to be accessed.
- Be careful changing index order. Mixing
ASC
orDESC
may affect how much of a composite index can be used.
[1] | (1, 2) An exception to this is a range of fixed values, i.e. IN (1,2,3,4,5) , which is able to use the full index on both population and continent (p_c) . |