Which columns should you choose for indexing?

To decide on the columns of an index you have to consider the most frequent queries running against your tables. You have to examine the WHERE and JOIN clauses (ORDER BY and GROUP BY also benefit from indexes), since SQL query optimizer will try to find the best way to return the results by using a complex mechanism to evaluate available indexes and statistics. An over simplified diagram of the query optimizer’s decision making process is displayed below.

image_thumb_43

Given that query optimizer decides on the fate of your query execution plan, it is important to maintain useful indexes and current statistics. It is not necessary for the query optimizer to use your indexes, if it finds them costly, even if those indexes are on columns stated in WHERE, JOIN, ORDER BY and GROUP BY statements. Consider the following example; Table APPLICATION has a total of 13,235 rows and  a clustered index on app_id column and a non-cluster index on computer_id column.

SELECT computer_id, app_id, name FROM application
image_thumb_30

The above statement would cause the query optimizer to use a clustered index scan to return the results, since there are no filtering criteria. Consider a WHERE clause on the clustered index key (app_id)

SELECT computer_id,app_id,name FROM application WHERE app_id > 300

image_thumb_31

Since there is a WHERE clause in the statement, based on the clustered index key, the query optimizer decides that a clustered index seek is better than a clustered index scan to return a result set of 1,729 rows out of 13,235. A WHERE clause on the non-clustered index computer_id would behave differently.

SELECT computer_id,app_id,name FROM application WHERE computer_id = 5000025

image_thumb_32

Although a non-clustered index exists on the column computer_id defined by the WHERE clause, the query optimizer decided not to use it to return 428 rows out of 13,235 because its bookmark lookup behavior would cause jumps from page-to-page and would be costlier than the scanning of the entire clustered index. Consider the same query with a different value after the equation.

SELECT computer_id,app_id,name FROM application WHERE computer_id = 6000000
image_thumb_33
Although it is the same query with almost the same WHERE clause, the query optimizer determined that it would benefit from the use of the non-clustered index seek as opposed to the previous example because of the high selectivity (fewer return rows) of the WHERE clause.

Query optimizer and statistics

While query optimizer asses the fate of the execution plans based on statistics, it is quite logical for SQL server to create statistics automatically on the key columns of an index when it is created. The index statistics creation is mandatory and not to be confused with the  AUTO_CREATE_STATISTICS option (ON by default) which instructs the query optimizer to automatically create statistics on single non-indexed columns defined on query predicates (JOIN and WHERE clauses). Nevertheless the update of those statistics can be configured to not be updated automatically by using an ALTER DATABASE statement with the AUTO_UPDATE_STATISTICS option. If automatic statistics update are left to the default setting (ON), the query optimizer, in order to optimize CPU utilization, will update statistics based on the following conditions: a table without rows gets a row, a table has fewer than 500 rows and updates 500+ rows, a table has more than 500 rows and updates 500+20% of the number of total rows.

Statistics can be configured to automatically updated either synchronously (default) or asynchronously.  In synchronous updates, the query optimizer is always compiles and executes the queries with up-to-date statistics as opposed to asynchronous where only subsequent queries will take benefit of the up-to-date results that completed asynchronously. As a best practice leave statistics creations and updates ON and to SYNC mode. Consider asynchronous updates when your application experiences request time outs waiting for updated statistics or when your application frequently uses the same or similar queries and cached query plans.

Although it is not recommended, you can turn off the automatic statistics update and creation process. In such cases you can use the CREATE STATISTICS and UPDATE STATISTICS statements to maintain statistics as needed.

All information about statistics objects are contained in the sysindexes table and in a statistics binary large object (statblob).  Statistics information include the number of rows in the table or index, the number of pages occupied, number of rows modified after last statistics updates,  time of statistics collection, number of rows used to produce the histogram, average key length, single-column histogram with the number of steps, estimated number of rows matching the filter, etc.

A histogram is a set of up to 200 values of a column. The sample or the complete set of column values are sorted and divided into up to 199 intervals so that the most statistically significant information is captured. The information on the range of index key values between two consecutive samples is called a step and contains information about:

  • RANGE_HI_KEY: the most significant value of a step
  • EQ_ROWS: the number of rows equal to RANGE_HI_KEY
  • RANGE_ROWS: the range of rows between the previous RANGE_HI_KEY and the current RANGE_HI_KEY
  • AVG_RANGE_ROWS: The average number of rows per distinct value within RANGE_ROWS
  • DISTINCT_RANGE_ROWS:  the number of distinct rows within RANGE_ROWS

Why query optimizer behaved differently in the above example?

Lets analyze the statistics of the application table to understand the different behavior. To find out the available statistics within application table:

sp_helpstats N'TEST.dbo.application', 'ALL'

image_thumb_34

The results indicate that two objects have been created; one for my clustered index CIX_APP_ID on column app_id and another for my non-clustered index IX_COMPUTER_ID on column computer_id. To gather some more information about these indexes:

SELECT name, indid, minlen, xmaxlen, dpages, reserved, used,
rowmodctr,rows
FROM sysindexes WHERE name='CIX_APP_ID'
OR name ='IX_COMPUTER_ID'
image_thumb_35

The results are indicating that CIX_APP_ID is a clustered index (indid=1) while IX_COMPUTER_ID is a non clustered (indid>1). Also there are some useful statistics regarding the minimum and maximum length of a row (minlen and xmaxlen), the number of index pages reserved (reserved) and used (used), the total number of rows (rows) and the total number of rows that inserted, updated or deleted since the last statistics update (rowmodctr). If you recall the conditions on which the query optimizer decides to auto-update statistics, you will realize that the CIX_APP_ID index won’t update its statistics until rowmodctr reach the value of 500+(13235*0,20) = 3147 , meaning that 3146 (3146 – current rowmodctr)updates need to take place before auto-stats fire up on that index.

If you analyze the histogram of the IX_COMPUTER_ID statistics object you will realize that the query optimizer checked the EQ_ROWS values and determined that the filter criteria on the computer_id index key with value 5000025 would return 428 rows (EQ_ROWS=428) while on the 6000000 value would return only 1 row (EQ_ROWS=1), thus it determined that it would not benefit from the non-clustered index on the first filtering criteria.

DBCC SHOW_STATISTICS (N'TEST.dbo.application', IX_COMPUTER_ID)

image_thumb_36

The impact of outdated statistics on query performance

To even better understand the impact of statistics consider the following example. In the application table there is only one row with computer_id value of 6000000. Let’s disable automatic updates on application table and add 4499 rows with computer_id value of 600000.

USE 
[master] GO ALTER DATABASE [TEST] SET AUTO_UPDATE_STATISTICS OFF WITH NO_WAIT DECLARE @ROWS INT SET @ROWS=1 WHILE (@ROWS < 4499) BEGIN INSERT INTO TEST.dbo.application (computer_id,app_id,name) VALUES (6000000,100,'MyApp') SET @ROWS = @ROWS+1 END GO
If you check the histogram of the statistics you will get that key 6000000 has EQ_ROWS=1 which is not true since we added 4499 rows meaning that EQ_ROWS=4500.image_thumb_37
By executing a select statement to return all rows with computer_id equal to 6000000, you get the following results
SET STATISTICS IO ON
SELECT computer_id,app_id,name FROM application WHERE computer_id = 6000000

image_thumb_38

image_thumb_39

Consider updating the outdated statistics to check out the different results.

GO
UPDATE STATISTICS application IX_COMPUTER_ID
GO

GO
SET STATISTICS IO ON
SELECT computer_id,app_id,name FROM application WHERE computer_id = 6000000

GO

image_thumb_40

image_thumb_41

It is quite clear that statistics affected the query performance. While outdated statistics  caused the query optimizer to use a nested loop and an IO operation of 9014 logical reads, updated statistics switched query optimizer to use a clustered index scan with an IO operation of 217 logical reads, thus 41,5 times better performance! That’s the importance of updated IO statistics.

If the statistics automatic update option were ON but in asynchronous mode, the results on the first execution would be a nested loop join and a clustered index scan on the second, with the same results as above. The diagram below explains that in the first execution the current statistics for RANGE_HI_KEY=6000000 were EQ_ROWS=1   while on the second became EQ_ROWS=4500.

image_thumb_42