So, why do you need to index your tables?

Because without an index the SQL server has to scan the entire table to return the requested data. It is like the index page in a book. You check within the index for the keyword you want to learn about. From that point forward, you jump directly to the page where the content belongs, instead of scanning page by page for the material you want to read.

Similarly, a table’s index allows you to locate the data without the need to scan the entire table. You create instead indexes on one or more columns in a table, to aid in the query process. Consider the following example:

SELECT 
[computer_id],[nic_device_id],[nic_vendor_id],[nic_desc] FROM [eXpress].[dbo].[nics]
image_24

You are about to retrieve all computers with computer_id > 5100. The SQL server will have to scan the entire table in order to return the results, without the presence of an index. An index on the computer_id column will speed up this process, by sorting the column’s values.

image_thumb_25

Now, if you want to return all data where computer_id > 5100, the SQL server will know how to locate the first value greater than 5100. Why? Because the computer_id column is sorted, eliminating the need to scan the entire table, thus improving performance.

SQL Index Types

There are two main index types: Clustered index and Non-Clustered index.

A clustered index alters the way that the rows are physically stored. When you create a clustered index on a column (or a number of columns), the SQL server sorts the table’s rows by that column(s).

It is like a dictionary, where all words are sorted in an alphabetical order. Note, that only one clustered index can be created per table. It alters the way the table is physically stored, it couldn’t be otherwise.

In the example below, all rows are sorted by computer_id, as a clustered index on the computer_id column has been created.

CREATE CLUSTERED INDEX [IX_CLUSTERED_COMPUTER_ID] 
ON [dbo].[nics] ([computer_id] ASC)

A non-clustered index, on the other hand, does not alter the way the rows are stored in the table. Instead, it creates a completely different object within the table, that contains the column(s) selected for indexing and a pointer back to the table’s rows containing the data.

It is like an index in the last pages of a book. All keywords are sorted and contain a reference back to the appropriate page number. A non-clustered index on the computer_id column, in the previous example, would look like the table below:

CREATE NONCLUSTERED INDEX [IX_NONCLUSTERED_COMPUTER_ID] 
ON [dbo].[nics] ([computer_id] ASC)
Computer_id Row Locator
5000025 234
5000031 345
5000045 112
5000046 348
5000055 234
5000059 984

Note, that SQL server sorts the indexes efficiently by using a B-tree structure. This is a tree data structure that allows SQL Server to keep data sorted, to allow searches, sequential access, insertions and deletions, in a logarithmic amortized time. This methodology minimizes the number of pages accessed, in order to locate the desired index key, therefore resulting in an improved performance.

Relation between clustered and non-clustered indexes

As explained above, a non-clustered index contains a pointer back to the rowID (RID), of the table, in order to relate the index’s column with the rest of the columns in a row.

But this is not always the case:

If a clustered index already exists in the table, the non-clustered index uses the clustered index’s key as the row locator, instead of the RID reference.

In the example below, when a non-clustered index is created on the computer_id column and a clustered index already exists on the nic_desc, the non-clustered index would look like the table below.

Computer_id Row Locator
5000025 VMware Accelerated AMD PCNet Adapter
5000031 Intel(R) PRO/100 VE Network Connection
5000045 Broadcom 440x 10/100 Integrated Controller
5000046 Broadcom NetXtreme 57xx Gigabit Controller
5000055 VMware Accelerated AMD PCNet Adapter
5000059 Broadcom 440x 10/100 Integrated Controller

Index Benefits and Side Effects

A table without a clustered-index is called a “heap table”. A heap table has not its data sorted. The SQL server has to scan the entire table in order to locate the data, in a process called a “scan”.

In the case of a clustered index, the data are sorted on the key values (columns) of the index. The SQL server is now able to locate the data by navigating down from the root node, to the branch and finally to the leaf nodes of the B-tree structure of the index. This process called a “seek”. The later approach is much faster, when you want to filter or sort the data you want to retrieve.

A non-clustered index, on the other hand, is a completely different object in the table. It contains only a subset of the columns. It also contains a row locator looking back to the table’s rows, or to the clustered index’s key.

Because of its smaller size (subset of columns), a non-clustered index can fit more rows in an index page, therefore resulting to an improved I/O performance. Furthermore a non-clustered index can be allocated to a different FileGroup, which can utilize a different physical storage in order to improve performance even more.

The side effects of indexes are related to the cost of INSERT, UPDATE, MERGE and DELETE statements. Such statements can take longer to execute, in the presence of indexes, as it alters the data in the table, thus to the indexes too.

Imagine the situation of an INSERT statement. It has to add new rows in a table with a clustered index. In such case the table rows may need to be re-positioned! Remember…? The clustered index needs to order the data pages themselves! This will cause overhead.

So, it is crucial to take into account the overhead of INSERT, UPDATE and DELETE statements before designing your indexing strategy. Although there is an overhead to the above statements, you have to take into account, that many times, an UPDATE or DELETE statement will execute in a subset of data. This subset can be defined by a WHERE clause, where indexing may outweigh the additional cost of index updates, because the SQL server will have to find the data before updating them.

As explained above, a non-clustered index includes the clustered index’s key as its row locator, in the presence of a clustered index in the table.

This comes with a cost and a benefit:

The cost has to do with the non-clustered index bookmark lookup. What if a query has to return more columns that the ones hosted in the index itself? In the case of a HEAP table, the SQL server would have to check the RID of the non-clustered index, in order to navigate directly to the row, where the rest of the columns belong.

In the case of a clustered index, the SQL server would have to check the row locator of the non-clustered index, in order to do an additional navigation to the B-tree structure of the clustered index, to retrieve the desired row. You see, the row locator does not contain the RID, but the clustered-index key.

image_thumb_26

On the other hand, there is a benefit. It has to do with the clustered index updates. Imagine the following situation: Two new rows with index key values of A2 and A3 have to be added in the clustered index below.

image_thumb_27

Because this is a clustered index page, its physical structure has to be reallocated in order to fit A2 and A3 between A1 and A4. It has to maintain index’s order. Since there is no free space in the index page to accommodate these changes,  a page split will occur. Now, there is enough space to fit A2 and A3 between A1 and A4.

image_thumb_28

The goal achieved and the order maintained within the index. But imagine what would have happened if the non-clustered index was looking at the RID, instead of the clustered index’s key? It would have to change its row locators to reflect the changes. This could have been a huge performance hit! Especially, in the case of large clustered indexes.

Instead of the RID, the row locators now point at the clustered index key. Meaning, that there is no longer needed to change its values. This is quite a benefit if you think of the large clustered indexes, that are usually maintained in many tables.

Clustered indexes VS non-clustered indexes

CLUSTERED NON-CLUSTERED
PROS
  • Fast to return large range of data
  • Fast for presorted results
  • Wide keys do not reflect on other indexes
  • Frequently updated key columns do not reflect on other indexes
  • Can be assigned on different FileGroup
  • Many non-clustered indexes per table
  • Smaller size than clustered indexes due to column subsets
CONS
  • Frequently updated key columns reflect on non-clustered indexes
  • Wide keys increase the size of the non-clustered indexes
  • Only one clustered index per table
  • Generally slower than clustered indexes due to bookmark lookup (except for covering indexes).
  • Not recommended for returning large data sets (except for covering indexes).