“He who has a why to live for can bear almost any how.” ~ Nietzsche
In discussions with colleagues, I’ve noted numerous misconceptions about indexes, specifically the why. When we discuss technical information, we may feel tempted to use other technical information to support our ideas, when in reality, this can create confusion for some people who either are new to the industry, or find it difficult to picture what indexing accomplishes in the world of databases. To an extent, no analogy is perfect, and for the case of indexing, I’ll use some incomplete analogies to demonstrate what we’re trying to do when we index columns in SQL Server.
Imagine that we have a fitness book with various exercises in a random order. The book possesses no chapters, no table of contents, no headings, no page numbers, and no glossary or index. Suppose that we were fitness trainers and one of our clients asked us, “Can you show me the process of doing deadlifts?” How will we find this in the book?
Answer: Outside of reading through the entire book until we find the information, we can’t. Think of this as analogous to a table scan on a heap of data in SQL Server.
Of course, books like this typically don’t exist (or if they do, I have yet to stumble across one). A book like this would prevent users from finding information quickly and each time we tried to locate something, we would be facing a challenge of reading through the book until we found it.
Exceptions might exist, but generally, SQL Server doesn’t know where the data are when we query a table; it makes assumptions and in some cases, SQL Server’s assumptions are correct. Sometimes, it’s assumptions cause major performance problems. Like the book example, a heap of data forces SQL Server to scan the entire table until it finds what we’re looking for – a very bad idea, depending on the table.
Now, imagine that we have a fitness book with page numbers, a table of contents, and an alphabetical organization structure by fitness routine. Our same client comes to us and asks, “Can you show me the process of doing deadlifts?” How will we find this in the book this time?
Answer: More than likely, we will look at the table of contents and try to find deadlifts. Or, we will try to find the D-starting fitness routines. This would be like a last name index on a SQL Server table.
We find books like this all the time and, most of the time, we can identify information in them quickly. The reason that we have tables of contents, glossaries, indexes, page numbers and the like is because it makes it easier for us to find information. In a sense, databases are the new books, and indexes help us find information quickly.
Now, SQL Server can begin making more accurate assumptions, like we can, when indexes exist. Notice that we could try finding the information by the table of contents or by the alphabetical structure. These are like two indexes on the book – we can drastically minimize the time it takes to find information by using these indexes. The same holds true for SQL Server; does it have to go through an entire table to find a last name starting with L when we have an index on last name and it can just find the L and go through it? Of course not! SQL Server can save time and look through the Ls.
Note: this also brings up a point of SARGability on indexing. Although I dislike the term SARG when we’re discussing querying and efficient use of indexing, this is the popular term, so I’ll stick with it. In our example of finding a last name in SQL Server is it effective to find a last name of Johnson by doing WHERE LastName LIKE ‘%nson%’ or WHERE LastName LIKE ‘John%’? The first doesn’t use indexes effectively at all, the second does.
Another book analogy to this: imagine trying to find deadlifts in a book by searching for all use of the word “lift” in a fitness book. The word lift could appear ten thousand times, while only 10 times would relate to it! Plus, since any word could lead “lift”, we can’t assume that the word will start with L – it could start with any letter, meaning that we’re not using our indexes effectively. By having the right starting point, D, we know how we can limit our search.
Points like these demonstrate what a data scientist is – the intersection between a data analyst and a developer. Just because we throw indexes on our tables doesn’t mean we’re effectively using them. We can poorly use indexes.
So, now we know why we need an index. And, at this point, we know that we need to be using our indexes effectively. However, what about the times when we don’t need an index?
It is true that we can over-index a table in SQL Server. Let’s go back to our fitness book analogy. Imagine that we had a book that had a table of contents, a glossary, page numbers, chapters and verses. We can probably all think of a book like this: The Holy Bible and many other religious texts do this. While these texts are well organized (no disputing that), imagine the time it took for these books to be organized (before we had mass printing, of course)? In fact, religious texts like the Bible probably encouraged mass printing more than any other phenomena due to their complexity (I would adduce the same is true for the rise of NoSQL databases – because data are becoming so complex, a simple SQL database can’t always handle data, though in many cases it can).
Our problem here is maintenance. Sure, we can index ever column on a table in SQL Server and welcome some major problems when it comes to rebuilding them due to the increased time. Likewise, not every column needs an index anyway. If we seldom (or never) use it in where clauses, joins or queries, why are we indexing it (note: foreign key columns should be indexed in almost all cases – a few exceptions might exist here).
Hopefully, this post sheds some light on indexing in SQL Server, specifically the why of what we’re doing.