Thursday, June 27, 2013

Problems with Stop Words

     What will you find when you Search for Khaled Hosseini’s new book, And the Mountains Echoed, in a library catalog?  Chances are you will retrieve quite a number of hits about ‘mountains’ and ‘echo’ but not the novel itself. Why is that? The book exists. The terms you entered are valid. Why can’t you find it? The problem is related to the words ‘and’ and ‘the’, which unfortunately appear at the beginning of the title. Those words are called ‘stop words’. Stop words are very common words that normally add little meaning to the subject content of the document being indexed. Most of the stop words are there to make a sentence grammatically correct. If you ignore them, a sentence will still make sense, somewhat. However, the problem is that search engines are ignoring them too.

     Most search engines do not index stop words in order to save disk space, to make searching more efficient, and to reduce result pollution. Some search engines might replace them with what is called a marker.

Consider this sentence:
                It is an unforgettable novel about finding a lost piece of yourself in someone else.

There are 7 stop words in this sentence; it, is, an, about, a, of, in. The sentence would be stored like this:
                *** unforgettable novel * finding * lost piece * yourself * someone else.

     To speed up the search process, search engines do not search for certain terms in order to save time. Consider the title And the Mountains Echoed.  A search engine will look for ‘mountains’ and ‘echoed’. To save time, it will most likely exclude terms that it considers too common, such as ‘and’ and ‘the’.

     So what should a researcher do short of asking every author in the world not to use stop words in the titles of their books? One way to avoid this problem is by entering search phrases as search engines are programmed to understand.  You can accomplish this by using markers where stop words appear.  Instead of And the Mountains Echoed, enter ** Mountains Echoed. Or skip certain words entirely; Dark and Deadly Pool instead of The Dark and Deadly Pool. You can also put the phrase in quotation marks for the search engine to search the exact phrase.

     Fortunately, database publishers understand this problem, and they are doing something about it. Some add extra script to let search engines be aware of certain terms. Some create a stop list, which is a list of stop words, and apply it to the content indexed in the database. However, this problem is still looming and will not go away entirely. The other day Mel, our cataloger, could not find a record for And the Mountains Echoed in the library catalog.  His search turned into a definite stop word dilemma!

No comments:

Post a Comment