Tech Sense: How Search Engines Work By John Bell
Updated: Feb 25, 2021
How Search Engines Work
Happy New Year! Welcome to 2021, hopefully a better year. Last week I was asked how search engines like Google or Bing work. This led me to believe that others may wonder about this as well, so this month that is what we will cover.
The Web Crawler
A search engine is composed of several major parts, the first of which is the web crawler. The web crawler tries to visit every webpage on the World Wide Web. It loads (or reads) each web page and examines every word it finds. Common words, sometimes called noise words, are discarded or ignored. These noise words include pronouns, articles, prepositions, and conjunctions. So, words like “and,” “but,” “or,” “to,” “the,” and “from” are typically ignored. Important words are noted and sent to be stored in an index file. When the crawler has completed indexing the words on the page, it will look at the links to other pages and other websites. If a link points to another page on the same website, the crawler moves to the linked page and processes it. This is repeated until all of the links have been followed and the pages on the site have been indexed.
If a link is found that points to a different site, a new crawler will be sent to create or update the index for that new site. When this happens, an entry is added to the index showing that the two sites are connected via the link.
The Search Index
The Search Index is the second major part of a search engine. The Search Index is built and kept up to date by the web crawlers. The index pairs the address of the page with the words found on the page, the number of times each word was found, and the importance of each word on the page.
Importance of a word may be determined by where the word is used on the page. A word found in normal text may be less important than the same word found in a header, or title. Words in a bold style might indicate more importance than words in the normal style. Each page in a website can also configure keywords for the page. The keywords are supposed to show the words that are most important for the topics of the website or a page within the website. The links to other pages on the website and to pages on other websites are also stored. These links will be used by the Search Engine to help determine the relevance of a match.
The Search Engine
The Search Engine is expected to use the Search Index to find and return the most relevant web pages matching the query. The query, typically called the “search term,” is the input to the Search Engine. The Search Engine looks in the index for matches with the search term and orders or sorts the results based on relevance. Relevance is a measure of the goodness of the match. This relevance calculation is what distinguishes one search engine from another. In general, the algorithms use the importance of the words on a page and the number of linked pages on the same site with the same important words and the number of sites that link back to the page.
Other factors may also be applied based on your own search history. This is search personalization. So, if you typically search for nuts as in the food you can eat then a search for nuts might return grocery-related nuts before returning hardware nuts. This is because based on your personal search history grocery nuts are likely to be more relevant. You may also add terms to a search to impact relevance like “nearby,” which would rank results higher if they were located geographically closer to your location. The ranking or ordering of the search results is typically called the relevance order with the most relevant results showing at the top of the results.
In most search engines, the search term is actually an expression that if used properly can enhance your search. In Google, for example, if you enter multiple words, Google will normally rank pages containing all of the words over pages with just a few of the words. If you use quotes around words in a search, Google will try to match the exact phrase with the quotes. Putting a minus sign (dash -) in front of a word Google will not return matches that have the subtracted word. These advanced search capabilities can be examined here: https://www.google.com/advanced_search, and for Bing they can be found here: https://help.bing.microsoft.com/#apex/18/en/10002/-1. Go ahead and try them out.
And in the End
I am hoping this serves as a gentle introduction to how search engines work. There is a lot more that goes on behind the scenes, but I think this covers most of the basics. For example, Search Engine Optimization (SEO) is a specialty area that helps websites improve their relevance, so they show up higher in the rankings. Most search companies make their income by selling advertising and use relevance calculations to determine the best people to show specific ads based on their search histories. Most search companies want to remember what you have searched for so they can present you with targeted ads. This is why when you search for something online you will often start to see ads about related products and services. You can reduce this by using a search engine like DuckDuckGo or by regularly clearing your cookies and search history for sites like Google and Bing. Until next time, search wisely.