Internet Interlude: Learning about Search Engines

So I was ready to jump aboard the internet highway and deep dive into my ongoing journey, but after a zoom call with some friends this weekend, I am taking a small detour to discuss search engines!

Nam Jun Paik, “Electronic Superhighway: Continental U.S., Alaska, Hawaii”, 1995

Every Sunday a close group of friends and I gather on Zoom to recite passages or poems, discuss future travel plans and anything else relevant to our lives in the moment. This past weekend, someone asked me if I knew anything about “DuckDuckGo” and “if it’s better than Google.” Naturally, I had no idea. I have only seen DuckDuckGo used in passing — the browser flashing by when one of my peers or coaches are screen sharing during lecture. What is the difference? Why do some people choose DuckDuckGo? Why did I never stop to think I had a choice? Let’s find out!

When I googled DuckDuckGo vs Google, I immediately came across an article by James Temperton from Wired titled “I Ditched Google for DuckDuckGo. Here’s Why You Should Too.” James initially takes a step back from the debate to look at the larger picture- what do we use search engines for anyways?

I can’t help but to agree — I am not using Google for anything more than “how tall is that actress in Tenet?”, let alone have a stern preference for which search engine I use. Aside from my shame for my rudimentary searches, I also became interested in just how these engines worked!

Web Crawlers

What are web crawlers?

A web crawler, or spider is a type bot that goes out and downloads content and indexes it for search engines to use. Think of it as someone who catalogs books or an archive — their goal is to grab just a summary from these sites, starting with the most popular. Then, after scanning those sites for general information, they will follow hyperlinks on those sites to other sites and so on. After the data is collected, they organize it into a search index for the engine to use. Pretty neat! But how do they obtain this info from each site? They download something called a robots.txt file which is like a set of rules for which pages on the site the crawler should and should not crawl. The crawlers also use various algorithms to determine whether a page needs to be re-crawled more often due to changing content. What about images? And other non-text files? Well, some crawlers can access those as well! Here are some of the file types that Google can index (I couldn’t find a good source on which files DuckDuckGo can index, but will update later if I do!)

Okay, but the internet is VAST, how to the crawlers index the whole thing?

Well, they don’t. Despite indexing BILLIONS of sites, crawlers only access somewhere between 40%-70% of the internet! On top of that, sites are constantly updating/providing new information which demands recrawling, thus making it close to impossible to index the entire internet.How do they decide what to index and what not to? Initially, they look at websites with relative importance; how many other sites link to that page, how many users that site is receiving, etc. The idea being that if a site is referenced often and has high traffic of visitors, its more likely that this site contains high-quality, authoritative information.

Indexing

After all of the crawling, Google, DuckDuckGo, or any other search engine, needs to then organize all this information so they can access it. This is where the indexing comes into play. The main reason is to allow for faster search queries — instead of searching each individual page for keywords, the engine uses something called an inverted index, or a reverse index. An inverted index is a hashmap-like structure for data that allows you to search a keyword and find a document or webpage. Let’s take a look at this simple example below:

Credit: Here

Here is a difference between a forward index vs an inverted index. If we were to search for “Spain” with the forward index, our engine would have to search through each of these documents to find the geo-scopeID of Spain. You can imagine how long this would take if your DocIDs ran into the millions. In an inverted index, our documents are organized by the geo-scopeID — “Spain” will return the documents with the associated ID. And in fact, this isn’t a brand new idea, we have been using inverted indexes forever!

Books have reverse indexing too! Credit.

This is a random index from a random book — looks familiar right? Well its also an inverted index! Instead of listing all the key words on page 1, page 2, page 3 and so on, here we can just look up “romance” and find ALL the pages that mention the topic! Tah-dah!

Despite this journey starting as an exploration on the difference between Google and DuckDuckGo, I don’t have any conclusions as to which search engine is the best to use. However, I hope you have learned more about search engines in general, as I know I have!

Perhaps this is a two-part-er? I cringe to think that all my blogs will be an endless thread, weaving from one topic to the next, not knowing when to stop and cut itself loose, but then i ask myself: Is any journey ever really concluded?

Until next time!

References

  1. James Temperton, Wired, “I Ditched Google for DuckDuckGo. Here’s Why You Should Too” November 2019. https://www.wired.co.uk/article/duckduckgo-google-alternative-search-privacy#:~:text=DuckDuckGo%20works%20in%20broadly%20the,other%20search%20engine%2C%20Google%20included.&text=Google%20does%20exactly%20the%20same,IP%20addresses%20or%20user%20information.
  2. Cloudfare,“What Is a Web Crawler? | How Web Spiders Work” https://www.cloudflare.com/learning/bots/what-is-a-web-crawler/
  3. Sam Marsden, DeepCrawl.com, “Search Engine Indexing” May 2018. https://www.deepcrawl.com/knowledge/technical-seo-library/search-engine-indexing/
  4. https://www.geeksforgeeks.org/inverted-index/