Examining Source Diversity in Web Search: A Comparative Study of Top Search Engines

cover
11 May 2024

Authors:

(1) Yagci, Nurce, HAW Hamburg, Germany & nurce.yagci@haw-hamburg.de;

(2) Sünkler, Sebastian, HAW Hamburg, Germany & sebastian.suenkler@haw-hamburg.de;

(3) Häußler, Helena, HAW Hamburg, Germany & helena.haeuessler@haw-hamburg.de;

(4) Lewandowski, Dirk, HAW Hamburg, Germany & dirk.lewandowski@haw-hamburg.de.

Abstract and Introduction

Literature Review

Objectives and Research Questions

Methods

Results

Discussion

Conclusion, Research Data, Acknowledgments, and References

LITERATURE REVIEW

Search result overlap

By the mid-1990s, researching the overlap of search results between different search engines sparked interest for the purpose of estimating the size of the Web (Bharat, 1998; Ding & Marchionini, 1996; Lawrence & Giles, 1999). The generally small degrees of overlap indicated diverse underlying databases, each of limited size. Therefore, metasearch engines that combined the results of several search engines were meant to provide an additional value (Chignell et al., 1999) and attracted further research (i.e., Meng et al., 2002). Consequently, Spink et al. (Spink et al., 2006) included the meta search engine Dogpile in their extensive study on the overlap of search results. For a collection of 12,570 queries, the results found on the first search engine result page (SERP) of Ask Jeeves, Google, MSN Search, and Yahoo were captured. 84.9% of the results were unique to one search engine, while only 1.1% were shared by all search engines. Additionally, the low overlap between search engines also manifests in the ranking since only 7% of the top results were similar. This is consistent with previous studies that reported low overlap in ranking (Bar-Ilan, 2005; Bar‐Ilan et al., 2006)

Subsequent studies report an increase of overlap in search results, while ranking algorithms appear to be the leading cause for differences in the presentation of the results. Bilal and Ellis (2011) compared major web search engines (Google, Bing, Yahoo) with search engines specifically designed for children (Ask Kids, Yahoo Kids) for queries of various lengths. Bing and Yahoo shared the same overlap with Google for nearly all queries, ranging between 22% and 40%. In contrast, Yahoo Kids had mostly unique results. However, the disparate relevance ranking between Bing and Yahoo suggests different ranking algorithms. Cardoso and Magalhães (2011) measured the overlap of search result rankings based on URLs and website contents. Search results for 40,000 queries were retrieved from Google and Bing. The overlap in domains was about 29%, and Google had more exclusive domains. When looking at the result sets without considering the positions, the similarity between the result sets increased. This indicates that Google and Bing have different ranking preferences but index mostly the same sources.

Similarly, Agrawal et al. (2016) based their overlap analysis on content but factored in the description of the result snippets. The top 10 results of 67 informational queries were collected from Google and Bing. The results indicate a high overlap between Google and Bing, with a slightly higher similarity among the top 5 results. Most recently, a study about information on Covid-19 was conducted by Makhortykh et al. (2020). The first 50 results for queries in different languages were collected from Baidu, Bing, DuckDuckGo, Google, Yandex, and Yahoo. DuckDuckGo and Yahoo shared nearly 50% of their results, whereas the other pairs remained under 25%. Google had an overlap of around 10% with Bing and a negligible one with DuckDuckGo, Yandex, Yahoo, and Baidu.

Source diversity

Various studies took a closer look at search results and evaluated the domains, the types of sources, as well as their diversity across result sets and search engines. Comparing Google, Live Search, and Yahoo, Thelwall (2008) reported that Yahoo returned the highest amount of different domains to a query and also returned around 10% more top-level domains than other search engines. Höchstötter & Lewandowski (2009) showed that while the sources in top results of various search engines differ, there is a concentration on some popular sources in top results. More recently, Lewandowski & Sünkler (2019) collected search results on queries related to insurance providers and identified the distribution of top-level domains. Only ten different domains were found in the first result across all queries. The five most popular domains in the top 5 results are price comparison websites, which make up 88.4% of all first results. The share of popular domains decreases according to the ranking but remains at 42.9% at position ten.

Unkel and Haim (2019) observed Google search results prior to the German parliamentary elections in 2017. The study found that result lists for candidates and parties exhibit a high share of self-administered websites. In contrast, results about election facts, guidance, and issues are mainly from general interest news, government information, and privately run websites. Among the top ten domains across all search results are seven news websites which account for a quarter of all search results. By contrast, Wikipedia made up 5.4% of all search results. The high share of news and Wikipedia articles is confirmed by Steiner et al. (2022) who compared the first results for queries on debated topics in Germany across the search engines Ask, Bing, DuckDuckGo, Google, and Ixquick. All search engines positioned Wikipedia in the first rank for queries on climate change and the Transatlantic Trade and Investment Partnership (TTIP); Ask did so for almost every query. In the other search engines, news sites were placed in the first rank most of the time. However, for the topic Covid-19, the sources are more diverse. As Makhortykh et al. (2020) observed, only Yahoo incorporates recent information from legacy media, whereas Bing strongly relied on healthcare-related sources and Google highlighted government-related websites. Yandex was the only search engine that included alternative media in its top 20 search results. Especially for recent topics that lack an established knowledge basis, the choice of search engine may considerably impact what information a user gets to see.

This paper is available on arxiv under CC 4.0 license.