# Assessing bias in search results

Search engines help users navigate through large collections of documents, be it the Web, the offerings of online shops or digital archives such as the German Digital Library. When the number of items in such a collection is too large to be explored manually, users need to rely on (and trust) the validity of the results obtained through search engines. Providing good search functionalities is therefore in the interest of both, the service provider as well as the user.

This can be a challenging task that requires expert knowledge concerning the characteristics of the documents (e.g. quality, content, length), the technical infrastructure used (e.g. retrieval model applied) and good knowledge of the skills and requirements of the target users. Often, however, search engines are only evaluated with regard to how effectively and efficiently they retrieve relevant documents. Bias in search engine result pages is rarely considered.

In this article, I describe the idea behind the retrievability metric which can be used to measure bias in search results.

## Bias

Wikipedia defines bias as “an inclination towards something, or a predisposition, partiality, prejudice, preference, or predilection.” In the context of information access, bias is generally negatively connoted. It is often forgotten that it is the core task of a search engine to induce bias on the ranking of the retrieved documents: Documents should be presented according to their relevance towards a specific search query. Search results should therefore be biased in favor of documents that satisfy the information need of the user. This type of bias is desired.

Unfortunately, numerous examples have shown that even widely used search engines such as Google, Bing and Yandex contain undesired bias, as the example of a query for “black girls” which “came back with pages dominated by pornography” shows. Since the results of search engines has the potential to shape public opinions, it is vital to assess inherent biases.

Bias can be imposed unintentionally. Examples of sources for such bias in document collections are:

• The document collection may not reflect the “real world” very well. One reason for this can be “survivorship bias”, which causes the collection to miss out on a specific class of documents.
• Digitization of documents can be a complex process and its success depends strongly on document characteristics (such as complexity of layout, degree of decay in historic documents or font) and the use of suitable technology for digitization and post-processing (e.g. use of suitable dictionaries for error-correction).
• Content and textual characteristics of documents (e.g. length and repetitiveness) can influence their retrievability.
• Access to documents is usually granted through a search interface. The (non-) existence of functionalities such as facets and logic operators, as well as the retrieval model used can influence which documents are easier or harder to find than others.
• The way search results and document detail pages are presented to an user influences how (much) information about the retrieved documents is perceived and how many search results a user sees without scrolling (several studies have shown that users typically examine the top ten search results and issue a new query if they have not found what they were looking for instead of paginating to subsequent result pages).
• Depending on their familiarity with search functionalities, users may find it difficult to find specific documents, e.g. if they are not familiar with specific terminology necessary to find a document.

Assessment of existing bias is difficult because it requires a thorough understanding of the data and software tools involved.

## Measuring & Visualizing Bias

Bias can be seen as a deviation from equality. We can therefore use the Gini coefficient and Lorenz curve which were developed by economists to measure and visualize inequality of income in societies.

### Gini Coefficient

The Gini coefficient ($$G$$) can have a value between 0 and 1. The lower the Gini coefficient, the less biased is the population. A value of $$G=0$$ would be reached if the wealth in a society is absolutely equally spread  (= a perfect communist society). A value close to $$G=1$$ means that one individual (or very few) own the entire wealth, whereas the large majority owns nothing (= a perfect tyranny).

### Lorenz Curves

Lorenz curves were originally developed by Max Otto Lorenz in 1905 to visualize inequalities in wealth distribution. In a perfect communist society, the Lorenz curve is a perfect diagonal. The following graph shows examples of Lorenz curves for a population of ten individuals (a, b, …, j) and different wealth distributions.

The black line in the graph represents the “perfect communist” society where wealth is spread equally ($$a=1, b=1, c=1, d=1, e=1, f=1, g=1, h=1, i=1, j=1$$) and $$G=0$$.

The green Lorenz curve expresses a less equal society ($$G=0.5$$) where four individuals own no wealth at all, four individuals own 50% and two individuals own the other 50% ($$a=0, b=0, c=0, d=0, e=1, f=1, g=1, h=1, i=2, j=2$$). The orange curve shows a “perfect tyranny” ($$G=0.9$$), where one individual owns the entire wealth and the others own nothing ($$a=0, b=0, c=0, d=0, e=0, f=0, g=0, h=0, i=0, j=1$$).

## Retrievability Assessment

The most common way to evaluate the performance of a search engine is to compute precision (How many of the retrieved documents are relevant?) and recall (How many relevant documents were retrieved?), or a combination of the two (F1 score). Unfortunately, these measures require relevance judgments to be available for documents and user queries. To create relevance judgments, humans are needed to judge whether a specific document can be considered relevant for a given user query.
They are therefore expensive and almost impossible to obtain for large document collections.

In 2008, Azzopardi et. al introduced the retrievability metric to complement traditional performance measures.

## Retrievability Score

The retrievability score $$r(d)$$ of a document $$d$$, measures how accessible a document is. It is determined by several factors, including the number of documents a user is willing to evaluate.

The retrievability score is the result of a cumulative scoring function, defined as:

where

• $$c$$ defines a cutoff which represents the number of documents a user is willing to examine in a ranked list
• $$o_q$$ weights the importance of a query
• $$k_{dq}$$ is the rank of document $$d$$ in the result list for query $$q$$
• $$f$$ is defined to return a value of 1 if the document is successfully retrieved below rank $$c$$, and 0 otherwise.

In summary, $$r(d)$$ counts for how many queries $$q \in Q$$ a document $$d$$ is retrieved at a rank lower than a chosen cutoff $$c$$.

## Retrievability Assessment Setup

In order to assess potential bias of search results, we need three ingredients: A search engine, queries and a collection of documents. Two types of queries can be used for retrievability assessment: Real queries collected from users of a retrieval system, or artificial queries. The latter can be created by counting unique terms and bigrams in the document collection (after stemming and stopword removal has been applied) and selecting the top $$n$$ terms as single term queries and the top $$n$$ bigrams as bi-term queries.

The queries are then issued against the document collection and we evaluate how often each document was retrieved and at which rank. Bias can be evaluated at different cutoff values $$c$$. A value of $$c=10$$, for example, can be used to measure bias that would be experienced by a typical user who only examines the top ten results.

### Validation

Whether or not retrievability scores are meaningful for a given setup can be validated with a known-item-search setup. That way we can confirm that documents with a low $$r(d)$$ score are actually harder to find.

For this, we divide the documents into multiple subsets depending on their $$r(d)$$ score. From each subset, we pick a random sample of $$n$$ documents. For each of the documents, we then count the occurrences of unique terms and select the two or three most frequent terms (ignoring stopwords). These terms are supposed to best represent the document and constitute the queries we issue against the complete collection.

For each of the selected document, we evaluate the rank in the result list of the query we generated and calculate the Mean Reciprocal Rank (MRR) as a measure of their retrieval performance. Using the Kolmogorov-Smirnov test we can test if the results are significant.

## Retrieval Bias

The previous paragraphs outline how inequality can be measured and visualized, and how we can measure the accessibility of a document using the $$r(d)$$ score. We will now look at an example taken from a study I published together with colleagues from CWI at the Joint Conference on Digital Libraries (JCDL) in 2016.

### Bias in Retrieval Models: Comparing BM25 with LM1000

In this study we investigated the inequality in retrieval scores for different retrieval models: Okapi BM25 and Language Model (LM1000) using Bayes Smoothing with $$\mu = 1,000$$. For this evaluation we used the historic newspaper collection of the National Library of the Netherlands (Delpher) which comprises more than 102 million OCRed news items (articles, advertisements, official notifications and captions of illustrations). We generated simulated queries from the documents’ content, but we were also able to use real user queries from the library’s search logs.

We evaluated the inequality in the results for the top ranked 10, 100 and 1,000 documents ($$c=10, c=100$$ and $$c=1,000$$).

Retrieval Modelc=10c=100c=1,000

Real queriesBM250.970.890.76
LM10000.970.900.78
Simulated queriesBM250.850.52
LM10000.890.71

The lower Gini coefficients show that BM25 is generally the less biased retrieval model, which is in line with findings by other studies (for example). This is also visible when we plot the Lorenz curves ($$c=100$$): The curve for LM1000 deviates much more from the diagonal than the curve for BM25.

While the Gini coefficient and the Lorenz curve help us assessing the extent of bias, they cannot tell us the origin of it. For this we need further analyses.

### Example: Document Length

In the above mentioned study, we also investigated whether search results are influenced by the length of documents. The length of the texts in the KB collection varies from 33 to 381,563 words (with a mean length of 362 words).

We sorted all documents according to their length and divided them into bins of 20,000 documents (5,135 bins in total). For each bin, we calculated the mean r(d).

The pattern we obtained for LM1000 shows an upwards trend for longer documents which means that longer documents are easier to retrieve.

The results for BM25, however, indicate that documents of medium length are most retrievable. Documents at both extremes are less retrievable.

We can see a bias in both patterns, while LM1000 clearly favors longer documents, BM25 overcompensates for long documents, while it seems to fail to compensate for short ones.

## Summary

Standard IR evaluation metrics typically focus on effectivity and efficiency of document retrieval. The retrievability metric extends this toolset and allows us to measure the accessibility of each document. By analyzing the distributions of $$r(d)$$ scores we can assess potential undesired bias in the search results. The meaningfulness of $$r(d)$$ scores for a given setup can be validated using a known-item-search setup. Finally, we showed the results of an analysis of bias in two retrieval models with respect to document length.

## Myriam Traub

Myriam has a background in Information Retrieval. After working in research for a few years she returned to industry to work as Data Scientist. She loves to find insights in messy data sets and to come up with simple ways to tell complex data stories.

Data Science