Efficient Ground Truth Generation for Search Evaluation

6 min readFeb 11, 2025

Introduction

Cold start problems are a common challenge in AI-driven search and retrieval systems, particularly when there is little to no historical data to guide relevance assessments. In our recent project, we faced this issue while developing an information retrieval system for a customer with a vast collection of technical documents. To address this, we devised a structured approach to build a reliable ground truth dataset, leveraging a combination of Text REtrieval Conference (TREC) pooling and GPT-4o assisted ranking.

This article outlines our methodology, detailing how we streamlined the labor-intensive process of manual labeling while ensuring high-quality search evaluation.

Challenges in Generating Ground Truth

Ground truth datasets are often created by experts who manually label the data, ensuring that the results are accurate and reliable. This process is usually labor and time consuming. Additionally, it is hard to have experts focused on providing the required information.

In our project:

Domain experts were on the production floor, and we had limited time with them.
To provide accurate answers to the queries, domain experts would have to scan numerous files (some significantly large, spanning hundreds of pages). For instance, with 1,000 documents indexed by page-chunking strategy for 50 queries, a domain expert would need to review 50,000 pages, which is not feasible.

These challenges make traditional approaches for collecting ground truth harder and impractical. Therefore, we developed a strategy to collect a reliable and robust ground truth dataset by optimizing the limited availability of domain experts. The solution involves a combination of the TREC Pooling Approach, GPT-4o assisted ranking methodology, and the use of a labeling tool, as presented in the next sections.

Process

In this section, we outline the process used for building a reliable ground truth dataset. It begins with collecting the user queries. Then, we delve into the TREC pooling approach which was used to efficiently manage the high cost of manual document labeling by focusing on a subset of documents most likely to be relevant. Subsequently, these questions answers (Q&A) pairs are validated by multiple SMEs using the Labeling Tool developed by our team, as illustrated by Figure 1 below.

TREC Pooling Approach

Pooling is a well-known method used in the TREC evaluations to address the huge cost of manually labeling every document in a large collection. The core idea behind pooling is to focus human assessors’ attention on a subset (or “pool”) of documents that are most likely to be relevant as demonstrated in Figure 1.

Why is TREC Effective

In typical TREC evaluations, multiple search systems (or different configurations of a single system) retrieve documents for the same queries. Then, each system outputs its top k-ranked documents — often k is equal to 10, 50, or 100. These results are then combined into a single pool of documents for each query.

By focusing on the top results, we significantly reduce the labeling workload while maintaining a high likelihood of capturing the most relevant documents.

However, this approach involves a trade-off; while some relevant documents may occasionally fall outside the top k, the overall efficiency and cost savings of labeling only the top-ranked results generally outweigh this limitation.

Ground Truth Collection

Steps 1 and 2 below summarize how we collect search results from various methods and create a unified set of documents:

We utilized hybrid, text-based, and vector searches on our source documents, retrieving the top 100 results for each query. This selection assumes that documents ranked beyond 100 are likely irrelevant, given the effectiveness of these methods.
Upon gathering the top 100 results from each method, the duplicates are eliminated to establish a distinct set of candidate documents for each query. Due to overlaps in the top lists of various systems, this merged, deduplicated list typically contains fewer than (100 * The number of methods) documents.

Assumptions and Practical Considerations

We assume that the set of queries provided is representative and diverse, reflecting the actual questions technicians would ask on the production floor.
We assume our search systems are robust, so documents beyond the top 100 are unlikely to be relevant.
Missing some lower-ranked relevant documents is acceptable; TREC pooling balances coverage and cost effectively.
We continue to depend on domain experts to verify the actual relevance within the dataset. This procedure guarantees high-quality relevance assessments for the documents that are most likely to be significant.

By using TREC’s pooling strategy, we maintain the feasibility of our labeling tasks while retaining a high level of completeness. After defining the pool, these documents are input into our labeling tool for domain experts to evaluate.

Reducing Manual labelling effort

To minimize the burden of manual labeling, we implemented a solution based on GPT-4o. This approach utilizes the multi-modal capabilities of GPT-4o to analyze entire page content, including images, and assigns a relevancy score to each document.

Relevancy Scoring Process

Multi-Modal Document Review: GPT-4o can analyze both text and accompanying images or diagrams. By using an actual image (e.g., page 10 from a user manual) instead of only OCR-extracted text, it provides more context for ranking relevancy.

Relevancy scores (0–5): GPT-4o evaluates the relevance of each document/page on a scale from 0 to 5, with 5 being the highest relevance. This enables us to translate GPT-4o’s analysis into a numerical score.

Calibration Dataset and Score Threshold

To align GPT-4o’s scores with expert expectations, a subset (approximately 30% of the total set) is selected, and domain experts provide the correct answers (i.e., which documents/pages are relevant to each query). These calibrated results serve as a reference for GPT-4o’s predictions.

Finding The Right Score Threshold

By comparing the domain experts’ judgments from the calibration dataset with GPT-4o’s scores, a threshold is determined to capture most relevant documents. In our project, a score of 4 was chosen as it covers approximately 90% of the documents deemed relevant by the domain experts in the calibrated set. Specifically, documents with scores of 5 and 4 are considered relevant, while those with scores below 4 (3, 2, 1, and 0) are not considered relevant.

By setting a threshold, documents exceeding it are marked as “likely relevant,” requiring further review. Instead of reviewing thousands of documents, experts may only need to examine a few hundred, saving time and effort. This GPT-4o–based ranking step refined our TREC pooling approach and labeling workflow, reducing manual work while maintaining accuracy in our dataset.

Labeling Tool for Annotation

To improve efficiency and reduce errors in labeling for domain experts, we developed a labeling tool as a static website:

Inline Content Display: The tool presents the exact page content in the browser instead of just showing a filename and page number as shown in Figure 2. This feature helps save time and minimize context switching.

Relevance Scoring: Domain experts can assign a relevance score (e.g., 0, 1, etc.) directly on the displayed content, eliminating the need to manually open files or switch between Excel and a PDF viewer.

Export to QREL: Upon completion of labeling, the tool exports the judgments in the QREL format. The resulting file, which includes our queries and document IDs, is compatible with search evaluation tools like trec-eval. An example of this file is shown in Figure 2 below.

The labeling tool and its documentation, including steps for generating document and query mapping files and launching the tool, are available at pdf-qrel-labeler.

Conclusion

The combination of TREC pooling, GPT-4o assisted ranking methodology and the use of the labeling tool have yielded significant results, highlighting both time savings and efficiency gains in the document labeling process.

References

TREC: Continuing Information Retrieval’s Tradition of Experimentation
Text Retrieval Conference (TREC) — the official website
trec_eval GitHub repository — a popular open-source library for evaluating retrieval systems
Multi-Drop Polling — RAD Data Communications/Pulse Supply. 2007.
Performance bounds for the effectiveness of pooling in multi-processing systems
Effectiveness of sample pooling strategies for diagnosis of SARS-CoV-2: Specimen pooling vs. RNA elutes pooling
Chowdhury, G. (2007), “TREC: Experiment and Evaluation in Information Retrieval”, Online Information Review, Vol. 31 №5, pp. 717–718.