Is Semantic Chunking worth the computational cost?

Overview

Semantic chunking is gaining traction in Retrieval-Augmented Generation (RAG) systems for its promise to divide documents into semantically coherent segments. However, our recent study challenges this trend, systematically evaluating semantic chunking against the simpler, computationally lighter method of fixed-size chunking. Spoiler alert: Semantic chunking doesn’t always deliver the promised magic.

What is chunking?

In RAG systems, chunking refers to splitting documents into smaller units, or "chunks," to improve the retrieval of relevant content. Two main approaches to chunking exist:

Fixed-Size Chunking: Documents are divided into uniform segments based on a predefined token or sentence count.
Semantic Chunking: Documents are divided into segments based on the semantic coherence of sentences, grouping together sentences that share similar topics or meanings based on embedding models.

Specifically, we consider two branches of semantic chunking:

Breakpoint-based: Documents are divided based on semantic distance thresholds between consecutive sentences to maintain coherence.
Clustering-based: Semantically similar sentences are grouped together, potentially combining non-consecutive text to form topic-based chunks.

Fixed-size chunking is computationally efficient but may scatter related sentences across chunks. Semantic chunking, on the other hand, aims to preserve context but comes with increased computational overhead.

How to evaluate chunking

While the evaluation of RAG systems is an ongoing research direction, there aren't any evaluation frameworks for chunk quality. To address this, we evaluated both chunking strategies across 3 RAG tasks:

Document Retrieval: We map the retrieved chunks back to their source documents, and then evaluate the documents.
Evidence Retrieval: We locate the specific evidence sentences in the retrieved chunks, and then evaluate the evidence sentences.
Answer Generation: We measure the quality of answers generated by GPT-4o based on the retrieved chunks.

Datasets & data processing

We used datasets from BEIR and RAGBench. Due to the short length of most BEIR documents, we synthesized longer ones by combining shorter documents from six datasets, then sampled 100 queries per dataset to retrieve the top-k chunks. For evidence retrieval and answer generation, we selected five RAGBench datasets and concatenated all documents in each dataset.

Evaluation metrics

Traditional metrics like Recall@k and NDCG@k are unsuitable due to the variable number of relevant documents or evidence sentences. We adopted F1@5 to balance precision and recall. Results were optimized for each strategy using the best hyperparameter configuration and evaluated with the dunzhang/stella_en_1.5B_v5 embedding model.

Evaluation results

F1@5 for Document Retrieval (%). Datasets marked with * are stitched. Rows are sorted by the average number of sentences per document (before stitching) in ascending order for easier comparison.

For Document Retrieval, Semantic Chunking showed slight advantages on stitched datasets. However, in realistic scenarios (datasets not artificially stitched), Fixed-size Chunking consistently outperformed Semantic Chunking.

F1@5 for Evidence Retrieval (%). Rows are sorted by the average number of sentences per document in ascending order for easier comparison.

For Evidence Retrieval, Semantic Chunking failed to show a clear advantage in identifying evidence sentences across datasets.

Using GPT-4o, differences between chunking strategies were negligible. This suggests that LLMs can compensate for retrieval imperfections by identifying relevant content within chunks.

Key findings

Performance Differences Are Minimal: While Semantic Chunking showed slight improvements in specific scenarios (e.g., when documents were artificially stitched to contain diverse topics), these advantages were inconsistent.
Fixed-Size Chunking Wins in Real-World Scenarios: In datasets that reflect typical document structures, fixed-size chunking often performed just as well or better. It benefits from simplicity and computational efficiency.
Computational Costs vs. Gains: The added computational burden of semantic chunking was rarely justified. When high-quality embeddings were used, the choice of chunking strategy mattered less than expected.
Task Dependency: The effectiveness of Semantic Chunking varied by task (implied by dataset choices), with its advantages often overshadowed by the quality of the embedding model or dataset structure.

Why fixed-size chunking still holds its ground

Fixed-size chunking’s simplicity gives it a significant edge:

Consistency: It doesn’t rely on the nuanced semantics of embeddings, which can vary in quality.
Scalability: Less computational cost makes it easier to implement in large-scale systems.
Adaptability: Overlapping sentences between chunks can help maintain context without the need for complex algorithms.

Semantic chunking, while innovative, has some notable limitations:

Inconsistent Gains: Its advantages largely depended on the structure of the dataset, particularly with artificially stitched documents.
Error Propagation: Clustering-based chunking methods often introduced errors by misgrouping sentences from different topics when semantic similarity was high but positional coherence was low.
Context Loss in Small Chunks: Breakpoint-based chunking sometimes resulted in single-sentence chunks, which lacked meaningful context for retrieval tasks.

Conclusion

If you’re designing a RAG system, don’t abandon Fixed-size Chunking just yet. While Semantic Chunking might offer marginal gains in some scenarios, it’s not a one-size-fits-all solution. Instead, focus on improving embedding quality and consider the trade-offs between computational resources and retrieval accuracy.

Our platform currently utilizes the fixed-size chunking strategy. To build your own RAG application with Vectara, simply sign up for a free trial account, upload your data, and get started in minutes. If you need help you can find us in the Vectara discussion forum or on Discord.