Document Ingestion with Spring AI: Loading Text, JSON, and Custom Chunks into Your RAG Pipeline
In the first post we built a basic RAG system — one text file, default chunking, done. It worked great for a quick demo, but real-world documents don’t come in neat .txt files. You’ll deal with JSON exports, PDFs, maybe DOCX files from that one coworker who still writes everything in Word. And the default chunking strategy? It’s a decent starting point, but you’ll want to tune it once you start caring about retrieval quality.
This post is all about document ingestion — the first phase of any RAG pipeline. We’ll go beyond the single TextReader from Demo 1 and explore how Spring AI handles different formats, how to enrich documents with metadata, and how to control chunk sizes to get better results. Everything maps to Demo 2: Document Ingestion in the rag-spring-ai project.
1. Why Ingestion Matters More Than You Think
Here’s the thing about RAG: your retrieval is only as good as what you put into the vector store. If your chunks are too big, the embedding becomes a vague blob that matches everything loosely. Too small, and you lose context — the LLM gets sentence fragments that don’t make sense on their own.
And if you’re shoving raw text into the store without any metadata, good luck figuring out which document a chunk came from when you’re debugging why the LLM gave a weird answer at 2am.
The ingestion phase is where you set yourself up for success (or frustration) down the line. Let’s get it right.
2. What’s New in Demo 2
Demo 1 had a single TextReader and default TokenTextSplitter. Demo 2 introduces:
| Feature | What it does |
|---|---|
| Text ingestion | Same as Demo 1, but with richer metadata (source, type) |
| JSON ingestion | Loads structured JSON with JsonReader, mapping specific fields |
| Custom chunking | Configurable TokenTextSplitter — tune chunk size, min size, separators |
| Detailed response | Every endpoint returns stats: docs read, chunks created, format, source |
The code lives in two files: IngestionService.java and IngestionController.java.
3. The IngestionService — Three Ways to Ingest
3.1 Plain Text Ingestion
This is the simplest path. Read a .txt file, attach some metadata, split into chunks, and store:
public Map<String, Object> ingestText() {
var reader = new TextReader(textDocument);
reader.getCustomMetadata().put("source", "spring-ai-overview.txt");
reader.getCustomMetadata().put("type", "text");
List<Document> documents = reader.get();
var splitter = new TokenTextSplitter();
List<Document> chunks = splitter.apply(documents);
vectorStore.add(chunks);
return Map.of(
"source", "spring-ai-overview.txt",
"format", "text",
"documentsRead", documents.size(),
"chunksCreated", chunks.size(),
"status", "ingested"
);
}It’s pretty close to what we had in Demo 1, but notice the metadata. We’re tagging each document with a source and type before it gets split. That metadata propagates to every chunk — so when you later retrieve a chunk during a query, you can trace it back to the original file. Small thing, but it makes debugging so much easier.
3.2 JSON Ingestion
Not everything is a text file. If your data lives in JSON — think FAQ exports, product catalogs, knowledge bases — Spring AI’s JsonReader has you covered:
public Map<String, Object> ingestJson() {
var reader = new JsonReader(jsonDocument, "title", "content", "category");
List<Document> documents = reader.get();
vectorStore.add(documents);
return Map.of(
"source", "ai-concepts.json",
"format", "json",
"documentsRead", documents.size(),
"chunksCreated", documents.size(),
"status", "ingested"
);
}The key part is the JsonReader constructor: new JsonReader(jsonDocument, "title", "content", "category"). Those three strings tell the reader which JSON fields to extract. Each object in the JSON array becomes a separate Document, with the specified fields combined into the document content.
Here’s what our sample JSON looks like:
[
{
"title": "What is RAG?",
"content": "RAG stands for Retrieval-Augmented Generation...",
"category": "concept"
},
{
"title": "What is a Vector Embedding?",
"content": "A vector embedding is a numerical representation...",
"category": "concept"
}
]Notice there’s no chunking step here. Each JSON object is already a self-contained piece of knowledge — a question-answer pair, a concept definition, etc. They’re small enough to embed directly without splitting. This is one of the nice things about structured data: the source format already gives you natural chunk boundaries.
3.3 Custom Chunking
This is where things get interesting. The default TokenTextSplitter uses 800 tokens per chunk with 350-token overlap. That’s reasonable, but what if your documents are very technical (shorter chunks might capture concepts more precisely) or very narrative (longer chunks might preserve flow better)?
public Map<String, Object> ingestWithCustomChunking(int chunkSize, int minChunkSize) {
var reader = new TextReader(textDocument);
reader.getCustomMetadata().put("source", "spring-ai-overview.txt");
reader.getCustomMetadata().put("chunking", "custom");
List<Document> documents = reader.get();
var splitter = TokenTextSplitter.builder()
.withChunkSize(chunkSize)
.withMinChunkSizeChars(minChunkSize)
.withMinChunkLengthToEmbed(5)
.withMaxNumChunks(100)
.withKeepSeparator(true)
.build();
List<Document> chunks = splitter.apply(documents);
vectorStore.add(chunks);
return Map.of(
"source", "spring-ai-overview.txt",
"format", "text",
"chunkSize", chunkSize,
"minChunkSize", minChunkSize,
"documentsRead", documents.size(),
"chunksCreated", chunks.size(),
"status", "ingested"
);
}Let’s break down the TokenTextSplitter.builder() parameters:
| Parameter | What it controls |
|---|---|
withChunkSize(chunkSize) |
Target number of tokens per chunk. Smaller = more precise embeddings, more chunks. Larger = more context per chunk, fewer chunks. |
withMinChunkSizeChars(minChunkSize) |
Minimum chunk size in characters. Prevents tiny, useless chunks at the end of a document. |
withMinChunkLengthToEmbed(5) |
Minimum character length to bother creating an embedding for. Skips chunks that are just whitespace or a few characters. |
withMaxNumChunks(100) |
Safety cap on the number of chunks. Prevents runaway splitting on very large documents. |
withKeepSeparator(true) |
Whether to preserve paragraph separators in the output. Helps maintain readability in retrieved chunks. |
The endpoint accepts chunkSize and minChunkSize as request parameters, so you can experiment without changing code:
# Small chunks (300 tokens) — more precise, more chunks
curl -s -X POST "http://localhost:8080/api/ingest/custom-chunking?chunkSize=300&minChunkSize=50" | jq
# Large chunks (1000 tokens) — more context, fewer chunks
curl -s -X POST "http://localhost:8080/api/ingest/custom-chunking?chunkSize=1000&minChunkSize=100" | jqTry both and then ask the same question through the Basic RAG endpoint — you’ll often see different chunks retrieved, and sometimes noticeably different answer quality.
4. The Controller — Clean and Thin
The IngestionController is intentionally minimal. It just maps HTTP endpoints to service methods:
@RestController
@RequestMapping("/api/ingest")
public class IngestionController {
private final IngestionService ingestionService;
public IngestionController(IngestionService ingestionService) {
this.ingestionService = ingestionService;
}
@PostMapping("/text")
public Map<String, Object> ingestText() {
return ingestionService.ingestText();
}
@PostMapping("/json")
public Map<String, Object> ingestJson() {
return ingestionService.ingestJson();
}
@PostMapping("/custom-chunking")
public Map<String, Object> ingestCustomChunking(
@RequestParam(defaultValue = "400") int chunkSize,
@RequestParam(defaultValue = "50") int minChunkSize) {
return ingestionService.ingestWithCustomChunking(chunkSize, minChunkSize);
}
}Three endpoints, each returning a detailed stats map so you can see exactly what happened:
| Action | HTTP Method | Endpoint | Parameters |
|---|---|---|---|
| Ingest text | POST |
/api/ingest/text |
— |
| Ingest JSON | POST |
/api/ingest/json |
— |
| Custom chunking | POST |
/api/ingest/custom-chunking |
chunkSize (default 400), minChunkSize (default 50) |
5. Running It Yourself
If you already have the infrastructure running from Demo 1, you’re good to go. If not:
# Clone and start everything
git clone https://github.com/gdunhao/rag-spring-ai.git
cd rag-spring-ai
docker compose up -d
# Wait for models to download
docker compose logs -f ollama-init
# Wait for "All models ready!"
# Start the app
./mvnw spring-boot:runTry all three ingestion strategies
# 1. Ingest plain text
curl -s -X POST http://localhost:8080/api/ingest/text | jq
# 2. Ingest JSON
curl -s -X POST http://localhost:8080/api/ingest/json | jq
# 3. Ingest with custom chunking (small chunks)
curl -s -X POST "http://localhost:8080/api/ingest/custom-chunking?chunkSize=300&minChunkSize=50" | jq
# 4. Ingest with custom chunking (large chunks)
curl -s -X POST "http://localhost:8080/api/ingest/custom-chunking?chunkSize=1000&minChunkSize=100" | jqExample response from text ingestion:
{
"source": "spring-ai-overview.txt",
"format": "text",
"documentsRead": 1,
"chunksCreated": 5,
"status": "ingested"
}Example response from JSON ingestion:
{
"source": "ai-concepts.json",
"format": "json",
"documentsRead": 8,
"chunksCreated": 8,
"status": "ingested"
}Notice how JSON ingestion produces one chunk per JSON object (no splitting needed), while text ingestion splits the document into multiple chunks.
You can also use the interactive playground at localhost:8080 — the Document Ingestion tab wraps these same endpoints in a visual UI where you can experiment with different chunk sizes and see the results immediately.
6. Choosing the Right Ingestion Strategy
So when should you use what? Here’s my take:
| Scenario | Reader | Chunking |
|---|---|---|
| Plain text docs (README, notes, articles) | TextReader |
TokenTextSplitter with defaults is fine for most cases |
| Structured data (FAQ, catalogs, knowledge bases) | JsonReader with field mapping |
Usually none — each object is already a natural chunk |
| Long technical docs | TextReader |
Custom TokenTextSplitter — try smaller chunks (300-500 tokens) for more precise retrieval |
| Narrative content (stories, reports) | TextReader |
Custom TokenTextSplitter — larger chunks (800-1200 tokens) to preserve narrative flow |
| Multi-format (PDF, DOCX, HTML) | TikaDocumentReader |
TokenTextSplitter — Tika handles format conversion, then chunk as usual |
The honest answer is: experiment. Ingest the same document with different chunk sizes, ask the same questions, and see which setup gives better answers. The custom chunking endpoint in Demo 2 is built exactly for this kind of experimentation.
7. What About PDF and DOCX?
List<Document>. The downstream pipeline (split, embed, store) is identical regardless of the source format.You might have noticed the IngestionService Javadoc mentions PagePdfDocumentReader and TikaDocumentReader. These are real Spring AI readers that handle PDFs, DOCX, HTML, and dozens of other formats through Apache Tika. They’re not wired up in Demo 2 (to keep the demo self-contained without extra dependencies), but the pattern is identical:
// PDF — one Document per page
var reader = new PagePdfDocumentReader(pdfResource);
List<Document> docs = reader.get();
// Tika — any format (PDF, DOCX, HTML, PPTX, etc.)
var reader = new TikaDocumentReader(anyResource);
List<Document> docs = reader.get();
// Then split and store — same as always
var chunks = new TokenTextSplitter().apply(docs);
vectorStore.add(chunks);The beauty of Spring AI’s DocumentReader abstraction is that everything downstream (splitting, embedding, storing) stays exactly the same regardless of the source format. Swap the reader, keep the rest.
8. Key Takeaways
-
Metadata is cheap, debugging is expensive. Always tag your documents with at least a
sourcefield. Future-you will thank present-you. -
Not everything needs chunking. Structured data (JSON, CSV rows) often has natural boundaries. Don’t blindly split what’s already the right size.
-
Chunk size is a tuning knob, not a constant. Smaller chunks = more precise retrieval. Larger chunks = more context. There’s no universal best value — it depends on your documents and your questions.
-
Spring AI’s reader abstractions are composable.
TextReader,JsonReader,TikaDocumentReader— they all produceList<Document>. Everything after that (splitting, embedding, storing) is the same code path. -
Test without infrastructure. Mock the
VectorStore, useClassPathResourcefor test documents, and verify your ingestion logic in milliseconds.
Series Roadmap
| Post | Topic | What it adds |
|---|---|---|
| Post 1 | Basic RAG | End-to-end retrieval pipeline with QuestionAnswerAdvisor |
| → You are here | Document Ingestion | Multi-format loading, custom chunk sizes, metadata enrichment |
| Coming next | Vector Store Operations | Direct similarity search, threshold tuning, embedding inspection |
| Chat with Memory | Conversational RAG with per-session history and context carryover | |
| Advisors | Composing RAG + memory + safety advisors in a pipeline | |
| Structured Output | Extracting typed Java records from LLM responses | |
| Function Calling | Letting the LLM invoke Java methods as tools | |
| Multi-Document RAG | Multiple document collections with smart routing | |
| Metadata Filtering | Scoping vector search with metadata filters |
Source code: github.com/gdunhao/rag-spring-ai — clone it, run
make setup && make run, and open localhost:8080 for the interactive playground.