Document Ingestion with Spring AI: Loading Text, JSON, and Custom Chunks into Your RAG Pipeline

In the first post we built a basic RAG system — one text file, default chunking, done. It worked great for a quick demo, but real-world documents don’t come in neat .txt files. You’ll deal with JSON exports, PDFs, maybe DOCX files from that one coworker who still writes everything in Word. And the default chunking strategy? It’s a decent starting point, but you’ll want to tune it once you start caring about retrieval quality.

This post is all about document ingestion — the first phase of any RAG pipeline. We’ll go beyond the single TextReader from Demo 1 and explore how Spring AI handles different formats, how to enrich documents with metadata, and how to control chunk sizes to get better results. Everything maps to Demo 2: Document Ingestion in the rag-spring-ai project.

1. Why Ingestion Matters More Than You Think

Here’s the thing about RAG: your retrieval is only as good as what you put into the vector store. If your chunks are too big, the embedding becomes a vague blob that matches everything loosely. Too small, and you lose context — the LLM gets sentence fragments that don’t make sense on their own.

And if you’re shoving raw text into the store without any metadata, good luck figuring out which document a chunk came from when you’re debugging why the LLM gave a weird answer at 2am.

The ingestion phase is where you set yourself up for success (or frustration) down the line. Let’s get it right.

2. What’s New in Demo 2

Three ingestion paths — plain text, JSON, and custom chunking — each using different readers and splitting strategies before storing in PgVectorStore. — **Figure:** The three ingestion strategies in Demo 2 — text files go through TextReader + default splitting, JSON files use JsonReader with no chunking, and the custom path lets you tune chunk size at runtime. All three converge into the same PgVectorStore.

Demo 1 had a single TextReader and default TokenTextSplitter. Demo 2 introduces:

Feature	What it does
Text ingestion	Same as Demo 1, but with richer metadata (`source`, `type`)
JSON ingestion	Loads structured JSON with `JsonReader`, mapping specific fields
Custom chunking	Configurable `TokenTextSplitter` — tune chunk size, min size, separators
Detailed response	Every endpoint returns stats: docs read, chunks created, format, source

The code lives in two files: IngestionService.java and IngestionController.java.

3. The IngestionService — Three Ways to Ingest

3.1 Plain Text Ingestion

This is the simplest path. Read a .txt file, attach some metadata, split into chunks, and store:

public Map<String, Object> ingestText() {
    var reader = new TextReader(textDocument);
    reader.getCustomMetadata().put("source", "spring-ai-overview.txt");
    reader.getCustomMetadata().put("type", "text");

    List<Document> documents = reader.get();

    var splitter = new TokenTextSplitter();
    List<Document> chunks = splitter.apply(documents);

    vectorStore.add(chunks);

    return Map.of(
            "source", "spring-ai-overview.txt",
            "format", "text",
            "documentsRead", documents.size(),
            "chunksCreated", chunks.size(),
            "status", "ingested"
    );
}

It’s pretty close to what we had in Demo 1, but notice the metadata. We’re tagging each document with a source and type before it gets split. That metadata propagates to every chunk — so when you later retrieve a chunk during a query, you can trace it back to the original file. Small thing, but it makes debugging so much easier.

3.2 JSON Ingestion

Not everything is a text file. If your data lives in JSON — think FAQ exports, product catalogs, knowledge bases — Spring AI’s JsonReader has you covered:

public Map<String, Object> ingestJson() {
    var reader = new JsonReader(jsonDocument, "title", "content", "category");
    List<Document> documents = reader.get();

    vectorStore.add(documents);

    return Map.of(
            "source", "ai-concepts.json",
            "format", "json",
            "documentsRead", documents.size(),
            "chunksCreated", documents.size(),
            "status", "ingested"
    );
}

The key part is the JsonReader constructor: new JsonReader(jsonDocument, "title", "content", "category"). Those three strings tell the reader which JSON fields to extract. Each object in the JSON array becomes a separate Document, with the specified fields combined into the document content.

Here’s what our sample JSON looks like:

[
  {
    "title": "What is RAG?",
    "content": "RAG stands for Retrieval-Augmented Generation...",
    "category": "concept"
  },
  {
    "title": "What is a Vector Embedding?",
    "content": "A vector embedding is a numerical representation...",
    "category": "concept"
  }
]

Notice there’s no chunking step here. Each JSON object is already a self-contained piece of knowledge — a question-answer pair, a concept definition, etc. They’re small enough to embed directly without splitting. This is one of the nice things about structured data: the source format already gives you natural chunk boundaries.

3.3 Custom Chunking

Comparison of small, default, and large chunk sizes showing how the same document gets split into 6, 3, or 2 chunks respectively. — **Figure:** The chunk size trade-off — smaller chunks (300 tokens) produce more precise embeddings but less context per chunk. Larger chunks (1200 tokens) preserve more context but match less precisely. The default (800 tokens) is a reasonable middle ground.

This is where things get interesting. The default TokenTextSplitter uses 800 tokens per chunk with 350-token overlap. That’s reasonable, but what if your documents are very technical (shorter chunks might capture concepts more precisely) or very narrative (longer chunks might preserve flow better)?

public Map<String, Object> ingestWithCustomChunking(int chunkSize, int minChunkSize) {
    var reader = new TextReader(textDocument);
    reader.getCustomMetadata().put("source", "spring-ai-overview.txt");
    reader.getCustomMetadata().put("chunking", "custom");

    List<Document> documents = reader.get();
    var splitter = TokenTextSplitter.builder()
            .withChunkSize(chunkSize)
            .withMinChunkSizeChars(minChunkSize)
            .withMinChunkLengthToEmbed(5)
            .withMaxNumChunks(100)
            .withKeepSeparator(true)
            .build();
    List<Document> chunks = splitter.apply(documents);

    vectorStore.add(chunks);

    return Map.of(
            "source", "spring-ai-overview.txt",
            "format", "text",
            "chunkSize", chunkSize,
            "minChunkSize", minChunkSize,
            "documentsRead", documents.size(),
            "chunksCreated", chunks.size(),
            "status", "ingested"
    );
}

Let’s break down the TokenTextSplitter.builder() parameters:

Parameter	What it controls
`withChunkSize(chunkSize)`	Target number of tokens per chunk. Smaller = more precise embeddings, more chunks. Larger = more context per chunk, fewer chunks.
`withMinChunkSizeChars(minChunkSize)`	Minimum chunk size in characters. Prevents tiny, useless chunks at the end of a document.
`withMinChunkLengthToEmbed(5)`	Minimum character length to bother creating an embedding for. Skips chunks that are just whitespace or a few characters.
`withMaxNumChunks(100)`	Safety cap on the number of chunks. Prevents runaway splitting on very large documents.
`withKeepSeparator(true)`	Whether to preserve paragraph separators in the output. Helps maintain readability in retrieved chunks.

The endpoint accepts chunkSize and minChunkSize as request parameters, so you can experiment without changing code:

# Small chunks (300 tokens) — more precise, more chunks
curl -s -X POST "http://localhost:8080/api/ingest/custom-chunking?chunkSize=300&minChunkSize=50" | jq

# Large chunks (1000 tokens) — more context, fewer chunks
curl -s -X POST "http://localhost:8080/api/ingest/custom-chunking?chunkSize=1000&minChunkSize=100" | jq

Try both and then ask the same question through the Basic RAG endpoint — you’ll often see different chunks retrieved, and sometimes noticeably different answer quality.

4. The Controller — Clean and Thin

The IngestionController is intentionally minimal. It just maps HTTP endpoints to service methods:

@RestController
@RequestMapping("/api/ingest")
public class IngestionController {

    private final IngestionService ingestionService;

    public IngestionController(IngestionService ingestionService) {
        this.ingestionService = ingestionService;
    }

    @PostMapping("/text")
    public Map<String, Object> ingestText() {
        return ingestionService.ingestText();
    }

    @PostMapping("/json")
    public Map<String, Object> ingestJson() {
        return ingestionService.ingestJson();
    }

    @PostMapping("/custom-chunking")
    public Map<String, Object> ingestCustomChunking(
            @RequestParam(defaultValue = "400") int chunkSize,
            @RequestParam(defaultValue = "50") int minChunkSize) {
        return ingestionService.ingestWithCustomChunking(chunkSize, minChunkSize);
    }
}

Three endpoints, each returning a detailed stats map so you can see exactly what happened:

Action	HTTP Method	Endpoint	Parameters
Ingest text	`POST`	`/api/ingest/text`	—
Ingest JSON	`POST`	`/api/ingest/json`	—
Custom chunking	`POST`	`/api/ingest/custom-chunking`	`chunkSize` (default 400), `minChunkSize` (default 50)

5. Running It Yourself

If you already have the infrastructure running from Demo 1, you’re good to go. If not:

# Clone and start everything
git clone https://github.com/gdunhao/rag-spring-ai.git
cd rag-spring-ai
docker compose up -d

# Wait for models to download
docker compose logs -f ollama-init
# Wait for "All models ready!"

# Start the app
./mvnw spring-boot:run

Try all three ingestion strategies

# 1. Ingest plain text
curl -s -X POST http://localhost:8080/api/ingest/text | jq

# 2. Ingest JSON
curl -s -X POST http://localhost:8080/api/ingest/json | jq

# 3. Ingest with custom chunking (small chunks)
curl -s -X POST "http://localhost:8080/api/ingest/custom-chunking?chunkSize=300&minChunkSize=50" | jq

# 4. Ingest with custom chunking (large chunks)
curl -s -X POST "http://localhost:8080/api/ingest/custom-chunking?chunkSize=1000&minChunkSize=100" | jq

Example response from text ingestion:

{
  "source": "spring-ai-overview.txt",
  "format": "text",
  "documentsRead": 1,
  "chunksCreated": 5,
  "status": "ingested"
}

Example response from JSON ingestion:

{
  "source": "ai-concepts.json",
  "format": "json",
  "documentsRead": 8,
  "chunksCreated": 8,
  "status": "ingested"
}

Notice how JSON ingestion produces one chunk per JSON object (no splitting needed), while text ingestion splits the document into multiple chunks.

You can also use the interactive playground at localhost:8080 — the Document Ingestion tab wraps these same endpoints in a visual UI where you can experiment with different chunk sizes and see the results immediately.

6. Choosing the Right Ingestion Strategy

So when should you use what? Here’s my take:

Scenario	Reader	Chunking
Plain text docs (README, notes, articles)	`TextReader`	`TokenTextSplitter` with defaults is fine for most cases
Structured data (FAQ, catalogs, knowledge bases)	`JsonReader` with field mapping	Usually none — each object is already a natural chunk
Long technical docs	`TextReader`	Custom `TokenTextSplitter` — try smaller chunks (300-500 tokens) for more precise retrieval
Narrative content (stories, reports)	`TextReader`	Custom `TokenTextSplitter` — larger chunks (800-1200 tokens) to preserve narrative flow
Multi-format (PDF, DOCX, HTML)	`TikaDocumentReader`	`TokenTextSplitter` — Tika handles format conversion, then chunk as usual

The honest answer is: experiment. Ingest the same document with different chunk sizes, ask the same questions, and see which setup gives better answers. The custom chunking endpoint in Demo 2 is built exactly for this kind of experimentation.

7. What About PDF and DOCX?

Four document readers — TextReader, JsonReader, PagePdfDocumentReader, TikaDocumentReader — all producing List of Document, feeding into the same split, embed, store pipeline. — **Figure:** Spring AI's DocumentReader abstraction — every reader produces `List<Document>`. The downstream pipeline (split, embed, store) is identical regardless of the source format.

You might have noticed the IngestionService Javadoc mentions PagePdfDocumentReader and TikaDocumentReader. These are real Spring AI readers that handle PDFs, DOCX, HTML, and dozens of other formats through Apache Tika. They’re not wired up in Demo 2 (to keep the demo self-contained without extra dependencies), but the pattern is identical:

// PDF — one Document per page
var reader = new PagePdfDocumentReader(pdfResource);
List<Document> docs = reader.get();

// Tika — any format (PDF, DOCX, HTML, PPTX, etc.)
var reader = new TikaDocumentReader(anyResource);
List<Document> docs = reader.get();

// Then split and store — same as always
var chunks = new TokenTextSplitter().apply(docs);
vectorStore.add(chunks);

The beauty of Spring AI’s DocumentReader abstraction is that everything downstream (splitting, embedding, storing) stays exactly the same regardless of the source format. Swap the reader, keep the rest.

8. Key Takeaways

Metadata is cheap, debugging is expensive. Always tag your documents with at least a source field. Future-you will thank present-you.
Not everything needs chunking. Structured data (JSON, CSV rows) often has natural boundaries. Don’t blindly split what’s already the right size.
Chunk size is a tuning knob, not a constant. Smaller chunks = more precise retrieval. Larger chunks = more context. There’s no universal best value — it depends on your documents and your questions.
Spring AI’s reader abstractions are composable. TextReader, JsonReader, TikaDocumentReader — they all produce List<Document>. Everything after that (splitting, embedding, storing) is the same code path.
Test without infrastructure. Mock the VectorStore, use ClassPathResource for test documents, and verify your ingestion logic in milliseconds.

Series Roadmap

Post	Topic	What it adds
Post 1	Basic RAG	End-to-end retrieval pipeline with `QuestionAnswerAdvisor`
→ You are here	Document Ingestion	Multi-format loading, custom chunk sizes, metadata enrichment
Coming next	Vector Store Operations	Direct similarity search, threshold tuning, embedding inspection
	Chat with Memory	Conversational RAG with per-session history and context carryover
	Advisors	Composing RAG + memory + safety advisors in a pipeline
	Structured Output	Extracting typed Java records from LLM responses
	Function Calling	Letting the LLM invoke Java methods as tools
	Multi-Document RAG	Multiple document collections with smart routing
	Metadata Filtering	Scoping vector search with metadata filters

Source code: github.com/gdunhao/rag-spring-ai — clone it, run make setup && make run, and open localhost:8080 for the interactive playground.