Spring AI — Artificial Intelligence & Large Language Models

Complete guide to Spring AI: the official Spring framework for integrating AI and Large Language Models (LLMs) into Java applications. Learn architecture, components, and build real-world AI-powered features.

1. What is Spring AI?
2. Why Use Spring AI?
3. Spring AI Architecture
4. Core Components
5. Project Setup
6. Configuration
7. Real-World Examples
8. Best Practices
9. Testing
10. Advanced Concepts
11. Production Considerations
12. Conclusion

1. What is Spring AI?

Spring AI is an official Spring project that provides a unified abstraction layer for integrating artificial intelligence capabilities into Java applications. It simplifies working with Large Language Models (LLMs), embedding models, vector databases, and AI-powered features while maintaining Spring's familiar patterns: dependency injection, auto-configuration, and testing support.

Instead of writing provider-specific code for OpenAI, Anthropic, or other AI services, Spring AI offers a consistent API that lets you switch between providers or models with minimal code changes. It follows Spring Boot's convention-over-configuration philosophy, making AI integration as straightforward as adding a dependency and configuring properties.

2. Why Use Spring AI?

Provider abstraction: write code once and switch between OpenAI, Anthropic, Azure OpenAI, Ollama, and other providers by changing configuration.
Spring Boot integration: auto-configuration, property-based setup, and seamless integration with the Spring ecosystem.
Modular architecture: include only the components you need (chat, embeddings, vector stores, RAG) to keep dependencies minimal.
Production-ready: built-in support for retries, rate limiting, observability, and error handling.
Testing support: easy mocking and testing of AI components using Spring's testing framework.
RAG support: built-in Retrieval Augmented Generation (RAG) framework for context-aware AI applications.

3. Spring AI Architecture

Spring AI follows a modular, layered architecture that separates concerns and promotes flexibility:

3.1 Architecture Layers

Application Layer: Your Spring Boot application code (controllers, services, repositories).
Spring AI Abstractions: Core interfaces like ChatModel, EmbeddingModel, VectorStore.
Provider Implementations: Concrete implementations for different AI providers (OpenAI, Anthropic, etc.).
AI Provider APIs: External HTTP APIs or local model servers.

This architecture allows you to:

Write business logic against stable Spring AI interfaces
Switch AI providers without changing application code
Test with mock implementations
Combine multiple providers in the same application

graph TB subgraph "Application Layer" A[Controllers] --> B[Services] B --> C[Repositories] end subgraph "Spring AI Abstractions" D[ChatClient/ChatModel] E[EmbeddingModel] F[VectorStore] G[Document] end subgraph "Provider Implementations" H[OpenAI Implementation] I[Anthropic Implementation] J[Ollama Implementation] K[PostgreSQL VectorStore] L[Pinecone VectorStore] end subgraph "AI Provider APIs" M[OpenAI API] N[Anthropic API] O[Local Ollama Server] P[Vector Database] end B --> D B --> E B --> F D --> H D --> I D --> J E --> H E --> I F --> K F --> L H --> M I --> N J --> O K --> P L --> P style A fill:#e1f5ff,stroke:#0273bd,stroke-width:2px style B fill:#e1f5ff,stroke:#0273bd,stroke-width:2px style C fill:#e1f5ff,stroke:#0273bd,stroke-width:2px style D fill:#fff4e1,stroke:#f57c00,stroke-width:2px style E fill:#fff4e1,stroke:#f57c00,stroke-width:2px style F fill:#fff4e1,stroke:#f57c00,stroke-width:2px style G fill:#fff4e1,stroke:#f57c00,stroke-width:2px style H fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style I fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style J fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style K fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style L fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px

4. Core Components

The following diagram illustrates how Spring AI components interact:

graph TB subgraph "Spring AI Core Components" A[ChatClient] --> B[ChatModel] C[EmbeddingModel] D[VectorStore] E[Document] F[PromptTemplate] end subgraph "Provider Implementations" B --> G[OpenAI ChatModel] B --> H[Anthropic ChatModel] B --> I[Ollama ChatModel] C --> J[OpenAI EmbeddingModel] C --> K[Anthropic EmbeddingModel] D --> L[PostgreSQL VectorStore] D --> M[Pinecone VectorStore] D --> N[Redis VectorStore] end subgraph "RAG Components" O[Document Loader] --> E E --> P[Text Splitter] P --> C C --> D D --> Q[RetrievalAugmentationAdvisor] Q --> B end A --> B F --> A style A fill:#e1f5ff,stroke:#0273bd,stroke-width:3px style B fill:#fff4e1,stroke:#f57c00,stroke-width:2px style C fill:#fff4e1,stroke:#f57c00,stroke-width:2px style D fill:#fff4e1,stroke:#f57c00,stroke-width:2px style E fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style F fill:#fff4e1,stroke:#f57c00,stroke-width:2px

4.1 ChatModel

The ChatModel interface is the primary abstraction for interacting with LLMs. It handles conversational AI, text generation, and chat-based interactions.

Key Features:

Multi-turn conversations with message history
System prompts and user messages
Streaming responses for real-time interactions
Function calling and tool integration

4.2 EmbeddingModel

EmbeddingModel converts text into numerical vectors (embeddings) that capture semantic meaning. Essential for semantic search, similarity matching, and RAG applications.

4.2.1 Understanding Vector Embeddings

Vector Embeddings are numerical representations of text (or other data) in a high-dimensional space. They transform words, sentences, or documents into arrays of numbers that capture semantic meaning.

Imagine representing the meaning of words as numbers:

"cat"    → [0.2, 0.8, 0.1, 0.5, ...]  (1536 numbers for OpenAI)
"dog"    → [0.3, 0.7, 0.2, 0.4, ...]
"car"    → [0.1, 0.2, 0.9, 0.3, ...]

Key Properties:

Similar meanings → Similar vectors: Words with related meanings produce vectors that are close together in the high-dimensional space
Different meanings → Different vectors: Unrelated words produce vectors that are far apart
Fixed size: Each embedding has the same number of dimensions (e.g., 1536 for OpenAI's text-embedding-ada-002)

4.2.2 How Embeddings Work

The embedding process involves three main steps:

Text Input: The model receives text input (e.g., "What is artificial intelligence?")
Embedding Model Processing: The neural network breaks text into tokens, analyzes context and meaning, and generates a numerical representation
Vector Output: The model outputs a fixed-size vector (e.g., 1536 numbers for OpenAI ada-002)

In Spring AI, this process is simplified:

// Spring AI makes this simple:
EmbeddingModel embeddingModel; // Auto-configured

// Convert text to vector
List<Double> embedding = embeddingModel.embed("Hello, world!");
// Result: [0.123, -0.456, 0.789, ...] (1536 dimensions)

4.2.3 Embedding Models

An Embedding Model is a neural network that converts text into vectors. Different models have different characteristics:

OpenAI text-embedding-ada-002: 1536 dimensions, 8191 token context length, affordable and high quality
OpenAI text-embedding-3-large: 3072 dimensions, higher quality
OpenAI text-embedding-3-small: 1536 dimensions, faster processing
Other providers: Cohere, local models, etc.

Use Cases:

Document similarity search
Semantic search in knowledge bases
Clustering and classification
RAG context retrieval

4.3 VectorStore

VectorStore is an abstraction for storing and querying vector embeddings. A Vector Store is a database optimized for storing and searching vectors efficiently.

4.3.1 Why Vector Stores?

Problem: Traditional databases can't efficiently search by similarity. They're designed for exact matches, not semantic similarity.

Solution: Vector stores use specialized indexes (like HNSW) for fast similarity search, allowing you to find documents with similar meanings rather than exact text matches.

4.3.2 How Vector Stores Work

The vector store process involves several steps:

Store Document: Add a document (e.g., "AI is transforming...")
Generate Embedding: Convert the document to a vector [0.123, -0.456, ...]
Store in Vector Store: Save the vector along with the document content and metadata
Query: When searching (e.g., "What is AI?"), generate a query embedding
Find Similar Vectors: Use distance metrics to find the most similar vectors
Return Top Results: Retrieve the most similar documents

4.3.3 Distance Metrics

Distance metrics measure how similar two vectors are. Spring AI supports several metrics:

1. Cosine Distance

Formula: 1 - cosine_similarity
What it measures: Angle between vectors (ignores magnitude)
Range: 0 (identical) to 2 (opposite)
Why use it: Focuses on direction, not magnitude; excellent for text embeddings; normalized range

2. Euclidean Distance (L2)

Formula: √(Σ(xi - yi)²)
What it measures: Straight-line distance between points
When to use: When magnitude matters in your use case

3. Dot Product

Formula: Σ(xi × yi)
What it measures: How aligned vectors are
When to use: When you need raw similarity scores

4.3.4 Index Types

Vector stores use specialized indexes to enable fast similarity search:

HNSW (Hierarchical Navigable Small World)

What it is: Graph-based index for fast similarity search
Benefits: O(log n) search time, high recall rate, scalable to millions of vectors
Trade-offs: Uses more memory, takes time to build the index
Best for: Production applications requiring high accuracy and performance

IVFFlat

What it is: Inverted file index
Benefits: Memory efficient, fast to build
Trade-offs: Lower recall than HNSW, slower search for large datasets
Best for: Smaller datasets or when memory is constrained

4.3.5 Supported Vector Stores

Spring AI supports multiple vector databases:

PostgreSQL with pgvector: Native SQL integration, HNSW index support, cosine distance optimization
Pinecone: Managed vector database service
Chroma: Open-source vector database
Weaviate: Vector search engine
Milvus: Open-source vector database
Redis: In-memory vector storage
Simple in-memory store: For testing and development

4.3.6 PgVectorStore Configuration

When using PostgreSQL with pgvector, you can configure the distance type:

@Bean
public VectorStore vectorStore(JdbcTemplate jdbcTemplate, EmbeddingModel embeddingModel) {
    return new PgVectorStore.Builder(jdbcTemplate, embeddingModel)
        .withDistanceType(PgVectorStore.PgDistanceType.COSINE_DISTANCE)
        .build();
}

4.4 Document

The Document class represents text content with metadata. Used for storing and retrieving documents in vector stores for RAG applications.

4.5 RAG (Retrieval Augmented Generation)

RAG enhances LLM responses with relevant context from your documents. It combines retrieval of relevant information with generation of responses based on that context.

4.5.1 The Problem RAG Solves

Without RAG:

User: "What is our refund policy?"
LLM: [Generic answer based on training data, may be outdated or incorrect]

With RAG:

User: "What is our refund policy?"
System: 
  1. Search documents for "refund policy"
  2. Find relevant sections from YOUR documents
  3. Add to prompt: "Based on: [your actual policy]..."
  4. LLM: [Answer based on YOUR current documents]

4.5.2 RAG Components

Spring AI provides a complete RAG framework that combines:

Document Loading: Load documents from various sources (PDF, text files, web pages)
Text Splitting: Chunk documents into manageable pieces
Embedding: Convert chunks to vectors
Storage: Store in vector databases
Retrieval: Find relevant context for queries
Generation: Use retrieved context to generate accurate responses

4.5.3 RAG Process Flow

The RAG process follows these steps:

User Query: User asks a question (e.g., "What is machine learning?")
Generate Query Embedding: Convert the query to a vector [0.123, -0.456, ...]
Vector Similarity Search: Find top K similar documents in the vector store
Retrieve Document Context: Extract the content from the most similar documents
Build Prompt with Context: Combine retrieved context with the user's question
Send to LLM: The LLM generates an answer using the provided context
Return Answer: Return the generated response to the user

The RAG workflow is illustrated below:

graph LR subgraph "Document Processing" A[Load Documents] --> B[Split into Chunks] B --> C[Generate Embeddings] C --> D[Store in Vector DB] end subgraph "Query Processing" E[User Query] --> F[Generate Query Embedding] F --> G[Similarity Search] G --> H[Retrieve Top K Documents] end subgraph "Response Generation" H --> I[Build Context] I --> J[Create Prompt with Context] J --> K[LLM Generation] K --> L[Return Response] end D --> G style A fill:#e1f5ff,stroke:#0273bd,stroke-width:2px style B fill:#e1f5ff,stroke:#0273bd,stroke-width:2px style C fill:#e1f5ff,stroke:#0273bd,stroke-width:2px style D fill:#e1f5ff,stroke:#0273bd,stroke-width:2px style E fill:#fff4e1,stroke:#f57c00,stroke-width:2px style F fill:#fff4e1,stroke:#f57c00,stroke-width:2px style G fill:#fff4e1,stroke:#f57c00,stroke-width:2px style H fill:#fff4e1,stroke:#f57c00,stroke-width:2px style I fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style J fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style K fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style L fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px

4.6 Prompt Templates

Spring AI supports prompt templating using StringTemplate, allowing dynamic prompt construction with variables and conditionals.

4.6.1 What are Prompt Templates?

Prompt Templates are reusable text patterns for LLM interactions. They provide a structured way to create prompts with placeholders that get filled with dynamic content.

4.6.2 Why Use Templates?

Without Template (hard to maintain):

String prompt = "Answer: " + question; // Hard to maintain, error-prone

With Template (maintainable and consistent):

PromptTemplate template = new PromptTemplate("""
    Answer the following question based on the provided context.
    If the answer cannot be found in the context, say "I don't know."
    
    Context: {context}
    Question: {question}
    """);

Prompt prompt = template.create(Map.of(
    "context", context,
    "question", question
));

Benefits:

Consistency: Same format every time, ensuring predictable LLM behavior
Maintainability: Update prompt structure in one place
Reusability: Use the same template across different queries and contexts
Type Safety: Compile-time checking of template variables

4.6.3 RAG Prompt Template Structure

For RAG (Retrieval-Augmented Generation) applications, prompt templates typically include:

Instructions: "Answer based on context" - tells the LLM how to use the context
Context Placeholder: {context} - filled with retrieved documents
Question Placeholder: {question} - the user's question
Fallback Instructions: "I don't know" if context is insufficient - prevents hallucination

4.7 Model Context Protocol (MCP)

Spring AI supports the Model Context Protocol, enabling AI models to interact with external tools, databases, and services through a standardized interface.

5. Project Setup

To get started with Spring AI, add the Spring AI BOM (Bill of Materials) and the specific dependencies you need.

5.1 Gradle Configuration

Create a build.gradle file with the following configuration:

plugins {
    id 'java'
    id 'org.springframework.boot' version '3.2.0'
    id 'io.spring.dependency-management' version '1.1.4'
}

java {
    sourceCompatibility = '17'
    targetCompatibility = '17'
}

ext {
    springAiVersion = '1.0.0'
}

dependencyManagement {
    imports {
        mavenBom "org.springframework.ai:spring-ai-bom:${springAiVersion}"
    }
}

repositories {
    mavenCentral()
}

dependencies {
    // Spring Boot starters
    implementation 'org.springframework.boot:spring-boot-starter-web'
    implementation 'org.springframework.boot:spring-boot-starter-validation'
    
    // Spring AI OpenAI (or use spring-ai-anthropic, spring-ai-ollama, etc.)
    implementation 'org.springframework.ai:spring-ai-openai-spring-boot-starter'
    
    // Optional: Vector Store (e.g., PostgreSQL with pgvector)
    implementation 'org.springframework.ai:spring-ai-pgvector-store-spring-boot-starter'
    
    // Testing
    testImplementation 'org.springframework.boot:spring-boot-starter-test'
    testImplementation 'org.mockito:mockito-core'
    testImplementation 'org.mockito:mockito-junit-jupiter'
    testRuntimeOnly 'org.junit.platform:junit-platform-launcher'
}

tasks.named('test') {
    useJUnitPlatform()
    testLogging {
        events "passed", "skipped", "failed"
        exceptionFormat "full"
    }
}

5.2 Maven Configuration (Alternative)

If you prefer Maven, use the following pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
         http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>3.2.0</version>
        <relativePath/>
    </parent>

    <properties>
        <spring-ai.version>1.0.0</spring-ai.version>
    </properties>

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>org.springframework.ai</groupId>
                <artifactId>spring-ai-bom</artifactId>
                <version>${spring-ai.version}</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <!-- Spring AI OpenAI -->
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
        </dependency>

        <!-- Optional: Vector Store -->
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
        </dependency>
    </dependencies>
</project>

6. Configuration

Configure Spring AI using application.properties or application.yml. Spring Boot's auto-configuration handles the rest.

6.1 OpenAI Configuration

# application.properties
spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.chat.options.model=gpt-4
spring.ai.openai.chat.options.temperature=0.7
spring.ai.openai.chat.options.max-tokens=500

6.2 Anthropic Configuration

# application.properties
spring.ai.anthropic.api-key=${ANTHROPIC_API_KEY}
spring.ai.anthropic.chat.options.model=claude-3-opus-20240229
spring.ai.anthropic.chat.options.temperature=0.7

6.3 Ollama (Local Models) Configuration

# application.properties
spring.ai.ollama.base-url=http://localhost:11434
spring.ai.ollama.chat.options.model=llama2

6.4 Vector Store Configuration (PostgreSQL)

# application.properties
spring.datasource.url=jdbc:postgresql://localhost:5432/vectordb
spring.datasource.username=postgres
spring.datasource.password=password

spring.ai.vectorstore.pgvector.index-type=HNSW
spring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCE

7. Real-World Examples

7.1 Example 1: Simple Chat Service

A basic service that uses ChatClient to generate text responses:

package com.example.ai.service;

import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.chat.messages.SystemMessage;
import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.stereotype.Service;

import java.util.List;

@Service
public class ChatService {
    private final ChatClient chatClient;

    public ChatService(ChatClient chatClient) {
        this.chatClient = chatClient;
    }

    public String chat(String userMessage) {
        return chatClient.call(userMessage);
    }

    public String chatWithSystemPrompt(String systemPrompt, String userMessage) {
        Prompt prompt = new Prompt(List.of(
            new SystemMessage(systemPrompt),
            new UserMessage(userMessage)
        ));
        return chatClient.call(prompt).getResult().getOutput().getContent();
    }
}

7.2 Example 2: REST Controller for Chat

Expose the chat service via REST API:

package com.example.ai.controller;

import com.example.ai.service.ChatService;
import org.springframework.web.bind.annotation.*;

@RestController
@RequestMapping("/api/chat")
public class ChatController {
    private final ChatService chatService;

    public ChatController(ChatService chatService) {
        this.chatService = chatService;
    }

    @PostMapping
    public ChatResponse chat(@RequestBody ChatRequest request) {
        String response = chatService.chat(request.getMessage());
        return new ChatResponse(response);
    }

    // DTOs
    public record ChatRequest(String message) {}
    public record ChatResponse(String response) {}
}

7.3 Example 3: Document Embedding and Vector Store

Store documents as embeddings and perform semantic search:

package com.example.ai.service;

import org.springframework.ai.document.Document;
import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.stereotype.Service;

import java.util.List;

@Service
public class DocumentService {
    private final EmbeddingModel embeddingModel;
    private final VectorStore vectorStore;

    public DocumentService(EmbeddingModel embeddingModel, VectorStore vectorStore) {
        this.embeddingModel = embeddingModel;
        this.vectorStore = vectorStore;
    }

    public void addDocument(String content, String metadata) {
        Document document = new Document(content);
        document.getMetadata().put("source", metadata);
        vectorStore.add(List.of(document));
    }

    public List<Document> searchSimilar(String query, int topK) {
        return vectorStore.similaritySearch(
            org.springframework.ai.vectorstore.SearchRequest.builder()
                .withQuery(query)
                .withTopK(topK)
                .build()
        );
    }
}

7.4 Example 4: RAG Application

Complete RAG implementation that retrieves relevant context before generating responses. The following diagram shows the component interactions:

sequenceDiagram participant User participant Controller participant RAGService participant VectorStore participant EmbeddingModel participant ChatClient User->>Controller: POST /api/rag/ask Controller->>RAGService: ask(question) RAGService->>EmbeddingModel: embed(question) EmbeddingModel-->>RAGService: queryVector RAGService->>VectorStore: similaritySearch(queryVector) VectorStore-->>RAGService: relevantDocuments RAGService->>RAGService: buildContext(documents) RAGService->>ChatClient: call(prompt with context) ChatClient-->>RAGService: response RAGService-->>Controller: answer Controller-->>User: JSON response

Implementation code:

package com.example.ai.service;

import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.prompt.PromptTemplate;
import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.stereotype.Service;

import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

@Service
public class RAGService {
    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    public RAGService(ChatClient chatClient, VectorStore vectorStore) {
        this.chatClient = chatClient;
        this.vectorStore = vectorStore;
    }

    public String ask(String question) {
        // 1. Retrieve relevant documents
        List<Document> relevantDocs = vectorStore.similaritySearch(
            org.springframework.ai.vectorstore.SearchRequest.builder()
                .withQuery(question)
                .withTopK(5)
                .build()
        );

        // 2. Build context from retrieved documents
        String context = relevantDocs.stream()
            .map(Document::getContent)
            .collect(Collectors.joining("\n\n"));

        // 3. Create prompt with context and question
        String promptTemplate = """
            Answer the following question based on the provided context.
            If the answer cannot be found in the context, say "I don't know."

            Context:
            {context}

            Question: {question}
            """;

        PromptTemplate template = new PromptTemplate(promptTemplate);
        Prompt prompt = template.create(Map.of(
            "context", context,
            "question", question
        ));

        // 4. Generate response
        return chatClient.call(prompt).getResult().getOutput().getContent();
    }
}

7.5 Example 5: Streaming Chat Response

Stream responses in real-time for better user experience:

package com.example.ai.controller;

import org.springframework.ai.chat.ChatClient;
import org.springframework.http.MediaType;
import org.springframework.web.bind.annotation.*;
import reactor.core.publisher.Flux;

@RestController
@RequestMapping("/api/chat")
public class StreamingChatController {
    private final ChatClient chatClient;

    public StreamingChatController(ChatClient chatClient) {
        this.chatClient = chatClient;
    }

    @PostMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<String> streamChat(@RequestBody ChatRequest request) {
        return chatClient.stream(request.message())
            .map(response -> response.getResult().getOutput().getContent());
    }
}

7.6 Example 6: Multi-Provider Setup

Use multiple AI providers in the same application:

package com.example.ai.service;

import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.chat.ChatModel;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.stereotype.Service;

@Service
public class MultiProviderService {
    private final ChatModel openAiChatModel;
    private final ChatModel anthropicChatModel;

    public MultiProviderService(
            @Qualifier("openAiChatModel") ChatModel openAiChatModel,
            @Qualifier("anthropicChatModel") ChatModel anthropicChatModel) {
        this.openAiChatModel = openAiChatModel;
        this.anthropicChatModel = anthropicChatModel;
    }

    public String useOpenAI(String prompt) {
        return openAiChatModel.call(prompt).getResult().getOutput().getContent();
    }

    public String useAnthropic(String prompt) {
        return anthropicChatModel.call(prompt).getResult().getOutput().getContent();
    }
}

8. Best Practices

8.1 Error Handling and Retries

Implement retry logic for transient failures:

@Service
public class ResilientChatService {
    private final ChatClient chatClient;
    private final RetryTemplate retryTemplate;

    public ResilientChatService(ChatClient chatClient) {
        this.chatClient = chatClient;
        this.retryTemplate = RetryTemplate.builder()
            .maxAttempts(3)
            .exponentialBackoff(1000, 2, 10000)
            .retryOn(IOException.class)
            .build();
    }

    public String chatWithRetry(String message) {
        return retryTemplate.execute(context -> {
            return chatClient.call(message);
        });
    }
}

8.2 Rate Limiting

Use Spring's rate limiting to prevent API quota exhaustion:

@Service
public class RateLimitedChatService {
    private final ChatClient chatClient;
    private final RateLimiter rateLimiter;

    public RateLimitedChatService(ChatClient chatClient) {
        this.chatClient = chatClient;
        this.rateLimiter = RateLimiter.create(10.0); // 10 requests per second
    }

    public String chat(String message) {
        rateLimiter.acquire();
        return chatClient.call(message);
    }
}

8.3 Caching

Cache responses for repeated queries:

@Service
public class CachedChatService {
    private final ChatClient chatClient;
    private final CacheManager cacheManager;

    public CachedChatService(ChatClient chatClient, CacheManager cacheManager) {
        this.chatClient = chatClient;
        this.cacheManager = cacheManager;
    }

    @Cacheable(value = "chatResponses", key = "#message")
    public String chat(String message) {
        return chatClient.call(message);
    }
}

8.4 Prompt Engineering

Use system prompts to define AI behavior and context
Structure prompts with clear instructions and examples
Use prompt templates for dynamic content
Validate and sanitize user inputs before sending to models
Test prompts with different models to ensure consistency

8.5 Observability

Log prompts and responses (be careful with PII/sensitive data)
Track token usage and costs
Monitor latency and error rates
Use distributed tracing for debugging

9. Testing

Spring AI makes testing easy with mock implementations. The following example demonstrates testing with JUnit 5 and Mockito:

package com.example.ai.service;

import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.boot.test.mock.mockito.MockBean;
import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.ChatResponse;

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.mockito.ArgumentMatchers.any;
import static org.mockito.Mockito.when;

@SpringBootTest
class ChatServiceTest {
    @MockBean
    private ChatClient chatClient;

    @Autowired
    private ChatService chatService;

    @Test
    void testChat() {
        // Mock response
        when(chatClient.call("Hello")).thenReturn("Hello, how can I help?");

        // Test
        String response = chatService.chat("Hello");
        assertEquals("Hello, how can I help?", response);
    }

    @Test
    void testChatWithSystemPrompt() {
        // Test with system prompt
        ChatResponse mockResponse = new ChatResponse(
            new ChatResponse.Result(
                new ChatResponse.Result.Output("I'm a helpful assistant.")
            )
        );
        when(chatClient.call(any(Prompt.class)))
            .thenReturn(mockResponse);

        String response = chatService.chatWithSystemPrompt(
            "You are a helpful assistant.",
            "Hello"
        );
        assertEquals("I'm a helpful assistant.", response);
    }
}

Run tests using Gradle:

# Run all tests
./gradlew test

# Run specific test class
./gradlew test --tests ChatServiceTest

# Run with coverage
./gradlew test jacocoTestReport

The component relationships in a typical Spring AI test setup:

graph TB subgraph "Test Configuration" A[SpringBootTest] --> B[Test Context] B --> C[Mock Beans] end subgraph "Service Under Test" D[ChatService] --> E[ChatClient Interface] end subgraph "Mock Implementation" C --> F[Mock ChatClient] F --> G[Stubbed Responses] end subgraph "Test Execution" H[Test Method] --> D D --> F F --> I[Assertions] end A --> H E -.->|injected| F style A fill:#e1f5ff,stroke:#0273bd,stroke-width:2px style D fill:#fff4e1,stroke:#f57c00,stroke-width:2px style F fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style I fill:#fce4ec,stroke:#c2185b,stroke-width:2px

10. Advanced Concepts

10.1 Token Limits

Tokens are pieces of text that LLMs process. They can be words, subwords, or characters depending on the tokenization method.

10.1.1 Token Limits by Model

GPT-4: ~8,192 tokens (input + output combined)
GPT-3.5: ~4,096 tokens
text-embedding-ada-002: 8,191 tokens per input

10.1.2 Why Token Limits Matter

Problem: LLMs have maximum token limits. If your input exceeds this limit, you'll get an error or the model will truncate the input.

Solutions:

Chunking: Split long documents into smaller pieces that fit within token limits
Summarization: Summarize content before sending to the LLM
Selective Context: Only send the most relevant parts of documents (this is what RAG does!)
Streaming: For long outputs, use streaming to handle responses that exceed limits

10.2 Temperature Parameter

Temperature controls the randomness and creativity in LLM responses. It's a parameter you can configure when making API calls.

10.2.1 Temperature Ranges

Low (0.1-0.3): Deterministic, focused, consistent responses. Best for factual answers, code generation, or when you need reproducible results.
Medium (0.7): Balanced creativity and consistency. Good default for most applications.
High (0.9-1.0): Creative, varied, unpredictable. Best for creative writing, brainstorming, or when you want diverse responses.

10.2.2 Example

Prompt: "Complete: The sky is"

Temperature 0.1: "blue" (always the same, most likely answer)
Temperature 0.7: "blue", "cloudy", "clear" (varied but reasonable)
Temperature 1.0: "blue", "purple", "raining cats" (very creative, may be nonsensical)

In Spring AI, you configure temperature in your application properties:

spring.ai.openai.chat.options.temperature=0.7

10.3 Streaming Responses

Streaming sends responses as they're generated, rather than waiting for the complete response before sending it to the client.

10.3.1 Benefits of Streaming

Faster Perceived Response: Users see text immediately, improving perceived performance
Better User Experience: Feels more interactive and responsive, similar to ChatGPT
Lower Latency: Don't wait for complete response before starting to display content
Handles Long Responses: Can handle responses that exceed token limits by streaming chunks

10.3.2 Implementation in Spring AI

Spring AI supports streaming through reactive streams (Flux):

Flux<String> stream = chatClient.stream(prompt);
// Returns chunks as they're generated

This is particularly useful for chat interfaces where users expect to see responses appear in real-time.

11. Production Considerations

11.1 Cost Management

Monitor token usage and implement budgets
Use smaller models for simple tasks
Cache responses when appropriate
Batch requests when possible

11.2 Security

Store API keys securely (use environment variables or secret management)
Validate and sanitize all inputs
Implement rate limiting to prevent abuse
Audit AI interactions for compliance
Consider data privacy when sending sensitive information to external providers

11.3 Performance

Use streaming for long responses
Implement async processing for non-interactive tasks
Optimize RAG retrieval with proper chunking and indexing
Use connection pooling for vector databases