无耳 Solon (OpenSolon) | rag - 文档的加载与分割（Document）

RAG 里最重要的一个工作，就是“检索”。就是在一个 “知识库（Repository）” 里找 “文档（Document）”。

1、概念

检索的内容，可称为文档（Document）。提供文档索检服务的对象，就是知识库（Repository）了。知识库可分为：

只读知识库，只用于检索的
可存储知识库，可用于检索，同时提供存储管理

可存储知识库，还会用到两个重要的工具：

DocumentLoader，文档加载器
DocumentSplitter，文档分割器

比如有个 pdf 文档，需要通过 DocumentLoader 加载，转为 Document 列表，然后存入 Repository。有时候 Document 太大，可能还要用 DocumentSplitter 分割成多个小 Document。

接口	描述
Document	文档
DocumentLoader	文档加载器
DocumentSplitter	文档分割器

Repository	知识库
RepositoryStorable	可存储知识库

EmbeddingModel	嵌入模型（知识库在存储或检索时使用）

2、DocumentLoader 及内置的适配

基础接口

public interface DocumentLoader {
    //附加元数据
    DocumentLoader additionalMetadata(String key, Object value);

    //附加元数据
    DocumentLoader additionalMetadata(Map<String, Object> metadata);

    //加载文档
    List<Document> load() throws IOException;
}

内置的适配（可以根据业务，按需定制）

加载器	所在插件	描述
TextLoader	solon-ai	纯文本加载器
ExcelLoader	solon-ai-load-excel	excel 文件加载器
HtmlSimpleLoader	solon-ai-load-html	html 简单加载器（一般要根据文页特征定制为好）
MarkdownLoader	solon-ai-load-markdown	md 文件加载器
PdfLoader	solon-ai-load-pdf	pdf 文件加载器
PptLoader	solon-ai-load-ppt	ppt 文件加载器
WordLoader	solon-ai-load-word	word 文件加载器

文档应用

文档加载后，一般是存入“文档知识库”，然后再检索使用。其实还可以，直接使用：

WordLoader loader = new WordLoader(new File("demo.docx")); //或 .doc
List<Document> docs = loader.load();

//作为用户消息的上下文
ChatMessage.ofUserAugment("这里有无耳公司的介绍吗？有的话摘出来", docs)

3、DocumentSplitter 及内置的适配

基础接口

public interface DocumentSplitter {
    //分割
    default List<Document> split(String text) {
        return split(Arrays.asList(new Document(text)));
    }

   //分割
    List<Document> split(List<Document> documents);
}

内置的适配（可以根据业务，按需定制）

分割器	所在插件	描述
JsonSplitter	solon-ai	根据 json 格式分割（把 array 格式的 json，分成多个文档）
RegexTextSplitter	solon-ai	根据正则表达式分割内容
TokenSizeTextSplitter	solon-ai	根据 token 限制大小分割内容

SplitterPipeline	solon-ai	分割器管道（把一批分割器，串起来用）

使用示例

private void load(RepositoryStorable repository, String file) throws IOException {
    //加载器
    HtmlSimpleLoader loader = new HtmlSimpleLoader(new File(file));
    
    //加载后再分割（按需）
    List<Document> documents = new SplitterPipeline() 
            .next(new RegexTextSplitter("\n\n"))
            .next(new TokenSizeTextSplitter(500))
            .split(loader.load());
    
    //入库
    repository.save(documents); 
}

Solon v4.0.3

1、概念

2、DocumentLoader 及内置的适配

3、DocumentSplitter 及内置的适配