Chroma 開源向量資料庫

專案簡介

Chroma 是一個開源的嵌入式向量資料庫，專為 AI 應用設計。API 簡潔易用，可輕鬆整合到 LangChain、LlamaIndex 等框架，是 RAG 應用的熱門選擇。

GitHub Stars: 26K+

主要功能

簡單易用 - Python 原生 API
嵌入式 - 無需獨立伺服器
元資料過濾 - 結合向量和屬性查詢
多種 Embedding - OpenAI、HuggingFace、本地模型
持久化 - 支援本地儲存

安裝

1
pip install chromadb

快速開始

基本使用

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import chromadb

# 建立客戶端
client = chromadb.Client()

# 建立 Collection
collection = client.create_collection("security_docs")

# 新增文件
collection.add(
    documents=[
        "SQL injection is a code injection technique",
        "XSS allows attackers to inject scripts",
        "CSRF forces users to execute unwanted actions"
    ],
    ids=["doc1", "doc2", "doc3"]
)

# 查詢
results = collection.query(
    query_texts=["What is injection attack?"],
    n_results=2
)

print(results["documents"])

持久化

本地儲存

1
2
3
4
5
6
7
import chromadb

# 持久化客戶端
client = chromadb.PersistentClient(path="./chroma_db")

# Collection 會自動持久化
collection = client.get_or_create_collection("my_docs")

Client/Server 模式

1
2
# 啟動伺服器
chroma run --path ./chroma_data --port 8000

1
2
3
import chromadb

client = chromadb.HttpClient(host="localhost", port=8000)

Collection 操作

建立 Collection

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# 基本建立
collection = client.create_collection("docs")

# 指定 Embedding 函數
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

embedding_fn = OpenAIEmbeddingFunction(
    api_key="your-key",
    model_name="text-embedding-3-small"
)

collection = client.create_collection(
    name="docs",
    embedding_function=embedding_fn
)

管理 Collection

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 取得 Collection
collection = client.get_collection("docs")

# 取得或建立
collection = client.get_or_create_collection("docs")

# 刪除 Collection
client.delete_collection("docs")

# 列出所有 Collection
collections = client.list_collections()

文件操作

新增文件

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 純文字（自動生成 embedding）
collection.add(
    documents=["Document 1", "Document 2"],
    ids=["id1", "id2"]
)

# 帶元資料
collection.add(
    documents=["SQL Injection guide", "XSS prevention"],
    metadatas=[
        {"category": "injection", "severity": "high"},
        {"category": "xss", "severity": "medium"}
    ],
    ids=["doc1", "doc2"]
)

# 直接提供 embedding
collection.add(
    embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
    documents=["Doc 1", "Doc 2"],
    ids=["id1", "id2"]
)

更新文件

1
2
3
4
5
collection.update(
    ids=["doc1"],
    documents=["Updated content"],
    metadatas=[{"category": "updated"}]
)

刪除文件

1
2
3
4
5
# 依 ID 刪除
collection.delete(ids=["doc1", "doc2"])

# 依條件刪除
collection.delete(where={"category": "deprecated"})

查詢

相似度搜尋

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
results = collection.query(
    query_texts=["security vulnerability"],
    n_results=5
)

# 結果包含
print(results["ids"])        # 文件 ID
print(results["documents"])  # 文件內容
print(results["distances"])  # 距離分數
print(results["metadatas"])  # 元資料

元資料過濾

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Where 過濾
results = collection.query(
    query_texts=["injection attack"],
    where={"severity": "high"},
    n_results=5
)

# 複雜條件
results = collection.query(
    query_texts=["attack"],
    where={
        "$and": [
            {"category": {"$eq": "injection"}},
            {"severity": {"$in": ["high", "critical"]}}
        ]
    },
    n_results=10
)

Where 運算子

運算子	說明
`$eq`	等於
`$ne`	不等於
`$gt`	大於
`$gte`	大於等於
`$lt`	小於
`$lte`	小於等於
`$in`	在列表中
`$nin`	不在列表中

文件過濾

1
2
3
4
results = collection.query(
    query_texts=["security"],
    where_document={"$contains": "SQL"}
)

Embedding 函數

OpenAI

1
2
3
4
5
6
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

ef = OpenAIEmbeddingFunction(
    api_key="your-key",
    model_name="text-embedding-3-small"
)

HuggingFace

1
2
3
4
5
from chromadb.utils.embedding_functions import HuggingFaceEmbeddingFunction

ef = HuggingFaceEmbeddingFunction(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

本地模型

1
2
3
4
5
6
from chromadb.utils.embedding_functions import OllamaEmbeddingFunction

ef = OllamaEmbeddingFunction(
    url="http://localhost:11434",
    model_name="nomic-embed-text"
)

LangChain 整合

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# 從文件建立
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 載入現有
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

# 相似度搜尋
results = vectorstore.similarity_search("query", k=3)

Docker 部署

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
version: '3.8'
services:
  chroma:
    image: chromadb/chroma
    ports:
      - "8000:8000"
    volumes:
      - ./chroma_data:/chroma/chroma
    environment:
      - CHROMA_SERVER_AUTHN_CREDENTIALS=admin:password

效能優化

批次操作

1
2
3
4
5
6
# 批次新增
collection.add(
    documents=documents,  # 大量文件
    ids=ids,
    batch_size=1000
)

索引設定

1
2
3
4
collection = client.create_collection(
    name="docs",
    metadata={"hnsw:space": "cosine"}  # 或 l2, ip
)

Chroma 開源向量資料庫

使用 Chroma 建立 AI 應用的向量儲存，簡單易用的嵌入式資料庫，完美支援 RAG

專案簡介

主要功能

安裝

快速開始

基本使用

持久化

本地儲存

Client/Server 模式

Collection 操作

建立 Collection

管理 Collection

文件操作

新增文件

更新文件

刪除文件

查詢

相似度搜尋

元資料過濾

Where 運算子

文件過濾

Embedding 函數

OpenAI

HuggingFace

本地模型

LangChain 整合

Docker 部署

效能優化

批次操作

索引設定

相關連結

延伸閱讀