API Reference

Complete API documentation for all DeepTextSearch classes, methods, and parameters.

`TextEmbedder`

The core class for embedding text and managing vector indexes.

Constructor

TextEmbedder(
    model_name: str = "BAAI/bge-m3",
    vector_store: Union[str, BaseVectorStore] = "faiss",
    store_config: Optional[dict] = None,
    index_dir: str = ".deeptextsearch",
    index_type: str = "flat",
    device: Optional[str] = None,
    batch_size: int = 64,
    normalize: bool = True,
)

Parameters:

Parameter	Type	Default	Description
`model_name`	`str`	`"BAAI/bge-m3"`	Any sentence-transformers HuggingFace model name, preset key, or local path to a saved model directory.
`vector_store`	`str` or `BaseVectorStore`	`"faiss"`	Vector store backend. String options: `"faiss"`, `"chroma"`, `"qdrant"`, `"postgres"`, `"mongo"`. Or pass a custom `BaseVectorStore` instance.
`store_config`	`dict` or `None`	`None`	Backend-specific configuration dict. Keys depend on the chosen backend (see Vector Stores).
`index_dir`	`str`	`".deeptextsearch"`	Directory path for saving/loading index data. Used as default persist location.
`index_type`	`str`	`"flat"`	FAISS index type: `"flat"` (exact), `"ivf"` (approximate), `"hnsw"` (approximate). Only used when `vector_store="faiss"`.
`device`	`str` or `None`	`None`	Inference device: `"cpu"`, `"cuda"`, `"mps"`. `None` auto-detects (CUDA > MPS > CPU).
`batch_size`	`int`	`64`	Batch size for encoding text. Larger = faster but more memory.
`normalize`	`bool`	`True`	Whether to L2-normalize embeddings. Required for cosine similarity with FAISS inner product index.

store_config keys by backend:

Backend	Key	Type	Default	Description
`faiss`	`index_type`	`str`	`"flat"`	Override FAISS index type
`chroma`	`collection_name`	`str`	`"deep_text_search"`	ChromaDB collection name
`chroma`	`persist_directory`	`str`	`index_dir`	Storage directory
`qdrant`	`collection_name`	`str`	`"deep_text_search"`	Qdrant collection name
`qdrant`	`location`	`str`	`None`	Remote server URL (e.g. `"http://localhost:6333"`)
`qdrant`	`path`	`str`	`index_dir`	Local persistent storage path
`postgres`	`connection_string`	`str`	`"postgresql://localhost:5432/deeptextsearch"`	PostgreSQL connection string
`postgres`	`table_name`	`str`	`"text_vectors"`	Table name
`postgres`	`metadata_schema`	`dict`	`None`	`{"column_name": "SQL_TYPE"}` for dedicated indexed columns
`mongo`	`connection_string`	`str`	`"mongodb://localhost:27017"`	MongoDB connection string
`mongo`	`database_name`	`str`	`"deeptextsearch"`	Database name
`mongo`	`collection_name`	`str`	`"text_vectors"`	Collection name
`mongo`	`index_name`	`str`	`"vector_index"`	Atlas Vector Search index name
`mongo`	`metadata_fields`	`list[str]`	`None`	Fields to store as top-level document fields

Properties

Property	Type	Description
`.corpus_size`	`int`	Number of indexed documents.
`.dimension`	`int`	Embedding vector dimension.
`.model_name`	`str`	Resolved HuggingFace model name.
`.device`	`str`	Active inference device.

Methods

`.index(corpus, text_column=None, metadata_columns=None) → TextEmbedder`

Embed and index an entire text corpus. Returns self for method chaining.

Parameter	Type	Description
`corpus`	`list[str]`, `pd.Series`, `pd.DataFrame`	The text corpus to index.
`text_column`	`str` or `None`	Required when `corpus` is a DataFrame. Column containing text.
`metadata_columns`	`list[str]` or `None`	DataFrame columns to store as metadata (DataFrame only).

embedder.index(["text 1", "text 2"])
embedder.index(df, text_column="content", metadata_columns=["title", "date"])

`.add(texts, metadata=None) → TextEmbedder`

Add texts to an existing index incrementally. Returns self for chaining.

Parameter	Type	Description
`texts`	`str` or `list[str]`	Text(s) to add.
`metadata`	`dict`, `list[dict]`, or `None`	Metadata for the new texts.

embedder.add("New document.")
embedder.add(["Doc A", "Doc B"], metadata=[{"src": "web"}, {"src": "book"}])

`.delete(indices) → None`

Delete documents from the index by their corpus indices.

Parameter	Type	Description
`indices`	`list[int]`	Corpus indices to delete.

embedder.delete([0, 5, 12])

`.encode(texts) → np.ndarray`

Encode text(s) into embedding vectors without adding to the index.

Parameter	Type	Description
`texts`	`str` or `list[str]`	Text(s) to encode.

Returns: np.ndarray of shape (N, D) where D is the embedding dimension.

vectors = embedder.encode("Hello world")  # shape: (1, 1024)
vectors = embedder.encode(["A", "B"])     # shape: (2, 1024)

`.save(index_dir=None) → None`

Save index, corpus, and configuration to disk.

Parameter	Type	Description
`index_dir`	`str` or `None`	Override save directory. Uses `self.index_dir` if `None`.

embedder.save("/path/to/my_index")

`.load(index_dir, device=None) → TextEmbedder` (classmethod)

Load a previously saved index from disk.

Parameter	Type	Description
`index_dir`	`str`	Directory containing saved index files.
`device`	`str` or `None`	Override device for inference.

Returns: TextEmbedder instance with loaded index.

embedder = TextEmbedder.load("/path/to/my_index")
embedder = TextEmbedder.load("/path/to/my_index", device="cuda")

`.from_csv(file_path, text_column, ...) → TextEmbedder` (classmethod)

Load CSV, embed, and return a ready-to-search instance.

Parameter	Type	Description
`file_path`	`str`	Path to CSV file.
`text_column`	`str`	Column containing text.
`metadata_columns`	`list[str]` or `None`	Columns to store as metadata.
`model_name`	`str`	Embedding model name.
`**kwargs`		Additional arguments passed to the constructor.

embedder = TextEmbedder.from_csv("data.csv", text_column="content")

`TextSearch`

Hybrid search engine combining dense and BM25 retrieval.

Constructor

TextSearch(
    embedder: TextEmbedder,
    mode: str = "hybrid",
    bm25_weight: float = 0.4,
    dense_weight: float = 0.6,
    rrf_k: int = 60,
)

Parameter	Type	Default	Description
`embedder`	`TextEmbedder`	—	A TextEmbedder with an indexed corpus.
`mode`	`str`	`"hybrid"`	Default search mode: `"hybrid"`, `"dense"`, or `"bm25"`.
`bm25_weight`	`float`	`0.4`	Weight for BM25 scores in hybrid RRF fusion.
`dense_weight`	`float`	`0.6`	Weight for dense scores in hybrid RRF fusion.
`rrf_k`	`int`	`60`	Reciprocal Rank Fusion constant. Lower = more weight to top ranks.

Methods

`.search(query, top_n=10, mode=None, filter_fn=None, filters=None) → list[SearchResult]`

Search the indexed corpus.

Parameter	Type	Default	Description
`query`	`str`	—	Search query text.
`top_n`	`int`	`10`	Number of results to return.
`mode`	`str` or `None`	`None`	Override default mode for this query.
`filter_fn`	`callable` or `None`	`None`	Function `(text: str, metadata: dict) → bool`. Returns `True` to include.
`filters`	`dict` or `None`	`None`	Metadata filters passed directly to the vector store.

Returns: list[SearchResult] sorted by relevance.

results = search.search("machine learning", top_n=5)
results = search.search("exact term", mode="bm25")
results = search.search("AI", filters={"category": "tech"})
results = search.search("AI", filter_fn=lambda t, m: m.get("year", 0) >= 2023)

`SearchResult`

A single search result returned by TextSearch.search().

Attributes

Attribute	Type	Description
`.index`	`int`	Position in the original corpus.
`.text`	`str`	The matched text.
`.score`	`float`	Relevance score.
`.metadata`	`dict`	Associated metadata.

Methods

`.to_dict() → dict`

Convert to a plain dictionary.

{"index": 0, "text": "...", "score": 0.9523, "metadata": {"category": "AI"}}

`Reranker`

Cross-encoder reranker for search result refinement.

Constructor

Reranker(
    model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
    max_length: int = 512,
    device: Optional[str] = None,
    batch_size: int = 64,
)

Parameter	Type	Default	Description
`model_name`	`str`	`"cross-encoder/ms-marco-MiniLM-L-6-v2"`	Any HuggingFace cross-encoder model or local path.
`max_length`	`int`	`512`	Maximum token length for query-passage pairs.
`device`	`str` or `None`	`None`	Inference device. `None` auto-detects.
`batch_size`	`int`	`64`	Batch size for scoring.

Methods

`.rerank(request, top_n=None) → list[dict]`

Rerank passages by relevance to a query.

Parameter	Type	Description
`request`	`RerankRequest`	Query and passages to rerank.
`top_n`	`int` or `None`	Return only top N. `None` returns all.

Returns: List of passage dicts sorted by relevance, with "score" added. All original fields are preserved.

`.rerank_texts(query, texts, top_n=None) → list[dict]`

Rerank plain text strings.

Parameter	Type	Description
`query`	`str`	Search query.
`texts`	`list[str]`	Texts to rerank.
`top_n`	`int` or `None`	Return only top N.

Returns: [{"text": "...", "score": 0.95}, ...]

`.rerank_search_results(query, search_results, top_n=None) → list[dict]`

Rerank SearchResult objects from TextSearch.search().

Parameter	Type	Description
`query`	`str`	Search query.
`search_results`	`list[SearchResult]`	Results from `TextSearch.search()`.
`top_n`	`int` or `None`	Return only top N.

Returns: [{"text": "...", "score": 0.95, "index": 3, "metadata": {...}}, ...]

`RerankRequest`

Container for a reranking request.

RerankRequest(query: str, passages: list[dict])

Parameter	Type	Description
`query`	`str`	The search query.
`passages`	`list[dict]`	List of passage dicts. Each must have a `"text"` key. May include any other fields (`"id"`, `"source"`, etc.) — they will be preserved in the output.

Vector Stores

All vector stores implement the BaseVectorStore abstract interface.

`BaseVectorStore` (Abstract)

class BaseVectorStore(ABC):
    def add(self, ids: list[str], vectors: np.ndarray, metadata: list[dict] = None) -> None: ...
    def search(self, query_vector: np.ndarray, k: int = 10, filters: dict = None) -> list[dict]: ...
    def delete(self, ids: list[str]) -> None: ...
    def count(self) -> int: ...
    def save(self, path: str) -> None: ...
    def load(self, path: str) -> None: ...

.search() return format:

[
    {"id": "0", "score": 0.95, "metadata": {"category": "AI"}},
    {"id": "1", "score": 0.87, "metadata": {"category": "ML"}},
]

`FAISSStore`

FAISSStore(dimension: int, index_type: str = "flat")

Parameter	Type	Default	Description
`dimension`	`int`	—	Vector dimension.
`index_type`	`str`	`"flat"`	`"flat"` (exact), `"ivf"` (approximate), `"hnsw"` (approximate).

`ChromaStore`

ChromaStore(collection_name: str = "deep_text_search", persist_directory: str = None)

Parameter	Type	Default	Description
`collection_name`	`str`	`"deep_text_search"`	ChromaDB collection name.
`persist_directory`	`str` or `None`	`None`	Directory for persistent storage. `None` for in-memory.

`QdrantStore`

QdrantStore(collection_name: str = "deep_text_search", location: str = None, path: str = None, dimension: int = 1024)

Parameter	Type	Default	Description
`collection_name`	`str`	`"deep_text_search"`	Qdrant collection name.
`location`	`str` or `None`	`None`	Remote server URL.
`path`	`str` or `None`	`None`	Local persistent storage path.
`dimension`	`int`	`1024`	Vector dimension.

`PostgresStore`

PostgresStore(connection_string: str, table_name: str = "text_vectors", dimension: int = 1024, metadata_schema: dict = None)

Parameter	Type	Default	Description
`connection_string`	`str`	`"postgresql://localhost:5432/deeptextsearch"`	PostgreSQL connection string.
`table_name`	`str`	`"text_vectors"`	Table name.
`dimension`	`int`	`1024`	Vector dimension.
`metadata_schema`	`dict` or `None`	`None`	`{"column_name": "SQL_TYPE"}` for dedicated indexed columns alongside JSONB metadata.

`MongoStore`

MongoStore(connection_string: str, database_name: str = "deeptextsearch", collection_name: str = "text_vectors", index_name: str = "vector_index", dimension: int = 1024, metadata_fields: list[str] = None)

Parameter	Type	Default	Description
`connection_string`	`str`	`"mongodb://localhost:27017"`	MongoDB connection string.
`database_name`	`str`	`"deeptextsearch"`	Database name.
`collection_name`	`str`	`"text_vectors"`	Collection name.
`index_name`	`str`	`"vector_index"`	Atlas Vector Search index name.
`dimension`	`int`	`1024`	Vector dimension.
`metadata_fields`	`list[str]` or `None`	`None`	Fields to store as top-level document fields for native MongoDB indexing.

Agent Tools

`TextSearchTool`

Generic callable tool for AI agent frameworks.

TextSearchTool(embedder: TextEmbedder, mode: str = "hybrid", reranker: Reranker = None)

Parameter	Type	Default	Description
`embedder`	`TextEmbedder`	—	Embedder with indexed corpus.
`mode`	`str`	`"hybrid"`	Default search mode.
`reranker`	`Reranker` or `None`	`None`	Optional reranker for result refinement.

Calling:

result_json = tool("query", k=5, mode="auto")  # Returns JSON string

Properties:

.tool_definition → dict — OpenAI/Claude function-calling compatible JSON schema.

`create_langchain_retriever()`

create_langchain_retriever(
    embedder: TextEmbedder,
    mode: str = "hybrid",
    reranker: Reranker = None,
    top_n: int = 5,
) → BaseRetriever

Returns a LangChain BaseRetriever instance. Requires langchain-core.

`create_llamaindex_retriever()`

create_llamaindex_retriever(
    embedder: TextEmbedder,
    mode: str = "hybrid",
    reranker: Reranker = None,
    top_n: int = 5,
) → BaseRetriever

Returns a LlamaIndex BaseRetriever instance. Requires llama-index-core.

Configuration

`EMBEDDING_PRESETS`

dict — Recommended embedding models with metadata.

from DeepTextSearch import EMBEDDING_PRESETS
for name, info in EMBEDDING_PRESETS.items():
    print(f"{name}: {info['dimensions']}d, ~{info['size_mb']} MB, {info['languages']}")

`RERANKER_PRESETS`

dict — Recommended reranker models with metadata.

from DeepTextSearch import RERANKER_PRESETS
for name, info in RERANKER_PRESETS.items():
    print(f"{name}: ~{info['size_mb']} MB, {info['languages']}")

`get_device(device=None) → str`

Auto-detect best available device. Returns "cuda", "mps", or "cpu".

from DeepTextSearch.config import get_device
device = get_device()       # auto-detect
device = get_device("cpu")  # force CPU

FilesExpand file tree

08_API_Reference.md

Latest commit

History

08_API_Reference.md

File metadata and controls

API Reference

TextEmbedder

Constructor

Properties

Methods

.index(corpus, text_column=None, metadata_columns=None) → TextEmbedder

.add(texts, metadata=None) → TextEmbedder

.delete(indices) → None

.encode(texts) → np.ndarray

.save(index_dir=None) → None

.load(index_dir, device=None) → TextEmbedder (classmethod)

.from_csv(file_path, text_column, ...) → TextEmbedder (classmethod)

TextSearch

Constructor

Methods

.search(query, top_n=10, mode=None, filter_fn=None, filters=None) → list[SearchResult]

SearchResult

Attributes

Methods

.to_dict() → dict

Reranker

Constructor

Methods

.rerank(request, top_n=None) → list[dict]

.rerank_texts(query, texts, top_n=None) → list[dict]

.rerank_search_results(query, search_results, top_n=None) → list[dict]

RerankRequest

Vector Stores

BaseVectorStore (Abstract)

FAISSStore

ChromaStore

QdrantStore

PostgresStore

MongoStore

Agent Tools

TextSearchTool

create_langchain_retriever()

create_llamaindex_retriever()

Configuration

EMBEDDING_PRESETS

RERANKER_PRESETS

get_device(device=None) → str

`TextEmbedder`

`.index(corpus, text_column=None, metadata_columns=None) → TextEmbedder`

`.add(texts, metadata=None) → TextEmbedder`

`.delete(indices) → None`

`.encode(texts) → np.ndarray`

`.save(index_dir=None) → None`

`.load(index_dir, device=None) → TextEmbedder` (classmethod)

`.from_csv(file_path, text_column, ...) → TextEmbedder` (classmethod)

`TextSearch`

`.search(query, top_n=10, mode=None, filter_fn=None, filters=None) → list[SearchResult]`

`SearchResult`

`.to_dict() → dict`

`Reranker`

`.rerank(request, top_n=None) → list[dict]`

`.rerank_texts(query, texts, top_n=None) → list[dict]`

`.rerank_search_results(query, search_results, top_n=None) → list[dict]`

`RerankRequest`

`BaseVectorStore` (Abstract)

`FAISSStore`

`ChromaStore`

`QdrantStore`

`PostgresStore`

`MongoStore`

`TextSearchTool`

`create_langchain_retriever()`

`create_llamaindex_retriever()`

`EMBEDDING_PRESETS`

`RERANKER_PRESETS`

`get_device(device=None) → str`