百万级本地曲库的混合向量搜索架构设计

When building professional-grade local material and music libraries, traditional search architectures often face severe challenges. When a user types “epic massive momentum” or “tranquil piano” into the search box, database searches based on keyword splitting (like SQLite FTS5) can only accurately match tracks containing exactly those words in their fields. This causes an ocean of audio materials with rich tags (but differing terminology) to sink into oblivion.

To thoroughly solve this pain point, in our latest low-layer refactoring, we fully upgraded to a Hybrid Search architecture that combines the LanceDB vector database and the BGE-M3 edge model.

I. Breaking the Limits of SQLite FTS5

In previous versions of the AI Music Organizer, the core search entirely relied on SQLite’s full-text search module (FTS5). This architecture is simple and extremely fast, but it hit bottlenecks in actual business scenarios:

The Semantic Gap: Users’ search terms often do not perfectly align with the actual audio tags (e.g., searching for “happy” might miss tracks tagged as “joyful”).
Weak Cross-Language Matching: Music metadata is extremely fragmented across multi-language descriptions. A simple tokenizer cannot establish semantic associations across languages.

II. Dual-Lane Retrieval & Hybrid Architecture

To introduce a semantic layer, we did not abandon lexical retrieval. Instead, we adopted a Dual-Lane Retrieval & RRF (Reciprocal Rank Fusion) model, deeply embedding LanceDB on top of the existing SQLite architecture.

1. Independent Projection Tables and BGE-M3 Embeddings

For each normalized user library (Root), we established an independent text_vectors projection table in its .lancedb/ directory.

Local ONNX Inference: When the index rebuild mechanism is triggered (Phase B), the system schedules a background engine based on ort, passes the metadata text to the local BGE-M3 Int8 ONNX model, produces high-quality 1024-dimensional f32 vectors, and upserts them into the database.
Ledger and Retry Mechanism: To prevent massive vectorization tasks from being forcibly interrupted due to device performance limitations, we designed a text_vector_index_ledger table in SQLite. It tracks the status of each text row (pending, indexed, failed), allowing the system to seamlessly resume and retry from the interruption point even if the program restarts unexpectedly.

2. Dual Providers: Lexical and Semantic

When a user types a search, the architecture splits into two core Providers in the backend:

LexicalRetrievalProvider: Still provides strong, hard matching via FTS5 (ensuring direct filename and clear tag searches are recalled at 100%).
SemanticRetrievalProvider: Loads the BGE inference environment, computes the vector of the user’s Query, and hands it over to LanceDB’s ANN (Approximate Nearest Neighbor) for semantic search.

III. The Art of Rank Fusion: Text-Priority RRF

Combining the results of both lanes and presenting the final list (ResultList) is the real challenge.

Rather than using a high-latency Cross-encoder reranker, we use a tuned Text-Priority RRF (Reciprocal Rank Fusion) algorithm for post-processing:

$$
Score_{RRF} = \frac{1}{K + \text{Rank}{Lexical}} + \frac{W{Semantic}}{K + \text{Rank}_{Semantic}}
$$

[!NOTE]
In our implementation:

$K = 60$

Lexical matching weight $W_{Lexical} = 1.0$

Semantic matching weight $W_{Semantic} = 0.95$

By applying a slight weight discount (0.95) to the Semantic side, we ensure that when an audio file’s lexical terms are directly hit, it will always rank higher than those that are merely “semantically similar.” This maximizes the restoration of traditional expectations for “search” while supplementing semantically relevant hidden gems in the long tail.

IV. Dual-Library Parallelism: Filtering for Music and Materials

Different application scenarios have entirely different indexing requirements. In the results retrieved by LanceDB (TextVectorSearchHit), we bound features with library_type (“music” or “material”) and surface.

This means that even if a user simultaneously manages hundreds of GB of lossless music albums and sound effect material collections, the search pipeline can execute precise feature splitting during ANN recall, ensuring environmental sound effects never slip into your music genre playlists.

Conclusion

The retrieval challenge of a million-track local library is essentially about finding the balance between accuracy and comprehensiveness under limited computing resources on the edge. The hybrid search architecture, combining LanceDB with a lightweight ONNX model, not only preserves user data privacy but also elevates the search experience to unprecedented heights. In future versions, we also plan to explore metadata chunking to support retrieval of longer description fields.