A hybrid search and text-analysis engine for the Quran — BM25 over the Arabic with ISRI stemming, cross-language translation embeddings, and letter/Abjad calculations verified against the Tanzil corpus.
Role
Author and maintainer
Date
Jan 2026
Stack
Python
BM25
ISRI stemmer
Translation embeddings
FastAPI
The problem
Searching and analyzing the Quran well is harder than it looks. Arabic is morphologically rich, so naive keyword search misses obvious matches; users query in Turkish and English, not only Arabic; and letter counts and Abjad (numerology) totals circulate widely but are often wrong because nobody shows how they were computed.
Constraints
▸
Every count and Abjad total must be verifiable letter-by-letter against an authoritative corpus, not asserted.
▸
Queries arrive in Turkish and English as well as Arabic; the engine has to bridge languages.
Approach
Mizan combines classical information retrieval with modern cross-language embeddings. BM25 handles morphology-aware lexical search on the Arabic source via an ISRI stemmer; translation embeddings bridge Turkish and English queries to the Arabic text. Letter counting and Abjad numerology are computed and then verified against Tanzil.net letter-by-letter.
Key decisions
Hybrid retrieval instead of pure lexical or pure semantic search
BM25 with an ISRI stemmer gives exact, explainable lexical matches on the Arabic; multilingual embeddings catch cross-language and paraphrase queries. Neither alone covers both a scholar's exact-phrase lookup and a casual Turkish question.
Attach reproducibility metadata to every result
Each search and analysis result carries the dataset SHA, model version, and methodology notes used to compute it, so any number the engine produces can be independently reconstructed and trusted.
Architecture
A query in Arabic, Turkish, or English enters the FastAPI service. The hybrid retriever runs BM25 (with the ISRI stemmer) over the Arabic source in parallel with a multilingual embedding fallback, merges the results, and returns them alongside the analysis engine's letter and Abjad counts — every response carrying reproducibility metadata.
BM25 + ISRI and a translation-embedding fallback feed one merged, metadata-tagged result.
Outcome
Mizan ships as a FastAPI service plus a Python client, with 172 tests passing — counts and Abjad totals verified against the Tanzil corpus with byte-level reproducibility. It is designed to plug into the Muhabbet stack later.
By the numbers
172 Tests passing
3 Query languages bridged
Tanzil corpus Count verification
SHA + model + notes Reproducibility tag per result
Deep dive
Mizan (“the scale” / “the balance”) combines classical information retrieval
with modern cross-language embeddings to make Quranic text searchable and its
letter-level statistics verifiable.
Why hybrid retrieval
Arabic is morphologically rich: the same root surfaces in many inflected forms.
A plain keyword index misses obvious matches, so Mizan runs BM25 over the
Arabic with an ISRI stemmer to get morphology-aware lexical recall. But a
scholar’s exact-phrase lookup and a casual Turkish question are different
problems — the second needs to bridge languages. So a multilingual
translation-embedding path runs alongside BM25 and catches cross-language and
paraphrase queries. The two are merged into one ranked result.
Verifiable, not just asserted
The part that matters most for trust is the numbers. Letter counts and Abjad
(numerological) totals circulate widely and are frequently wrong because nobody
shows the work. Mizan computes them and then verifies them letter-by-letter
against the Tanzil corpus, and every search and analysis result carries the
dataset SHA, the model version, and the methodology notes used to produce it.
That means any number the engine reports can be reconstructed independently —
the result is reproducible down to the byte.
Where it goes next
Mizan is a FastAPI service plus a Python client, with 172 tests passing. It
is built to plug into the Muhabbet stack later, so the same verified retrieval
and analysis can sit behind a conversational interface. The source is public on
GitHub.