AG
TR

// work / mizan

Mizan

A hybrid search and text-analysis engine for the Quran — BM25 over the Arabic with ISRI stemming, cross-language translation embeddings, and letter/Abjad calculations verified against the Tanzil corpus.

Role
Author and maintainer
Date
Jan 2026

Stack

  • Python
  • BM25
  • ISRI stemmer
  • Translation embeddings
  • FastAPI

The problem

Searching and analyzing the Quran well is harder than it looks. Arabic is morphologically rich, so naive keyword search misses obvious matches; users query in Turkish and English, not only Arabic; and letter counts and Abjad (numerology) totals circulate widely but are often wrong because nobody shows how they were computed.

Constraints

  • Every count and Abjad total must be verifiable letter-by-letter against an authoritative corpus, not asserted.
  • Queries arrive in Turkish and English as well as Arabic; the engine has to bridge languages.

Approach

Mizan combines classical information retrieval with modern cross-language embeddings. BM25 handles morphology-aware lexical search on the Arabic source via an ISRI stemmer; translation embeddings bridge Turkish and English queries to the Arabic text. Letter counting and Abjad numerology are computed and then verified against Tanzil.net letter-by-letter.

Key decisions

  • Hybrid retrieval instead of pure lexical or pure semantic search

    BM25 with an ISRI stemmer gives exact, explainable lexical matches on the Arabic; multilingual embeddings catch cross-language and paraphrase queries. Neither alone covers both a scholar's exact-phrase lookup and a casual Turkish question.

  • Attach reproducibility metadata to every result

    Each search and analysis result carries the dataset SHA, model version, and methodology notes used to compute it, so any number the engine produces can be independently reconstructed and trusted.

Architecture

A query in Arabic, Turkish, or English enters the FastAPI service. The hybrid retriever runs BM25 (with the ISRI stemmer) over the Arabic source in parallel with a multilingual embedding fallback, merges the results, and returns them alongside the analysis engine's letter and Abjad counts — every response carrying reproducibility metadata.

flowchart LR
  q["Query<br/>Arabic · Turkish · English"]
  svc["FastAPI service"]
  bm25["BM25 + ISRI stemmer<br/>(Arabic source)"]
  emb["Translation embeddings<br/>(multilingual fallback)"]
  analysis["Analysis engine<br/>letter + Abjad counts"]
  merge["Merge + reproducibility metadata"]
  out["Result"]

  q --> svc
  svc --> bm25 --> merge
  svc --> emb --> merge
  svc --> analysis --> merge
  merge --> out
BM25 + ISRI and a translation-embedding fallback feed one merged, metadata-tagged result.

Outcome

Mizan ships as a FastAPI service plus a Python client, with 172 tests passing — counts and Abjad totals verified against the Tanzil corpus with byte-level reproducibility. It is designed to plug into the Muhabbet stack later.

By the numbers

  • 172 Tests passing
  • 3 Query languages bridged
  • Tanzil corpus Count verification
  • SHA + model + notes Reproducibility tag per result

Deep dive

Mizan (“the scale” / “the balance”) combines classical information retrieval with modern cross-language embeddings to make Quranic text searchable and its letter-level statistics verifiable.

Why hybrid retrieval

Arabic is morphologically rich: the same root surfaces in many inflected forms. A plain keyword index misses obvious matches, so Mizan runs BM25 over the Arabic with an ISRI stemmer to get morphology-aware lexical recall. But a scholar’s exact-phrase lookup and a casual Turkish question are different problems — the second needs to bridge languages. So a multilingual translation-embedding path runs alongside BM25 and catches cross-language and paraphrase queries. The two are merged into one ranked result.

Verifiable, not just asserted

The part that matters most for trust is the numbers. Letter counts and Abjad (numerological) totals circulate widely and are frequently wrong because nobody shows the work. Mizan computes them and then verifies them letter-by-letter against the Tanzil corpus, and every search and analysis result carries the dataset SHA, the model version, and the methodology notes used to produce it. That means any number the engine reports can be reconstructed independently — the result is reproducible down to the byte.

Where it goes next

Mizan is a FastAPI service plus a Python client, with 172 tests passing. It is built to plug into the Muhabbet stack later, so the same verified retrieval and analysis can sit behind a conversational interface. The source is public on GitHub.

All case studies