Letter counts and Abjad totals: verifying a Quranic engine letter-by-letter

Mizan Core Engine (MCE) — named after the Quranic concept Al-Mīzān (الميزان), “the Balance” — is a scholarly-grade Quranic text analysis system that prioritizes measurement accuracy above everything else. The repository is at github.com/ahmetabdullahgultekin/Mizan. Python 3.11+, FastAPI, PostgreSQL 17 with pgvector, Redis 7, Next.js frontend at mizan.rollingcatsoftware.com, deployed on Hetzner CX43 behind Traefik. The codebase has 172 tests passing, every count is verified against Tanzil.net, and every API response carries reproducibility metadata.

This piece is about what “scholarly grade” actually means in code, and why it is much harder than it sounds.

1. “How many letters in Al-Fatiha?” is not a one-line answer

The first time someone asks you to compute the letter count of Sūrat al-Fātiḥa, you write a one-liner. You strip whitespace, count characters, and return a number. That number is wrong. It is wrong in at least four distinct ways, each of which corresponds to a real scholarly debate.

Uthmani vs Imla’i script. The Uthmani Mushaf preserves classical orthographic features — silent letters, alif khanjariyya (ٰ), alif wasla (ٱ) — that the Imla’i (modern simplified) script removes. The same verse can have different letter counts depending on which mushaf you consult.
Base-only vs full counting. Should the alif wasla (the connecting hamzatul wasl in ٱللَّهِ) count as a letter? Tradition says yes; some modern computational standards say no.
Diacritics. Should ḥarakāt (fatḥa, kasra, ḍamma) count? Universally, no. Should shadda? It marks a doubled letter — does that count as one or two? Universally one.
Ligatures. The Quranic ﷲ (Allah ligature) is one Unicode code point but morphologically four letters.

We exposed three counting methods explicitly: TRADITIONAL (alif wasla included, alif khanjariyya excluded — matches scholarly consensus and is the default), UTHMANI_FULL (everything Uthmani included), and NO_WASLA (base letters only). A request must pick one. There is no ambient default that hides the choice from the caller — you cannot use the API without acknowledging the methodology.

Counting is not a function of the text. Counting is a function of the text and the methodology applied to it.

2. Hybrid retrieval: BM25 + ISRI + multilingual embeddings

Counting letters is the small problem. Searching the Qur’an by meaning, across Arabic and Turkish and English, is the harder one.

We landed on a hybrid retriever with four parallel paths and Reciprocal Rank Fusion (RRF) on top:

Vector search on Arabic verses — intfloat/multilingual-e5-base (768-dimension), 6,236 verse embeddings, pgvector for the index.
Vector search on EN translations — Sahih International, 6,236 embedded translations, same model.
Vector search on TR translations — Elmalılı Hamdi Yazır, also 6,236, also same model. (Primary and fallback embedders MUST share a dimension — mixing 768 with 1024 silently breaks the index.)
BM25 keyword search with PostgreSQL tsvector + GIN, plus an Arabic-aware path: the ISRI stemmer (a pure-Python implementation of the Information Sciences Research Institute Arabic stemmer) reduces والدين → ولد and صابرين → صبر so a query for “patience” can match وَٱصْبِرْ.

Then RRF fuses the four ranked lists. A cross-encoder reranker (cross-encoder/ms-marco-MiniLM-L-6-v2) sits behind a feature flag — the infrastructure is built, OOM-safe, and disabled by default because on a 16GB CX43 it competes with the embedding model for RAM.

The lesson: for a multilingual corpus where the source language is the canonical text and translations are auxiliary, you fuse — you don’t pick. A user searching in Turkish for “sabır” should find verses where the Arabic root is صبر, regardless of which translator’s English chose “patience” vs “perseverance”.

3. Reproducibility metadata on every result

Every response from Mizan includes:

The dataset SHA (which Tanzil mushaf, which version).
The model version (which embedder, which dimension).
The methodology (which counting method, which scoring system).

This is not bureaucratic decoration. The corpus we ship today is the Medina Mushaf as published by Tanzil; if Tanzil republishes with a correction (and they have, historically), every cached count downstream becomes stale. Without a SHA in the response payload, downstream consumers have no way to detect that they are looking at a number from a superseded version.

A scholarly tool that returns “139 letters in Al-Fatiha” without telling you which mushaf and which counting method is not a scholarly tool — it is a number generator.

4. The verification matrix

We pinned five values as test invariants. If any of these fails, the build fails:

Metric	Value	Source
Al-Fatiha letters (TRADITIONAL)	139	Tanzil.net
Basmalah letters (TRADITIONAL)	19	Traditional consensus
Allah (الله) Abjad (Mashriqi)	66	Universal
Basmalah Abjad (Mashriqi)	786	Universal
All 28 letter values	100% match	Scholarly standard

These are not arbitrary. The number 19 for Basmalah letters is foundational in classical numerological literature. The number 786 for Basmalah Abjad appears at the top of letters across the Indian subcontinent. Allah = 66 is in every classical ‘ilm al-hurūf source ever written. If our engine gets these wrong, no other count it produces is trustworthy. So we treat them as guard rails — failing tests that trip if anyone “optimizes” the counter and breaks the invariant.

The Mashriqi vs Maghribi system distinction is itself a methodological choice we expose: Mashriqi (Eastern) and Maghribi (Western) Abjad orders disagree on the numerical values of several letters. A library that supports only one and calls it “Abjad” is making a choice on behalf of the user without telling them. Ours makes the user pick.

5. 172 tests as a quality gate

The 172 tests cover four categories:

Counting correctness. Every method against every pinned value, plus property-based tests via Hypothesis for “no method ever returns a negative count, no method ever exceeds the character count, the no-wasla count is never greater than the traditional count”.
Navigation. Every surah/verse pair must be reachable; every reverse navigation (verse → surah → verse) must round-trip; the 6,236 total must hold.
Integrity. SHA-256/512 checksums of the corpus on every load. If the file on disk has been modified, the engine refuses to start.
Multi-script. Uthmani, Simple (Imla’i), and Uthmani-min must all round-trip through the API. A request for surah 1 in Imla’i and the same in Uthmani must return texts that are different but verifiably the same verse.

What 172 tests catch in practice: regressions in the counter when we add a method, off-by-one bugs in verse ranges (verse 0 vs verse 1 indexing — the Quran is 1-indexed; computer scientists are not), and the time someone tried to “speed up” the abjad calculator by short-circuiting on whitespace and broke the basmalah test in CI.

What we’d do differently

Pin Tanzil corpus version in lockfile-style. We do this informally; we should have done it from day one with a checked-in corpus.lock.
Add the methodology field to the URL, not the query string. A misconfigured client can drop query strings; methodology should be in the path so it cannot be dropped silently.
Build the verification harness first, the UI second. Same lesson as Sarnıç: the test matrix is the spec, and writing it forces decisions.

Reading list

The Tanzil corpus documentation — read this before writing a single line of Quranic-text code.
BM25 and Beyond by Robertson and Zaragoza — the classical reference.
The original ISRI Arabic stemmer paper — Khoja, Garside (1999).

Repository: github.com/ahmetabdullahgultekin/Mizan. Live frontend: mizan.rollingcatsoftware.com. API: mizan-api.rollingcatsoftware.com.

nlp
arabic
testing
reproducibility