Tool

Lisān OCR

Read Lisān ud-Daʿwat pages into editable text, in your browser.

Upload a scan, photo, or PDF of a Lisān ud-Daʿwat page — typically set in FatemiMaqala or Kanz al-Marjaan — and this tool recognises the text and drops it into an editable box rendered in FatemiMaqala. Multi-page PDFs are read page by page. Recognition runs entirely in your browser; the image isn't uploaded for OCR. (You can separately, and only if you opt in, contribute a correction to help train the model — see below.)

Default model: a custom recogniser trained on Lisān ud-Daʿwat text rendered in FatemiMaqala and Kanz al-Marjaan (voweled and plain) plus real legacy pages — it reads the extended Urdu/Persian letters stock Arabic OCR misses and captures the iʿrāb when present. Getting the base letters right is the priority; vocalization is a nice-to-have, captured when it's there. You can also drop in a PDF (each page is rendered and read). Treat the output as a draft to correct, not a finished transcription — and use Export .docx to keep working in Word.

Drop an image or PDF here or click to choose.

PNG, JPG, or PDF — a scan, a photo, or a multi-page document.

Image or first PDF page will appear here.

Model:

Recognised text

Copied ✓

Help improve the model

Optional and off by default. If you turn this on, the image you uploaded and your corrected text are sent to an open Lisān ud-Daʿwat OCR training set, so the recogniser improves over time. Please don't contribute anything private or sensitive. See the privacy policy.

How it works & where it's headed

Now: in-browser recognition via Tesseract.js (WASM), no backend.
Next: a custom recognizer trained on synthetic lines — Unicode Lisān ud-Daʿwat text rendered in FatemiMaqala and Kanz al-Marjaan with scan-like augmentation, in both voweled and plain forms — to read the extended letters stock Arabic OCR misses and capture the iʿrāb when it's present.
Now (opt-in): a correction loop — if you choose to contribute, your edits become real-world training data that improves the model over time.
Then: page layout detection for multi-line scans.

← All projects