Deep dive May 22, 2026 · 5 min read
PaddleOCR in the browser — running 60 MB ML models without uploads
PaddleOCR is one of the most accurate OCR engines available under a permissive Apache 2.0 license. Loft runs it entirely in the browser, downloading the model weights once to your device. Here is how that works, what the engineering tradeoffs are, and why it is the right shape for privacy-sensitive OCR.
By Khine 962 words → OCR PDF Extractable lead
When I first tried to run PaddleOCR in a browser tab, it took
eighteen seconds to OCR a single page of typewritten English.
The page froze for the full eighteen seconds. The
browser’s “this tab is using significant memory” warning fired.
I gave up on browser-side OCR for about a month and used
Tesseract.js instead, which is roughly half the accuracy but at
least usable.
This post is the story of how we got back to PaddleOCR and made
it work — and what was actually slow the first time.
The starting state
PaddleOCR is a Chinese-origin OCR engine, released under the
Apache 2.0 license, that benchmarks at or near the top for text
recognition accuracy on
English, CJK languages, and most Latin-script European languages.
Mature, well-maintained, used in production by enterprises that
have GPUs available.
The “in production” qualifier is the important one. PaddleOCR
was designed for server-side inference on PaddlePaddle’s native
runtime, ideally with CUDA. Running it in a browser tab on a
phone’s CPU is asking the model to operate under conditions
nobody designed it for.
That’s why my first attempt was eighteen seconds.
What was actually slow
Profiling the slow path revealed three separable problems:
The model files were in PaddlePaddle’s native format. Loading
them in a JS runtime required a heavy compatibility shim that
itself was slow.
The detection model — the part that finds text regions on the
page — was the highest-quality version, which was also the
largest version (around 95 MB). It downloaded slowly on first
run.
The recognition model — the part that reads text in each
detected region — ran in WASM via ONNX Runtime Web on
single-threaded mode. Multi-threaded WASM (via
SharedArrayBuffer) was either unsupported by the page’s headers
or not enabled.
Each problem was solvable individually. Together they were
producing the eighteen-second outcome.
What we did about it
Three fixes, in order:
Convert to ONNX format up front. PaddleOCR ships native
PaddlePaddle weights; ONNX Runtime Web reads ONNX weights. The
official Paddle-to-ONNX conversion tool was the bridge. We
moved the conversion out of runtime entirely — the
ppu-paddle-ocr package ships pre-converted ONNX weights from
its build pipeline.
Quantise to INT8. Full-precision (FP32) ONNX weights for
the detection + recognition pair were large. INT8 quantization
brought each per-language set down to roughly 10–15 MB at the
cost of about 2-3% accuracy on benchmark sets. The accuracy loss is
real but typically invisible for routine office documents; the
file-size win is huge.
Enable cross-origin isolation. Loft’s /tools/* pages
serve Cross-Origin-Opener-Policy: same-origin and
Cross-Origin-Embedder-Policy: credentialless. That unlocks
SharedArrayBuffer, which ONNX Runtime Web uses for
multi-threaded WASM. Inference time dropped by a factor of
three on devices with multiple cores.
After all three: roughly three seconds for a single page on a
modern desktop, six on a phone. Acceptable.
The architecture now
Three pieces, executing as a pipeline:
The detection model identifies bounding boxes around text
regions on the input image. Output: list of polygons.
For each polygon, the image gets cropped and passed to the
recognition model, which produces the text content. Output:
list of (polygon, text) pairs.
The post-processing step assembles the output into a structured
representation — usually plain text in reading order, sometimes
with positional metadata for tools that need it (e.g. the OCR
PDF tool, which embeds extracted text back into the PDF at the
correct coordinates).
All three pieces run inside a Web Worker so the main thread
stays responsive while OCR runs.
What I underestimated
The first-visit model download is the single biggest UX hurdle.
10–15 MB per language on a 4G connection takes a noticeable
pause; on a slower connection it takes longer. We mitigate by
lazy-loading the model only when the OCR tool is opened (not on
first visit to any Loft page) and by showing a clear progress
indicator during download. We also offer Tesseract.js as a
fallback for users who don’t want the PaddleOCR weight.
The mitigation works, but there’s no escaping that “before the
tool works, your phone needs to download the model weights” is a
worse first-run experience than “this tool just works.” Native
OCR apps don’t have this problem because the user downloads them
upfront via the app store.
What I’d do differently
Two things, with the benefit of hindsight:
Ship the Tesseract fallback first. The lower-accuracy
fallback is more than good enough for most users’ actual
documents. Shipping it as the default and letting users opt
into PaddleOCR for higher accuracy would have given us a
better day-one experience and a reason to download the heavy
model only when the user asked for the upgrade.
Cache the model more aggressively. The current caching is
service-worker-driven and works, but I’ve heard of (and seen,
in my own browsing data) cases where the cache gets evicted
on iOS Safari after a few weeks of disuse. The next iteration
should treat the model as a “persistent” storage class via
StorageManager.persist().
What’s still hard
Handwritten text. Loft’s PaddleOCR setup is excellent for
printed text in supported languages; it’s meaningfully behind
cloud-managed OCR on hand-written documents, particularly
cursive English and free-form notes. The training data gap is
real and we don’t have a path to close it without a
fundamentally different model.
Rare scripts. The general-purpose PaddleOCR weights handle
common Latin and CJK scripts. For Arabic, Hebrew, Thai, or
Devanagari we fall back to Tesseract. Same gap.
Very large documents on phones. Each page is its own
inference cycle, and on a memory-constrained device the
cumulative state can hit the tab ceiling before the document
finishes. Mitigation: process serially with explicit cleanup,
warn the user before starting.
PaddleOCR’s repo: github.com/PaddlePaddle/PaddleOCR.
ONNX Runtime Web docs:
onnxruntime.ai/docs/tutorials/web/.
The pillar at /docs/how-it-works/ covers
the ML stack briefly in §4.