Peru Prescription OCR

Downloads

1. Executive Summary

KEY FINDING: No existing dataset of Peruvian prescriptions exists on HuggingFace or Kaggle. The client's 1-year historical corpus (~100k labeled pairs) is the critical training asset. PP-OCRv5 + Donut is the recommended model stack.

PP-OCRv5 (PaddleOCR): Primary OCR engine - 70M params, Apache 2.0, runs on CPU, SOTA handwriting accuracy
Donut: Structured JSON extraction - image-to-JSON, 200M params, MIT license
Claude Vision: Fallback for low-confidence (100% pilot, ~5% steady-state)
Hybrid architecture: Reduces per-image cost from $0.04 (manual) to near-zero at scale

2. HuggingFace Model Recommendations

Tier 1 - Production Models Recommended

Property	PP-OCRv5	Donut (Medical)
HuggingFace ID	PaddlePaddle/PP-OCRv5_server_det/rec	chinmays18/medical-prescription-ocr
Parameters	~70M total	~200M
License	Apache 2.0	MIT
Spanish	Yes - 100+ languages	Via fine-tuning
Handwriting	★★★★★ SOTA	★★★★ 84% word-level
Output	Text + bounding boxes	Direct structured JSON
GPU	CPU OK (370+ char/sec)	1x L4/A10G
Key Advantage	Purpose-built for noisy handwriting	End-to-end image-to-JSON

Tier 2 - Alternatives Strong

Model	HuggingFace ID	Params	License	Best For
TrOCR Prescription	aci-mis-team/trocr-large-handwritten-prescription	560M	MIT	Prescription handwriting
GOT-OCR2	GOT-OCR/GOT-OCR2	1B+	Check	General document OCR
RolmOCR	reducto-ai/RolmOCR	7B	Apache 2.0	Document transcription
Florence-2	microsoft/Florence-2-large	Multi-B	MIT	Few-shot extraction
Qwen2.5-VL	Qwen/Qwen2.5-VL-7B-Instruct	7B	Apache 2.0	Joint OCR + interpretation
Claude Vision	Anthropic API (Sonnet 4.6)	API	API	Pilot (100%), steady (5%)

Recommended Architecture

Phone Photo --> Pre-processing (deskew, crop, enhance) | v PP-OCRv5 (text detection + recognition) | v Donut fine-tuned (structured JSON extraction) | (if confidence < 0.85) v Claude Vision (fallback) | v Master file matching (product/doctor/outlet via rapidfuzz) | v Postgres --> Weekly Excel report

Full Comparison Matrix

Model	Params	License	Handwriting	Spanish	JSON	GPU
PP-OCRv5	70M	Apache 2.0	★★★★★	Yes	Pipeline	CPU
Donut	200M	MIT	★★★★	Fine-tune	Direct	L4
TrOCR	560M	MIT	★★★★★	Fine-tune	Text	T4
GOT-OCR2	1B+	Check	★★★	Yes	Prompt	A10G
RolmOCR	7B	Apache	★★★	Yes	Prompt	A10G
Florence-2	Multi-B	MIT	★★★	Yes	Prompt	A100
Claude	API	API	★★★★	Yes	Prompt	N/A

3. Peru Prescription Format (Receta Unica Estandarizada)

Regulated by DIGEMID under MINSA. Official template: DIGEMID Model (PDF)

Standard Fields

Section	Fields	Notes
Institution	Logo, name, address	Printed/stamped
Patient	Nombres, DNI, Edad, Sexo, HC	Often handwritten
Diagnosis	Diagnostico, CIE-10, Especialidad	ICD-10 codes
Medications	DCI, Concentracion, Forma, Dosis, Frecuencia	Core extraction target
Dates	Fecha Expedicion, Fecha Validez	dd/mm/yyyy
Prescriber	Nombre, CMP, Firma, Sello	Stamp overlaps text

Institutional Variations

Institution	Characteristics	Volume
Hospital de la Solidaridad	City coat of arms, standardized RUE, mostly handwritten	~20%
MINSA Hospitals	MINSA logo, standard RUE, varies by region	~20%
EsSalud	Printed/digital forms, barcodes, insurance number	~15%
Private Clinics	Custom letterhead, same required fields	~15%
Generic Pads	Plain pads, fully handwritten	~15%

OCR Challenges

Challenge	Description
Poor Handwriting	Illegible doctor handwriting with personal shorthand
Phone Distortion	Perspective skew, rotation from quick pharmacy photos
Lighting	Shadows, reflections, flash glare
Compression	WhatsApp JPEG compression reduces detail
Stamp Overlap	Doctor's stamp overlaps medication text
Abbreviations	tab, cap, gts, c/8h, VO, IM, amp

4. Training Datasets

CRITICAL GAP: No existing dataset of Peruvian prescriptions exists. All available datasets are English or Bengali with non-Latin American formats.

Dataset	Source	Images	Language
chinmays18/medical-prescription-dataset	HuggingFace	1,000 synthetic	English
avi-kai/Medical_Prescription_Handwritten_Words	HuggingFace	~40	English
RxHandBD	Mendeley	5,578	Bengali/EN
Doctor's Handwritten Prescription BD	Kaggle	~500+	Bengali/EN

Data Strategy

Source	Description
1. Client Historical Corpus	~100k labeled pairs from 1 year of manual transcription (PRIMARY)
2. Synthetic Generation	Generate synthetic Peruvian prescription images
3. Manus Web Collection	47 real images + 500 catalog entries (COMPLETED)
4. Active Learning Loop	Human corrections feed back into training data

5. Manus AI Data Collection COMPLETED

Task ID: NXRJM6HC3JMMh6TuS3nbNB | Agent: manus-1.6-max | View Task

The Manus agent collected 47 real prescription images from 10 source categories (Scribd, SlideShare, Studocu, government portals, news, social media) plus 453 synthetic metadata entries for a total of 500 catalog entries.

Deliverable	Size	Contents
peru_rx_dataset.zip	5.3 MB	47 images (rx_0001-rx_0047) in JPG/PNG/WebP
prescription_catalog.csv	104 KB	500 entries with full metadata columns
peru_rx_vocabulary.md	19 KB	Top 100 medications, abbreviations, CIE-10 codes
template_analysis.md	11 KB	RUE template analysis by institution type
sources_summary.md	6.2 KB	10 source categories with result counts

Downloads

Research Report (.docx)

Prescription Dataset

Prescription Catalog

Peru Rx Vocabulary

Template Analysis

Sources Summary

1. Executive Summary

2. HuggingFace Model Recommendations

Tier 1 - Production Models Recommended

Tier 2 - Alternatives Strong

Recommended Architecture

Full Comparison Matrix

3. Peru Prescription Format (Receta Unica Estandarizada)

Standard Fields

Institutional Variations

OCR Challenges

4. Training Datasets

Data Strategy

5. Manus AI Data Collection COMPLETED

6. Key References

Regulatory

Models & Technology