1. Executive Summary
KEY FINDING: No existing dataset of Peruvian prescriptions exists on HuggingFace or Kaggle. The client's 1-year historical corpus (~100k labeled pairs) is the critical training asset. PP-OCRv5 + Donut is the recommended model stack.
- PP-OCRv5 (PaddleOCR): Primary OCR engine - 70M params, Apache 2.0, runs on CPU, SOTA handwriting accuracy
- Donut: Structured JSON extraction - image-to-JSON, 200M params, MIT license
- Claude Vision: Fallback for low-confidence (100% pilot, ~5% steady-state)
- Hybrid architecture: Reduces per-image cost from $0.04 (manual) to near-zero at scale
2. HuggingFace Model Recommendations
Tier 1 - Production Models Recommended
| Property | PP-OCRv5 | Donut (Medical) |
| HuggingFace ID | PaddlePaddle/PP-OCRv5_server_det/rec | chinmays18/medical-prescription-ocr |
| Parameters | ~70M total | ~200M |
| License | Apache 2.0 | MIT |
| Spanish | Yes - 100+ languages | Via fine-tuning |
| Handwriting | ★★★★★ SOTA | ★★★★ 84% word-level |
| Output | Text + bounding boxes | Direct structured JSON |
| GPU | CPU OK (370+ char/sec) | 1x L4/A10G |
| Key Advantage | Purpose-built for noisy handwriting | End-to-end image-to-JSON |
Tier 2 - Alternatives Strong
| Model | HuggingFace ID | Params | License | Best For |
| TrOCR Prescription | aci-mis-team/trocr-large-handwritten-prescription | 560M | MIT | Prescription handwriting |
| GOT-OCR2 | GOT-OCR/GOT-OCR2 | 1B+ | Check | General document OCR |
| RolmOCR | reducto-ai/RolmOCR | 7B | Apache 2.0 | Document transcription |
| Florence-2 | microsoft/Florence-2-large | Multi-B | MIT | Few-shot extraction |
| Qwen2.5-VL | Qwen/Qwen2.5-VL-7B-Instruct | 7B | Apache 2.0 | Joint OCR + interpretation |
| Claude Vision | Anthropic API (Sonnet 4.6) | API | API | Pilot (100%), steady (5%) |
Recommended Architecture
Phone Photo --> Pre-processing (deskew, crop, enhance)
|
v
PP-OCRv5 (text detection + recognition)
|
v
Donut fine-tuned (structured JSON extraction)
| (if confidence < 0.85)
v
Claude Vision (fallback)
|
v
Master file matching (product/doctor/outlet via rapidfuzz)
|
v
Postgres --> Weekly Excel report
Full Comparison Matrix
| Model | Params | License | Handwriting | Spanish | JSON | GPU |
| PP-OCRv5 | 70M | Apache 2.0 | ★★★★★ | Yes | Pipeline | CPU |
| Donut | 200M | MIT | ★★★★ | Fine-tune | Direct | L4 |
| TrOCR | 560M | MIT | ★★★★★ | Fine-tune | Text | T4 |
| GOT-OCR2 | 1B+ | Check | ★★★ | Yes | Prompt | A10G |
| RolmOCR | 7B | Apache | ★★★ | Yes | Prompt | A10G |
| Florence-2 | Multi-B | MIT | ★★★ | Yes | Prompt | A100 |
| Claude | API | API | ★★★★ | Yes | Prompt | N/A |
3. Peru Prescription Format (Receta Unica Estandarizada)
Regulated by DIGEMID under MINSA. Official template: DIGEMID Model (PDF)
Standard Fields
| Section | Fields | Notes |
| Institution | Logo, name, address | Printed/stamped |
| Patient | Nombres, DNI, Edad, Sexo, HC | Often handwritten |
| Diagnosis | Diagnostico, CIE-10, Especialidad | ICD-10 codes |
| Medications | DCI, Concentracion, Forma, Dosis, Frecuencia | Core extraction target |
| Dates | Fecha Expedicion, Fecha Validez | dd/mm/yyyy |
| Prescriber | Nombre, CMP, Firma, Sello | Stamp overlaps text |
Institutional Variations
| Institution | Characteristics | Volume |
| Hospital de la Solidaridad | City coat of arms, standardized RUE, mostly handwritten | ~20% |
| MINSA Hospitals | MINSA logo, standard RUE, varies by region | ~20% |
| EsSalud | Printed/digital forms, barcodes, insurance number | ~15% |
| Private Clinics | Custom letterhead, same required fields | ~15% |
| Generic Pads | Plain pads, fully handwritten | ~15% |
OCR Challenges
| Challenge | Description |
| Poor Handwriting | Illegible doctor handwriting with personal shorthand |
| Phone Distortion | Perspective skew, rotation from quick pharmacy photos |
| Lighting | Shadows, reflections, flash glare |
| Compression | WhatsApp JPEG compression reduces detail |
| Stamp Overlap | Doctor's stamp overlaps medication text |
| Abbreviations | tab, cap, gts, c/8h, VO, IM, amp |
4. Training Datasets
CRITICAL GAP: No existing dataset of Peruvian prescriptions exists. All available datasets are English or Bengali with non-Latin American formats.
| Dataset | Source | Images | Language |
| chinmays18/medical-prescription-dataset | HuggingFace | 1,000 synthetic | English |
| avi-kai/Medical_Prescription_Handwritten_Words | HuggingFace | ~40 | English |
| RxHandBD | Mendeley | 5,578 | Bengali/EN |
| Doctor's Handwritten Prescription BD | Kaggle | ~500+ | Bengali/EN |
Data Strategy
| Source | Description |
| 1. Client Historical Corpus | ~100k labeled pairs from 1 year of manual transcription (PRIMARY) |
| 2. Synthetic Generation | Generate synthetic Peruvian prescription images |
| 3. Manus Web Collection | 47 real images + 500 catalog entries (COMPLETED) |
| 4. Active Learning Loop | Human corrections feed back into training data |
5. Manus AI Data Collection COMPLETED
Task ID: NXRJM6HC3JMMh6TuS3nbNB |
Agent: manus-1.6-max |
View Task
The Manus agent collected 47 real prescription images from 10 source categories (Scribd, SlideShare, Studocu, government portals, news, social media) plus 453 synthetic metadata entries for a total of 500 catalog entries.
| Deliverable | Size | Contents |
| peru_rx_dataset.zip | 5.3 MB | 47 images (rx_0001-rx_0047) in JPG/PNG/WebP |
| prescription_catalog.csv | 104 KB | 500 entries with full metadata columns |
| peru_rx_vocabulary.md | 19 KB | Top 100 medications, abbreviations, CIE-10 codes |
| template_analysis.md | 11 KB | RUE template analysis by institution type |
| sources_summary.md | 6.2 KB | 10 source categories with result counts |
6. Key References
Regulatory
Models & Technology