Example using cairo and Pango (Linux/macOS):
from pypdf import PdfReader reader = PdfReader("khmer_document.pdf") for page in reader.pages: print(page.extract_text()) Khmer requires reordering of vowels and diacritics. Use pyftsubset + harfbuzz (via weasyprint or cairo ) for proper shaping.
Use weasyprint or xhtml2pdf with HTML/CSS that already handles Khmer shaping. 2. Extracting Text from Khmer PDFs Using PyMuPDF (fitz) PyMuPDF handles Khmer Unicode extraction well.
from fpdf import FPDF pdf = FPDF() pdf.add_page() pdf.add_font('khmer', '', 'KhmerOS.ttf', uni=True) pdf.set_font('khmer', size=12) pdf.cell(0, 10, txt="ជំរាបសួរ", ln=1) pdf.output("fpdf_khmer.pdf")
create_khmer_report("data.yaml", "report.pdf") This guide gives you a complete foundation for handling tasks — from creation and extraction to rendering and OCR. Always test with real Khmer text and use fonts that support the full Unicode range for Khmer (U+1780 to U+17FF, plus U+19E0–U+19FF).