← Back to Portfolio

Text Recognition: Marathi OCR + NLP System

This project implements a complete pipeline for Handwritten Text Recognition using OCR and NLP, specifically targeting the complex Marathi script. Developed as part of my M.Tech in AI & Data Science (NFSU) coursework, the goal was to extract, clean, and process native language text from scanned documents to enable further linguistic analysis.

Academic Context: CTMTAIDS SII P3 - Natural Language Processing (TA2 Assignment)

Problem & Objectives

  • Challenge: Native Language Handwritten Text

    The primary challenge was accurately recognizing Marathi handwritten characters, which requires robust preprocessing and fine-tuned OCR tools due to the complexity of the script and variations in handwriting.

  • Objective: End-to-End Linguistic Analysis

    Build a system capable of handling the entire lifecycle: image ingestion, text extraction (OCR), cleaning/tokenization (NLP), and finally, utility tasks like machine translation.

Technical Stack & Methodology

    Core Technologies
  • Python 3.10+ (Primary language)
  • Tesseract OCR (with Marathi Language Pack)
  • OpenCV (Image Preprocessing)
  • Indic NLP Library (Script-specific processing)
  • Deep Translator (Translation Utility)

Implementation Flow

The solution was built sequentially across three stages:

  1. Stage 1: Image Preprocessing (OpenCV)

    Applied techniques like thresholding and noise reduction using OpenCV to enhance the clarity of the handwritten text, significantly boosting Tesseract's recognition accuracy.

  2. Stage 2: Optical Character Recognition (Tesseract)

    Tesseract was configured specifically for the Marathi language to accurately extract the raw Devanagari script from the prepared image.

  3. Stage 3: Natural Language Processing (Indic NLP)

    The raw text output was processed using the Indic NLP Library to perform linguistic analysis, including tokenization (breaking text into words/units) and language detection validation.

  4. Stage 4: Utility & Output (Deep Translator)

    The cleaned text was translated (e.g., Marathi to English) using Deep Translator, and the tokenized output was presented clearly in a tabular format (using Pandas).

Results & Demonstration

The system achieved high fidelity in character extraction, successfully demonstrating the feasibility of building native language NLP systems on top of open-source OCR tools.

Extracted & Translated Text Preview:

# Sample OCR Extraction (After Cleaning)
आम्ही सारे भारतीय आहोत.

# Sample Translation (Marathi to English)
We are all Indians.
        

Tokenization Output Example:

| Index | Token |
|-------|-------|
| 1     | आम्ही |
| 2     | सारे |
| 3     | भारतीय |
| 4     | आहोत |
| 5     | .     |
        

Source Code & Demo

Review the complete implementation details, code, and notebook files via the links below:

View GitHub Repository Run in Google Colab