Case Study: Marathi OCR + NLP

Text Recognition: Marathi OCR + NLP System

This project implements a complete pipeline for Handwritten Text Recognition using OCR and NLP, specifically targeting the complex Marathi script. Developed as part of my M.Tech in AI & Data Science (NFSU) coursework, the goal was to extract, clean, and process native language text from scanned documents to enable further linguistic analysis.

Academic Context: CTMTAIDS SII P3 - Natural Language Processing (TA2 Assignment)

Problem & Objectives

Challenge: Native Language Handwritten Text

The primary challenge was accurately recognizing Marathi handwritten characters, which requires robust preprocessing and fine-tuned OCR tools due to the complexity of the script and variations in handwriting.
Objective: End-to-End Linguistic Analysis

Build a system capable of handling the entire lifecycle: image ingestion, text extraction (OCR), cleaning/tokenization (NLP), and finally, utility tasks like machine translation.

Technical Stack & Methodology

Core Technologies

Python 3.10+ (Primary language)
Tesseract OCR (with Marathi Language Pack)
OpenCV (Image Preprocessing)
Indic NLP Library (Script-specific processing)
Deep Translator (Translation Utility)

Implementation Flow

The solution was built sequentially across three stages:

Stage 1: Image Preprocessing (OpenCV)

Applied techniques like thresholding and noise reduction using OpenCV to enhance the clarity of the handwritten text, significantly boosting Tesseract's recognition accuracy.
Stage 2: Optical Character Recognition (Tesseract)

Tesseract was configured specifically for the Marathi language to accurately extract the raw Devanagari script from the prepared image.
Stage 3: Natural Language Processing (Indic NLP)

The raw text output was processed using the Indic NLP Library to perform linguistic analysis, including tokenization (breaking text into words/units) and language detection validation.
Stage 4: Utility & Output (Deep Translator)

The cleaned text was translated (e.g., Marathi to English) using Deep Translator, and the tokenized output was presented clearly in a tabular format (using Pandas).

Results & Demonstration

The system achieved high fidelity in character extraction, successfully demonstrating the feasibility of building native language NLP systems on top of open-source OCR tools.

Extracted & Translated Text Preview:

# Sample OCR Extraction (After Cleaning)
आम्ही सारे भारतीय आहोत.

# Sample Translation (Marathi to English)
We are all Indians.

Tokenization Output Example:

| Index | Token |
|-------|-------|
| 1     | आम्ही |
| 2     | सारे |
| 3     | भारतीय |
| 4     | आहोत |
| 5     | .     |

Source Code & Demo

Review the complete implementation details, code, and notebook files via the links below:

View GitHub Repository Run in Google Colab

Problem & Objectives

Challenge: Native Language Handwritten Text

Objective: End-to-End Linguistic Analysis

Technical Stack & Methodology

Core Technologies

Implementation Flow

Stage 1: Image Preprocessing (OpenCV)

Stage 2: Optical Character Recognition (Tesseract)

Stage 3: Natural Language Processing (Indic NLP)

Stage 4: Utility & Output (Deep Translator)

Results & Demonstration

Extracted & Translated Text Preview:

Tokenization Output Example:

Source Code & Demo