Text Recognition: Marathi OCR + NLP System
This project implements a complete pipeline for Handwritten Text Recognition using OCR and NLP, specifically targeting the complex Marathi script. Developed as part of my M.Tech in AI & Data Science (NFSU) coursework, the goal was to extract, clean, and process native language text from scanned documents to enable further linguistic analysis.
Academic Context: CTMTAIDS SII P3 - Natural Language Processing (TA2 Assignment)
Problem & Objectives
-
Challenge: Native Language Handwritten Text
The primary challenge was accurately recognizing Marathi handwritten characters, which requires robust preprocessing and fine-tuned OCR tools due to the complexity of the script and variations in handwriting.
-
Objective: End-to-End Linguistic Analysis
Build a system capable of handling the entire lifecycle: image ingestion, text extraction (OCR), cleaning/tokenization (NLP), and finally, utility tasks like machine translation.
Technical Stack & Methodology
- Python 3.10+ (Primary language)
- Tesseract OCR (with Marathi Language Pack)
- OpenCV (Image Preprocessing)
- Indic NLP Library (Script-specific processing)
- Deep Translator (Translation Utility)
Core Technologies
Implementation Flow
The solution was built sequentially across three stages:
-
Stage 1: Image Preprocessing (OpenCV)
Applied techniques like thresholding and noise reduction using OpenCV to enhance the clarity of the handwritten text, significantly boosting Tesseract's recognition accuracy.
-
Stage 2: Optical Character Recognition (Tesseract)
Tesseract was configured specifically for the Marathi language to accurately extract the raw Devanagari script from the prepared image.
-
Stage 3: Natural Language Processing (Indic NLP)
The raw text output was processed using the Indic NLP Library to perform linguistic analysis, including tokenization (breaking text into words/units) and language detection validation.
-
Stage 4: Utility & Output (Deep Translator)
The cleaned text was translated (e.g., Marathi to English) using Deep Translator, and the tokenized output was presented clearly in a tabular format (using Pandas).
Results & Demonstration
The system achieved high fidelity in character extraction, successfully demonstrating the feasibility of building native language NLP systems on top of open-source OCR tools.
Extracted & Translated Text Preview:
# Sample OCR Extraction (After Cleaning)
आम्ही सारे भारतीय आहोत.
# Sample Translation (Marathi to English)
We are all Indians.
Tokenization Output Example:
| Index | Token |
|-------|-------|
| 1 | आम्ही |
| 2 | सारे |
| 3 | भारतीय |
| 4 | आहोत |
| 5 | . |
Source Code & Demo
Review the complete implementation details, code, and notebook files via the links below: