Lexicon reduction for Urdu/Arabic script based character recognition: A multilingual OCR

Naz, Saeeda; Umar, Arif Iqbal; Razzak, Imran

File(s) under permanent embargo

Lexicon reduction for Urdu/Arabic script based character recognition: A multilingual OCR

journal contribution

posted on 2016-04-01, 00:00 authored by Saeeda Naz, Arif Iqbal Umar, Imran RazzakImran Razzak

Arabic script character recognition is challenging task due to complexity of the script and huge number of ligatures. We present a method for the development of multilingual Arabic script OCR (Optical Character Recognition) and lexicon reduction for Arabic Script and its derivative languages. The objective of the proposed method is to overcome the large dataset Urdu and similar scripts by using GCT (Ghost Character Theory) concept. Arabic and its sibling script languages share the similar character dataset i.e. the character set are difference in diacritic and writing styles like Naskh or Nasta'liq. Based on the proposed method, the lexicon for Arabic and Arabic script based languages can be minimized approximately up to 20 times. The proposed multilingual Arabic script OCR approach have been evaluated for online Arabic and its derivative language like Urdu using BPNN. The result showed that proposed method helps to not only the reduction of lexicon but also helps to develop the Multilanguage character recognition system for Arabic Script.

History

Journal

Mehran University Research Journal Of Engineering & Technology

Volume

35

Issue

2

Pagination

209 - 216

Publisher

Mehran University of Engineering and Technology

Location

Jamshoro, Pakistan

ISSN

0254-7821

eISSN

2413-7219

Language

eng

Publication classification

C1.1 Refereed article in a scholarly journal

Copyright notice

2016, Mehran University of Engineering & Technology

Usage metrics

Keywords

Urdu Optical Character Recognition Multilingual Optical Character Recognition Naskh Nasta'liq Science & Technology Technology Engineering, Multidisciplinary Engineering Optical character recognition devices Scripting languages (Computer science)Research--Methodology Multilingualism Artificial Intelligence and Image Processing

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

File(s) under permanent embargo

Lexicon reduction for Urdu/Arabic script based character recognition: A multilingual OCR

History

Journal

Volume

Issue

Pagination

Publisher

Location

ISSN

eISSN

Language

Publication classification

Copyright notice

Usage metrics

Categories

Keywords

Licence

Exports