Naučno-tehnički pregled
1995, vol. 45, no. 6-7 

Automatic error correction in optical character recognition of Serbian text

Pero Šipka

Faculty of Philosophy in Novi Sad, Department of Psychology

Biljana Kosanović

Institute for Military Technics, Belgrade

Abstract: The effectiveness of the PAKoST, a program for postprocessing optically read Serbian text developed by the same authors, was experimentally tested. The sample consisted of about 70.000 words of the Serbian Latin printed text compiled from nine sources covering evenly three discourses (scientific, literary, and political), different fonts, and various printing qualities. The text was optically read by Recognita Plus, the only prestigious commercial OCR software that at present support Yugoslav character set. The ASCII Recognita's output was then processed by PAKoST in the automatic correction mode.

The effectiveness of the postprocessing was checked by using two algorithms: a hybrid contextual postprocessor (HCP) that was build into the previous version of PAKoST and a new more complex algorithm called MiniMax implemented in the last version of the program. Both general correctability (against Recognita) and incremental correctability (against HCP) of MiniMax was calculated and statistically tested.

The tests demonstrated that the new algorithm brought to PAKoST substantial improvement in correctability. It reduced Recognita word error rate from 7.90% to 4,39%. Furthermore, MiniMax, unlike HCP, produces a tolerable amount of type II errors (new errors of its own), encouraging the use of PAKoST as an automatic postprocessor.

