Visual domain knowledge-based multimodal zoning for textual region localization in noisy historical document images

Chulwoo Pack, Leen Kiat Soh, Elizabeth Lorang

Research output: Contribution to journalArticlepeer-review


Document layout analysis, or zoning, is important for textual content analysis such as optical character recognition. Zoning document images such as digitized historical newspaper pages are challenging due to noise and quality of the document images. Recently, effective data-driven approaches, such as leveraging deep learning, have been proposed, albeit with the concern of requiring larger training data and thus incurring additional cost of ground truthing. We propose a zoning solution by incorporating a knowledge-driven document representation, gravity map, into a multimodal deep learning framework to reduce the amount of time and data required for training. We first generate a gravity map for each image, considering the centroid distance and area between a cell in a Voronoi tessellation and its content to encode visual domain knowledge of a zoning task. Second, we inject the gravity maps into a deep convolution neural network (DCNN) during training, as an additional modality to boost performance. We report on two investigations using two state-of-the-art DCNN architectures and three datasets: two sets of historical newspapers and a set of born-digital contemporary documents. Evaluations show that our solution achieved comparable segmentation accuracy using fewer training epochs and less training data compared to a naïve training scheme.

Original languageEnglish (US)
Article number063028
JournalJournal of Electronic Imaging
Issue number6
StatePublished - Nov 1 2021


  • document image processing
  • image analysis
  • image decomposition
  • image recognition
  • image segmentation

ASJC Scopus subject areas

  • Atomic and Molecular Physics, and Optics
  • Computer Science Applications
  • Electrical and Electronic Engineering


Dive into the research topics of 'Visual domain knowledge-based multimodal zoning for textual region localization in noisy historical document images'. Together they form a unique fingerprint.

Cite this