11. OCR: Digitized Data

Data Science for Economists

2026-03-01

Today’s plan

  • “Non-computable information”
  • Lloyd’s shipping list: The Wind of Change: Maritime Technology, Trade, and Economic Development, Pascali (2017)
  • Plantation records: The Development Effects of the Extractive Colonial Economy: The Dutch Cultivation System in Java, Dell and Olken (2020)
  • Clay tablets: Trade, Merchants, and the Lost Cities of the Bronze Age, Barjamovic et al. (2019)

Non-computable information

Non-computable information

  • Standard digitization methods often fail to capture historical documents effectively
    • Especially for less frequently used languages, scripts and settings
  • Data may also be trapped in various types of images
  • Text data contains a significant amount of non-computable information

Economics and data

  • Key economic questions necessitate disaggregated data: Misallocation, inequality, social mobility, welfare effects of trade
  • Long-term digital disaggregated data uncommon
    • Existing data predominantly originating from high resource contexts
  • Growing academic interest, also due to much better computing power and methods

Digitizing data