OCR

Optical Character Recognition (OCR) is software that converts scanned-in text into digital text, which one can select, copy, paste, edit and search within. A historian might compare OCR to a monk in a scriptorium: the software copies an old book by typing its contents letter for letter into a text document.

OCR has been around for a long time. In the early nineteenth century devices where being developed that would convert written text into braille or specific tones to accommodate blind people. Since then, OCR has come a long way. Nowadays it has many uses, from speed cameras to processing checks.

The Digital Shift
Within history research, OCR is a powerful tool that many historians will perhaps not directly, but surely indirectly interact with. With the large move of digitizing literature since the beginning of the 21th century, OCR has been important for making books searchable within a large corpus. When entering a search query into a digital archive in which the books are OCR’ed, your results will not only consist of matches from the metadata, but also from the resource itself. On a more narrow level of just a single book, OCR can make search query serve as a replacement for the time-old index in the back of a publication. A search query like this can be done by pressing ctrl+f (cmd+f on Mac) in most browsers and pdf readers.

In Oral History, OCR can be useful when digitizing transcripts.

Below is an example of a BBC document, meant to read aloud for the radio broadcasting on June 6th, 1944, at 8 a.m. (D-Day).

The 8 pages of the original document are OCR-ed and the text is respoken by Benedict Cumberbatch (for more information see BBC-website).

audio fragment.