Skip to main content

 

OCLC Support

OCR

CONTENTdm provides an extension that enables the Project Client to generate file transcripts by using Optical Character Recognition (OCR). This allows the text characters in an image file to be searched.

Additionally, when an end user searches for a term generated by the OCR processing, either with a general search or within a compound object, the search term is highlighted in the image. (Search term highlighting is not supported for Hebrew, Chinese, Japanese, and Korean.)

For compound objects, the OCR extension also provides an option to create a PDF of the entire compound object for ease of printing.

For information about how to use OCR processing on items already in your collection, see Adding OCR to Items in a Collection.

The accuracy of OCR is dependent upon:

  • The quality of the scan

  • The quality of the original document being scanned

  • Whether the characters being recognized are typewritten, computer-generated, hand printed, or cursive

  • The font face of the typewritten or computer-generated text

  • Whether you use the OCR fast mode option (see OCR Settings)

OCR can be performed on JPEG2000, JPEG, GIF, PNG, and TIFF files.

Support languages

CONTENTdm OCR supports the languages below.

Supported languages
Abkhaz
Adyghe
Afrikaans
Agul
Albanian
Altaic
Armenian
(Eastern)
Armenian
(Grabar)
Armenian
(Western)
Avar
Aymara
Azerbaijani
(Cyrillic)
Azerbaijani
(Latin)
Bashkir
Basic
programming
language
Basque
Belarussian
Bemba
Blackfoot
Breton
Bugotu
Bulgarian
Buryat
Catalan
Chamorro
Chechen
Chemistry
(simple
chemical
formulas)
Chinese
Simplified
Chinese
Traditional
Chukcha
Chuvash
Corsican
Crimean
Tatar
Croatian
Crow
Czech
Danish
Dargwa
Digits
(Numbers)
Dungan
Dutch
(Netherlands)
Dutch
(Belgium)
E-13B
(MICR
text
type)
English
English
and
Russian
Eskimo
(Cyrillic)
Eskimo
(Latin)
Esperanto
Estonian
Even
Evenki
Faeroese
Fijian
Finnish
Fortran
programming
language
French
Frisian
Friulian
Gagauz
Galician
Ganda
German
German
(Luxembourg)
German
(new
spelling)
Greek
Guarani
Hani
Hausa
Hawaiian
Hebrew
Hungarian
Icelandic
Ido
Indonesian
Ingush
Interlingua
Irish
Italian
Japanese
+
English
Kabardian
Kalmyk
Karachay-Balkar
Karakalpak
Kasub
Kawa
Kazakh
Khakas
Khanty
Kikuyu
Kirghiz
Kongo
Korean
and
English
Koryak
Kpelle
Kumyk
Kurdish
Lak
Lappish
(Sami)
Latin
Latvian
Lezgin
Lithuanian
Luba
Macedonian
Malagasy
Malay
Malinke
Maltese
Mansi
Maori
Mari
Maya
Miao
Minankabaw
Mohawk
Mongol
Mordvin
Nahuatl
Nenets
Nivkh
Nogay
Norwegian
(Bokmal
and
Nynorsk)
Norwegian
(Bokmal)
Norwegian
(Nynorsk)
Nyanja
Occidental
Ojibway
Ossetian
Papiamento
Pascal
programming
language
Pidgin
English
(Tok
Pisin)
Polish
Portuguese
(Brazil)
Portuguese
(Portugal)
Provencal
Quechua
Rhaeto-Romanic
Romanian
Romanian
(Moldavia)
Romany
Ruanda
Rundi
Russian
Russian
(old
spelling)
Samoan
Scottish
Gaelic
Selkup
Serbian
(Cyrillic)
Serbian
(Latin)
Shona
Sioux
(Dakota)
Slovak
Slovenian
Somali
Sorbian
Sotho
Spanish
Sunda
Swahili
Swazi
Swedish
Tabassaran
Tagalog
Tahitian
Tajik
Tatar
Tinpo
Tongan
Tswana
Tun
Turkish
Turkmen
Tuvan
Udmurt
Ukrainian
Uzbek
(Cyrillic)
Uzbek
(Latin)
Visayan
Welsh
Wolof
Xhosa
Yakut
Yiddish
Zapotec
Zulu

Activating and Resetting OCR

Optical Character Recognition (OCR) is provided by the CONTENTdm OCR Extension, powered by the ABBYY® FineReader®. You need to activate your FineReader license before using the OCR Extension.

To have a license code, you must have purchased the CONTENTdm OCR Extension. You must be connected to the Internet to activate or reset your OCR license.

To activate OCR:

  1. Access the Project Settings Manager, and click the OCR tab. The OCR page displays.

  2. Click Activate to display the License Manager dialog box.

  3. Enter your 20-character serial number (the number is not case-sensitive). Click OK. The License Manager displays.

  4. Click Activate. The Activation Wizard opens.

  5. Check the license activation method of via Internet, and click Next.

  6. Confirm your computer is connected to the Internet, and click Next. After the license has successfully activated (it may take a moment), a success message is displayed.

  7. Click Finish.

  8. Close the Project Client, and then restart the Project Client.

  9. Confirm that OCR has been activated.

Checking OCR Activation

To check that OCR has been activated for your Project Client

  1. Access the Project Settings Manager, and click the OCR tab. The OCR page displays.

  2. Confirm that the OCR Extension is activated (the OCR license code should be visible, and you can check the number of remaining pages you can process for the month, selected recognition languages, and whether fast mode is selected).

  • Resetting OCR Activation

    If you need to decommission an OCR license or if activation fails, you can reset the FineReader license by clearing the OCR license and activating it again. Use caution when clearing an OCR license (see note below).

    Note: Each license code can be activated once. If you clear and reactivate a license code more than once (for example, to install the extension on a different machine), contact CONTENTdm Support.

    Note: Microsoft Windows XP requires Admin privileges to clear an OCR license.

    To reset OCR activation

  1. From the OCR page, click Clear OCR License. A confirmation dialog box displays.

    Caution: If you clear and reactivate a license more than once, the activation will fail and you will need to contact Support for assistance

  2. Click Yes to confirm. The OCR page displays.

  3. Click OK, and then access the OCR page again to view the updated OCR settings.

OCR Settings

Using the Optical Character Recognition (OCR) settings, you can select a faster processing mode and choose one or more languages to use for OCR processing. Choosing the fast mode recognition option increases the processing speed. However, fast mode is not as accurate as the standard processing mode and is most appropriate for documents with a simple layout, and good scanning and print quality.

OCR processing must be activated before you can use this processing option. For more information, see Activating OCR.

OCR settings are managed per project using the Project Settings Manager. When the OCR Extension is activated, the OCR license code is displayed, you can check the number of remaining pages you can process for the month, select recognition languages, and select whether fast mode is selected.

To enable the fast mode recognition option:

  1. In the Project Client, select your project tab. On the left task pane in Other Tasks, click Edit Project Settings.

  2. Select the OCR tab. The OCR page displays.

  3. Select Fast Mode, and then click OK.

Note: Selecting the fast mode option increases the processing speed of the OCR, but results will be less accurate.

To change the recognition language:

  1. In the Project Client, select your project tab. On the left task pane in Other Tasks, click Edit Project Settings.

  2. Select the OCR tab. The OCR page displays.

  3. Click Change in the OCR Options section. The Recognition Language dialog box opens.

  4. Select the desired language or languages. The current language is displayed in the text box at the top of the dialog box. Additional language selections are added to the text box, separated by commas. To remove a language from the list, clear the box next to the language.

    Note: Some languages are not supported in combination. For example, OCR processing may not process some languages when also combined with Chinese, Japanese, or Korean. If you have more than one recognition language selection and receive an error when trying to process, you may need to select only the primary language for the particular item.

  5. Click OK to save changes.

Generating Transcripts Using OCR

If you have the OCR Extension, you can use the Add Compound Objects wizard or the Add OCR text option in the Project and Item Editing tabs to generate transcripts using OCR for single files, multiple files or compound objects.

Generating Transcripts Using OCR with the Add Compound Objects Wizard

The compound object wizards provide an option for generating transcripts by using OCR, if you have the OCR extension. All compound object wizards provide the OCR option within the Page Information screen. You also can choose to create a PDF during the OCR processing, which can be used for printing.

To generate transcripts using OCR with a compound object wizard:

  1. The administrator must edit field properties of the collection to enable full text searching. The administrator can add a new field for the transcript or designate an existing field as the full text search field.

  2. On the project tab, click Add Compound Objects in the left task bar. The Add Multiple Compound Objects screen displays.

  3. Select a wizard to use and click Add. Follow the wizard screens.

  4. On the Page Information screen, select Generate transcripts using OCR.

  5. If desired, select Create print PDF.

  6. When you are finished with the wizard, you can review the compound object by going to the project tab and finding the object in the project spreadsheet.

Generating Transcripts Using OCR with Items in the Project

The Project spreadsheet and the Item Editing tab provide another option for generating transcripts by using OCR, if you have the OCR extension. You can OCR items you select in the Project spreadsheet or open items and compound objects in the Item Editing tab to add OCR text.

To generate transcripts using OCR in the Project tab:

  1. Full text searching must be enabled in the collection. In the Project spreadsheet, check the boxes next to the items to OCR.

  2. From the Edit menu or the More Actions menu, click Add OCR Text.

  3. A progress bar displays while the OCR is performed. When complete, a summary screen displays the summary and any errors or warnings.

  4. Click Close. The OCR text is displayed in the full text field of the items.

To generate transcripts using OCR in the Item Editing tab:

  1. Full text searching must be enabled in the collection. From the Project spreadsheet, open the item or compound object in a new tab.

  2. From the Edit menu or the More Actions menu, click Add OCR Text.

  3. For compound objects, you can choose to OCR the entire compound and create a print PDF or OCR only selected pages. To OCR selected pages within the compound object, click on the names of pages while pressing the Ctrl key. Click Perform OCR.

  4. A progress bar displays while the OCR is performed. When complete, the OCR text is displayed in the full text field of the item or compound object pages.

OCR Processing Page Limits

The CONTENTdm OCR Extension enables you to process a certain number of pages per month, depending on your license level. (You can check your page counts by reviewing the page limit on the OCR tab in the Project Settings Manager).

The pages are measured according to the international paper standard of A4: approximately 8.27 inches x 11.69 inches, which is 96.68 square inches. The US standard letter size of 8.5 inches x 11 inches, which is 93.5 square inches, is three inches smaller than A4 and counts as one processed page. If the pages exceed size A4, you will receive a warning that processing the page will exceed the single page scan size and will be counted as more than one page. You can cancel the process, if you do not want to proceed. If you do not want to be warned about oversized images in the future, you can choose to suppress the warning message.

If the page that you are scanning is larger than A4, the number of pages counted will be equal to the area of the page divided by the A4 area (96.68 inches). The result is rounded to the next whole number. For example, if you are processing a tabloid page that is 11 inches x 17 inches, the area of that page is 187 square inches. 187 is divided by 96.68, resulting in 1.93. This means that an 11 x 17 page will count as two processed pages.

If you know the dimensions of your image in pixels, use the following formula to determine the size in inches:

(Pixel width) / (X resolution) * (Pixel height) / (Y Resolution)

For example, if you have an image that was scanned at 72 pixels per inch and the image is 1200 pixels wide by 1600 pixels high, using the above formula (1200/72 x 1600/72), the dimensions are 16.66 inches wide x 22.22 inches high (370.19 square inches). Divide that by the A4 value, which results in 3.82 pages (or 4 pages, rounded to the next whole number).

General guidelines for A4 dimensions in pixels are:

72 dpi = 595 X 842 pixels
300 dpi = 2480 X 3508 pixels
600 dpi = 4960 X 7016 pixels

The following table is a quick reference for the above formulas and dimensions.

A4 paper size in inches: 8.27 x 11.69 (96.68 square inches)
To determine size in inches when given pixels: (Pixel width)/(X resolution) * (Pixel height)/(Y Resolution)
To determine number of pages counted toward processing: Area of the page/Area of A4 (96.8)

 

  • Was this article helpful?