OCR

When a PDF file contains scanned pages or graphics that look like text, you can use OCR (Optical Character Recognition) text recovery to create an editable document.
Additionally, you can apply OCR to accurately recreate the text in a Word document when the font information in the PDF file is incomplete (resulting in corrupt or incorrect characters).

This converter uses SolidOCR for extracting text from scanned images.

OCR is dependent upon image quality, so we recommend a resolution of 200dpi or greater for best results.

Auto

If this option is selected then OCR will be performed on images only. This is usually the best choice. Any non-scanned text within the document will be extracted as text.

Never

No attempt will be made to convert images to text. If you have selected to include images in the reconstructed document, then these images will still be images within the Word document.

Always

The entire document will be considered to be a scanned image, even if it contains text. As such OCR will be performed on the text as well as the scanned image. This option is most useful if OCR has already been performed on the document, but failed to extract the text effectively, or the PDF contains "Non-Standard Encoded" text.

Note: OCR is a premium feature and is limited to 10 scanned pages when using this free service

If your PDF contains more than 10 scanned pages and OCR is not set to 'Always' SimplyPDF will not perform any OCR. If your PDF contains more than 10 scanned pages and OCR is set to 'Always' SimplyPDF will only convert the first 10 pages of your document.

For unlimited OCR, see our Desktop Products or Solid Framework

OCR Language

While SolidFramework can attempt to identify the language of a document based on its content, explicilty specifying the language can improve the accuracy of reconstruction.

The following languages are supported:
  • Catalan
  • Chinese (Simplified)
  • Chinese (Traditional)
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Greek
  • Italian
  • Japanese
  • Korean
  • Norwegian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Slovenian
  • Spanish
  • Swedish
  • Turkish
  • SolidFramework uses Tesseract for Chinese, Japanese, Korean and Greek language OCR.