About Working with PDF Files
Portable Document Format (PDF) is a format created by Adobe® used, for electronic document distribution and exchange.
CONTENTdm provides features for efficient processing of born-digital documents in PDF format. PDF files and PDF compound objects can be displayed inline in the Item Viewer and Compound Object Viewer by using Adobe Reader®.
The PDF features include: automatic conversion of multiple-page PDF files into compound objects, creation of thumbnail images from PDF files, and full text extraction. Additionally, pages of a compound object automatically generated from a PDF file will not count toward the total number of items on the server.
Before you decide to use PDF over another format, consider whether your source materials are well-suited to this format, and whether your end-user experience would be optimized by using PDF. For example, PDF files are ideal for documents that were initially created as digital documentation, such as theses and city council minutes. PDF files are not efficient nor provide an optimal end-user experience for scanned images, books, maps or newspapers.
Additionally, PDF is not ideal for scanned images because an item that has been scanned does not automatically contain embedded text. For scanned images, you can use the CONTENTdm OCR Extension for generating full text. PDF files created from images can be very large and slow to download for online viewing. For a better end-user experience, you can use CONTENTdm to create JPEG2000 or JPEG display images from scanned TIFF files, rather than converting the TIFF files to PDF files. For more information, see the Using PDF Files in CONTENTdm tutorial (PDF).
To view PDF files and PDF compound objects in the Project Client, you must have Adobe Reader. If it is not already installed, install Adobe Reader.
A single PDF file can contain many pages. Regardless of the number of pages, it is a single file and is uploaded as a single file. You can import multiple PDF files using the Add Multiple Items wizard.
Depending on how your collection is configured, multiple-page PDF files can be added to your collection to be viewed as single items or, if PDF conversion is enabled, they can be automatically converted to PDF compound objects.
Note: To ensure an optimal end-user experience, PDF files (or pages of a compound object) larger than 20 MB are not loaded inline in any of the item viewers. These larger files can be saved to the desktop or opened outside of the browser.
Another way that PDF files are different from other files is that the text from PDF files is extracted and placed in a full text search field when PDF files are approved and added to a collection. Automatic text extraction occurs when:
Note: If your PDF file was created from a born-digital document, such as a Microsoft Word file, it will almost always have embedded text. If your PDF file was created from scanned TIFF images, it does not have embedded text unless you have taken the additional step to OCR the image (or PDF file) and add that text to the PDF.
CONTENTdm supports integrated OCR functionality through the OCR Extension. Using the OCR Extension, full text can be generated from JPEG2000, JPEG, PNG, GIF and TIFF files. OCR is not supported for PDF files. (The automatic text extraction for PDF files mentioned above is separate functionality and does not require the OCR Extension.)
Thumbnail images can be automatically generated for PDF files based on the first page of the PDF, or you can specify a custom thumbnail.
Single-Item PDF Files
Single-item PDF files are created for PDF files that only contain one page. Single-item PDF files are also created by default for multiple-page PDF files, unless your CONTENTdm administrator has configured the collection for PDF conversion. (This setting can be turned on and off for each collection.) You can override the collection setting on the server by editing the Processing settings in the Project Settings Manager.
PDF Compound Objects
PDF compound objects (of the type monograph) are automatically created when multiple-page PDF files are added and approved to a collection, if that collection has been configured to enable PDF conversion or if you have configured the Processing settings in the Project Settings Manager.
The page order of the PDF compound object matches the page order of the original, multiple-page PDF file. Each page of the PDF file has a metadata record after it is added to a collection, but the digital item associated with it in CONTENTdm is virtual (i.e., a link to the related page in the PDF file). The individual pages of PDF files do not exist separately on the server; they are extracted and displayed only when the user requests them. This improves the end-user's access speed because the entire PDF file does not have to download to display the requested page. You cannot set permissions on individual pages. You also cannot edit the individual pages of PDF compound objects unless you remove the PDF file from the collection, edit the original PDF, and then add it to the collection again.
When a multiple-page PDF is added to a project in the Project Client, you can create compound object–level metadata by editing the record in the project spreadsheet view or in the Item Editing tab. When the multiple-page PDF file is added to the collection, text is extracted from each page and added to the full text field in the associated page-level metadata records.
Pages of the compound object are named based on collection configuration settings or PDF processing settings that you have specified using the Project Settings Manager in the Project Client. You also can rename individual pages in the Item Editing Tab.
If you choose to automatically generate thumbnails for the PDF compound object, thumbnails are created for each page. The thumbnail that represents the PDF compound object itself is based on the first page of the PDF file. (If you choose to use a custom thumbnail for a PDF compound object, the custom thumbnail is used for the compound object, as well as for each page of the object.)
You can import PDF files to display in your collections as single items, whether they contain one or more pages. For information about changing the default display so that single-item PDF files display inline, see Configuring the Website.
To import a single-item PDF:
Note: You can import multiple, single-item PDF files using the Add Multiple Items wizard.
When the file is added, a thumbnail is automatically generated. (Alternatively, you can use Images & Thumbnails to select a custom thumbnail for all PDF files.)
When the PDF is added to the project, the first 128,000 single-byte characters (64,000 double-byte characters) are extracted from the PDF and put into the full text search field. If the text in the PDF is longer than that, the text is truncated.
If the full text search field already contains data, the text is not extracted.
You can import multiple-page PDF files to display in your collections in the Compound Object Viewer.
To import a multiple-page PDF files as a compound object:
Note: You can import a batch of multiple-page PDF files by using the Add Multiple Items wizard.
When the file is added, thumbnails are automatically generated. (Alternatively, you can use Images & Thumbnails to select a custom thumbnail for all PDF files.)
When the PDF compound object is added to the project, the text from each page (up to 128,000 single- byte characters or 64,000 double-byte characters) is extracted from the PDF and put into the full text search field for the metadata record for each page.
CONTENTdm PDF functionality uses the Adobe® PDF Library™. Adobe, Adobe PDF Library, and the Adobe logo are trademarks of Adobe Systems Incorporated.
CONTENTdm® is a registered trademark of OCLC