Description

PDE is a R package that easily extracts information and tables from PDF files. The PDE_analyzer_i() performs the sentence and table extraction while the included PDE_reader_i() allows the user-friendly visualization and quick-processing of the obtained results.


Installation

Install the dependent packages

install.packages("tcltk2")    # Install the dependent package tcltk2

The package requires the Xpdf command line tools by Glyph & Cog, LLC. Please download and install the Xpdf command line tools from the following website onto your local disk: https://www.xpdfreader.com/download.html (https://www.xpdfreader.com/download.html). Alternatively, the following command can be used to install the correct Xpdf command line tools:

PDE_install_Xpdftools4.02()    # Download and install the Xpdf command line tools
PDE_check_Xpdf_install()        # Check if all required XPDF tools are installed correctly

Install the package through CRAN

install.packages("PDE", dependencies = TRUE)

or choose the location where you downloaded latest PDE_*.*.*.tar.gz and install it from a local path.

filename <- file.choose()     # Choose the location where you downloaded the latest PDE_*.*.*.tar.gz
install.packages(filename,  type="source", repos=NULL)

NOTE: The PDE package was tested on Microsoft Windows, Mac and Linux machines. Major differences include the visual appearance of the interfaces and the directory structures, but all functions are preserved.


Execution

The PDE analyzer can be accessed through different functions which are outlined below.

PDE_analyzer()
PDE_analyzer_i()
PDE_extr_data_from_pdfs()
PDE_pdfs2table()
PDE_pdfs2table_searchandfilter()
PDE_pdfs2txt_searchandfilter()

The PDE reader is only available as an interactive user interface requiring the R package tcltk2.

PDE_reader_i()

NOTE: For problem solution concerning a potential error when starting PDE_analyzer_i() or PDE_reader_i() on Mac see Troubleshoot - Error when starting interactive user interface on Mac (failed to allocate tcl font).

Quick guide to get started

PDE_analyzer_i()

  1. Run
library("PDE")
PDE_analyzer_i()
PDE_analyzer_i() user interface on Mac


  1. This should open a user interface.
  2. Fill out the form from top to bottom (standard parameters are preselected).
  3. The filled form can and should be saved as a TSV file at any time. This can be done by clicking the Save form as tsv button at the top, center of the form.
    NOTE: Choose an empty folder or create a new one as the output directory, since analyses create at least a number of files equal to the number of PDF files analyzed.

PDE_reader_i()

PDE_reader_i() user interface on Linux


  1. Run
library("PDE")
PDE_reader_i()
  1. This should open a user interface.

  2. Load either a sentence analysis file or a folder with such files.
    NOTE: Analysis files refer to the files created by the PDE_analyzer_i() which contain “txt+-” in their name.

  3. The user can browse through all analysis files in the folder to get an overview over the data.

  4. Additional functions can be enabled by loading the PDF folder as well as the TSV file used for analysis.
    NOTE: Flagging and marking changes filenames but can be reversed in the program at any time.


Parameters

PDE_analyzer_i()

NOTE: Arguments for the R function PDE_extr_data_from_pdfs() are listed below each description: argument

  1. Run
library("PDE")
PDE_analyzer_i()
PDE_analyzer_i() user interface on Microsoft Windows


Choose the locations for the required files:

PDE_analyzer_i() user interface - Choose the locations for the required files


  1. Load form from tsv OR Save form as tsv: The filled form can and should be saved as a TSV file at any time, accordingly the saved parameters can be loaded from saved TSV files.

  2. Reset form: This will clear all fields and variables.

Input/Output:

PDE_analyzer_i() user interface - Input/Output


  1. Open PDF folder: Open a folder with PDF files you want to analyze. For the analysis, all PDF files in the folder and subfolders will be analyzed.
    or
    Load PDF files: Select one or more PDF files you want to analyze (use Ctrl and/or Shift to select multiple). Multiple PDF files will be separated by ; without a space.
    Argument for PDE_extr_data_from_pdfs(): pdfs

  2. Choose what to extract: The PDE analyzer has 2 main functions A] PDF2TXT (extract sentences from pdf) and B] PDF2TABLE (table of PDF to excel file) which can be combined or executed separately. Each function can be combined with filters and search words. A file with the sentences carrying the search words will have the name format: [search words]txt+-[context][PDF file name] in the corresponding subfolder. Tables will be named: [PDF file name][number of table][table heading].
    Argument for PDE_extr_data_from_pdfs(): whattoextr

  3. Open output folder: All analysis files will be created inside of this folder; therefore, choose an empty folder or create a new one as output directory, since analyses create at least a number of files equal to the amount of PDF files analyzed. If no output folder is chosen, the results will be saved in the R working directory.
    Argument for PDE_extr_data_from_pdfs(): out

  4. Choose output format: The resulting analyses files can either be generated as comma-separated values files (.csv) or tab-separated values files (.tsv), with the former being easier to open and save in Microsoft Excel, while the later leads to less errors when opening in Microsoft Excel (as tabs are rare in texts). Depending on the operational system the output file are opened in, it is recommended to choose the Microsoft Windows (WINDOWS-1252), Mac (macintosh) or Linux (UTF-8) encoding.
    Argument for PDE_extr_data_from_pdfs(): out.table.format

Parameters:

PDE_analyzer_i() user interface - Parameters


  1. Enter table headings: Standard scientific articles have their tables labeled with “TABLE”, “TAB”, “Table” or “table” plus number and are detected accordingly. If a table is expected to have a different heading, it should be typed in this field. For multiple different heading use “;” without extra spaces.
    Argument for PDE_extr_data_from_pdfs(): table.heading.words

  2. Table heading case sensitive: E.g., for “HEADING”, if “no” was chosen then “HEADING”, “heading”, “Heading”, etc., will be detected, if “yes” was chosen only “HEADING” will be detected.
    Argument for PDE_extr_data_from_pdfs(): ignore.case.th

  3. Pixel deviation: For some tables the heading is slightly indented which would make the algorithm assume it was a separated column. With the pixel deviation the size of indention which would be considered the same column can be adjusted.
    Argument for PDE_extr_data_from_pdfs(): dev

  4. Filter words?: In some cases, only articles of a certain topic should be analyzed. Filterwords provide a way to analyze only articles which carry words from a list at least n times.

  5. Filter words: Type in the list of filter words separated by “;” without spaces in between. A hit will be counted every time a word from the list is detected in the article.
    Argument for PDE_extr_data_from_pdfs(): filter.words

  6. Filter words case sensitive: E.g., for “Word”, if “no” was chosen then “word”, “WORD”, “Word”, etc., will be detected, if “yes” was chosen only “Word” will be detected.
    Argument for PDE_extr_data_from_pdfs(): ignore.case.fw

  7. Filter word times: This represents the minimum number of hits described above which has to be detected for a paper to be further analyzed. If the threshold is not met, a documentation file can be exported if selected in the documentation section.
    Argument for PDE_extr_data_from_pdfs(): filter.word.times

  8. Search words?: The algorithm can either extract , tables, or sentences and tables with one of the search words present. If the “tables” only analysis was chosen, the algorithm can also extract all tables detected in the paper (choose this option here). In the later case, the search words field should remain empty.

  9. Search words: Type in the list of search words separated by “;” without spaces in between.
    Argument for PDE_extr_data_from_pdfs(): search.words

  10. Search words case sensitive: E.g., for “Word”, if “no” was chosen then “word”, “WORD”, “Word”, etc., will be detected, if “yes” was chosen only “Word” will be detected.
    Argument for PDE_extr_data_from_pdfs(): ignore.case.sw

  11. Number of sentences before and after: When 0 is chosen, only the sentence with the search word is extracted. If any number n is chosen, n number of sentences before and n number of sentences after the sentence with the search word will be extracted. A sentence is currently defined by starting and ending with a “.” (period with a subsequent space).
    Argument for PDE_extr_data_from_pdfs(): context

  12. Evaluate abbreviations?: If “yes” was chosen, all abbreviations that were used in the PDF documents for the search words will be saved and then replace by abbreviation (search word)$*, e.g., MTX will be replaced by MTX (Methotrexate)$*. In addition plural of the abbreviations, i.e., the abbreviation with an “s” at the end will be replaced accordingly as well.
    Argument for PDE_extr_data_from_pdfs(): eval.abbrevs

Documentation/Debugging:

PDE_analyzer_i() user interface - Documentation/Debugging


  1. Table values in file: When “tables” detection/export was chosen, this option will be relevant. For “yes”, a separate file with the headings of all tables, their relative location in the generated HTML and TXT files, as well as information if search words were found will be generated. The files will start with “htmltablelines”, “txttablelines”, “keeplayouttablelines” followed by the PDF file name and can be found in html.docu, txt.docu, keeptxt.docu subfolders.
    Argument for PDE_extr_data_from_pdfs(): write.table.locations

  2. Export tables with problems: For “yes”, if a table was detected in a PDF file but is an image or cannot be read, the page with the table with be exported as a portable network graphics (PNG) file. The documentation file will have the name format: [PDF name]page[page number]w.table-[page number].png
    Argument for PDE_extr_data_from_pdfs(): exp.nondetc.tabs

  3. Table documentation files?: For “yes”, if search words are used for table detection and no search words were found in the tables of a PDF file, a file will be created with the PDF name followed by “no.table.w.search.words” in the folder with the name no_tab_w_sw.
    Argument for PDE_extr_data_from_pdfs(): write.tab.doc.file

  4. Sentence documentation file?: For “yes”, if no search words were found in the sentences of a pdf, a file will be created with the PDF file name followed by “no.txt.w.search.words” in the no_txt_w_sw folder. If the PDF file is empty, a file will be created with the PDF file name followed by “non-readable” in the nr folder. Files that were filtered out using the filterwords will lead to the creation of a file with the PDF name followed by “no.txt.w.filter.words” in the excl_by_fw folder.
    Argument for PDE_extr_data_from_pdfs(): write.txt.doc.file

  5. Delete intermediate files: The program generates a txt, keeplayouttxt and HTML copy of the PDF file, which will be deleted if intermediate files deletion is chosen. In case, this option was chosen accidentally, the user has two options to delete the .txt and .html file. 1) Slow & easy option: Rerun the analysis with this option being yes. 2) Quick and slightly more complicated option: Open the file explorer and search for *.txt and *.html in the PDF folder. Then select all files and folders of the search result and press delete.
    Argument for PDE_extr_data_from_pdfs(): delete


PDE_reader_i()