boilerpipeR: Interface to the Boilerpipe Java Library

Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe (http://code.google.com/p/boilerpipe/) Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

Version: 1.3
Imports: rJava
Suggests: RCurl
Published: 2015-05-11
Author: See AUTHORS file.
boilerpipeR author details
Maintainer: Mario Annau <mario.annau at gmail.com>
BugReports: https://github.com/mannau/boilerpipeR/issues
License: Apache License (== 2.0)
URL: https://github.com/mannau/boilerpipeR
NeedsCompilation: no
Materials: NEWS
In views: NaturalLanguageProcessing, WebTechnologies
CRAN checks: boilerpipeR results

Downloads:

Reference manual: boilerpipeR.pdf
Vignettes: Introduction to the tm.plugin.webmining Package
Package source: boilerpipeR_1.3.tar.gz
Windows binaries: r-devel: boilerpipeR_1.3.zip, r-release: boilerpipeR_1.3.zip, r-oldrel: boilerpipeR_1.3.zip
OS X Snow Leopard binaries: r-oldrel: boilerpipeR_1.3.tgz
OS X Mavericks binaries: r-release: boilerpipeR_1.3.tgz
Old sources: boilerpipeR archive

Reverse dependencies:

Reverse imports: tm.plugin.webmining