Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.
| Version: | 1.0 |
| Depends: | rJava |
| Published: | 2012-12-21 |
| Author: | Mario Annau [aut, cre] |
| Maintainer: | Mario Annau <mario.annau at gmail.com> |
| License: | Apache License (== 2.0) |
| NeedsCompilation: | no |
| CRAN checks: | boilerpipeR results |
| Package source: | boilerpipeR_1.0.tar.gz |
| MacOS X binary: | boilerpipeR_1.0.tgz |
| Windows binary: | boilerpipeR_1.0.zip |
| Reference manual: | boilerpipeR.pdf |
| Vignettes: |
Introduction to the tm.plugin.webmining Package |
| Reverse depends: | tm.plugin.webmining |