mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines

Utilities for preprocessing of text corpora into data structures suitable for natural language models: integer sequences or matrices, vocabulary embedding matrices, term-doc, doc-term, term co-occurrence matrices etc. All functions allow for full or partial hashing of the terms in the vocabulary.

Version: 0.0.1
Depends: R (≥ 3.4.0)
Imports: Rcpp (≥ 0.12), Matrix, digest (≥ 0.6.8), sparsepp (≥ 0.2.0)
LinkingTo: Rcpp, digest (≥ 0.6.8), sparsepp (≥ 0.2.0)
Suggests: testthat, knitr
Published: 2018-04-13
Author: Vitalie Spinu [aut, cre]
Maintainer: Vitalie Spinu <spinuvit at>
License: GPL-3
NeedsCompilation: yes
SystemRequirements: C++11
Materials: README
CRAN checks: mlvocab results


Reference manual: mlvocab.pdf
Package source: mlvocab_0.0.1.tar.gz
Windows binaries: r-devel:, r-release:, r-oldrel:
OS X binaries: r-release: mlvocab_0.0.1.tgz, r-oldrel: mlvocab_0.0.1.tgz


