text2vec 0.5.1 [2018-01-10]
- removed rank* columns from
collocation_stat - were never used internally. Users can easily calculate ranks themselves
- Added Bi-Normal Separation transformation, thanks to Pavel Shashkin ( @pshashk )
- Added Dunning’s log-likelihood ratio for collocations, thanks to Chris Lee ( @Chrisss93 )
- Early stopping for collocations learning
- fixed several bugs #219 #217 #205
- decreased number of dependencies - no more
- removed distributed LDA which didn’t work correctly
- Now tokenization is based on tokenizers and THE stringi packages.
- models API follow mlapi package. No API changes on
text2vec side - we just put abstract
scikit-learn-like classes to a separate package in order to make them more reusable.
- Add additional filters to
prune_vocabulary - filter by document counts
- Clean up LSA, fixed transform method. Added option to use randomized SVD algorithm from
- API breaking change - vocabulary format change - now plain
data.frame with meta-information in attributes (stopwords, ngram, number of docs, etc).
- No more rely on RcppModules
- API breaking change - removed
lda_c from formats in DTM construction
itoken_parallel high-level functions for parallel computing
- API breaking change
chunks_numer parameter renamed to
- API breaking change - removed
create_corpus from public API, moved co-occurence related optons to
create_tcm from vecorizers
- add ability to add custom weights for co-occurence statistics calculations
- Noticeable speedup (1.5x) and even more noticeable improvement on memory usage (2x less!) for
create_tcm . Now package relies on sparsepp library for underlying hash maps.
- Collocations - detection of multi-word phrases using differend heuristics - PMI, gensim, LFMD.
- Fixed bug in
2016-10-03. See 0.4 milestone tags.
- Now under GPL (>= 2) Licence
- “immutable” iterators - no need to reinitialize them
- unified models interface
- New models: LSA, LDA, GloVe with L1 regularization
- Fast similarity and distances calculation: Cosine, Jaccard, Relaxed Word Mover’s Distance, Euclidean
- Better hadnling UTF-8 strings, thanks to @qinwf
- iterators and models rely on
- 2016-01-13 fix for #46, thanks to @buhrmann for reporting
- 2016-01-16 format of vocabulary changed.
- do not keep
doc_proportions. see #52.
stop_words argument to
prune_vocabulary. signature also was changed.
- 2016-01-17 fix for #51. if iterator over tokens returns list with names, these names will be:
- stored as
- rownames in dtm
- names for dtm list in
- 2016-02-02 high level function for corpus and vocabulary construction.
- construction of vocabulary from list of
- construction of dtm from list of
- 2016-02-10 rename transformers
- now all transformers starts with
transform_* - more intuitive + simpler usage with autocompletion
- 2016-03-29 (accumulated since 2016-02-10)
- new functions
- All core functions are able to benefit from multicore machines (user have to register parallel backend themselves)
- Fix for progress bars. Now they are able to reach 100% and ticks increased after computation.
ids argument to
itoken. Simplifies assignement of ids to rows of DTM
create_vocabulary now can handle
- see all updates here
- 2016-03-30 more robust
text2vec 0.2.0 (2016-01-10)
First CRAN release of text2vec.
- Fast text vectorization with stable streaming API on arbitrary n-grams.
- Functions for vocabulary extraction and management
- Hash vectorizer (based on digest murmurhash3)
- Vocabulary vectorizer
- GloVe algorithm word embeddings.
- Fast term-co-occurence matrix factorization via parallel async AdaGrad.
- All core functions written in C++.