A small release correcting some minor bugs.
Fixed a bug in estimateEffect when a character or factor variable had 2+ levels with the same number of observtions. Thanks to APuzyk on github for catching this.
The last release turned on a different recovery method for the spectral algorithm by default. Changed the default back to exponential gradient as documented. Thanks to Simone Zhang for this catch.
Better defaults for some of the labeling in plot.STM
Better argument matching and errors for plot.estimateEffect
Small updates to the documentation and vignette.
Thanks to efforts by Ken Benoit stm can now take a quanteda dfm object.
Thanks to help from Chris Baker we are now using roxygen for our documentation.
Jeffrey Arnold fixed a small bug in toLDAVis. Thanks!
Carsten Schwemmer helped us find a bug where plot.estimateEffect() didn't work when dplyr was loaded. This is now fixed.
In making the change to roxygen we unfortunately break backwards compatability. The package's generic functions such as plot.estimateEffect() and plot.STM() can now only be called by plot() rather than by their full name.
We document and export optimizeDocument which provides access to the document level E-step.
We have documented and exported several of the labeling functions including calcfrex, calcscore, calclift and js.estimate. These are marked with keyword internal because they don't have much error checking and most users will want labelTopics anyhow. But they can be accessed with ? and are linked from labelTopics
After much popular demand we have released a fitNewDocuments() function which will calculate topic proportions for documents not used to fit the models. There are many different options here.
estimateEffect now has a summary function which will make regression tables
a number of the internals to plot.estimateEffect have been improved which should eliminate some edge case bugs.
Much of the documentation has been updated as has the vignette.
Our wrapper s() for the splines package function bs() now has predict functions associated with it so it should work in contexts like lm()
We have documented and exported all the metrics for searchK()
We made a change in the spectral initialization which ensures that only the top 10000 (a modifiable default) words are used in the initialization. This allows it to be used effectively with much larger vocab.
Added a modifiable max iteration timeout error for the prevalence regression in stm. This will only matter for people using covariate sets which are very, very large.
Added a new recovery algorithm used in spectral initialization from gradient descent to a more accurate and generally faster one based on quadratic programming. This can be turned on by: control=list(recoverEG=FALSE). Eventually we may change the default recovery method. Note: while this more accurately solves the actual problem, we've seen better results with the early stopping produced by exponential gradient not fully convering. This has been confirmed by the Arora group as well.
Clarified some of the documentation in textProcessor thanks to James Gibbon.
Fixed a problem with registration of S3 methods for textProcessor()
Added access to the information criterion parameter for L1 mode prevalence covariance in stm. See the gamma.ic.k option in the control parameters of stm
searchK() can now be used with content covariates thanks to github user rosemm
added a querying function based on data.table into findThoughts()
Fixed a rare bug in the K=0 feature for spectral initialization where words with the exact same appearance pattern would cause the projection to fail.
Fixed the unexported findTopic()
Improved some documentation
Small finetuning in toLDAViz
Fixed a small bug that caused readCorpus to fail on dense document term matrices
Fixed a small bug in the random projections algorithm
Improved warnings in stm when restarting models (Hat tip to Andrew Goldstone)
Added the convertCorpus function for converting stm to other formats.
Formatting changes to the vignette
Performance improvements via various optimizations including porting some components to C++
Various new experimental features including K=0
Improved documentation including a new version of the vignette.
Better error messages in several places
Experimental options for random projections with spectral initializations
Fixes a problem in make.heldout where a document could be completely emptied by the procedure. Hat tip to Jesse Rhodes for the bug report.
gamma.prior="L1" coerce the mu object back to a matrix class object. Should fix a speed hit introduced in 1.0.10 for this case.
Prevalence covariates can now use sparse matrices which will result in better performance for large factors.
textProcessor() and prepDocuments() now do a better job of preserving labels and keeping track of dropped elements. Special thanks to Github users gtuckerkellog and elbamos for pull requests.
Fixed an edge case in init.type="Spectral" where words appearing only in documents by themselves would throw an error. The error was correct but hard to address in certain cases, so now it temporarily removes the words and then reintroduces them before starting inference spreading a tiny bit of mass evenly across the topics. Hat tip to Nathan Sanders for brining this to our attention.
New function findTopic() which helps locate topics containing particular words or phrases.
New function topicLasso() helps build predictive models with topics.
Fixed a minor bug in prepDocuments which arises in cases where there are vocab elements which do not appear in the data.
Fixed a minor bug in frex calculation that caused some models not to label.
Fixed a minor bug in searchK that caused heldout results to report incorrectly.
Rewrite of plot.estimateEffect() which fixed a bug in some interaction models. Also returns results invisibly for creating custom plots.
Increased the stability of the spectral methods for stm initialization.
Complete rewrite of plotRemoved() which makes it much faster for larger datasets.
A minor patch to deal with textProcessor() in older versions of R.
Large changes many of which are not backwards compatible.
Numerous speed improvements to the core algorithm.
Introduction of several new options for the core stm function including spectral initalization, memoized inference, and model restarts.
Content covariate models are now estimated using the distributed multinomial formulation which is dramatically faster. Default prior also changed to L1.
Handling of document level convergence was changed to ensure positive definiteness in the document-level covariance matrices
Fixed bug in binary/binary interactions.
Numerous new diagnostic and summary functions
Expanding the console printing of many of the preprocessing functions
Fix an error with vignettes building on linux machines
sageLabels exported but not documented
factorCheck diagnostic function exported
Bug fix in the semantic Coherence function that affected content covariate models.
Bug fix to plot.STM() where for content covariate models with only a subset of topics requested the labels would show up as mostly NA. Thanks to Jetson Leder-Luis for pointing this out.
Bug fix for the readCorpus() function with txtorg vocab. Thanks to Justin Farrell for pointing this out.
Added some diagnostics to notify the user when words have been dropped in preprocessing.
Automatically coerce dates to numeric in spline function.
Very minor change with textProcessor() to accomodate API change in tm version 0.6
New option for plot.STM() which plots the distribution of theta values. Thanks to Antonio Coppola for coauthoring this component.
Deprecated option "custom" in "labeltype" of plot.STM(). Now you can simply specify the labels. Added additional functionality to specify custom topic names rather than the default "Topic #:"
Bug fixes to various portions of plot.STM() that would cause labels to not print.
Added numerous error messages.
Added permutationTest() function and associated plot capabilities
Updates to the vignette.
Added functionality to a few plotting functions.
When using summary() and labelTopics() content covariate models now have labels thresholded by a small value. Thus one may see no labels or very few labels particularly for topic-covariate interactions which indicates that there are no sizable positive deviations from the baseline.
S3 method for findThoughts and ability to threshold by theta.
Allow estimateEffect() to receive a data frame. (Thanks to Baoqiang Cao for pointing this out)
Major updates to the vignette
Minor Updates to several plotting functions
Fixed an error where labelTopics() would mislabel when passed topic numbers out of order (Thanks to Jetson Leder-Luis for pointing this out)
Introduction of the termitewriter function.
Version for submission to CRAN (2/28/2014)
Introduced new dataset poliblog5k and shrunk the footprint of the package
Numerous alternate options changed and some slight syntax changes to stm to finalize the API.
New build 2/14/2014
Fixing a small bug introduced in the last version which kept defaults of manyTopics() from working.
Updated version posted to Github (2/13/2014)
Various improvements to plotting functions.
Setting the seed in selectModel() threw an error. This is now corrected. Thanks to Mark Bell for pointing this out.
First public version released on Github (2/5/2014)
This is a beta release and we may change some of the API before submission to CRAN.