Package: randomSurvivalForest Version: 3.6.3 BUILD: bld20100520 --------------------------------------------------------------------------------- CHANGES TO RELEASE 3.6.3 RELEASE 3.6.3 represents a minor update of the product, and will not affect most users of the prior version of the product. Key changes are as follows: o Updates and streamlining of Rd files. o Removal of unnecessary data sets. NOTE: The 3.6.x stream represents the last and final major upgrade of this product. Current and future functionality will migrate to the new CRAN package, Random Forests for Survival, Regression, and Classification, that will be released in the coming months. --------------------------------------------------------------------------------- CHANGES TO RELEASE 3.6.2 RELEASE 3.6.2 represents a minor update of the product, and will not affect most users of the prior version of the product. Key changes are as follows: o Updates to Rd files. o Minor updates to R functions. o Removal of all trace functionality to decrease execution time. NOTE: The 3.6.x stream represents the last and final major upgrade of this product. Current and future functionality will migrate to the new CRAN package, Random Forests for Survival, Regression, and Classification, that will be released in the coming months. --------------------------------------------------------------------------------- RELEASE 3.6.1 represents a critical update of the product resolving platform portability on 64-bit Windows machines. Key changes are as follows: o Fix to 64-bit Windows build resulting in naming conflict with variable reserved for implementation. No other platforms affected. Thanks to Brian Ripley for this. NOTE: The 3.6.x stream represents the last and final major upgrade of this product. Current and future functionality will migrate to the new CRAN package, Random Forests for Survival, Regression, and Classification, that will be released in the coming months. --------------------------------------------------------------------------------- RELEASE 3.6.0 represents the last and final major upgrade of this product. Current and future functionality will migrate to the new CRAN package, Random Forests for Survival, Regression, and Classification, that will be released in the coming months. Key changes to the current release are as follows: o The ability to fully analyze competing risk data including ensemble estimation, error rates, and VIMP by event type. Can also predict on test data. Missing data imputation is also available. See rsf() and competing.risk() for details. o Automatic variable selection using minimal depth theory implemented in the new function varSel(). Also see the core function max.subtree(). o Pairwise interactions between variables using minimal depth theory. See find.interaction() for more details. o The ability to analyze mean minimal depth information by individual (for variable selection at the individual level). See the documentation for rsf() for more information. o Impute only mode for extremely fast imputation of data. See the documentation for impute.rsf() fro more details. o The ability to export forests to a Java (JUNG) based plug-in. This allows one to visualize individual trees in the forest. See the documentation for rsf2rfz() for more information. o Modification to random splitting to grow deeper trees. o Fix of memory leak in INTERACTION mode. Thanks to Xi Chen for finding this feature. Other miscellaneous bug fixes. --------------------------------------------------------------------------------- Release 3.5.1 represents a minor upgrade of the product, and will not affect most users of the prior version of the product. Key changes are as follows: o An adaptive upper bound for the number of splits attempted is now in force. It is not greater than the number of data points in a node. --------------------------------------------------------------------------------- Release 3.5.0 represents a significant upgrade in the functionality of the product. Key changes are as follows: o A random version of all computed split rules is now available. See the documentation for more details. o The ability to handle factors has been implemented. If the factor is ordered, then splits are similar to real valued variables. If the factor is unordered, a split will move a subset of the levels in the parent node to the left daughter, and the complementary subset to the right daughter. All possible complementary pairs are considered and apply to factors with an unlimited number of levels. However, for deterministic splitting there is an optimization check to ensure that the number of splits is not greater than the number of complementary pairs in a node (this internal check will be overridden in random splitting mode if nsplit is set high enough). Note that when predicting on test data involving factors, the factor labels in the test data must be the same as in the grow (training) data. Consider setting labels that are unique in the test data to missing. --------------------------------------------------------------------------------- Release 3.2.3 represents a critical update of the product resolving unexpected program termination under certain conditions. Key changes are as follows: o In the case of missing data, it is possible for imputation to fail catastrophically in PREDICT or INTERACTION mode. This can occur when the GROW data set has no missing data, but the non-GROW data set does have missingness. Thanks to Andy J. Minn for finding this feature. o In the case of a single predictor, PREDICT mode can fail to dimension the data properly causing unpredictable results. Thanks to Eric A. Macklin for finding this feature. --------------------------------------------------------------------------------- Release 3.2.2 represents a critical update of the product resolving unexpected program termination under certain conditions. Key changes are as follows: o In the case of missing data, it is possible for naive imputation to fail catastrophically in PREDICT or INTERACTION mode. This can occur when the dimension of the mode specific data set is less than the GROW data set. Thanks to Andy J. Minn for finding this feature. --------------------------------------------------------------------------------- Release 3.2.1 represents a minor upgrade of the product, and will not affect most users of the prior version of the product. Key changes are as follows: o An additional option in INTERACTION mode can now be specified. The option 'rough' can significantly decrease the processing time necessary to determine variable importance for some large data sets. This is accomplished by replacing the cumulative hazard function that is based on the Nelson-Aalen estimate with the mean death time in a node as the measure of mortality. See the documentation for more details. o A minor but longstanding (since Release 1.0.0) issue in booking potential split points has been discovered. The problem manifests itself by slightly favoring splits on continuous covariates under certain conditions such as low values of 'mtry'. However, it is not expected that users will notice significant changes. The issue has been resolved. --------------------------------------------------------------------------------- Release 3.2.0 represents a significant upgrade in the functionality of the product. Key changes are as follows: o A second method of perturbing the data set in order to calculate variable importance (VIMP) has been implemented. In addition to permuting the values for a single variable, a random split approach has been taken in which a data point is randomly assigned to the left or right daughter node when a split occurs on the specified variable. o The joint VIMP among multiple variables of a (potentially proper) subset of the GROW data can now be calculated using the new function interaction.rsf(). This represents a third mode of operation (INTERACTION) for the application, and follows rsf.default (GROW) and predict.rsf (PREDICT). See the documentation for details. o An additional option in GROW mode can now be specified. The option 'varUsed' allows users to quantify which variables have been split upon within a single tree or over the entire forest. See the documentation for more details. o The ability to multiply impute data has been implemented. This involves imputing data while growing a forest and using the results to grow a new forest in order to better impute the data. o In GROW mode, the application now outputs both the in-bag and OOB summary imputed values. o An additional split rule 'randomsplit' has been implemented. See the documentation for more details. o The split rule 'logrankscore' is now calculated correctly. o The split rule 'logrankapprox' has been removed and replaced by the new split rule 'logrankrandom'. See the documentation for more details. --------------------------------------------------------------------------------- Release 3.0.1 represents a minor upgrade of the product, and will not affect most users of the prior version of the product. Key changes are as follows: o A slight adjustment has been made to the variance used in the "logrankapprox" splitting rule. This fixes the issue where a sizable fraction of trees in the forest were being stumped in some examples. o Illegal syntax fixes to C-code that manifest themselves on some compilers. Thanks to Brian Ripley for pointing this out. --------------------------------------------------------------------------------- Release 3.0.0 represents a major upgrade in the functionality of the product. Key changes are as follows: o Missing data can be imputed in both GROW and PREDICT mode. This applies to variables as well as time and censoring outcome values. Values are imputed dynamically as the tree is grown using a new tree imputation methodology. This produces an imputed forest which can be used for prediction purposes on test data sets with missing data. o Importance values for variables are returned in PREDICT mode when test data contains outcomes as well as variables. o Fixed some bugs in plot.variable(). Thanks to Andy J. Minn for pointing this out. o Minor modification of PMML representation of RSF forest output to accomodate imputation. The method of random seed chain recovery has been altered. Note that forests produced with prior releases will have to be regenerated using this release. We apologize for the inconvenience. --------------------------------------------------------------------------------- Release 2.1.0 represents a minor upgrade of the product, and will not affect most users of the prior version of the product. Key changes are as follows: o R 2.5.0 compliance issues and necessitated modifications. o Modification of PMML representation of RSF forest output. The RSF custom extension has been moved from the DataDictionary node to a new MiningBuildTask node. Note that forests produced with Release 2.0.0 will have to be regenerated using Release 2.1.0. We apologize for the inconvenience. o Fast processing of data involving large numbers of predictors (as in many genomic examples) by using the option big.data=TRUE. This option bypasses the huge overhead needed by R in creating design matrices and parsing formula. However, users should be aware of some side effects. See the RSF help file for more details. Thanks to Steven (Xi) Chen for pointing out the problem. o Only the top 100 predictors are now printed to the terminal when calling plot.error(). This deals with settings as above when one might have thousands of predictors. o Introduced a new wrapper "find.interaction()" for testing of pairwise interactions between predictors. --------------------------------------------------------------------------------- Release 2.0.0 represents a major upgrade in the functionality and stability of the original 1.0.0 release. Key changes are as follows: o Two new splitting rules, 'logrankscore' and 'logrankapprox', added. o Expanded output from 'rsf()'. Now out-of-bag objects 'oob.ensemble' and 'oob.mortality' are included in addition to the full ensemble objects 'ensemble' and 'mortality'. o Importance values for predictors can now be calculated (set 'importace = TRUE' in the initial 'rsf()' call). Extended 'plot.error()' to print, as well as plot, such values. o Prediction on test data can now be implemented using 'rsf.predict()' (set 'forest = TRUE' in the initial 'rsf()' call). o Included option 'predictorWt' used for weighted sampling of predictors when growing a tree. o Formula no longer restricted to main effects. Formula for 'rsf' interpreted as in typical R applications. However, users should be aware that including interactions or higher order terms in a formula may not be an optimal way to grow a forest. o Three types of objects are generated in an RSF analysis: '(rsf, grow)', '(rsf, predict)' and '(rsf, forest)'. Wrappers handle each type of object in different ways. o Improved error checking in all wrappers. o Extended 'plot.variable()' wrapper to generate parial plots for predictors. o Improved control over trace output. See the 'do.trace' option in 'rsf()'. o Implements the Predictive Model Markup Language specification for an '(rsf, forest)' forest object. PMML is an XML based language which provides a way for applications to define statistical and data mining models and to share models between PMML compliant applications. More information about PMML and the Data Mining Group can be found at http://www.dmg.org. Our implementation gives the user the ability to save the geometry of a forest as a PMML XML document for export or later retrieval. ---------------------------------------------------------------------------------