The ChangeLog for data.table is the SVN commit log itself : $ svn log svn://svn.r-forge.r-project.org/svnroot/datatable This NEWS file summarises the main changes. ********************************************** ** ** ** CHANGES IN DATA.TABLE VERSION 1.8.0 ** ** ** ********************************************** NEW FEATURES o character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported. Implements FR#1493, FR#1224 and (partially) FR#951. o setkey() no longer sorts factor levels. This should be more convenient and compatible with ordered factors where the levels are 'labels', in some order other than alphabetical. The established advice to paste each level with an ordinal prefix, or use another table to hold the factor labels instead of a factor column, is no longer needed. Solves FR#1420. Thanks to Damian Betebenner and Allan Engelhardt raising on datatable-help and their tests have been added verbatim to the test suite. o unique(DT) and duplicated(DT) are now faster with character columns, on unkeyed tables as well as keyed tables, FR#1724. o New function set(DT,i,j,value) allows fast assignment to elements of DT. Similar to := but avoids the overhead of [.data.table, so is much faster inside a loop. Less flexible than :=, but as flexible as matrix subassignment. Similar in spirit to setnames(), setcolorder(), setkey() and setattr(); i.e., assigns by reference with no copy at all. M = matrix(1,nrow=100000,ncol=100) DF = as.data.frame(M) DT = as.data.table(M) system.time(for (i in 1:1000) DF[i,1L] <- i) # 591.000s system.time(for (i in 1:1000) DT[i,V1:=i]) # 1.158s system.time(for (i in 1:1000) M[i,1L] <- i) # 0.016s system.time(for (i in 1:1000) set(DT,i,1L,i)) # 0.027s o New functions chmatch() and %chin%, faster versions of match() and %in% for character vectors. R's internal string cache is utilised (no hash table is built). They are about 4 times faster than match() on the example in ?chmatch. o Internal function sortedmatch() removed and replaced with chmatch() when matching i levels to x levels for columns of type 'factor'. This preliminary step was causing a (known) significant slowdown when the number of levels of a factor column was large (e.g. >10,000). Exacerbated in tests of joining four such columns, as demonstrated by Wes McKinney (author of Python package Pandas). Matching 1 million strings of which of which 600,000 are unique is now reduced from 16s to 0.5s, for example. Background here : http://stackoverflow.com/questions/8991709/why-are-pandas-merges-in-python-faster-than-data-table-merges-in-r o rbind.data.table() gains a use.names argument, by default TRUE. Set to FALSE to combine columns in order rather than by name. Thanks to a question by Zach on Stack Overflow : http://stackoverflow.com/questions/9315258/aggregating-sub-totals-and-grand-totals-with-data-table o New argument 'keyby'. An ad hoc by just as 'by' but with an additional setkey() on the by columns of the result, for convenience. Not to be confused with a 'keyed by' such as DT[...,by=key(DT)] which can be more efficient as explained by FAQ 3.3. Thanks to Yike Lu for the suggestion and discussion (FR#1780). o Single by (or keyby) expressions no longer need to be wrapped in list(), for convenience, implementing FR#1743; e.g., these now works : DT[,sum(v),by=a%%2L] DT[,sum(v),by=month(date)] instead of needing : DT[,sum(v),by=list(a%%2L)] DT[,sum(v),by=list(month(date))] o Unnamed 'by' expressions have always been inspected using all.vars() to make a guess at a sensible column name for the result. This guess now includes function names via all.vars(functions=TRUE), for convenience; e.g., DT[,sum(v),by=month(date)] now returns a column called 'month' rather than 'date'. It is more robust to explicitly name columns, though; e.g., DT[,sum(v),by=list("Guaranteed name"=month(date))] o For a surprising speed boost in some circumstances, default options such as 'datatable.verbose' are now set when the package loads (unless they are already set, by user's profile for example). The 'default' argument of base::getOption() was the culprit and has been removed internally from all 11 calls. BUG FIXES o Fixed a `suffixes` handling bug in merge.data.table that was only recently introduced during the recent "fast-merge"-ing reboot. Briefly, the bug was only triggered in scenarios where both tables had identical column names that were not part of `by` and ended with *.1. cf. "merge and auto-increment columns in y[x]" test in tests/test-data.frame-like.R for more information. o Adding a column using := on a data.table just loaded from disk was correctly detected and over allocated, but incorrectly warning about a previous copy. Test 462 tested loading from disk, but suppressed warnings (sadly). Fixed. o data.table unaware packages that use DF[i] and DF[i]<-value syntax were not compatible with data.table, fixed. Many thanks to Prasad Chalasani for providing a reproducible example with base::droplevels(), and Helge Liebert for providing a reproducible example (#1794) with stats::reshape(). Tests added. o as.data.table(DF) already preserved DF's attributes but not any inherited classes such as nlme's groupedData, so nlme was incompatible with data.table. Fixed. Thanks to Dieter Menne for providing a reproducible example. Test added. o The internal row.names attribute of .SD (which exists for compatibility with data.frame only) was not being updated for each group. This caused length errors when calling any non-data.table-aware package from j, by group, when that package used length of row.names. Such as the recent update to ggplot2. Fixed. o When grouped j consists of a print of an object (such as ggplot2), the print is now masked to return NULL rather than the object that ggplot2 returns since the recent update v0.9.0. Otherwise data.table tries to accumulate the (albeit invisible) print object. The print mask is local to grouping, not generally. o 'by' was failing (bug #1880) when passed character column names where one or more included a space. So, this now works : DT[,sum(v),by="column 1"] and j retains spaces in column names rather than replacing spaces with "."; e.g., DT[,list("a b"=1)] Thanks to Yang Zhang for reporting. Tests added. As before, column names may be back ticked in the usual R way (in i, j and by); e.g., DT[,sum(`nicely named var`+1),by=month(`long name for date column`)] o unique() on an unkeyed table including character columns now works correctly, fixing #1725. Thanks to Steven Bagley for reporting. Test added. o %like% now returns logical (rather than integer locations) so that it can be combined with other i clauses, fixing #1726. Thanks to Ivan Zhang for reporting. Test added. THANKS TO o Joshua Ulrich for spotting a missing PACKAGE="data.table" in .Call in setkey.R, and suggesting as.list.default() and unique.default() to avoid dispatch for speed, all implemented. USER-VISIBLE CHANGES o Providing .SDcols when j doesn't use .SD is downgraded from error to warning, and verbosity now reports which columns have been detected as used by j. o check.names is now FALSE by default, for convenience when working with column names with spaces and other special characters, which are now fully supported. This difference to data.frame has been added to FAQ 2.17. ********************************************** ** ** ** CHANGES IN DATA.TABLE VERSION 1.7.10 ** ** ** ********************************************** NEW FEATURES o New function setcolorder() reorders the columns by name or by number, by reference with no copy. This is (almost) infinitely faster than DT[,neworder,with=FALSE]. o The prefix i. can now be used in j to refer to join inherited columns of i that are otherwise masked by columns in x with the same name. BUG FIXES o tracemem() in example(setkey) was causing CRAN check errors on machines where R is compiled without memory profiling available, for efficiency. Notably, R for Windows, Ubuntu and Mac have memory profiling enabled which may slow down R on those architectures even when memory profiling is not being requested by the user. The call to tracemem() is now wrapped with try(). o merge of unkeyed tables now works correctly after breaking in 1.7.8 and 1.7.9. Thanks to Eric and DM for reporting. Tests added. o nomatch=0 was ignored for the first group when j used join inherited scope. Fixed and tests added. USER-VISIBLE CHANGES o Updating an existing column using := after a key<- now works without warning or error. This can be useful in interactive use when you forget to use setkey() but don't mind about the inefficiency of key<-. Thanks to Chris Neff for providing a convincing use case. Adding a new column uing := after key<- issues a warning, shallow copies and proceeds, as before. o The 'datatable.pre.suffixes' option has been removed. It was available to obtain deprecated merge() suffixes pre v1.5.4. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.7.9 ** ** ** ********************************************* NEW FEATURES o New function setnames(), referred to in 1.7.8 warning messages. It makes no copy of the whole data object, unlike names<- and colnames<-. It may be more convenient as well since it allows changing a column name, by name; e.g., setnames(DT,"oldcolname","newcolname") # by name; no match() needed setnames(DT,3,"newcolname") # by position setnames(DT,2:3,c("A","B")) # multiple setnames(DT,c("a","b"),c("A","B")) # multiple by name setnames(DT,toupper(names(DT))) # replace all setnames() maintains truelength of the over-allocated names vector. This allows := to add columns fully by reference without growing the names vector. As before with names<-, if a key column's name is changed, the "sorted" attribute is updated with the new column name. BUG FIXES o Incompatibility with reshape() of 3 column tables fixed (introduced by 1.7.8) : Error in setkey(ans, NULL) : x is not a data.table Thanks to Damian Betebenner for reporting and reproducible example. Tests added to catch in future. o setattr(DT,...) still returns DT, but now invisibly. It returns DT back again for compound syntax to work; e.g., setattr(DT,...)[i,j,by] Again, thanks to Damian Betebenner for reporting. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.7.8 ** ** ** ********************************************* BUG FIXES o unique(DT) now works when DT is keyed and a key column is called 'x' (an internal scoping conflict introduced in v1.6.1). Thanks to Steven Bagley for reporting. o Errors and seg faults could occur in grouping when j contained character or list columns. Many thanks to Jim Holtman for providing a reproducible example. o Setting a key on a table with over 268 million rows (2^31/8) now works (again), #1714. Bug introduced in v1.7.2. setkey works up to the regular R vector limit of 2^31 rows (2 billion). Thanks to Leon Baum for reporting. o Checks in := are now made up front (before starting to modify the data.table) so that the data.table isn't left in an invalid state should an error occur, #1711. Thanks to Chris Neff for reporting. o The 'Chris crash' is fixed. The root cause was that key<- always copies the whole table. The problem with that copy (other than being slower) is that R doesn't maintain the over allocated truelength, but it looks as though it has. key<- was used internally, in particular in merge(). So, adding a column using := after merge() was a memory overwrite, since the over allocated memory wasn't really there after key<-'s copy. data.tables now have a new attribute '.internal.selfref' to catch and warn about such copies in future. All internal use of key<- has been replaced with setkey(), or new function setkeyv() which accepts a vector, and do not copy. Many thanks to Chris Neff for extended dialogue, providing a reproducible example and his patience. This problem was not just in pre 2.14.0, but post 2.14.0 as well. Thanks also to Christoph Jäckel, Timothée Carayol and DM for investigations and suggestions, which in combination led to the solution. o An example in ?":=" fixed, and j and by descriptions improved in ?data.table. Thanks to Joseph Voelkel for reporting. NEW FEATURES o Multiple new columns can be added by reference using := and with=FALSE; e.g., DT[,c("foo","bar"):=1L,with=FALSE] DT[,c("foo","bar"):=list(1L,2L),with=FALSE] o := now recycles vectors of non divisible length, with a warning (previously an error). o When setkey coerces a numeric or character column, it no longer makes a copy of the whole table, FR#1744. Thanks to an investigation by DM. o New function setkeyv(DT,v) (v stands for vector) replaces key(DT)<-v syntax. Also added setattr(). See ?copy. o merge() now uses (manual) secondary keys, for speed. USER VISIBLE CHANGES o The loc argument of setkey has been removed. This wasn't very useful and didn't warrant a period of deprecation. o datatable.alloccol has been removed. That warning is now controlled by datatable.verbose=TRUE. One option is easer. o If i is a keyed data.table, it is no longer an error if its key is longer than x's key; the first length(key(x)) columns of i's key are used to join. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.7.7 ** ** ** ********************************************* BUG FIXES o Previous bug fix for random crash in R <= 2.13.2 related to truelength and over-allocation didn't work, 3rd attempt. Thanks to Chris Neff for his patience and testing. This has shown up consistently as error status on CRAN old-rel checks (windows and mac). So if they pass, this issue is fixed. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.7.6 ** ** ** ********************************************* NEW FEATURES o An empty list column can now be added with :=, and data.table() accepts empty list(). DT[,newcol:=list()] data.table(a=1:3,b=list()) Empty list columns contain NULL for all rows. BUG FIXES o Adding a column to a data.table loaded from disk could result in a memory corruption in R <= 2.13.2, revealed and thanks to CRAN checks on windows old-rel. o Adding a factor column with a RHS to be recycled no longer loses its factor attribute, #1691. Thanks to Damian Betebenner for reporting. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.7.5 ** ** ** ********************************************* BUG FIXES o merge()-ing a data.table where its key is not the first few columns in order now works correctly and without warning, fixing #1645. Thanks to Timothee Carayol for reporting. o Mixing nomatch=0 and mult="last" (or "first") now works, #1661. Thanks to Johann Hibschman for reporting. o Join Inherited Scope now respects nomatch=0, #1663. Thanks to Johann Hibschman for reporting. o by= could generate a keyed result table with invalid key; e.g., when by= expressions return NA, #1631. Thanks to Muhammad Waliji for reporting. o Adding a column to a data.table loaded from disk resulted in an error that truelength(DT)= length() other than just after a table has been loaded from disk. o New option 'datatable.nomatch' allows the default for nomatch to be changed from NA to 0, as wished for by Branson Owen. o cbind(DT,...) now retains DT's key, as wished for by Chris Neff and partly implementing FR#295. BUG FIXES o Assignment to factor columns (using :=, [<- or $<-) could cause 'variable not found' errors and a seg fault in some circumstances due to a new feature in v1.7.0: "Factor columns on LHS of :=, [<- and $<- can now be assigned new levels", fixing #1664. Thanks to Daniele Signori for reporting. o DT[i,j]<-value no longer crashes when j is a factor column and value is numeric, fixing #1656. o An unnecessarily strict machine tolerance test failed CRAN checks on Mac preventing v1.7.2 availability for Mac (only). USER VISIBLE CHANGES o := now has its own help page in addition to the examples in ?data.table, see help(":="). o The error message from X[Y] when X is unkeyed has been lengthened to including advice to call setkey first and see ?setkey. Thanks to a comment by ilprincipe on Stack Overflow. o Deleting a missing column is now a warning rather than error. Thanks to Chris Neff for suggesting, #1642. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.7.2 ** ** ** ********************************************* NEW FEATURES o unique and duplicated methods now work on unkeyed tables (comparing all columns in that case) and both now respect machine tolerance for double precision columns, implementing FR#1626 and fixing bug #1632. Their help page has been updated accordingly with detailed examples. Thanks to questions by Iterator and comments by Allan Engelhardt on Stack Overflow. o A new method as.data.table.list has been added, since passing a (pure) list to data.table() now creates a single list column. BUG FIXES o Assigning to a column variable using <- or = in j now works (creating a local copy within j), rather than persisting from group to group and sometimes causing a crash. Non column variables still persist from group to group; e.g., a group counter. This fixes the remainder of #1624 thanks to Steve Lianoglou for reporting. o A crash bug is fixed when j returns a (strictly) NULL column next to a non-empty column, #1633. This case was anticipated and coded for but an errant LENGTH() should have been length(). Thanks to Dennis Murphy for reporting. o The first column of data.table() can now be a list column, fixing #1640. Thanks to Stavros Macrakis for reporting. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.7.1 ** ** ** ********************************************* BUG FIXES o .SD is now locked, partially fixing #1624. It was never the intention to allow assignment to .SD. Take a 'copy(.SD)' first if needed. Now documented in ?data.table and new FAQ 4.5 including example. Thanks to Steve Lianoglou for reporting. o := now works with a logical i subset; e.g., DT[x==1,y:=x] Thanks to Muhammad Waliji for reporting. USER VISIBLE CHANGES o Error message "column of i is not internally type integer" is now more helpful adding "i doesn't need to be keyed, just convert the (likely) character column to factor". Thanks to Christoph_J for his SO question. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.7.0 ** ** ** ********************************************* NEW FEATURES o data.table() now accepts list columns directly rather than needing to add list columns to an existing data.table; e.g., DT = data.table(x=1:3,y=list(4:6,3.14,matrix(1:12,3))) Thanks to Branson Owen for reminding. As before, list columns can be created via grouping; e.g., DT = data.table(x=c(1,1,2,2,2,3,3),y=1:7) DT2 = DT[,list(list(unique(y))),by=x] DT2 x V1 [1,] 1 1, 2 [2,] 2 3, 4, 5 [3,] 3 6, 7 and list columns can be grouped; e.g., DT2[,sum(unlist(V1)),by=list(x%%2)] x V1 [1,] 1 16 [2,] 0 12 Accordingly, one item has been added to FAQ 2.17 (differences between data.frame and data.table): data.frame(list(1:2,"k",1:4)) creates 3 columns, data.table creates one list column. o subset, transform and within now retain keys when the expression does not 'touch' key columns, implemeting FR #1341. o Recycling list() items on RHS of := now works; e.g., DT[,1:4:=list(1L,NULL),with=FALSE] # set columns 1 and 3 to 1L and remove columns 2 and 4 o Factor columns on LHS of :=, [<- and $<- can now be assigned new levels; e.g., DT = data.table(A=c("a","b")) DT[2,"A"] <- "c" # adds new level automatically DT[2,A:="c"] # same (faster) DT$A = "newlevel" # adds new level and recycles it Thanks to Damian Betebenner and Chris Neff for highlighting. To change the type of a column, provide a full length RHS (i.e. 'replace' the column). BUG FIXES o := with i all FALSE no longer sets the whole column, fixing bug #1570. Thanks to Chris Neff for reporting. o 0 length by (such as NULL and character(0)) now behave as if by is missing, fixing bug #1599. This is useful when by is dynamic and a 'dont group' needs to be represented. Thanks to Chris Neff for reporting. o NULL j no longer results in 'inconsistent types' error, but instead returns no rows for that group, fixing bug #1576. o matrix i is now an error rather than using i as if it were a vector and obtaining incorrect results. It was undocumented that matrix might have been an acceptable type. matrix i is still acceptable in [<-; e.g., DT[is.na(DT)] <- 1L and this now works rather than assigning to non-NA items in some cases. o Inconsistent [<- behaviour is now fixed (#1593) so these examples now work : DT[x == "a", ]$y <- 0L DT["a", ]$y <- 0L But, := is highly encouraged instead for speed; i.e., DT[x == "a", y:=0L] DT["a", y:=0L] Thanks to Leon Baum for reporting. o unique on an unsorted table now works, fixing bug #1601. Thanks to a question by Iterator on Stack Overflow. o Bug fix #1534 in v1.6.5 (see NEWS below) only worked if data.table was higher than IRanges on the search() path, despite the item in NEWS stating otherwise. Fixed. o Compatibility with package sqldf (which can call do.call("rbind",...) on an empty "...") is fixed and test added. data.table was switching on list(...)[[1]] rather than ..1. Thanks to RYogi for reporting #1623. USER VISIBLE CHANGES o cbind and rbind are no longer masked. But, please do read FAQ 2.23, 4.4 and 5.1. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.6.6 ** ** ** ********************************************* BUG FIXES o Tests using .Call("Rf_setAttrib",...) passed CRAN acceptance checks but failed on many (but not all) platforms. Fixed. Thanks to Prof Brian Ripley for investigating the issue. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.6.5 ** ** ** ********************************************* NEW FEATURES o The LHS of := may now be column names or positions when with=FALSE; e.g., DT[,c("d","e"):=NULL,with=FALSE] DT[,4:5:=NULL,with=FALSE] newcolname="myname" DT[,newcolname:=3.14,with=FALSE] This implements FR#1499 'Ability to efficiently remove a vector of column names' by Timothee Carayol in addition to creating and assigning to multiple columns. We still plan to allow multiple := without needing with=FALSE, in future. o setkey(DT,...) now returns DT (invisibly) rather than NULL. This is to allow compound statements; e.g., setkey(DT,x)["a"] o setkey (and key<-) are now more efficient when the data happens to be already sorted by the key columns; e.g., when data is loaded from ordered files. o If DT is already keyed by the columns passed to setkey (or key<-), the key is now rebuilt and checked rather than skipping for efficiency. This is to save needing to know to drop the key first to rebuild an invalid key. Invalid keys can arise by going 'under the hood'; e.g., attr(DT,"sorted")="z", or somehow ending up with unordered factor levels. A warning is issued so the root cause can be fixed. Thanks to Timothee Carayol for highlighting. o A new copy() function has been added, FR#1501. This copies a data.table (retaining its key, if any) and should now be used to copy rather than data.table(). Reminder: data.tables are not copied on write by setkey, key<- or :=. BUG FIXES o DT[,z:=a/b] and DT[a>3,z:=a/b] work again, where a and b are columns of DT. Thanks to Chris Neff for reporting, and his patience. o Numeric columns with class attributes are now correctly coerced to integer by setkey and ad hoc by. The error similar to 'fractional data cannot be truncated' should now only occur when that really is true. A side effect of this is that ad hoc by and setkey now work on IDate columns which have somehow become numeric; e.g., via rbind(DF,DF) as reported by Chris Neff. o .N is now 0 (rather than 1) when no rows in x match the row in i, fixing bug #1532. Thanks to Yang Zhang for reporting. o Compatibility with package IRanges has been restored. Both data.table and IRanges mask cbind and rbind. When data.table's cbind is found first (if it is loaded after IRanges) and the first argument is not data.table, it now delegates to the next package on the search path (and above that), one or more of which may also mask cbind (such as IRanges), rather than skipping straight to base::cbind. So, it no longer matters which way around data.table and IRanges are loaded, fixing #1534. Thanks to Steve Lianoglou for reporting. USER VISIBLE CHANGES o setkey's verbose messages expanded. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.6.4 ** ** ** ********************************************* NEW FEATURES o DT[colA>3,which=TRUE] now returns row numbers rather than a logical vector, for consistency. BUG FIXES o Changing a keyed column name now updates the key, too, so an invalid key no longer arises, fixing #1495. Thanks to Chris Neff for reporting. o := already warned when a numeric RHS is coerced to match an integer column's type. Now it also warns when numeric is coerced to logical, and integer is coerced to logical, fixing #1500. Thanks to Chris Neff for reporting. o The result of DT[,newcol:=3.14] now includes the new column correctly, as well as changing DT by reference, fixing #1496. Thanks to Chris Neff for reporting. o :=NULL to remove a column (instantly, regardless of table size) now works rather than causing a segfault in some circumstances, fixing #1497. Thanks to Timothee Carayol for reporting. o Previous within() and transform() behaviour restored; e.g., can handle multiple columns again. Thanks to Timothee Carayol for reporting. o cbind(DT,DF) now works, as does rbind(DT,DF), fixing #1512. Thanks to Chris Neff for reporting. This was tricky to fix due to nuances of the .Internal dispatch code in cbind and rbind, preventing S3 methods from working in all cases. R will now warn that cbind and rbind have been masked when the data.table package is loaded. These revert to base::cbind and base::rbind when the first argument is not data.table. o Removing multiple columns now works (again) using DT[,c("a","b")]=NULL, or within(DT,rm(a,b)), fixing #1510. Thanks to Timothee Carayol for reporting. NOTES o The package uses two features (packageVersion() and \href in Rd) added to R 2.12.0 and is therefore dependent on that release. A 'spurious warning' when checking a package using \href was fixed in R 2.12.2 patched but we believe that warning can safely be ignored in versions >= 2.12.0 and < 2.12.2 patched. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.6.3 ** ** ** ********************************************* NEW FEATURES o Ad hoc grouping now returns results in the same order each group first appears in the table, rather than sorting the groups. Thanks to Steve Lianoglou for highlighting. The order of the rows within each group always has and always will be preserved. For larger datasets a 'keyed by' is still faster; e.g., by=key(DT). o The 'key' argument of data.table() now accepts a vector of column names in addition to a single comma separated string of column names, for consistency. Thanks to Steve Lianoglou for highlighting. o A new argument '.SDcols' has been added to [.data.table. This may be character column names or numeric positions and specifies the columns of x included in .SD. This is useful for speed when applying a function through a subset of (possibly very many) columns; e.g., DT[,lapply(.SD,sum),by="x,y",.SDcols=301:350] o as(character, "IDate") and as(character, "ITime") coercion functions have been added. Enables the user to declaring colClasses as "IDate" and "ITime" in the various read.table (and sister) functions. Thanks to Chris Neff for the suggestion. o DT[i,j]<-value is now handled by data.table in C rather than falling through to data.frame methods, FR#200. Thanks to Ivo Welch for raising speed issues on r-devel, to Simon Urbanek for the suggestion, and Luke Tierney and Simon for information on R internals. [<- syntax still incurs one working copy of the whole table (as of R 2.13.1) due to R's [<- dispatch mechanism copying to `*tmp*`, so, for ultimate speed and brevity, the operator := may now be used in j as follows. o := is now available to j and means assign to the column by reference; e.g., DT[i,colname:=value] This syntax makes no copies of any part of memory at all. m = matrix(1,nrow=100000,ncol=100) DF = as.data.frame(m) DT = as.data.table(m) system.time(for (i in 1:1000) DF[i,1] <- i) user system elapsed 287.062 302.627 591.984 system.time(for (i in 1:1000) DT[i,V1:=i]) user system elapsed 1.148 0.000 1.158 ( 511 times faster ) := in j can be combined with all types of i, such as binary search, and used to add and remove columns efficiently. Fast assigning within groups will be implemented in future. Reminder that data.frame and data.table both allow columns of mixed types, including columns which themselves may be type list; matrix may be one (atomic) type only. *Please note*, := is new and experimental. BUG FIXES o merge()ing two data.table's with user-defined `suffixes` was getting tripped up when column names in x ended in '.1'. This resulted in the `suffixes` parameter being ignored. o Mistakenly wrapping a j expression inside quotes; e.g., DT[,list("sum(a),sum(b)"),by=grp] was appearing to work, but with wrong column names. This now returns a character column (the quotes should not be used). Thanks to Joseph Voelkel for reporting. o setkey has been made robust in several ways to fix issues introduced in 1.6.2: #1465 ('R crashes after setkey') reported by Eugene Tyurin and similar bug #1387 ('paste() by group to create long comma separated strings can crash') reported by Nicolas Servant and Jean-Francois Rami. This bug was not reproducible so we are especially grateful for the patience of these people in helping us find, fix and test it. o Combining a join, j and by together in one query now works rather than giving an error, fixing bug #1468. Discovered indirectly thanks to a post from Jelmer Ypma. o Invalid keys no longer arise when a non-data.table-aware package reorders the data; e.g., setkey(DT,x,y) plyr::arrange(DT,y) # same as DT[order(y)] This now drops the key to avoid incorrect results being returned the next time the invalid key is joined to. Thanks to Chris Neff for reporting. USER-VISIBLE CHANGES o The startup banner has been shortened to one line. o data.table does not support POSIXlt. Almost unbelievably POSIXlt uses 40 bytes to store a single datetime. If it worked before, that was unintentional. Please see ?IDateTime, or any other date class that uses a single atomic vector. This is regardless of whether the POSIXlt is a key column, or not. This resolves bug #1481 by documenting non support in ?data.table. DEPRECATED & DEFUNCT o Use of the DT() alias in j is no longer caught for backwards compatibility and is now fully removed. As warned in NEWS for v1.5.3, v1.4, and FAQs 2.6 and 2.7. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.6.2 ** ** ** ********************************************* NEW FEATURES o setkey no longer copies the whole table and should be faster for large tables. Each column is reordered by reference (in C) using one column of working memory, FR#1006. User defined attributes on the original table are now also retained (thanks to Thell Fowler for reporting). o A new symbol .N is now available to j, containing the number of rows in the group. This may be useful when the column names are not known in advance, for convenience generally, and for efficiency. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.6.1 ** ** ** ********************************************* NEW FEATURES o j's environment is now consistently reused so that local variables may be set which persist from group to group; e.g., incrementing a group counter : DT[,list(z,groupInd<-groupInd+1),by=x] Thanks to Andreas Borg for reporting. o A new symbol .BY is now available to j, containing 1 row of the current 'by' variables, type list. 'by' variables may also be used by name, and are now length 1 too. This implements FR#1313. FAQ 2.10 has been updated accordingly. Some examples : DT[,sum(x)*.BY[[1]],by=eval(byexp)] DT[,sum(x)*mylookuptable[J(y),z],by=y] DT[,list(sum(unlist(.BY)),sum(z)),by=list(x,y%%2)] o i may now be type list, and works the same as when i is type data.table. This saves needing J() in as many situations and may be a little more efficient. One application is using .BY directly in j to join to a relatively small lookup table, once per group, for space and time efficiency. For example : DT[,list(GROUPDATA[.BY]$name,sum(v)),by=grp] BUG FIXES o A 'by' character vector of column names now works when there are less rows than columns; e.g., DT[,sum(x),by=key(DT)] where nrow(DT)==1. Many thanks to Andreas Borg for report, proposed fix and tests. o Zero length columns in j no longer cause a crash in some circumstances. Empty columns are filled with NA to match the length of the longest column in j. Thanks to Johann Hibschman for bug report #1431. o unique.data.table now calls the same internal code (in C) that grouping calls. This fixes a bug when unique is called directly by user, and, NA exist in the key (which might be quite rare). Thanks to Damian Betebenner for bug report. unique should also now be faster. o Variables in calling scope can now be used in j when i is logical or integer, fixing bug #1421. Thanks to Alexander Peterhansl for reporting. USER-VISIBLE CHANGES o ?data.table now documents that logical i is not quite the same as i in [.data.frame. NA are treated as FALSE, and DT[NA] returns 1 row of NA, unlike [.data.frame. Three points have been added to FAQ 2.17. Thanks to Johann Hibschman for highlighting. o Startup banner now uses packageStartupMessage() so the banner can be suppressed by those annoyed by banners, whilst still being helpful to new users. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.6 ** ** ** ********************************************* NEW FEATURES o data.table now plays nicely with S4 classes. Slots can be defined to be S4 objects, S4 classes can inherit from data.table, and S4 function dispatch works on data.table objects. See the tests in inst/tests/test-S4.R, and from the R prompt: ?"data.table-class" o merge.data.table now works more like merge.data.frame: (i) suffixes are consistent with merge.data.frame; existing users may set options(datatable.pre.suffixes=TRUE) for backwards compatibility. (ii) support for 'by' argument added (FR #1315). However, X[Y] syntax is preferred; some users never use merge. BUG FIXES o by=key(DT) now works when the number of rows is not divisible by the number of groups (#1298, an odd bug). Thanks to Steve Lianoglou for reporting. o Combining i and by where i is logical or integer subset now works, fixing bug #1294. Thanks to Johann Hibschman for contributing a new test. o Variable scope inside [[...]] now works without a workaround required. This can be useful for looking up which function to call based on the data e.g. DT[,fns[[fn]](colA),by=ID]. Thanks to Damian Betebenner for reporting. o Column names in self joins such as DT[DT] are no longer duplicated, fixing bug #1340. Thanks to Andreas Borg for reporting. USER-VISIBLE CHANGES o Additions and updates to FAQ vignette. Thanks to Dennis Murphy for his thorough proof reading. o Welcome to Steve Lianoglou who joins the project contributing S4-ization, testing using testthat, and more. o IDateTime is now linked from ?data.table. data.table users unaware of IDateTime, please do take a look. Tom added IDateTime in v1.5 (see below). ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.5.3 ** ** ** ********************************************* NEW FEATURES o .SD no longer includes 'by' columns, FR#978. This resolves the long standing annoyance of duplicated 'by' columns when the j expression returns a subset of rows from .SD. For example, the following query no longer contains a redundant 'colA.1' duplicate. DT[,.SD[2],by=colA] # 2nd row of each group Any existing code that uses .SD may require simple changes to remove workarounds. o 'by' may now be a character vector of column names. This allows syntax such as DT[,sum(x),by=key(DT)]. o X[Y] now includes Y's non-join columns, as most users naturally expect, FR#746. Please do use j in one step (i.e. X[Y,j]) since that merges just the columns j uses and is much more efficient than X[Y][,j] or merge(X,Y)[,j]. o The 'Join Inherited Scope' feature is back on, FR#1095. This is consistent with X[Y] including Y's non-join columns, enabling natural progression from X[Y] to X[Y,j]. j sees columns in X first then Y. If the same column name exists in both X and Y, the data in Y can be accessed via a prefix "i." (not yet implemented). o Ad hoc by now coerces double to integer (provided they are all.equal) and character to factor, FR#1051, as setkey already does. USER-VISIBLE CHANGES o The default for mult is now "all", as planned and prior notice given in FAQ 2.2. o ?[.data.table has been merged into ?data.table and updated, simplified, corrected and formatted. DEPRECATED & DEFUNCT o The DT() alias is now fully deprecated, as warned in NEWS for v1.4, and FAQs 2.6 and 2.7. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.5.2 ** ** ** ********************************************* NEW FEATURES o 'by' now works when DT contains list() columns i.e. where each value in a column may itself be vector or where each value is a different type. FR#1092. o The result from merge() is now keyed. FR#1244. BUG FIXES o eval of parse()-ed expressions now works without needing quote() in the expression, bug #1243. Thanks to Joseph Voelkel for reporting. o the result from the first group alone may be bigger than the table itself, bug #1245. Thanks to Steve Lianoglou for reporting. o merge on a data.table with a single key'd column only and all=TRUE now works, bug #1241. Thanks to Joseph Voelkel for reporting. o merge()-ing by a column called "x" now works, bug #1229 related to variable scope. Thanks to Steve Lianoglou for reporting. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.5.1 ** ** ** ********************************************* BUG FIXES o Fixed inheritance for other packages importing or depending on data.table, bugs #1093 and #1132. Thanks to Koert Kuipers for reporting. o data.table queries can now be used at the debugger() prompt, fixing bug #1131 related to inheritance from data.frame. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.5 ** ** ** ********************************************* NEW FEATURES o data.table now *inherits* from data.frame, for functions and packages which _only_ accept data.frame, saving time and memory of conversion. A data.table is a data.frame too; is.data.frame() now returns TRUE. o Integer-based date and time-of-day classes have been introduced. This allows dates and times to be used as keys more easily. See as.IDate, as.ITime, and IDateTime. Conversions to and from POSIXct, Date, and chron are supported. o [<-.data.table and $<-.data.table were revised to check for changes to the key-ed columns. [<-.data.table also now allows data.table-style indexing for i. Both of these changes may introduce incompatibilities for existing code. o Logical columns are now allowed in keys. Logical columns (and expressions that evaluate to logical) are now allowed in 'by'. Thanks to David Winsemius for highlighting. BUG FIXES o DT[,5] now returns 5 as FAQ 1.1 says, for consistency with DT[,c(5)] and DT[,5+0]. DT[,"region"] now returns "region" as FAQ 1.2 says. Thanks to Harish V for reporting. o When a quote()-ed expression q is passed to 'by' using by=eval(q), the group column names now come from the list in the expression rather than the name 'q' (bug #974) and, multiple items work (bug #975). Thanks to Harish V for reporting. o quote()-ed i and j expressions receive similar fixes, bugs #977 and #1058. Thanks to Harish V and Branson Owen for reporting. o Multiple errors (grammar, format and spelling) in intro.Rnw and faqs.Rnw corrected by Dennis Murphy. Thank you. o Memory is now reallocated in rare cases when the up front allocate for the result of grouping is insufficient. Bug #952 raised by Georg V, and also reported by Harish. Thank you. o A function call foo(arg=sum(b)) now finds b in DT when foo contains DT[,eval(substitute(arg)),by=a], fixing bug #1026. Thanks to Harish V for reporting. o If DT contains column 'a' then DT[J(unique(a))] now finds 'a', fixing bug #1005. Thanks to Branson Owen for reporting. o 'by' on no data (for example when 'i' returns no rows) now works, fixing bug #709. o 'by without by' now heeds nomatch=NA, fixing bug #1015. Thanks to Harish V for reporting. o DT[NA] now returns 1 row of NA rather than the whole table via standard NA logical recycling. A single NA logical is a special case and is now replaced by NA_integer_. Thanks to Branson Owen for highlighting the issue. o NROW removed from data.table, since the is.data.frame() in base::NROW now returns TRUE due to inheritance. Fixes bug #1039 reported by Bradley Buchsbaum. Thank you. o setkey() now coerces character to factor and double to integer (provided they are all.equal), fixing bug #953. Thanks to Steve Lianoglou for reporting. o 'by' now accepts lists from the calling scope without the work around of wrapping with as.list() or {}, fixing bug #1060. Thanks to Johann Hibschman for reporting. NOTES o The package uses the 'default' option of base::getOption, and is therefore dependent on R 2.10.0. Updated DESCRIPTION file accordingly. Thanks to Christian Hudon for reporting. ********************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.4.1 ** ** ** ********************************************* NEW FEATURES o Vignettes tidied up. BUG FIXES o Out of order levels in key columns are now sorted by setkey. Thanks to Steve Lianoglou for reporting. ******************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.4 ** ** ** ******************************************* NEW FEATURES o 'by' faster. Memory is allocated first for the result, then populated directly by the result of j for each group. Can be 10 or more times faster than tapply() and aggregate(), see timings vignette. o j should now be a list(), not DT(), of expressions. Use of j=DT(...) is caught internally and replaced with j=list(...). o 'by' may be a list() of expressions. A single column name is automatically list()-ed for convenience. 'by' may still be a comma separated character string, as before. DT[,sum(x),by=region] # new DT[,sum(x),by=list(region,month(date))] # new DT[,sum(x),by="region"] # old, ok too DT[,sum(x),by="region,month(date)"] # old, ok too o key() and key<- added. More R-style alternatives to getkey() and setkey(). o haskey() added. Returns TRUE if a table has a key. o radix sorting is now column by column where possible, was previously all or nothing. Helps with keys of many columns. o Added format method. o 22 tests added to test.data.table(), now 149. o Three vignettes added : FAQ, Intro & Timings DEPRECATED & DEFUNCT o The DT alias is removed. Use 'data.table' instead to create objects. See 2nd new feature above. o RUnit framework removed. test.data.table() is called from examples in .Rd so 'R CMD check' will run it. Simpler. An eval(body(test.data.table)) is also in the .Rd, to catch namespace issues. o Dependency on package 'ref' removed. o Arguments removed: simplify, incbycols and byretn. Grouping is simpler now, these are superfluous. BUG FIXES o Column classes are now retained by subset and grouping. o tail no longer fails when a column 'x' exists. KNOWN PROBLEMS o Minor : Join Inherited Scope not working, contrary to the documentation. NOTES o v1.4 was essentially the branch at rev 44, reintegrated at rev 78. ******************************************* ** ** ** CHANGES IN DATA.TABLE VERSION 1.3 ** ** ** ******************************************* NEW FEATURES o Radix sorting added. Speeds up setkey and add-hoc 'by' by factor of 10 or more. o Merge method added, much faster than base::merge method of data.frame. o 'by' faster. Logic moved from R into C. Memory is allocated for the largest group only, then re-used. o The Sub Data is accessible as a whole by j using object .SD. This should only be used in rare circumstances. See FAQ. o Methods added : duplicated, unique, transform, within, [<-, t, Math, Ops, is.na, na.omit, summary o Column name rules improved e.g. dots now allowed. o as.data.frame.data.table rownames improved. o 29 tests added to test.data.table(), now 127. USER-VISIBLE CHANGES o Default of mb changed, now tables(mb=TRUE) DEPRECATED & DEFUNCT o ... removed in [.data.table. j may not be a function, so this is now superfluous. BUG FIXES o Incorrect version warning with R 2.10+ fixed. o j enclosure raised one level. This fixes some bugs where the j expression previously saw internal variable names. It also speeds up grouping a little. NOTES o v1.3 was not released to CRAN. R-Forge repository only. ******************************************** ** ** ** VERSION 1.2 RELEASED CRAN AUG 2008 ** ** ** ********************************************