Vector.as_datetime
: Addprecision
argumentVector.sort
: Fix sorting object vectors
DataFrame
: Fix column and method name clash errors in certain operationsdt.replace
: Allow vector arguments the same length asx
DataFrame.count
: New method, shorthand fordata.group_by(...).aggregate(n=di.count())
Vector.rank
: Handle empty and all-NA vectors
USE_NUMBA_CACHE
: New option, read from environment variableDATAITER_USE_NUMBA_CACHE
if exists, defauls toTrue
- Fix a possible issue with Numba caching
- Use
numba.extending.overload
instead of the deprecatednumba.generated_jit
DataFrame
: Don't try to do joins on NA values inby
columnsDataFrame.drop_na
: New method
DataFrame
: Truncate multiline strings when printingDataFrame.from_arrow
: New methodDataFrame.read_parquet
: New methodDataFrame.to_arrow
: New methodDataFrame.write_parquet
: New methodread_parquet
: New functionVector.__init__
: Fix type guessing when mixing Python and NumPy floats or integers and missing values- Allow using a thousand separator when printing numbers,
off by default, can be set with
dataiter.PRINT_THOUSAND_SEPARATOR
- Fix printing really small numbers
DataFrame.modify
: Fix grouped modify on unsorted data frame
Vector.map
: Adddtype
argument
ListOfDicts.to_data_frame
: Addstrings_as_object
argument
read_csv
,read_geojson
,DataFrame.from_pandas
,DataFrame.read_csv
,GeoJSON.read
: Addstrings_as_object
argument
DataFrame.slice_off
: New methodGeoJSON.to_data_frame
: New method- Fix error with new column placeholder attributes in conjunction with pop, popitem and clear
DataFrame
: Add placeholder attributes for columns so that tab completion of columns as attributes at a shell worksdt.from_string
: New functiondt.to_string
: New functionnrow
: Remove deprecated aggregation function- Don't use Numba for aggregation involving strings due to bad performance
dt
: New module for dealing with dates and datetimes
DataFrame.from_pandas
: Speed up by avoiding unnecessary conversionsDataFrame.full_join
: Fix join and output whenby
is a tupleGeoJSON
: Fix printing object
Vector
: Handle timedeltas correctly for NA checks and printingVector.is_timedelta
: New method
DataFrame.sort
: Convert object to string for sortingVector.sort
: Convert object to string for sorting- Fix conditional Numba use when importing the numba package works, but caching doesn't
- Add
di-open
cli command (currently not part of the default install, but can be installed from source usingmake install-cli
)
DataFrame.modify
: Add support for grouped modification (#19)DataFrame.split
: New methodListOfDicts.split
: New method
DataFrame.compare
: New experimental methodVector.as_string
: Addlength
argument- Change the documentation to default to the latest release ("stable") instead of the development version ("latest")
- Use keyword-only arguments where appropriate – the general principle is that mandatory arguments are allowed as positional, but optional modifiers are keyword only
- Rename all instances of "missing" to "na", such as
Vector.is_missing
toVector.is_na
, the only exception beingListOfDicts.fill_missing
, which becomesListOfDicts.fill_missing_keys
- Truncate data frame object and string columns at
PRINT_TRUNCATE_WIDTH
(default 32) for printing
- Fix aggregation functions to work with all main data types: boolean, integer, float, date, datetime and string
- Fix aggregation functions to handle all missing values (NaN, NaT, blank string) correctly, the same as implemented in Vector
- Rename aggregation functions'
dropna
arguments todrop_missing
first
,last
,nth
: Adddrop_missing
argumentVector.drop_missing
: New method
mode
: Fix to return first in case of ties (requires Python >= 3.8)std
,var
: Addddof
argument (defaults to 0 on account of Numba limitations)- Don't try to dropna for non-float vectors in aggregation functions
- Add shorthand helper functions for use with
DataFrame.aggregate
, optionally using Numba JIT-compiled code for speed DataFrame.map
: New methodncol
: Removednrow
: Deprecated in favor ofdataiter.count
read_csv
: New alias forDataFrame.read_csv
read_geojson
: New alias forGeoJSON.read
read_json
: New alias forListOfDicts.read_json
read_npz
: New alias forDataFrame.read_npz
DataFrame
: Make object columns work in various operationsDataFrame.from_json
: Add argumentscolumns
anddtypes
DataFrame.from_pandas
: Add argumentdtypes
DataFrame.full_join
: Speed upDataFrame.read_csv
: Add argumentdtypes
DataFrame.read_json
: Add argumentscolumns
anddtypes
GeoJSON.read
: Add argumentscolumns
anddtypes
ListOfDicts.fill_missing
: New methodListOfDicts.from_json
: Add argumentskeys
andtypes
ListOfDicts.full_join
: Speed upListOfDicts.read_csv
: Add argumenttypes
, renamecolumns
tokeys
ListOfDicts.read_json
: Add argumentskeys
andtypes
DataFrame
: Fix error message when column not foundDataFrame.aggregate
: Speed upDataFrame.full_join
: Fix to join all possible columnsDataFrame.read_csv
: Try to avoid mixed typesListOfDicts.full_join
: Fix to join all possible keysListOfDicts.write_csv
: Use minimal quotingVector.get_memory_use
: New methodVector.rank
: Rewrite, addmethod
argument*.read_*
: Renamefname
argumentpath
*.write_*
: Renamefname
argumentpath
- Add comparison table dplyr vs. Dataiter vs. Pandas to documentation: https://dataiter.readthedocs.io/en/latest/comparison.html
DataFrame.read_npz
: New method to read NumPy npz formatDataFrame.write_npz
: New method to write NumPy npz format*.read_*
: Decompress.bz2|.gz|.xz
automatically*.write_*
: Compress.bz2|.gz|.xz
automatically
DataFrame.print_missing_counts
: Fix when nothing missingVector.replace_missing
: New method
DataFrame.print_memory_use
: New methodListOfDicts.write_csv
: Use less memory
Vector.is_*
: Change to be methods instead of properties- Drop deprecated use of
np.int
- Drop deprecated comparisons against NaN
ListOfDicts.map
: New method
DataFrame.read_csv
: Addcolumns
argumentListOfDicts.read_csv
: Addcolumns
argument
DataFrame.*_join
: Handle differing by names via tuple argumentListOfDicts.*_join
: Handle differing by names via tuple argument
- Use terminal window width as maximum print width
Vector.__init__
: Handle NaN values in non-float vectors
Vector.__init__
: Accept generators/iteratorsVector.map
: New method
DataFrame.print_missing_counts
: New methodGeoJSON.read
: Handle properties differing between featuresListOfDicts.print_missing_counts
: New methodVector.as_object
: New method
GeoJSON.read
: Use warnings, not errors for ignored excess feature keys
GeoJSON
: New class
ListOfDicts.sort
: Handle descending sort for all types
ListOfDicts
: Make obsoletion a warning instead of an error
DataFrame
: Fix error printing blank strings (#8)
DataFrame.filter
: Addcolname_value_pairs
argumentDataFrame.filter_out
: Addcolname_value_pairs
argumentListOfDicts.__init__
: Remove arguments not intended for external useListOfDicts.rename
: Preserve order of keys- Add documentation: https://dataiter.readthedocs.io/
Vector.__init__
: Speed up by fixing type deduction
ListOfDicts.select
: Fix return value (#7)
DataFrame.aggregate
: FixUnicodeEncodeError
with string columnsDataFrame.unique
: FixUnicodeEncodeError
with string columnsListOfDicts.select
: Return keys in requested orderVector.__repr__
: Add custom conversion to string for displayVector.__str__
: Add custom conversion to string for displayVector.to_string
: Add custom conversion to string for displayVector.to_strings
: Add custom conversion to string for display
Array
: Rename toVector
Vector.head
: New methodVector.range
: New methodVector.sample
: New methodVector.sort
: New methodVector.tail
: New methodVector.unique
: New method
DataFrame
: New classListOfDicts.__add__
: New method to support the+
operatorListOfDicts.__init__
: Rename, reorder argumentsListOfDicts.__mul__
: New method to support the*
operatorListOfDicts.__repr__
: New method, format as JSONListOfDicts.__rmul__
: New method to support the*
operatorListOfDicts.__setitem__
: New method, coerce toAttributeDict
ListOfDicts.__str__
: New method, format as JSONListOfDicts.aggregate
: Speed upListOfDicts.anti_join
: New methodListOfDicts.append
: New methodListOfDicts.clear
: New methodListOfDicts.extend
: New methodListOfDicts.full_join
: New methodListOfDicts.head
: New methodListOfDicts.inner_join
: New methodListOfDicts.insert
: New methodListOfDicts.join
: Removed in favor of specific join typesListOfDicts.left_join
: New methodListOfDicts.pluck
: Add argument "default" to handle missing keysListOfDicts.print_
: New methodListOfDicts.read_csv
: Add explicit argumentsListOfDicts.read_json
: Relay arguments tojson.loads
ListOfDicts.read_pickle
: New methodListOfDicts.reverse
: New methodListOfDicts.sample
: New methodListOfDicts.semi_join
: New methodListOfDicts.sort
: Change arguments to support sort direction betterListOfDicts.tail
: New methodListOfDicts.to_data_frame
: New methodListOfDicts.to_pandas
: New methodListOfDicts.unique
: Return unique by all keys if none givenListOfDicts.write_csv
: Add explicit argumentsListOfDicts.write_pickle
: New method
- Make
sort
handleNone
values, sorted last
- Fix
ObsoleteError
after multiple modifying actions
- Add
read_csv
- Add
read_json
- Add
write_csv
- Add
write_json
- Fix
ObsoleteError
withdeepcopy
- Define
__deepcopy__
so thatcopy.deepcopy
works too - Add
copy
(and__copy__
forcopy.copy
)
- Mark
ListOfDicts
object obsolete thus preventing (accidental) use if a chained successor has modified the shared dicts - Add
modify_if
- Speed up, mostly by avoiding copying (methods that modify dicts now do it in place rather than making a copy)
- Initial release