Skip to content

Latest commit

 

History

History
490 lines (370 loc) · 15.6 KB

NEWS.md

File metadata and controls

490 lines (370 loc) · 15.6 KB

2024-12-15: Dataiter 0.999

  • DataFrame.fom_arrow: Remove strings_as_object argument
  • DataFrame.from_pandas: Remove strings_as_object argument
  • DataFrame.read_csv: Remove strings_as_object argument
  • DataFrame.read_parquet: Remove strings_as_object argument
  • GeoJSON.read: Remove strings_as_object argument
  • ListOfDicts.to_data_frame: Remove strings_as_object argument
  • read_csv: Remove strings_as_object argument
  • read_geojson: Remove strings_as_object argument
  • read_parquet: Remove strings_as_object argument
  • Vector.as_string: Remove length argument
  • Vector.is_na: Fix to work in multidimensional cases where the elements of an object vector are arrays/vectors
  • Vector.rank: Change default method to "min"
  • Vector.rank: Remove method "average"

This is a breaking change to switch the string data type from the fixed-width str_ a.k.a. <U# to the variable-width StringDType introduced in NumPy 2.0. The main benefit is greatly reduced memory use, making strings usable without needing to be careful or falling back to object. The note about stability below release 0.99 still applies.

Note that as StringDType is only in NumPy >= 2.0, any NPZ or Pickle files saved cannot be opened using Dataiter < 0.99 and NumPy < 2.0. If you need that kind of interoperability, consider using the Parquet file format.

2024-08-17: Dataiter 0.99

  • Adapt to changes in NumPy 2.0
  • Bump NumPy dependency to >= 2.0

This is a minimal change to be NumPy 2.0 compatible. In the 0.99+ releases, we plan to adopt the new NumPy string dtype and fix any regressions that come up, leading to a 1.0 release when everything looks to be working reliably (#26). Anyone looking for extreme stability should consider avoiding the 0.99+ releases and waiting for 1.0.

2024-06-24: Dataiter 0.51

  • Mark NumPy dependency as < 2.0

2024-04-06: Dataiter 0.50

  • ListOfDicts.drop_na: New method
  • ListOfDicts.keys: New method
  • ListOfDicts.print_memory_use: New method
  • Fix tabular display of Unicode characters with width != 1
  • Add dependency on wcwidth: https://pypi.org/project/wcwidth

2023-11-08: Dataiter 0.49

  • dt: Handle all NaT input
  • Migrate from setup.py to hatch and pyproject.toml

2023-10-08: Dataiter 0.48

  • Vector.as_datetime: Add precision argument
  • Vector.concat: New method
  • Vector.sort: Fix sorting object vectors

2023-09-09: Dataiter 0.47

  • DataFrame: Fix column and method name clash errors in certain operations
  • dt.replace: Allow vector arguments the same length as x

2023-09-05: Dataiter 0.46

  • DataFrame.count: New method, shorthand for data.group_by(...).aggregate(n=di.count())
  • Vector.rank: Handle empty and all-NA vectors

2023-06-14: Dataiter 0.45

  • USE_NUMBA_CACHE: New option, read from environment variable DATAITER_USE_NUMBA_CACHE if exists, defauls to True
  • Fix a possible issue with Numba caching

2023-06-13: Dataiter 0.44

  • Use numba.extending.overload instead of the deprecated numba.generated_jit

2023-06-08: Dataiter 0.43

  • DataFrame: Don't try to do joins on NA values in by columns
  • DataFrame.drop_na: New method

2023-05-30: Dataiter 0.42

  • DataFrame: Truncate multiline strings when printing
  • DataFrame.from_arrow: New method
  • DataFrame.read_parquet: New method
  • DataFrame.to_arrow: New method
  • DataFrame.write_parquet: New method
  • read_parquet: New function
  • Vector.__init__: Fix type guessing when mixing Python and NumPy floats or integers and missing values
  • Allow using a thousand separator when printing numbers, off by default, can be set with dataiter.PRINT_THOUSAND_SEPARATOR

2023-03-11: Dataiter 0.41

  • Fix printing really small numbers

2023-02-21: Dataiter 0.40.1

  • DataFrame.modify: Fix grouped modify on unsorted data frame

2023-02-20: Dataiter 0.40

  • Vector.map: Add dtype argument

2023-02-06: Dataiter 0.39.1

  • ListOfDicts.to_data_frame: Add strings_as_object argument

2023-01-21: Dataiter 0.39

  • read_csv, read_geojson, DataFrame.from_pandas, DataFrame.read_csv, GeoJSON.read: Add strings_as_object argument

2022-12-15: Dataiter 0.38

  • DataFrame.slice_off: New method
  • GeoJSON.to_data_frame: New method
  • Fix error with new column placeholder attributes in conjunction with pop, popitem and clear

2022-11-17: Dataiter 0.37

  • DataFrame: Add placeholder attributes for columns so that tab completion of columns as attributes at a shell works
  • dt.from_string: New function
  • dt.to_string: New function
  • nrow: Remove deprecated aggregation function
  • Don't use Numba for aggregation involving strings due to bad performance

2022-10-16: Dataiter 0.36

  • dt: New module for dealing with dates and datetimes

2022-10-03: Dataiter 0.35

  • DataFrame.from_pandas: Speed up by avoiding unnecessary conversions
  • DataFrame.full_join: Fix join and output when by is a tuple
  • GeoJSON: Fix printing object

2022-09-17: Dataiter 0.34

  • Vector: Handle timedeltas correctly for NA checks and printing
  • Vector.is_timedelta: New method

2022-09-03: Dataiter 0.33

  • DataFrame.sort: Convert object to string for sorting
  • Vector.sort: Convert object to string for sorting
  • Fix conditional Numba use when importing the numba package works, but caching doesn't
  • Add di-open cli command (currently not part of the default install, but can be installed from source using make install-cli)

2022-04-02: Dataiter 0.32

  • DataFrame.modify: Add support for grouped modification (#19)
  • DataFrame.split: New method
  • ListOfDicts.split: New method

2022-02-26: Dataiter 0.31

  • DataFrame.compare: New experimental method
  • Vector.as_string: Add length argument
  • Change the documentation to default to the latest release ("stable") instead of the development version ("latest")

2022-02-19: Dataiter 0.30

  • Use keyword-only arguments where appropriate – the general principle is that mandatory arguments are allowed as positional, but optional modifiers are keyword only
  • Rename all instances of "missing" to "na", such as Vector.is_missing to Vector.is_na, the only exception being ListOfDicts.fill_missing, which becomes ListOfDicts.fill_missing_keys
  • Truncate data frame object and string columns at PRINT_TRUNCATE_WIDTH (default 32) for printing

2022-02-09: Dataiter 0.29.2

  • Fix aggregation functions to work with all main data types: boolean, integer, float, date, datetime and string
  • Fix aggregation functions to handle all missing values (NaN, NaT, blank string) correctly, the same as implemented in Vector
  • Rename aggregation functions' dropna arguments to drop_missing
  • first, last, nth: Add drop_missing argument
  • Vector.drop_missing: New method

2022-01-30: Dataiter 0.29.1

  • mode: Fix to return first in case of ties (requires Python >= 3.8)
  • std, var: Add ddof argument (defaults to 0 on account of Numba limitations)
  • Don't try to dropna for non-float vectors in aggregation functions

2022-01-29: Dataiter 0.29

2022-01-09: Dataiter 0.28

  • DataFrame: Make object columns work in various operations
  • DataFrame.from_json: Add arguments columns and dtypes
  • DataFrame.from_pandas: Add argument dtypes
  • DataFrame.full_join: Speed up
  • DataFrame.read_csv: Add argument dtypes
  • DataFrame.read_json: Add arguments columns and dtypes
  • GeoJSON.read: Add arguments columns and dtypes
  • ListOfDicts.fill_missing: New method
  • ListOfDicts.from_json: Add arguments keys and types
  • ListOfDicts.full_join: Speed up
  • ListOfDicts.read_csv: Add argument types, rename columns to keys
  • ListOfDicts.read_json: Add arguments keys and types

2022-01-01: Dataiter 0.27

  • DataFrame: Fix error message when column not found
  • DataFrame.aggregate: Speed up
  • DataFrame.full_join: Fix to join all possible columns
  • DataFrame.read_csv: Try to avoid mixed types
  • ListOfDicts.full_join: Fix to join all possible keys
  • ListOfDicts.write_csv: Use minimal quoting
  • Vector.get_memory_use: New method
  • Vector.rank: Rewrite, add method argument
  • *.read_*: Rename fname argument path
  • *.write_*: Rename fname argument path
  • Add comparison table dplyr vs. Dataiter vs. Pandas to documentation: https://dataiter.readthedocs.io/en/latest/comparison.html

2021-12-02: Dataiter 0.26

  • DataFrame.read_npz: New method to read NumPy npz format
  • DataFrame.write_npz: New method to write NumPy npz format
  • *.read_*: Decompress .bz2|.gz|.xz automatically
  • *.write_*: Compress .bz2|.gz|.xz automatically

2021-11-13: Dataiter 0.25

  • DataFrame.print_missing_counts: Fix when nothing missing
  • Vector.replace_missing: New method

2021-10-27: Dataiter 0.24

  • DataFrame.print_memory_use: New method
  • ListOfDicts.write_csv: Use less memory

2021-07-08: Dataiter 0.23

  • Vector.is_*: Change to be methods instead of properties
  • Drop deprecated use of np.int
  • Drop deprecated comparisons against NaN

2021-05-13: Dataiter 0.22

  • ListOfDicts.map: New method

2021-03-08: Dataiter 0.21

  • DataFrame.read_csv: Add columns argument
  • ListOfDicts.read_csv: Add columns argument

2021-03-06: Dataiter 0.20

  • DataFrame.*_join: Handle differing by names via tuple argument
  • ListOfDicts.*_join: Handle differing by names via tuple argument

2021-03-04: Dataiter 0.19

  • Use terminal window width as maximum print width
  • Vector.__init__: Handle NaN values in non-float vectors

2021-03-03: Dataiter 0.18

  • Vector.__init__: Accept generators/iterators
  • Vector.map: New method

2021-02-27: Dataiter 0.17

  • DataFrame.print_missing_counts: New method
  • GeoJSON.read: Handle properties differing between features
  • ListOfDicts.print_missing_counts: New method
  • Vector.as_object: New method

2020-10-03: Dataiter 0.16.1

  • GeoJSON.read: Use warnings, not errors for ignored excess feature keys

2020-09-26: Dataiter 0.16

  • GeoJSON: New class

2020-09-12: Dataiter 0.15

  • ListOfDicts.sort: Handle descending sort for all types

2020-08-22: Dataiter 0.14

  • ListOfDicts: Make obsoletion a warning instead of an error

2020-08-15: Dataiter 0.13

  • DataFrame: Fix error printing blank strings (#8)

2020-07-25: Dataiter 0.12

  • DataFrame.filter: Add colname_value_pairs argument
  • DataFrame.filter_out: Add colname_value_pairs argument
  • ListOfDicts.__init__: Remove arguments not intended for external use
  • ListOfDicts.rename: Preserve order of keys
  • Add documentation: https://dataiter.readthedocs.io/

2020-06-02: Dataiter 0.11

  • Vector.__init__: Speed up by fixing type deduction

2020-05-28: Dataiter 0.10.1

  • ListOfDicts.select: Fix return value (#7)

2020-05-21: Dataiter 0.10

  • DataFrame.aggregate: Fix UnicodeEncodeError with string columns
  • DataFrame.unique: Fix UnicodeEncodeError with string columns
  • ListOfDicts.select: Return keys in requested order
  • Vector.__repr__: Add custom conversion to string for display
  • Vector.__str__: Add custom conversion to string for display
  • Vector.to_string: Add custom conversion to string for display
  • Vector.to_strings: Add custom conversion to string for display

2020-05-11: Dataiter 0.9

  • Array: Rename to Vector
  • Vector.head: New method
  • Vector.range: New method
  • Vector.sample: New method
  • Vector.sort: New method
  • Vector.tail: New method
  • Vector.unique: New method

2020-05-10: Dataiter 0.8

  • DataFrame: New class
  • ListOfDicts.__add__: New method to support the + operator
  • ListOfDicts.__init__: Rename, reorder arguments
  • ListOfDicts.__mul__: New method to support the * operator
  • ListOfDicts.__repr__: New method, format as JSON
  • ListOfDicts.__rmul__: New method to support the * operator
  • ListOfDicts.__setitem__: New method, coerce to AttributeDict
  • ListOfDicts.__str__: New method, format as JSON
  • ListOfDicts.aggregate: Speed up
  • ListOfDicts.anti_join: New method
  • ListOfDicts.append: New method
  • ListOfDicts.clear: New method
  • ListOfDicts.extend: New method
  • ListOfDicts.full_join: New method
  • ListOfDicts.head: New method
  • ListOfDicts.inner_join: New method
  • ListOfDicts.insert: New method
  • ListOfDicts.join: Removed in favor of specific join types
  • ListOfDicts.left_join: New method
  • ListOfDicts.pluck: Add argument "default" to handle missing keys
  • ListOfDicts.print_: New method
  • ListOfDicts.read_csv: Add explicit arguments
  • ListOfDicts.read_json: Relay arguments to json.loads
  • ListOfDicts.read_pickle: New method
  • ListOfDicts.reverse: New method
  • ListOfDicts.sample: New method
  • ListOfDicts.semi_join: New method
  • ListOfDicts.sort: Change arguments to support sort direction better
  • ListOfDicts.tail: New method
  • ListOfDicts.to_data_frame: New method
  • ListOfDicts.to_pandas: New method
  • ListOfDicts.unique: Return unique by all keys if none given
  • ListOfDicts.write_csv: Add explicit arguments
  • ListOfDicts.write_pickle: New method

2019-12-03: Dataiter 0.7

  • Make sort handle None values, sorted last

2019-11-29: Dataiter 0.6

  • Fix ObsoleteError after multiple modifying actions

2019-11-10: Dataiter 0.5

  • Add read_csv
  • Add read_json
  • Add write_csv
  • Add write_json

2019-11-01: Dataiter 0.4

  • Fix ObsoleteError with deepcopy
  • Define __deepcopy__ so that copy.deepcopy works too
  • Add copy (and __copy__ for copy.copy)

2019-11-01: Dataiter 0.3

  • Mark ListOfDicts object obsolete thus preventing (accidental) use if a chained successor has modified the shared dicts
  • Add modify_if

2019-10-31: Dataiter 0.2

  • Speed up, mostly by avoiding copying (methods that modify dicts now do it in place rather than making a copy)

2019-09-29: Dataiter 0.1

  • Initial release