Data selector

When training models, it is common to try out different subsets of features or subpopulations. DataSelector allows you to define a series of transformations on your data so you can succinctly define a subsetting pipeline as a series of dictionaries.


Subset a pandas.DataFrame by passing a series of steps


*steps – Steps to apply to the data sequentially (order matters). Each step must be a dictionary with a key “kind” whose value must be one of “column_drop”, “row_drop” or “column_keep”. The rest of the key-value pairs must match the signature for the corresponding Step objects

transform(df, return_summary: bool = False)

Apply steps

  • df – Data frame to transform

  • return_summary – If False, the function only returns the output data frame, if True, it also returns a summary table

class Optional[list] = None, prefix: Optional[str] = None, suffix: Optional[str] = None, contains: Optional[str] = None, max_na_prop: Optional[float] = None)

Drop columns

  • names – List of columns to drop

  • prefix – Drop columns with this prefix (or list of)

  • suffix – Drop columns with this suffix (or list of)

  • contains – Drop columns if they contains this substring

  • max_na_prop – Drop columns whose proportion of NAs [0, 1] is larger than this

class bool = False, query: Optional[str] = None)

Drop rows

  • if_nas – If True, deletes all rows where there is at leat one NA

  • query – Drops all rows matching the query (passed via pandas.query)

class Optional[list] = None, dotted_path: Optional[str] = None)

Subset columns


names – List of columns to keep