Dataset

Astrapia datasets are used to unify dataset representations across different models and explainers. Each interface between a model and an explainer needs to support the following dataset format.

class astrapia.Dataset

The Dataset class represents a dataset. It should be able to capture diverse attributes of

The Explainer class wraps an explainer and provides a unified interface for it. Initialization depends on the specific explainer. This class should not be used as is but rather extended.

__init__(data: DataFrame, feature_names: list, categorical_features: dict, target: DataFrame, target_names: list, target_name: str, target_categorical: bool = True, name: str = None, data_dev: DataFrame = None, target_dev: DataFrame = None, data_test: DataFrame = None, target_test: DataFrame = None) → None

Parameters:

data – training data as pandas DataFrame
feature_names – list of feature names
categorical_features – dictionary of categorical features
target – target feature as pandas DataFrame
target_names – list of target feature values
target_name – name of the target feature
target_categorical – whether the target is categorical (currently only True supported)
name – name of the dataset
data_dev – development data as pandas DataFrame
target_dev – development target as pandas DataFrame
data_test – test data as pandas DataFrame
target_test – test target as pandas DataFrame

You can easily load a dataset into a dataset object by using the load_csv_data method.

dataset.load_csv_data(root_path='data', seed=0)

Parse a csv dataset to be used. This function assumes you have a folder $name under data, containing a file $name.data with a comma-separated training set, and a JSON file containing feature names (amongst other info).

The default data directory (‘data/’) con be overwritten through the root_path parameter.

Parameters:

seed – RNG seed for numpyRandomState
dataset_name – name of the dataset, used for path/file names
root_path – path to the root data directory, defaults to ‘data/’

Returns:

data as an astrapia.Dataset

When using your own dataset, either use the initialization function or include the following files:

A datasetname.data file containing the training data in a csv format.
A datasetname.test file containing the test data in a csv format.
A meta.json file containing the following metadata in a json format.
- target: The name of the target column.
- target_categorical: Whether the target is categorical or not (currently only true is supported).
- target_names: A list of the values of the target categories.
- features_names: A list of the names of the features.
- categorical_features: Dictionary mapping feature names to a list of the values of the respective feature.
- na_values: Token used for missing values.

Off-the-shelf datasets

To allow for quickly starting with benchmarking, astrapia supplies multiple datasets ready to be used. Visit https://github.com/DataManagementLab/Astrapia to download them.

UCI Adult Data Set: This is a multivariate dataset that has 48,842 instances for predicting whether income exceeds $50K per year based on census data, also known as census income dataset.
UCI Breast Cancer Wisconsin Dataset: This breast cancer dataset is used for a relatively simple binary classification task, whether the diagnosis of the patient is malignant or benign according to its features.