Dataset
Astrapia datasets are used to unify dataset representations across different models and explainers. Each interface between a model and an explainer needs to support the following dataset format.
- class astrapia.Dataset
The Dataset class represents a dataset. It should be able to capture diverse attributes of
The Explainer class wraps an explainer and provides a unified interface for it. Initialization depends on the specific explainer. This class should not be used as is but rather extended.
- __init__(data: DataFrame, feature_names: list, categorical_features: dict, target: DataFrame, target_names: list, target_name: str, target_categorical: bool = True, name: str = None, data_dev: DataFrame = None, target_dev: DataFrame = None, data_test: DataFrame = None, target_test: DataFrame = None) None
- Parameters:
data – training data as pandas DataFrame
feature_names – list of feature names
categorical_features – dictionary of categorical features
target – target feature as pandas DataFrame
target_names – list of target feature values
target_name – name of the target feature
target_categorical – whether the target is categorical (currently only True supported)
name – name of the dataset
data_dev – development data as pandas DataFrame
target_dev – development target as pandas DataFrame
data_test – test data as pandas DataFrame
target_test – test target as pandas DataFrame
You can easily load a dataset into a dataset object by using the load_csv_data method.
- dataset.load_csv_data(root_path='data', seed=0)
Parse a csv dataset to be used. This function assumes you have a folder $name under data, containing a file $name.data with a comma-separated training set, and a JSON file containing feature names (amongst other info).
The default data directory (‘data/’) con be overwritten through the root_path parameter.
- Parameters:
seed – RNG seed for numpyRandomState
dataset_name – name of the dataset, used for path/file names
root_path – path to the root data directory, defaults to ‘data/’
- Returns:
data as an astrapia.Dataset
When using your own dataset, either use the initialization function or include the following files:
A
datasetname.datafile containing the training data in a csv format.A
datasetname.testfile containing the test data in a csv format.A
meta.jsonfile containing the following metadata in a json format.target: The name of the target column.target_categorical: Whether the target is categorical or not (currently only true is supported).target_names: A list of the values of the target categories.features_names: A list of the names of the features.categorical_features: Dictionary mapping feature names to a list of the values of the respective feature.na_values: Token used for missing values.
Off-the-shelf datasets
To allow for quickly starting with benchmarking, astrapia supplies multiple datasets ready to be used. Visit https://github.com/DataManagementLab/Astrapia to download them.
UCI Adult Data Set: This is a multivariate dataset that has 48,842 instances for predicting whether income exceeds $50K per year based on census data, also known as census income dataset.
UCI Breast Cancer Wisconsin Dataset: This breast cancer dataset is used for a relatively simple binary classification task, whether the diagnosis of the patient is malignant or benign according to its features.