Data splitting functionality#

nlpsig has functionality to split data (given as a pair of some inputs, x_data, and some corresponding labels, y_data) into general train/validation/test splits using nlpsig.DataSplits, or into \(K\) number of folds using nlpsig.Folds. These allow the user to return the data as torch.Tensor objects or torch.utils.data.dataloader.DataLoader objects ready to be used within a PyTorch model.

class nlpsig.classification_utils.DataSplits(x_data: np.array | torch.Tensor | dict[str, np.array | torch.Tensor], y_data: np.array | torch.Tensor, groups: np.array | torch.Tensor | None = None, train_size: float = 0.8, valid_size: float | None = 0.33, indices: tuple[Iterable[int], Iterable[int], Iterable[int]] | None = None, shuffle: bool = False, random_state: int = 42)#

Bases: object

Class to split the data into train, validation and test sets.

Parameters:
  • x_data (np.array | torch.Tensor | dict[str, np.array | torch.Tensor]) – Features for prediction. This can be a numpy array or torch tensor of shape (n_samples, n_features), or a dictionary where the keys are the names of the features and the values are numpy arrays or torch tensors of shape (n_samples, n_features).

  • y_data (np.array | torch.Tensor) – Variable to predict.

  • groups (np.array | torch.Tensor | None, optional) – Groups to split by, default None. If groups are passed, then GroupShuffleSplit is used to create a dataset split where groups are fit entirely within a datasplit, i.e. groups would not span over the different splits created.

  • train_size (float, optional) – Proportion of data to use as training data, by default 0.8.

  • valid_size (float | None, optional) – Proportion of training data to use as validation data, by default 0.33. If None, will not create a validation set.

  • indices (tuple[Iterable[int], Iterable[int] | None, Iterable[int]] | None, optional) – Train, validation, test indices to use. If passed, will split the data according to these indices rather than splitting it within the method using the train_size and valid_size provided. First item in the tuple should be the indices for the training set, second item should be the indices for the validaton set (this could be None if no validation set is required), and third item should be indices for the test set.

  • shuffle (bool, optional) – Whether or not to shuffle the dataset, by default False. This is ignored if either groups are passed, or if indices are passed.

  • random_state (int, optional) – Seed number, by default 42. This is ignored if indices are passed.

get_splits(as_DataLoader: bool = False, data_loader_args: dict | None = None) tuple[DataLoader, DataLoader, DataLoader] | tuple[np.array | torch.Tensor | dict[str, np.array | torch.Tensor]]#

Returns train, validation and test set.

Parameters:
  • as_DataLoader (bool, optional) – Whether or not to return as torch.utils.data.dataloader.DataLoader objects ready to be passed into PyTorch model, by default False.

  • data_loader_args (dict | None, optional) – Any keywords to be passed in obtaining the torch.utils.data.dataloader.DataLoader object, by default {"batch_size": 64, "shuffle": True}.

Returns:

If as_DataLoader is True, return tuple of torch.utils.data.dataloader.DataLoader objects:
  • First element is training dataset

  • Second element is validation dataset

  • Third element is testing dataset

If as_DataLoader is False, returns tuple:
  • First element is features (which is either an array/tensor or dictionary) for training dataset

  • Second element is labels for training dataset as a array/tensor

  • Third element is features (which is either an array/tensor or dictionary) for validation dataset

  • Fourth element is labels for validation dataset as a array/tensor

  • Fifth element is features (which is either an array/tensor or dictionary) for testing dataset

  • Sixth element is labels for testing dataset as a array/tensor

Return type:

tuple[DataLoader] | tuple[np.array | torch.Tensor | dict[str, np.array | torch.Tensor]]

class nlpsig.classification_utils.Folds(x_data: np.array | torch.Tensor | dict[str, np.array | torch.Tensor], y_data: np.array | torch.Tensor, groups: np.array | torch.Tensor | None = None, n_splits: int = 5, valid_size: float | None = 0.33, indices: tuple[tuple[Iterable[int], Iterable[int] | None, Iterable[int]]] | None = None, shuffle: bool = False, random_state: int = 42)#

Bases: object

Class to split the data into different folds based on groups

Parameters:
  • x_data (np.array | torch.Tensor | dict[str, np.array | torch.Tensor]) – Features for prediction. This can be a numpy array or torch tensor of shape (n_samples, n_features), or a dictionary where the keys are the names of the features and the values are numpy arrays or torch tensors of shape (n_samples, n_features).

  • y_data (np.array | torch.Tensor) – Variable to predict.

  • groups (np.array | torch.Tensor | None, optional) – Groups to split by, default None. If None is passed, then does standard KFold, otherwise implements GroupShuffleSplit (if shuffle is True), or GroupKFold (if shuffle is False).

  • n_splits (int, optional) – Number of splits / folds, by default 5.

  • valid_size (float | None, optional) – Proportion of training data to use as validation data, by default 0.33. If None, will not create a validation set.

  • indices (tuple[tuple[Iterable[int], Iterable[int] | None, Iterable[int]]] | None, optional) – Tuple of length n_splits where each item is also a tuple containing the train, validation, test indices to use for each fold. If passed, will split the data according to these indices rather than splitting it within the method using the train_size and valid_size provided. For each item in the tuple, the first item in the tuple should be the indices for the training set, second item should be the indices for the validaton set (this could be None if no validation set is required), and third item should be indices for the test set.

  • shuffle (bool, optional) – Whether or not to shuffle the dataset, by default False.

  • random_state (int, optional) – Seed number, by default 42. This is ignored if indices are passed.

Raises:
  • ValueError – if n_splits < 2.

  • ValueError – if x_data and y_data do not have the same number of records (number of rows in x_data should equal the length of y_data).

  • ValueError – if x_data and groups do not have the same number of records (number of rows in x_data should equal the length of groups).

get_splits(fold_index: int, as_DataLoader: bool = False, data_loader_args: dict | None = None) tuple[DataLoader, DataLoader, DataLoader] | tuple[np.array | torch.Tensor, np.array | torch.Tensor, np.array | torch.Tensor, np.array | torch.Tensor, np.array | torch.Tensor, np.array | torch.Tensor]#

Obtains the data from a particular fold

Parameters:
  • fold_index (int) – Which fold to obtain data for

  • as_DataLoader (bool, optional) – Whether or not to return as torch.utils.data.dataloader.DataLoader objects ready to be passed into PyTorch model, by default False.

  • data_loader_args (dict | None, optional) – Any keywords to be passed in obtaining the torch.utils.data.dataloader.DataLoader object, by default {"batch_size": 64, "shuffle": True}.

Returns:

If as_DataLoader is True, return tuple of torch.utils.data.dataloader.DataLoader objects:
  • First element is training dataset

  • Second element is validation dataset

  • Third element is testing dataset

If as_DataLoader is False, returns tuple:
  • First element is features (which is either an array/tensor or dictionary) for training dataset

  • Second element is labels for training dataset as a array/tensor

  • Third element is features (which is either an array/tensor or dictionary) for validation dataset

  • Fourth element is labels for validation dataset as a array/tensor

  • Fifth element is features (which is either an array/tensor or dictionary) for testing dataset

  • Sixth element is labels for testing dataset as a array/tensor

Return type:

tuple[DataLoader] | tuple[np.array | torch.Tensor | dict[str, np.array | torch.Tensor]]

Raises:

ValueError – if the requested fold_index is not valid (out of range).

nlpsig.classification_utils.set_seed(seed: int) None#

Helper function for reproducible behavior to set the seed in random, torch.

Parameters:

seed (int) – Seed number.