Using nlpsig to construct paths of embeddings#
Once embeddings are obtained, nlpsig.PrepareData
allows the user to construct paths of embeddings by padding.
- class nlpsig.data_preparation.PrepareData(original_df: pd.DataFrame, embeddings: np.array, embeddings_reduced: np.array | None = None, pooled_embeddings: np.array | None = None, id_column: str | None = None, label_column: str | None = None, verbose: bool = True)#
Bases:
object
Class to prepare dataset for computing signatures.
- Parameters:
original_df (pd.DataFrame) – Dataset as a pandas dataframe.
embeddings (np.array) – Embeddings for each of the items in original_df.
embeddings_reduced (np.array | None, optional) – Dimension reduced embeddings, by default None.
pooled_embeddings (np.array | None, optional) – Pooled embeddings for each unique id in id_column, by default None.
id_column (str | None, optional) – Name of the column which identifies each of the text, e.g. - “text_id” (if each item in original_df is a word or sentence from a particular text), - “user_id” (if each item in original_df is a post from a particular user) - “timeline_id” (if each item in original_df is a post from a particular time) If None, it will create a dummy id_column named “dummy_id” and fill with zeros.
label_column (str | None, optional) – Name of the column which are corresponds to the labels of the data.
- Raises:
ValueError – if original_df and embeddings does not have the same number of rows.
ValueError – if original_df and embeddings_reduced does not have the same number of rows (if embeddings_reduced is provided).
- check_history_length_for_SeqSigNet(shift: int, window_size: int, n: int) bool #
Helper function to detemine whether or not the path created (by .pad()) has history length (k) long enough to create a tensor for SeqSigNet network.
In particular, for a given shift, window_size and n, we must have history_length == shift * n + (window_size - shift).
- Parameters:
shift (int) – Amount we are shifting the window.
window_size (int) – Size of the window we use over the texts.
n (int) – Number of units we wish to use in SeqSigNet.
- Returns:
Whether or not the history length in the path created by .pad() satisfies the requested configuration in SeqSigNet.
- Return type:
bool
- Raises:
ValueError – If a path hasn’t been created yet using .pad().
- get_embeddings(reduced_embeddings: bool = False) array #
Returns a np.array object of the embeddings.
- Parameters:
reduced_embeddings (bool, optional) – If True, returns np.array of dimension reduced embeddings, by default False.
- Returns:
Embeddings.
- Return type:
np.array
- get_path(include_features: bool = True) array #
Returns a np.array object of the path. Includes the features by default (if they are present after the padding).
- Parameters:
include_features (bool, optional) – Whether or not to keep the features, by default True.
- Returns:
Path.
- Return type:
np.array
- Raises:
ValueError – if self.array_padded is None. In this case, need to call .pad() first.
- get_time_feature(time_feature: str = 'timeline_index', standardise_method: str | None = None) dict[str, np.array | Callable | None] #
Returns a np.array object of the time_feature that is requested (the string passed has to be one of the strings in ._feature_list).
- Parameters:
time_feature (str, optional) – Which time feature to obtain np.array for, by default “timeline_index”.
standardise_method (str | None, optional) –
If not None, applies standardisation to the time features, default None. Options: - “z_score”: transforms by subtracting the mean and dividing by standard deviation - “sum_divide”: transforms by dividing by the sum - “minmax”: transform by return (x-min(x)) / (max(x)-min(x)) where x
is the vector to standardise
- Returns:
Dictionary where dict[“time_feature”] stores the np.array of the time feature, and dict[“transform”] is the function to transform new data using the standardisation applied (if standardise_method is not None), or None.
- Return type:
dict[str, np.array | Callable | None]
- Raises:
ValueError – if time_feature is not in the possible time_features (can be found in ._feature_list attribute).
- get_torch_path_for_SWNUNetwork(include_features_in_path: bool, include_features_in_input: bool, include_embedding_in_input: bool, reduced_embeddings: bool = False, path_indices: list | np.array | None = None) dict[str, dict[str, torch.tensor] | int | None] #
Returns a torch.tensor object that can be passed into nlpsig_networks.SWNUNetwork model.
- Parameters:
include_features_in_path (bool) – Whether or not to keep the additional features (e.g. time features) within the path.
include_features_in_input (bool) – Whether or not to concatenate the additional features into the feed-forward neural network in the nlpsig_networks.SWNUNetwork model.
include_embedding_in_input (bool) – Whether or not to concatenate the embeddings into the feed-forward neural network in the nlpsig_networks.SWNUNetwork model. If we created a path for each item in the dataset, we will concatenate the embeddings in .embeddings (if reduced_embeddings=False) or the embeddings in .reduced_embeddings (if reduced_embeddings=True). If we created a path for each id in .id_column, then we concatenate the embeddings in .pooled_embeddings.
reduced_embeddings (bool, optional) – Whether or not to concatenate the dimension reduced embeddings, by default False. This is ignored if we created a path for each if in .id_column, i.e. .pad_method=’id’.
path_indices (list | np.array | None, optional) – If not None, will return the path for the indices specified in path_indices. If None, will return the path for all indices in .df (or all ids in .id_column if pad_by=”id”), by default None.
- Returns:
Dictionary where: - “x_data” is a dictionary where:
”path” is a tensor of the path to be passed into the nlpsig_networks.SWNUNetwork network
”features” is a tensor of the features (e.g. time features or additional features)
”input_channels” is the number of channels in the path
”num_features” is the number of features (e.g. time features or additional features) (this is None if there are no additional features to concatenate)
- Return type:
dict[str, dict[str, torch.tensor] | int | None]
- get_torch_path_for_SeqSigNet(shift: int, window_size: int, n: int, include_features_in_path: bool, include_features_in_input: bool, include_embedding_in_input: bool, reduced_embeddings: bool = False, path_indices: list | np.array | None = None) dict[str, dict[str, torch.tensor] | int | None] #
Returns a torch.tensor object that can be passed into nlpsig_networks.SeqSigNet model.
- Parameters:
shift (int) – Amount we are shifting the window.
window_size (int) – Size of the window we use over the texts.
n (int) – Number of units we wish to use in SeqSigNet.
include_features_in_path (bool) – Whether or not to keep the additional features (e.g. time features) within the path.
include_features_in_input (bool) – Whether or not to concatenate the additional features into the feed-forward neural network in the nlpsig_networks.SeqSigNet model.
include_embedding_in_input (bool) – Whether or not to concatenate the embeddings into the feed-forward neural network in the nlpsig_networks.SeqSigNet model. If we created a path for each item in the dataset, we will concatenate the embeddings in .embeddings (if reduced_embeddings=False) or the embeddings in .reduced_embeddings (if reduced_embeddings=True). If we created a path for each id in .id_column, then we concatenate the embeddings in .pooled_embeddings.
reduced_embeddings (bool, optional) – Whether or not to concatenate the dimension reduced embeddings, by default False. This is ignored if we created a path for each if in .id_column, i.e. .pad_method=’id’.
path_indices (list | np.array | None, optional) – If not None, will return the path for the indices specified in path_indices. If None, will return the path for all indices in .df (or all ids in .id_column if pad_by=”id”), by default None.
- Returns:
Dictionary where: - “x_data” is a dictionary where:
”path” is a tensor of the path to be passed into the nlpsig_networks.SeqSigNet network
”features” is a tensor of the features (e.g. time features or additional features)
”input_channels” is the number of channels in the path
”num_features” is the number of features (e.g. time features or additional features) (this is None if there are no additional features to concatenate)
- Return type:
dict[str, dict[str, torch.tensor] | int | None]
- Raises:
ValueError – If a path hasn’t been created yet using .pad().
- pad(pad_by: str, method: str = 'k_last', zero_padding: bool = True, k: int = 5, features: list[str] | str | None = None, standardise_method: list[str] | str | None = None, embeddings: str = 'full', include_current_embedding: bool = True, pad_from_below: bool = True) np.array #
Creates an array which stores the path. We create a path for each id in id_column if pad_by=”id” (by constructing a path of the embeddings of the texts associated to each id), or for each item in .df if pad_by=”history” (by constructing a path of the embeddings of the previous texts).
We can decide how long our path is by letting method=”k_last and specifying k. Alternatively, we can set method=”max”, which sets the length of the path by setting k to be the largest number of texts associated to an individual id.
The function “pads” if there aren’t enough texts to fill in (e.g. if requesting for the last 5 posts for an id, but there are less than 5 posts available), by adding empty records (if zero_padding=True) or by the last previous text (if zero_padding=False). This ensures that each path has the same number of data points.
- Parameters:
pad_by (str) –
How to construct the path. Options are:
”id”: constructs a path of the embeddings of the texts associated to each id
”history”: constructs a path by looking at the embeddings of the previous texts for each text
method (str, optional) –
How long the path is, default “k_last”. Options are:
”k_last”: specifying the length of the path through the choice of k (see below)
”max”: the length of the path is chosen by looking at the largest number of texts associated to an individual id in .id_column
zero_padding (bool, optional) – If True, will pad with zeros. Otherwise, pad with the latest text associated to the id.
k (int, optional) – The requested length of the path, default 5. This is ignored if method=”max”.
features (list[str] | str | None, optional) – Which feature(s) to keep. If None, then doesn’t keep any.
standardise_method (list[str] | str | None, optional) –
If not None, applies standardisation to the features, default None. If a list is passed, must be the same length as features. Options:
”z_score”: transforms by subtracting the mean and dividing by standard deviation
”sum_divide”: transforms by dividing by the sum
”minmax”: transform by return (x-min(x)) / (max(x)-min(x)) where x is the vector to standardise
embeddings (str, optional) –
Which embeddings to keep, by default “full”. Options:
”dim_reduced”: dimension reduced embeddings
”full”: full embeddings
”both”: keeps both dimension reduced and full embeddings
include_current_embedding (bool, optional) – If pad_by=”history”, this determines whether or not the embedding for the text is included in it’s history, by default True. If pad_by=”id”, this argument is ignored.
pad_from_below (bool, optional) – If True, will pad the path from below, otherwise pads the path from above, by default True.
- Returns:
- 3 dimension array of the path:
First dimension is ids (if pad_by=”id”) or each text (if pad_by=”history”)
Second dimension is the associated texts
Third dimension are the features (e.g. embeddings / dimension reduced embeddings, time features)
- Return type:
np.array