Using nlpsig to construct paths of embeddings#

Once embeddings are obtained, nlpsig.PrepareData allows the user to construct paths of embeddings by padding.

class nlpsig.data_preparation.PrepareData(original_df: pd.DataFrame, embeddings: np.array, embeddings_reduced: np.array | None = None, pooled_embeddings: np.array | None = None, id_column: str | None = None, label_column: str | None = None, verbose: bool = True)#

Bases: object

Class to prepare dataset for computing signatures.

Parameters:
  • original_df (pd.DataFrame) – Dataset as a pandas dataframe.

  • embeddings (np.array) – Embeddings for each of the items in original_df.

  • embeddings_reduced (np.array | None, optional) – Dimension reduced embeddings, by default None.

  • pooled_embeddings (np.array | None, optional) – Pooled embeddings for each unique id in id_column, by default None.

  • id_column (str | None, optional) – Name of the column which identifies each of the text, e.g. - “text_id” (if each item in original_df is a word or sentence from a particular text), - “user_id” (if each item in original_df is a post from a particular user) - “timeline_id” (if each item in original_df is a post from a particular time) If None, it will create a dummy id_column named “dummy_id” and fill with zeros.

  • label_column (str | None, optional) – Name of the column which are corresponds to the labels of the data.

Raises:
  • ValueError – if original_df and embeddings does not have the same number of rows.

  • ValueError – if original_df and embeddings_reduced does not have the same number of rows (if embeddings_reduced is provided).

check_history_length_for_SeqSigNet(shift: int, window_size: int, n: int) bool#

Helper function to detemine whether or not the path created (by .pad()) has history length (k) long enough to create a tensor for SeqSigNet network.

In particular, for a given shift, window_size and n, we must have history_length == shift * n + (window_size - shift).

Parameters:
  • shift (int) – Amount we are shifting the window.

  • window_size (int) – Size of the window we use over the texts.

  • n (int) – Number of units we wish to use in SeqSigNet.

Returns:

Whether or not the history length in the path created by .pad() satisfies the requested configuration in SeqSigNet.

Return type:

bool

Raises:

ValueError – If a path hasn’t been created yet using .pad().

get_embeddings(reduced_embeddings: bool = False) array#

Returns a np.array object of the embeddings.

Parameters:

reduced_embeddings (bool, optional) – If True, returns np.array of dimension reduced embeddings, by default False.

Returns:

Embeddings.

Return type:

np.array

get_path(include_features: bool = True) array#

Returns a np.array object of the path. Includes the features by default (if they are present after the padding).

Parameters:

include_features (bool, optional) – Whether or not to keep the features, by default True.

Returns:

Path.

Return type:

np.array

Raises:

ValueError – if self.array_padded is None. In this case, need to call .pad() first.

get_time_feature(time_feature: str = 'timeline_index', standardise_method: str | None = None) dict[str, np.array | Callable | None]#

Returns a np.array object of the time_feature that is requested (the string passed has to be one of the strings in ._feature_list).

Parameters:
  • time_feature (str, optional) – Which time feature to obtain np.array for, by default “timeline_index”.

  • standardise_method (str | None, optional) –

    If not None, applies standardisation to the time features, default None. Options: - “z_score”: transforms by subtracting the mean and dividing by standard deviation - “sum_divide”: transforms by dividing by the sum - “minmax”: transform by return (x-min(x)) / (max(x)-min(x)) where x

    is the vector to standardise

Returns:

Dictionary where dict[“time_feature”] stores the np.array of the time feature, and dict[“transform”] is the function to transform new data using the standardisation applied (if standardise_method is not None), or None.

Return type:

dict[str, np.array | Callable | None]

Raises:

ValueError – if time_feature is not in the possible time_features (can be found in ._feature_list attribute).

get_torch_path_for_SWNUNetwork(include_features_in_path: bool, include_features_in_input: bool, include_embedding_in_input: bool, reduced_embeddings: bool = False, path_indices: list | np.array | None = None) dict[str, dict[str, torch.tensor] | int | None]#

Returns a torch.tensor object that can be passed into nlpsig_networks.SWNUNetwork model.

Parameters:
  • include_features_in_path (bool) – Whether or not to keep the additional features (e.g. time features) within the path.

  • include_features_in_input (bool) – Whether or not to concatenate the additional features into the feed-forward neural network in the nlpsig_networks.SWNUNetwork model.

  • include_embedding_in_input (bool) – Whether or not to concatenate the embeddings into the feed-forward neural network in the nlpsig_networks.SWNUNetwork model. If we created a path for each item in the dataset, we will concatenate the embeddings in .embeddings (if reduced_embeddings=False) or the embeddings in .reduced_embeddings (if reduced_embeddings=True). If we created a path for each id in .id_column, then we concatenate the embeddings in .pooled_embeddings.

  • reduced_embeddings (bool, optional) – Whether or not to concatenate the dimension reduced embeddings, by default False. This is ignored if we created a path for each if in .id_column, i.e. .pad_method=’id’.

  • path_indices (list | np.array | None, optional) – If not None, will return the path for the indices specified in path_indices. If None, will return the path for all indices in .df (or all ids in .id_column if pad_by=”id”), by default None.

Returns:

Dictionary where: - “x_data” is a dictionary where:

  • ”path” is a tensor of the path to be passed into the nlpsig_networks.SWNUNetwork network

  • ”features” is a tensor of the features (e.g. time features or additional features)

  • ”input_channels” is the number of channels in the path

  • ”num_features” is the number of features (e.g. time features or additional features) (this is None if there are no additional features to concatenate)

Return type:

dict[str, dict[str, torch.tensor] | int | None]

get_torch_path_for_SeqSigNet(shift: int, window_size: int, n: int, include_features_in_path: bool, include_features_in_input: bool, include_embedding_in_input: bool, reduced_embeddings: bool = False, path_indices: list | np.array | None = None) dict[str, dict[str, torch.tensor] | int | None]#

Returns a torch.tensor object that can be passed into nlpsig_networks.SeqSigNet model.

Parameters:
  • shift (int) – Amount we are shifting the window.

  • window_size (int) – Size of the window we use over the texts.

  • n (int) – Number of units we wish to use in SeqSigNet.

  • include_features_in_path (bool) – Whether or not to keep the additional features (e.g. time features) within the path.

  • include_features_in_input (bool) – Whether or not to concatenate the additional features into the feed-forward neural network in the nlpsig_networks.SeqSigNet model.

  • include_embedding_in_input (bool) – Whether or not to concatenate the embeddings into the feed-forward neural network in the nlpsig_networks.SeqSigNet model. If we created a path for each item in the dataset, we will concatenate the embeddings in .embeddings (if reduced_embeddings=False) or the embeddings in .reduced_embeddings (if reduced_embeddings=True). If we created a path for each id in .id_column, then we concatenate the embeddings in .pooled_embeddings.

  • reduced_embeddings (bool, optional) – Whether or not to concatenate the dimension reduced embeddings, by default False. This is ignored if we created a path for each if in .id_column, i.e. .pad_method=’id’.

  • path_indices (list | np.array | None, optional) – If not None, will return the path for the indices specified in path_indices. If None, will return the path for all indices in .df (or all ids in .id_column if pad_by=”id”), by default None.

Returns:

Dictionary where: - “x_data” is a dictionary where:

  • ”path” is a tensor of the path to be passed into the nlpsig_networks.SeqSigNet network

  • ”features” is a tensor of the features (e.g. time features or additional features)

  • ”input_channels” is the number of channels in the path

  • ”num_features” is the number of features (e.g. time features or additional features) (this is None if there are no additional features to concatenate)

Return type:

dict[str, dict[str, torch.tensor] | int | None]

Raises:

ValueError – If a path hasn’t been created yet using .pad().

pad(pad_by: str, method: str = 'k_last', zero_padding: bool = True, k: int = 5, features: list[str] | str | None = None, standardise_method: list[str] | str | None = None, embeddings: str = 'full', include_current_embedding: bool = True, pad_from_below: bool = True) np.array#

Creates an array which stores the path. We create a path for each id in id_column if pad_by=”id” (by constructing a path of the embeddings of the texts associated to each id), or for each item in .df if pad_by=”history” (by constructing a path of the embeddings of the previous texts).

We can decide how long our path is by letting method=”k_last and specifying k. Alternatively, we can set method=”max”, which sets the length of the path by setting k to be the largest number of texts associated to an individual id.

The function “pads” if there aren’t enough texts to fill in (e.g. if requesting for the last 5 posts for an id, but there are less than 5 posts available), by adding empty records (if zero_padding=True) or by the last previous text (if zero_padding=False). This ensures that each path has the same number of data points.

Parameters:
  • pad_by (str) –

    How to construct the path. Options are:

    • ”id”: constructs a path of the embeddings of the texts associated to each id

    • ”history”: constructs a path by looking at the embeddings of the previous texts for each text

  • method (str, optional) –

    How long the path is, default “k_last”. Options are:

    • ”k_last”: specifying the length of the path through the choice of k (see below)

    • ”max”: the length of the path is chosen by looking at the largest number of texts associated to an individual id in .id_column

  • zero_padding (bool, optional) – If True, will pad with zeros. Otherwise, pad with the latest text associated to the id.

  • k (int, optional) – The requested length of the path, default 5. This is ignored if method=”max”.

  • features (list[str] | str | None, optional) – Which feature(s) to keep. If None, then doesn’t keep any.

  • standardise_method (list[str] | str | None, optional) –

    If not None, applies standardisation to the features, default None. If a list is passed, must be the same length as features. Options:

    • ”z_score”: transforms by subtracting the mean and dividing by standard deviation

    • ”sum_divide”: transforms by dividing by the sum

    • ”minmax”: transform by return (x-min(x)) / (max(x)-min(x)) where x is the vector to standardise

  • embeddings (str, optional) –

    Which embeddings to keep, by default “full”. Options:

    • ”dim_reduced”: dimension reduced embeddings

    • ”full”: full embeddings

    • ”both”: keeps both dimension reduced and full embeddings

  • include_current_embedding (bool, optional) – If pad_by=”history”, this determines whether or not the embedding for the text is included in it’s history, by default True. If pad_by=”id”, this argument is ignored.

  • pad_from_below (bool, optional) – If True, will pad the path from below, otherwise pads the path from above, by default True.

Returns:

3 dimension array of the path:
  • First dimension is ids (if pad_by=”id”) or each text (if pad_by=”history”)

  • Second dimension is the associated texts

  • Third dimension are the features (e.g. embeddings / dimension reduced embeddings, time features)

Return type:

np.array