Obtaining embeddings from transformers#

nlpsig interfaces with transformers and sentence-transformers via nlpsig.TextEncoder and nlpsig.SentenceEncoder respectively.

class nlpsig.encode_text.SentenceEncoder(df: pd.DataFrame, feature_name: str, model_name: str = 'all-MiniLM-L6-v2', model_modules: Iterable[nn.Module] | None = None, model_encoder_args: dict | None = None, model_fit_args: dict | None = None, verbose: bool = True)#

Bases: object

Class to obtain sentence embeddings using SentenceTransformer class in sentence_transformers.

Parameters:
  • df (pd.DataFrame) – Dataset as a pandas dataframe

  • feature_name (str) – Column name which has the text in

  • model_name (str, optional) –

    Name of model to obtain sentence embeddings, by default “all-MiniLM-L6-v2”.

    If loading a pretrained model using .load_pretrained_model() method, passes this to the model_name_or_path argument when initialising SentenceTransformer object.

    A few alternative options are:
    • ”all-mpnet-base-v2”

    • ”all-distilroberta-v1”

    • ”all-MiniLM-L12-v2”

    See more pre-trained SentenceTransformer models at https://www.sbert.net/docs/pretrained_models.html.

  • model_modules (Iterable[nn.Module] | None, optional) –

    This parameter can be used to create custom SentenceTransformer models from scratch.

    See https://www.sbert.net/docs/training/overview.html#creating-networks-from-scratch for examples.

    If creating a custom model using .load_custom_model() method, passes this into the modules argument when initialising SentenceTransformer object.

  • model_encoder_args (dict | None, optional) – Any keywords to be passed into the model for encoding sentences, by default the following arguments to pass into the .encode() method of SentenceTransformer class: {"batch_size": 64, "show_progress_bar": True, "output_value": "sentence_embedding", "convert_to_numpy": True, "convert_to_tensor": False, "device": None, "normalize_embeddings": False}

  • model_fit_args (dict | None, optional) – Any keywords to be passed into the model to fine-tune sentence transformer, by default None.

Raises:

KeyError – if feature_name is not a column in df.

fit_transformer(train_objectives: Iterable[tuple[DataLoader, nn.Module]]) None#

Trains / fine-tunes SentenceTransformer model via the .fit method.

Passes in .model_fit_args into .fit method too.

Parameters:

train_objectives (Iterable[Tuple[DataLoader, nn.Module]]) – Tuples of (DataLoader, LossFunction). Pass more than one for multi-task learning. See https://www.sbert.net/docs/training/overview.html for more details.

Raises:

NotImplementedError – if .model attribute is None in which case need to load the model first using either .load_pretrained_model() or .load_custom_model() methods.

load_custom_model(force_reload: bool = False) None#

Loads pre-trained model into .model by passing in .model_name to the modules argument when initialising SentenceTransformer object.

Parameters:

force_reload (bool, optional) – Whether or not to overwrite current loaded model, by default False.

Raises:
load_pre_computed_embeddings(pre_computed_embeddings_file: str) None#

Loads in pre-computed sentence embeddings.

Parameters:

pre_computed_embeddings_file (str) – Path to pre-computed embeddings, by default None.

Raises:

ValueError – if the loaded embeddings is not a (n x d) array, where n is the number of sentences (in .df) and d is the dimension of the embeddings.

load_pretrained_model(force_reload: bool = False) None#

Loads pre-trained model into .model by passing in .model_name to the model_name_or_path argument when initialising SentenceTransformer object.

.model_name can also be path to a trained model.

Parameters:

force_reload (bool, optional) – Whether or not to overwrite current loaded model, by default False.

Raises:

NotImplementedError – if .model_name cannot be loaded by SentenceTransformer. This might happen if this is not a pre-trained model available. See https://www.sbert.net/docs/pretrained_models.html for examples.

obtain_embeddings() array#

Obtains sentence embeddings via the .encode method, and saves in .embeddings_sentence attribute.

Passes in .model_encoder_args into .encode method too.

Raises:

NotImplementedError – if .model attribute is None in which case need to load the model first using either .load_pretrained_model() or .load_custom_model() methods.

class nlpsig.encode_text.TextEncoder(feature_name: str, df: pd.DataFrame | None = None, dataset: Dataset | None = None, model_name: str | None = None, model: PreTrainedModel | None = None, config: PretrainedConfig | None = None, tokenizer: PreTrainedTokenizer | None = None, data_collator: DataCollator | None = None, verbose: bool = True)#

Bases: object

Class to obtain token embeddings (and optionally pool them) using Huggingface transformers.

Parameters:
  • feature_name (str) – Column name which has the text in.

  • df (pd.DataFrame | None, optional) – Dataset as a pandas dataframe, by default None. If df is not provided, dataset must be provided. A dataframe will then be created from it.

  • dataset (Dataset | None, optional) – Huggingface Dataset object for the full dataset, by default None. If df is a dataframe, a Dataset will be created from it, even if dataset is provided.

  • model_name (str | None, optional) – Name of transformer encoder model from Huggingface Hub, by default None. To be used if want to load in a pretrained model.

  • model (PreTrainedModel | None, optional) – Huggingface transformer model class, by default None.

  • config (PretrainedConfig | None, optional) – Huggingface configuration class, by default None.

  • tokenizer (PreTrainedTokenizer | None, optional) – Huggingface tokenizer class, by default None.

  • data_collator (DataCollator | None, optional) – Data collator to use, by default None. Should work with the tokenizer that is passed in.

fit_transformer_with_trainer_api(output_dir: str | None = None, data_collator: DataCollator | None = None, compute_metrics: Callable[[EvalPrediction], dict] | None = None, training_args: dict | None = None, trainer_args: dict | None = None)#

Train / fine-tune transformer model to some task.

If the dataset hasn’t been split up, or either of the training arguments or trainer hasn’t been set up, can pass in arguments to do that here. Otherwise uses the split dataset, training arguments and trainer saved in .dataset_split, .training_args and .trainer, respectively.

Parameters:
  • output_dir (str) – The output directory where the model predictions and checkpoints will be written, by default None.

  • data_collator (DataCollator | None, optional) – The function to use to form a batch from a list of elements of train_dataset or eval_dataset, to pass into Trainer(), by default None.

  • compute_metrics (Callable[[EvalPrediction], dict] | None, optional) – The function that will be used to compute metrics at evaluation. Must take a EvalPrediction object and return a dictionary string to metric values, by default None.

  • training_args (dict | None, optional) – Passed along to TrainingArguments() class, by default None.

  • trainer_args (dict | None, optional) – Passed along to Trainer() class, by default None.

initialise_transformer(force_reload: bool = False, **config_args) None#

Loads in config and tokenizer. Initialises the transformer with random weights from transformers.

Parameters:
  • force_reload (bool, optional) – Whether or not to overwrite current loaded model, by default False.

  • **config_args – Passed along to AutoConfig.from_pretrained() method.

load_pretrained_model(force_reload: bool = False) None#

Loads in config, tokenizer and pretrained weights from transformers, using AutoConfig, AutoTokenizer, AutoModel.

If another model is required, e.g. model for masked language modelling, then recommended to load the model using the appropriate class, e.g. AutoModelForMaskedLM() to load in the model and reset .model attribute to this object. This is required if you wish to train / pre-train the model to the data later.

Parameters:

force_reload (bool, optional) – Whether or not to overwrite current loaded model, by default False.

obtain_embeddings(method: str = 'hidden_layer', batch_size: int = 100, layers: int | list[int] | tuple[int] | None = None) np.array | list[np.array]#

Once text has been tokenized (using .tokenize_text), can obtain token embeddings for each token in .tokenized_df[“tokens”].

Method passes in the tokens (in .tokens) through the transformer model and obtains token embeddings by combining the hidden layers in some way. See method argument below for options.

Parameters:
  • method (str, optional) –

    Method for combining the layer hidden states, by default “hidden_layer”. Options are:

    • ”hidden_layer”:
      • if layers is just an integer, token embedding will be taken as is taken from the hidden state in layer number layers. By default (if layers is not specified), the token embeddings are taken from the second-to-last layer hidden state.

      • if layers is a list of integers, will return the layer hidden states from the specified layers.

    • ”concatenate”:
      • if layers is a list of integers, will return the concatenation of the layer hidden states from the specified layers. By default (if layers is not specified), the token embeedings are computed by concatenating the layer hidden states from the last 4 layers (or all the hidden states if the number of hidden states is less than 4).

      • if layers is just an integer, token embedding will be taken as is taken from the hidden state in layer number layers (as concatenation of one layer hidden state is just that layer).

    • ”sum”:
      • if layers is a list of integers, will return the sum of the layer hidden states from the specified layers. By default (if layers is not specified), the token embeedings are computed by concatenating the layer hidden states from the last 4 layers (or all the hidden states if the number of hidden states is less than 4).

      • if layers is just an integer, token embedding will be taken as is taken from the hidden state in layer number layers (as sum of one layer hidden state is just that layer).

    • ”mean”:
      • if layers is a list of integers, will return the mean of the layer hidden states from the specified layers. By default (if layers is not specified), the token embeedings are computed by concatenating the layer hidden states from the last 4 layers (or all the hidden states if the number of hidden states is less than 4).

      • if layers is just an integer, token embedding will be taken as is taken from the hidden state in layer number layers (as mean of one layer hidden state is just that layer).

  • batch_size (int = 100, optional) – The size of the batches, by default 100.

  • layer (int | list[int] | tuple[int] | None, optional) – The layers to use when combining the hidden states of the transformer.

Returns:

Unless method=hidden_layer and layers is a list of integers, the method returns a 2 dimensional array with dimensions [token, embedding], i.e. the number of rows is the number of tokens in .tokenized_df, and the number of columns is the dimension of the embeddings.

If method=hidden_layer and layers is a list of integers, the method returns a list of 3 dimensional arrays with dimensions [layer, token, embedding]. Each item in the list is the output of the hidden layers requested for each sentence, i.e. for each item, the first dimension denotes the layers that was requested, the second dimension is the tokens (as found in .tokenized_df) and the third dimension is the embeddings. This option is added so that the user can combine the hidden layers in some custom way.

Return type:

Union[np.array, List[np.array]]

Raises:
  • ValueError – if .tokens is None. Means the text has not been tokenized yet and .tokenize_text() needs to be called first.

  • ValueError – if layers is not an integer or a list of integers.

  • ValueError – if any of the requested layers are out of range, i.e. if any are less than zero, or are larger than the number of hidden layers in the transformer architecture.

  • NotImplementedError – if requested method is not one of “hidden_layer”, “concatenate”, “sum” or “mean”.

pool_token_embeddings(method: str = 'mean') array#

Once token embeddings have been computed (using .obtain_embeddings), .token_embeddings is a 2 dimensional array with dimensions [token, embedding]. We can pool these together to obtain an embedding for the whole sentence by pooling the token embeddings in the sentence. See method argument below for options.

Note that if .token_embeddings is a 3 dimensional array with dimensions [layer, token, embedding], the hidden layers must be pooled first to obtain a 2 dimensional array with dimensions [token, embedding].

Note that SentenceEncoder might be more appropriate for obtaining sentence embeddings which uses SBERT via the sentence-transformers package.

Parameters:

method (str, optional) – Method for combining the token embeddings, by default “mean”. Options are: - “mean”: takes the mean average of token embeddings - “max”: takes the maximum in each dimension of the token embeddings - “sum”: takes the element-wise sum of the token embeddings - “cls”: takes the ‘cls’ embedding (only possible if skip_special_tokens=False was set when tokenizing the text)

Returns:

A 2 dimensional array with dimensions [sentences, embedding], i.e. the number of rows is the number of sentences/texts in .df, and the number of columns is the dimension of the embeddings.

Return type:

np.array

Raises:
  • ValueError – if .token_embeddings is None. Means the token embeddings have not been computed yet and .obtain_embeddings() needs to be called first.

  • ValueError – if .token_embeddings is not a 2 dimensional array. In which case, it can be that it is a 3 dimensional array with dimensions [layer, token, embedding], and the hidden layers must be pooled first to obtain a 2 dimensional array with dimensions [token, embedding].

  • ValueError – if method=”cls” but skip_special_tokens=False was set when tokenizing the text. This means that when the token embeddings were obtained, the embedding corresponding to the ‘cls’ token was not saved.

  • NotImplementedError – if requested method is not one of “mean”, “max”, “sum” or “cls”.

set_up_trainer(data_collator: DataCollator | None = None, compute_metrics: Callable[[EvalPrediction], dict] | None = None, custom_loss: Callable[[float, float], float] | None = None, **kwargs) Trainer#

Set up Trainer object and save to .trainer attribute.

Parameters:
  • data_collator (DataCollator | None, optional) – The function to use to form a batch from a list of elements of train_dataset or eval_dataset, to pass into Trainer(), by default None.

  • compute_metrics (Callable[[EvalPrediction], dict] | None, optional) – The function that will be used to compute metrics at evaluation. Must take a EvalPrediction object and return a dictionary string to metric values, by default None.

  • custom_loss (Callable[[float, float], float] | None, optional) – A function that computes a custom loss function. If passed, will create a subclass of Trainer and override the custom_loss method, by default None.

  • **kwargs – Passed along to Trainer() class, by default None.

Returns:

Trainer object.

Return type:

Trainer

set_up_training_args(output_dir: str, **kwargs) TrainingArguments#

Set up TrainingArguments object and save to .trainer attribute.

Parameters:
  • output_dir (str) – The output directory where the model predictions and checkpoints will be written.

  • **kwargs – Passed along to TrainingArguments() class.

Returns:

TrainingArguments object.

Return type:

TrainingArguments

split_dataset(train_size: float = 0.8, valid_size: float | None = 0.33, indices: tuple[Iterable[int], Iterable[int] | None, Iterable[int]] | None = None, shuffle: bool = False, random_state: int = 42) DatasetDict#

Split up dataset into train, validation, test sets for training / fine-tuning.

Parameters:
  • train_size (float, optional) – How to split the initial dataset into train, test/validation, by default 0.8. Ignored if indices are passed.

  • valid_size (float | None, optional) – Proportion of training data to use as validation data, by default 0.33. If None, will not create a validation set. Ignored if indices are passed.

  • indices (tuple[Iterable[int], Iterable[int] | None, Iterable[int]] | None, optional) – Train, validation, test indices to use. If passed, will split the data according to these indices rather than splitting it within the method using the train_size and valid_size provided. First item in the tuple should be the indices for the training set, second item should be the indices for the validaton set (this could be None if no validation set is required), and third item should be indices for the test set.

  • shuffle (bool, optional) – Whether or not to shuffle the dataset, by default False.

  • random_state (int, optional) – Seed number, by default 42.

Returns:

A dictionary of Datasets with training (train), validation (valid) (if valid_size is not None), and test (test) Datasets.

Return type:

DatasetDict

tokenize_text(text_id_col_name: str = 'text_id', skip_special_tokens: bool = True, batched: bool = True, batch_size: int = 1000, **tokenizer_args) Dataset#

Method to tokenize each item in the feature_name column of the dataframe.

Will tokenize the text (which are then saved in .tokens attribute). The method will also create a new dataframe (and save it in .tokenized_df attribute) where each item in the tokens column is a token and text_id_col_name column denotes the text-id for which it belongs to (where text-id is just the index of the original dataframe stored in .df).

Parameters:
  • text_id_col_name (str, optional) – Column name to be used in .tokenized_df to denote the text-id for which the token belongs, by default “text_id”.

  • skip_special_tokens (bool, optional) – Whether or not to skip special tokens added by the transformer tokenizer, by default True.

  • batched (bool, optional) – Whether or not to tokenize the text in batches, by default True.

  • batch_size (int, optional) – The size of the batches (if used), by default 1000.

  • **tokenizer_args – Passed along to the .tokenizer() method. By default, we pass the following arguments: - padding = False (as dynamic padding is used later) - truncation = True - return_special_tokens_mask = True (this is always used and overrides user option if passed)

Returns:

The tokenized text as BatchEncoding type.

Return type:

BatchEncoding

Raises:

ValueError – if text_id_column_name is already a column name in .df dataframe. In this case, will need to pass in a different string.