Dimensionality reduction#

Since the size of the embeddings obtained by transformer models are typically very large, nlpsig provides an interface to a number of dimensionality reduction techniques. The functionality for performing dimensionality reduction within the package can be done through the nlpsig.DimReduce class. There is also functionality to visualise embeddings via the nlpsig.PlotEmbedding class.

class nlpsig.dimensionality_reduction.DimReduce(method: str = 'gaussian_random_projection', n_components: int = 5, dim_reduction_kwargs: dict | None = None)#

Bases: object

Class to perform dimension reduction on word or sentence embeddings.

Parameters:
  • method (str, optional) –

    Which dimensionality reduction technique to use, by default “gaussian_random_projection”. Options are:

    • ”umap” (UMAP): implemented using umap-learn package.

    • ”pca” (PCA): implemented using scikit-learn.

    • ”tsne” (TSNE): implemented using scikit-learn.

    • ”gaussian_random_projection” (Gaussian random projection): implemented using scikit-learn.

    • ”sparse_random_projection” (sparse random projection): implemented using scikit-learn.

    • ”ppapca” (Post Processing Algorithm (PPA) with PCA): see Mu, J., Bhat, S., and Viswanath, P. (2017). All-but-the-top: Simple and effective postprocessing for word representations. arXiv:1702.01417.

    • ”ppapacppa” (PPA-PCA-PPA): see Raunak, V., Gupta, V., and Metze, F. (2019). Effective dimensionality reduction for word embeddings. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP- 2019), pages 235-243.

  • n_components (int, optional) – Number of n_components to keep, by default 5

  • dim_reduction_kwargs (dict | None) – Any keywords to be passed into the functions which perform the dimensionality reduction, by default None

fit_transform(embeddings: array, random_state: int = 42) array#

Fit embeddings into an embedded space and return that transformed output

Parameters:
  • embeddings (np.array) – Word or sentence embeddings which we wish to reduce the dimensions of

  • random_state (int, optional) – Seed number, by default 42

Returns:

Dimension reduced embeddings in transformed space.

Return type:

np.array

ppa_pca(embeddings: array, n_components: int = 5, pca_dim: int = 50, dim: int = 3, extra_ppa: bool = False) array#

Post Processing Algorithm with PCA (with option to apply PPA again)

Parameters:
  • embeddings (np.array) – Word or sentence embeddings which we wish to reduce the dimensions of

  • n_components (int, optional) – Number of n_components to keep, by default 5

  • pca_dim (int, optional) – Number of components for PCA algorithm (must be greater than n_components), by default 50

  • dim (int, optional) – Threshold parameter D in Post Processing Algorithm (must be smaller than n_components), by default 3

  • extra_ppa (bool, optional) – Whether or not to apply PPA again, by default False

Returns:

Dimension reduced embeddings in transformed space.

Return type:

np.array

Raises:

ValueError – if n_components is less than dim, or if n_components is greater than pca_dim

class nlpsig.plot_embedding.PlotEmbedding(x_data: array, y_data: array)#

Bases: object

Class to visualise word or sentence embeddings.

Parameters:
  • x_data (np.array) – features

  • y_data (np.array) – y labels

embedding_dim_reduce(method: str = 'pca', dim: int = 3, overwrite: bool = False, random_state: int = 42) None#

Performs dimension reduction to the data and adds reduced embeddings to .embed.

Parameters:
  • method (str, optional) –

    Which dimensionality reduction technique to use, by default “pca”. Options:
    • ”pca” (PCA): implemented using scikit-learn

    • ”umap” (UMAP): implemented using umap-learn package

    • ”tsne” (TSNE): implemented using scikit-learn

  • dim (int, optional) – Number of components to keep, by default 3.

  • overwrite (bool, optional) – Whether or not to overwrite current implemented embedding, by default False.

  • random_state (int, optional) – Seed number, by default 42.

plt_2d(embed_args: dict | None = None, line_args: dict | None = None) None#

Plots the embedding in 2d space after first performing dimension reduction.

Parameters:
  • embed_args (dict | None, optional) – Any keywords to be passed into the functions which perform the dimensionality reduction, by default {“method”: “pca”, “dim”: 2}.

  • line_args (dict | None, optional) – Any keywords to be passed into the functions which plots the embeddings (arguments for matplotlib.pyplot.scatter()), by default {“marker”: “o”, “alpha”: 0.3}.