Dimensionality reduction#
Since the size of the embeddings obtained by transformer models are typically very large, nlpsig
provides an interface to a number of dimensionality reduction techniques.
The functionality for performing dimensionality reduction within the package can be done through the nlpsig.DimReduce
class.
There is also functionality to visualise embeddings via the nlpsig.PlotEmbedding
class.
- class nlpsig.dimensionality_reduction.DimReduce(method: str = 'gaussian_random_projection', n_components: int = 5, dim_reduction_kwargs: dict | None = None)#
Bases:
object
Class to perform dimension reduction on word or sentence embeddings.
- Parameters:
method (str, optional) –
Which dimensionality reduction technique to use, by default “gaussian_random_projection”. Options are:
”umap” (UMAP): implemented using umap-learn package.
”pca” (PCA): implemented using scikit-learn.
”tsne” (TSNE): implemented using scikit-learn.
”gaussian_random_projection” (Gaussian random projection): implemented using scikit-learn.
”sparse_random_projection” (sparse random projection): implemented using scikit-learn.
”ppapca” (Post Processing Algorithm (PPA) with PCA): see Mu, J., Bhat, S., and Viswanath, P. (2017). All-but-the-top: Simple and effective postprocessing for word representations. arXiv:1702.01417.
”ppapacppa” (PPA-PCA-PPA): see Raunak, V., Gupta, V., and Metze, F. (2019). Effective dimensionality reduction for word embeddings. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP- 2019), pages 235-243.
n_components (int, optional) – Number of n_components to keep, by default 5
dim_reduction_kwargs (dict | None) – Any keywords to be passed into the functions which perform the dimensionality reduction, by default None
- fit_transform(embeddings: array, random_state: int = 42) array #
Fit embeddings into an embedded space and return that transformed output
- Parameters:
embeddings (np.array) – Word or sentence embeddings which we wish to reduce the dimensions of
random_state (int, optional) – Seed number, by default 42
- Returns:
Dimension reduced embeddings in transformed space.
- Return type:
np.array
- ppa_pca(embeddings: array, n_components: int = 5, pca_dim: int = 50, dim: int = 3, extra_ppa: bool = False) array #
Post Processing Algorithm with PCA (with option to apply PPA again)
- Parameters:
embeddings (np.array) – Word or sentence embeddings which we wish to reduce the dimensions of
n_components (int, optional) – Number of n_components to keep, by default 5
pca_dim (int, optional) – Number of components for PCA algorithm (must be greater than n_components), by default 50
dim (int, optional) – Threshold parameter D in Post Processing Algorithm (must be smaller than n_components), by default 3
extra_ppa (bool, optional) – Whether or not to apply PPA again, by default False
- Returns:
Dimension reduced embeddings in transformed space.
- Return type:
np.array
- Raises:
ValueError – if n_components is less than dim, or if n_components is greater than pca_dim
- class nlpsig.plot_embedding.PlotEmbedding(x_data: array, y_data: array)#
Bases:
object
Class to visualise word or sentence embeddings.
- Parameters:
x_data (np.array) – features
y_data (np.array) – y labels
- embedding_dim_reduce(method: str = 'pca', dim: int = 3, overwrite: bool = False, random_state: int = 42) None #
Performs dimension reduction to the data and adds reduced embeddings to .embed.
- Parameters:
method (str, optional) –
- Which dimensionality reduction technique to use, by default “pca”. Options:
”pca” (PCA): implemented using scikit-learn
”umap” (UMAP): implemented using umap-learn package
”tsne” (TSNE): implemented using scikit-learn
dim (int, optional) – Number of components to keep, by default 3.
overwrite (bool, optional) – Whether or not to overwrite current implemented embedding, by default False.
random_state (int, optional) – Seed number, by default 42.
- plt_2d(embed_args: dict | None = None, line_args: dict | None = None) None #
Plots the embedding in 2d space after first performing dimension reduction.
- Parameters:
embed_args (dict | None, optional) – Any keywords to be passed into the functions which perform the dimensionality reduction, by default {“method”: “pca”, “dim”: 2}.
line_args (dict | None, optional) – Any keywords to be passed into the functions which plots the embeddings (arguments for matplotlib.pyplot.scatter()), by default {“marker”: “o”, “alpha”: 0.3}.