Skip to content

module unsupervised


function compute_tsne_embedding

compute_tsne_embedding(
    dataset: DataFrame,
    cols: list,
    N_rows: int = 20000,
    n_components=2,
    perplexity=30
) → tuple

Compute TSNE embedding. Only for a random subset of rows.

Args:

  • dataset: Input data
  • cols: A list of column names to produce the embedding for
  • N_rows: A number of rows to randomly sample for the embedding. Only these rows are embedded.
  • n_components: The number of dimensions to embed the data into.
  • perplexity: The perplexity of the TSNE embedding.

Returns: The tuple: - A numpy array with the embedding data, only for a random subset of row - The rows that were used for the embedding


function compute_morlet

compute_morlet(
    data: ndarray,
    dt: float = 0.03333333333333333,
    n_freq: int = 5,
    w: float = 3
) → ndarray

Compute morlet wavelet transform of a time series.

Args:

  • data: A 2D array containing the time series data, with dimensions (n_pts x n_channels)
  • dt: The time step of the time series
  • n_freq: The number of frequencies to compute
  • w: The width of the morlet wavelet

Returns A 2D numpy array with the morlet wavelet transform. The first dimension is the frequency, the second is the time.


function compute_density

compute_density(
    dataset: DataFrame,
    embedding_extent: tuple,
    bandwidth: float = 0.5,
    n_pts: int = 300,
    N_sample_rows: int = 50000,
    rows: list = None
) → ndarray

Compute kernel density estimate of embedding.

Args:

  • dataset: pd.DataFrame with embedding data loaded in it. (Must have already populated columns named 'embedding_0', 'embedding_1')
  • embedding_extent: the bounds in which to apply the density estimate. Has the form (xmin, xmax, ymin, ymax)
  • bandwidth: the Gaussian kernel bandwidth. Will depend on the scale of the embedding. Can be changed to affect the number of clusters pulled out
  • n_pts: number of points over which to evaluate the KDE
  • N_sample_rows: number of rows to randomly sample to generate estimate
  • rows: If provided, use these rows instead of a random sample

Returns: Numpy array with KDE over the specified square region in the embedding space, with dimensions (n_pts x n_pts)


function compute_watershed

compute_watershed(
    dens_matrix: ndarray,
    positive_only: bool = False,
    cutoff: float = 0
) → tuple

Compute watershed clustering of a density matrix.

Args:

  • dens_matrix: A square 2D numpy array, output from compute_density, containing the kernel density estimate of the embedding.
  • positive_only: Whether to apply a threshold, 'cutoff'. If applied, 'cutoff' is subtracted from dens_matrix, and any value below zero is set to zero. Useful for only focusing on high density clusters.
  • cutoff: The cutoff value to apply if positive_only = True

Returns: A numpy array with the same dimensions as dens_matrix. Each value in the array is the cluster ID for that coordinate.


function cluster_behaviors

cluster_behaviors(
    dataset: DataFrame,
    feature_cols: list,
    N_rows: int = 200000,
    use_morlet: bool = False,
    use_umap: bool = True,
    n_pts: int = 300,
    bandwidth: float = 0.5,
    **kwargs
) → tuple

Cluster behaviors based on dimensionality reduction, kernel density estimation, and watershed clustering.

Note that this will modify the dataset dataframe in place.

The following columns are added to dataset: 'embedding_index_[0/1]': the coordinates of each embedding coordinate in the returned density matrix 'unsup_behavior_label': the Watershed transform label for that row, based on its embedding coordinates. Rows whose embedding coordinate has no watershed cluster, or which fall outside the domain have value -1.

Args:

  • dataset: the pd.DataFrame with the features of interest
  • feature_cols: list of column names to perform the clustering on
  • N_rows: number of rows to perform the embedding on. If 'None', then all rows are used.
  • use_morlet: Apply Morlet wavelet transform to the feature cols before computing the embedding
  • use_umap: If True will use UMAP dimensionality reduction, if False will use TSNE
  • n_pts: dimension of grid the kernel density estimate is evaluated on.
  • bandwidth: Gaussian kernel bandwidth for kernel estimate
  • **kwargs: All other keyword parameters are sent to dimensionality reduction call (either TSNE or UMAP)

Returns: A tuple with components: - dens_matrix: the (n_pts x n_pts) numpy array with the density estimate of the 2D embedding - labels: numpy array with same dimensions are dens_matrix, but with values the watershed cluster IDs - embedding_extent: the coordinates in embedding space that dens_matrix is approximating the density over


This file was automatically generated via lazydocs.