flowstab.network_clustering

Module for network clustering and related operations

Copyright (C) 2021 Alexandre Bovet <alexandre.bovet@maths.ox.ac.uk>

Attributes

logger

Classes

BaseClustering

Base class for autocovariance matrix clustering with the Louvain algorithm.

Clustering

Symmetric Clustering.

FlowIntegralClustering

FlowIntegralClustering.

Partition

Node partition as a list of node sets and a node to cluster dict.

SparseClustering

Symmetric Clustering using sparse matrices.

Functions

avg_norm_var_information(clusters_lists_list[, ...])

Returns the average normalized variation of information for the

jaccard_distance(clusters1, clusters2)

Returns the Jaccard distance between two clustering.

n_random_seeds(n)

norm_mutual_information(clusters1, clusters2)

Normalized mutial information between two non-overlapping clustering.

norm_var_information(clusters1, clusters2[, N, ...])

Normalized variation of information between two non-overlapping clustering.

run_multi_louvain(clustering, num_repeat, **kwargs)

Helper function to run multiple (serial) instances of the Louvain algorithm for the

sort_clusters(cluster_list_to_sort, cluster_list_model)

Quick heuristic to sort a list of clusters in order to closely match another list

static_clustering(A[, t, rnd_seed, discrete_time_rw, ...])

Initializes a clustering instance to optimize the continuous time

Module Contents

class flowstab.network_clustering.BaseClustering(T=None, p1=None, p2=None, S=None, source_cluster_list=None, source_node_to_cluster_dict=None, target_cluster_list=None, target_node_to_cluster_dict=None, rnd_state=None, rnd_seed=None, S_threshold=None)[source]

Base class for autocovariance matrix clustering with the Louvain algorithm.

This class provides a base implementation for clustering autocovariance matrices using the Louvain algorithm. It can be used to cluster nodes in a network based on their autocovariance matrices.

Initialize a BaseClustering object.

Parameters:
  • T (numpy.ndarray) – NxN transition matrix, T[i,j] is the probability of going from node i to node j between t1 and t2.

  • p1 (numpy.ndarray) – Nx1 probability density at t1. Default is the uniform probability.

  • p2 (numpy.ndarray) – Nx1 probability density at t2. Default is p1 @ T.

  • S (numpy.ndarray) – NxN covariance matrix. Default is diag(p1) @ T - outer(p1,p2).

  • source_cluster_list (list) – List of sets of nodes describing the source partition. Default is singleton clusters.

  • source_node_to_cluster_dict (dict) – Dictionary with mapping between nodes and source cluster number. Default is singleton clusters.

  • target_cluster_list (list) – List of sets of nodes describing the target partition. Default is singleton clusters.

  • target_node_to_cluster_dict (dict) – Dictionary with mapping between nodes and target cluster number. Default is singleton clusters.

  • rnd_state (np.random.RandomState) – Random state object. Default creates a new one.

  • rnd_seed (int) – Seed for the random object. Default is a random seed.

  • S_threshold (float) – Smallest values of S. Used to trim insignificantly small values.

  • note:: (..) – At least T or S must be provided to initialize the clustering.

  • note:: – Clusters can either be initialized with a cluster_list or a node_to_cluster_dict.

abstract compute_stability(R=None)[source]

Compute the stability of the clustering.

Parameters:

R (numpy.ndarray) – The clustered autocovariance matrix.

Returns:

stability – The stability of the clustering.

Return type:

float

find_louvain_clustering(delta_r_threshold=np.finfo(float).eps, n_meta_iter_max=1000, n_sub_iter_max=1000, verbose=False, rnd_seed=None, save_progress=False, print_num_loops=False)[source]

Returns n_meta_loop

rnd_seed is used to set to state of the random number generator. Default is to keep the current state

S_threshold = None[source]
p1 = None[source]
p2 = None[source]
source_part[source]
target_part[source]
class flowstab.network_clustering.Clustering(p1=None, p2=None, T=None, S=None, cluster_list=None, node_to_cluster_dict=None, rnd_state=None, rnd_seed=None, S_threshold=None)[source]

Bases: BaseClustering

Symmetric Clustering.

Finds the best partition that optimizes the stability between two times defined as the trace of the clustered autocovariance matrix.

At least T must be given to initialize the clustering.

Clusters can either be initilized with a cluster_list or a node_to_cluster_dict.

Parameters:
  • T (numpy.ndarrays) – NxN transition matrix, T[i,j] is the probability of going from node i to node j between t1 and t2.

  • p1 (numpy.ndarrays) – Nx1 probability density at t1. Default is the uniform probability.

  • p2 (numpy.ndarrays) – Nx1 probability density at t2. Default is p1 @ T.

  • S (numpy.ndarrays) – NxN covariance matrix. Default is diag(p1) @ T - outer(p1,p2).

  • cluster_list (list) – list of set of nodes describing the partition. Default is singleton clusters.

  • node_to_cluster_dict (dict) – dictionary with mapping between nodes and cluster number. Default is singleton clusters.

  • rnd_state (np.random.RandomState) – Random state object. Default creates a new one.

  • rnd_seed (int) – Seed for the random object. Default is a random seed.

  • S_threshold (float) – Smallest values of S. Used to trim insignificantly small values.

compute_stability(R=None)[source]

Returns the stability of the clusters given in cluster_list computed between times t1 and t2

Here, for symmetric clustering, we only care about the diagonal of R.

find_louvain_clustering(delta_r_threshold=np.finfo(float).eps, n_meta_iter_max=1000, n_sub_iter_max=1000, verbose=False, rnd_seed=None, save_progress=False, print_num_loops=False)[source]

Returns n_meta_loop

rnd_seed is used to set to state of the random number generator. Default is to keep the current state

partition[source]
target_part = None[source]
class flowstab.network_clustering.FlowIntegralClustering(T_list=None, p1=None, time_list=None, T_inter_list=None, integral_time_grid=None, reverse_time=False, rtol=1e-08, verbose=False)[source]

FlowIntegralClustering.

Class to finds the best partition that optimizes the integral of the autocovariance between two times computed as

int_t1^t2 [diag(p1) @ T(t1,t) @ np.diag(1/pt) @ T(t1,t).T @ np.diag(p1) - outer(p1,p1)] dt

T_list or T_inter_list must be given to initialize the clustering.

Clusters can either be initilized with a cluster_list or a node_to_cluster_dict.

Parameters:
  • T_list (list of scipy.sparse.csr matrices or numpy.ndarray) – list of K succesive NxN transition matrices, Tk[i,j] is the probability of going from node i to node j between t1 and tk+1.

  • T_inter_list (list of scipy.sparse.csr matrices or numpy.ndarray) – list of K succesive NxN inter event transition matrices, T_inter_k[i,j] is the probability of going from node i to node j between tk and tk+1.

  • p1 (numpy.ndarrays) – Nx1 probability density at t1. Default is the uniform probability.

  • time_list (list or numpy.array) – list of K+1 time instants corresponding to the tk of the transition matrices. Default is unit times.

  • integral_time_grid (list) – list of times until which to compute the integral. The final times and indices used are stored in _t_integral_grid and _k_integral_grid. Default is [time_list[0], time_list[-1]]

  • reverse_time (bool) – Whether to reverse time when computing T_list and I_list from T_inter_list, for backward flow stability. Default is False (forward flow stability). If T_list is provided in input, it must have been computed with the corresponding time direction.

  • rtol (float) – Relative tolerance to set I values to zero. Values smaller than I.max()*rtol are set to zero to keep I sparse.

find_louvain_clustering(k=0, delta_r_threshold=np.finfo(float).eps, n_meta_iter_max=1000, n_sub_iter_max=1000, verbose=False, rnd_seed=None, save_progress=False, cluster_list=None, node_to_cluster_dict=None, rnd_state=None, S_threshold=None)[source]

Louvain algorithm to find the best partition.

The best partition is saved in `self.partition[k]

Parameters:
  • k (int) – Index of the covariance integral to cluster. self.I_list[k] will be used.

  • delta_r_threshold (float, optional) – Minimal value for an improvement of the quality function. The default is np.finfo(float).eps.

  • n_meta_iter_max (int, optional) – Maximum number of meta iterations. The default is 1000.

  • n_sub_iter_max (int, optional) – Maximum number of sub iterations. The default is 1000.

  • verbose (bool, int, optional) – Degree of verbosity. The default is False.

  • rnd_seed (int) – Seed for the random object. Default is a random seed.

  • save_progress (bool, optional) – Whether to save the progress in the Clustering.partition_progress and Clustering.stability_progress. The default is False.

  • cluster_list (list of sets, optional) – list of set of nodes describing the partition. Default is singleton clusters.

  • node_to_cluster_dict (dict, optional) – dictionary with mapping between nodes and cluster number. Default is singleton clusters.

  • rnd_state (np.random.RandomState) – Random state object. Default creates a new one.

  • S_threshold (float.) – Smallest values of the covariance. Used to trim insignificantly small values. Default is None (no thresholding)

Returns:

n_loop – Number of meta loops of the louvain algorithm.

Return type:

int

clustering[source]
is_nparray = False[source]
is_sparse = False[source]
is_sparse_stoch = False[source]
num_nodes[source]
p1 = None[source]
partition[source]
reversed_time = False[source]
time_list[source]
class flowstab.network_clustering.Partition(num_nodes: int, cluster_list: Sequence[set[int]] | None = None, node_to_cluster_dict: dict | None = None, check_integrity: bool = False)[source]

Node partition as a list of node sets and a node to cluster dict.

A Partition object represents a node partition that can be described as a list of node sets and a node to cluster dict. It provides methods to manipulate and analyze the partition.

Initialize a Partition object.

Parameters:
  • num_nodes (int) – Number of nodes in the partition.

  • cluster_list (Collection | None, optional) – List of clusters with each cluster being a set of nodes.

  • node_to_cluster_dict (dict | None, optional) – Mapping that maps each node to the index of the corresponding cluster in cluster_list.

  • check_integrity (bool, optional) – If True, check that all nodes are in only one cluster.

  • note:: (..) – If both cluster_list and node_to_cluster_dict are provided, a ValueError is raised. If neither is provided, the default clustering is one node per cluster.

__repr__()[source]

Return a string representation of the Partition object.

Returns:

A string representation of the Partition object.

Return type:

str

__str__()[source]
check_integrity()[source]

Check the integrity of the partition.

This method checks that all nodes are in only one cluster, i.e. non-overlapping clusters whose union is the full set of nodes.

Raises:

ValueError – If the partition is not valid.

get_indicator(sparse=True)[source]

Get the indicator matrix for the partition.

Parameters:

sparse (bool, optional) – If True, return a sparse indicator matrix. Default is True.

Returns:

The indicator matrix for the partition.

Return type:

ndarray or sparse matrix

get_num_clusters(non_empty=False)[source]

Get the number of clusters in the partition.

Parameters:

non_empty (bool, optional) – If True, only count non-empty clusters. Default is False.

Returns:

The number of clusters in the partition.

Return type:

int

iter_cluster_node_index()[source]

Iterate over the node indices for each cluster.

Yields:

list – The node indices for each cluster.

move_node(node, c_f)[source]

Move a node to cluster c_f.

Parameters:
  • node (int) – The node to move.

  • c_f (int) – The index of the cluster to move the node to.

remove_empty_clusters()[source]

Remove empty clusters from cluster_list and node_to_cluster_dict and reindex clusters.

This method removes any empty clusters from the cluster_list and updates the node_to_cluster_dict accordingly. It also reindexes the clusters to ensure that the indices are consecutive.

num_nodes[source]
class flowstab.network_clustering.SparseClustering(p1=None, p2=None, T=None, S=None, cluster_list=None, node_to_cluster_dict=None, rnd_state=None, rnd_seed=None)[source]

Bases: Clustering

Symmetric Clustering using sparse matrices.

TODO: fix formatting; complete if needed

Finds the best partition that optimizes the stability between two times defined as the trace of the clustered autocovariance matrix.

At least T must be given to initialize the clustering.

Clusters can either be initilized with a cluster_list or a node_to_cluster_dict.

Parameters:
  • T (scipy csr_sparse matrix) – NxN transition matrix, T[i,j] is the probability of going from node i to node j between t1 and t2.

  • p1 (numpy.ndarrays) – Nx1 probability density at t1. Default is the uniform probability.

  • p2 (numpy.ndarrays) – Nx1 probability density at t2. Default is p1 @ T.

  • S (SparseAutocovMat) – NxN covariance matrix. Default is diag(p1) @ T - outer(p1,p2).

  • cluster_list (list) – list of set of nodes describing the partition. Default is singleton clusters.

  • node_to_cluster_dict (dict) – dictionary with mapping between nodes and cluster number. Default is singleton clusters.

  • rnd_state (np.random.RandomState) – Random state object. Default creates a new one.

  • rnd_seed (int) – Seed for the random object. Default is a random seed.

  • S_threshold (float) – Smallest values of S. Used to trim insignificantly small values.

TODO: adding docstring

compute_stability(R=None)[source]

Returns the stability of the clusters given in cluster_list computed between times t1 and t2

Here, for symmetric clustering, we only care about the diagonal of R.

p1 = None[source]
p2 = None[source]
partition[source]
source_part[source]
target_part = None[source]
flowstab.network_clustering.avg_norm_var_information(clusters_lists_list, num_samples=None)[source]

Returns the average normalized variation of information for the clusters_list in clusters_lists_list.

By default uses all N*(N-1)/2 combinations possible of cluster_list pairs, where N = len(clusters_lists_list) or uses num_samples pairs if num_samples < N*(N-1)/2.

flowstab.network_clustering.jaccard_distance(clusters1, clusters2)[source]

Returns the Jaccard distance between two clustering.

inputs can be node_to_cluster dictionaries or cluster lists of node sets

flowstab.network_clustering.n_random_seeds(n)[source]
flowstab.network_clustering.norm_mutual_information(clusters1, clusters2)[source]

Normalized mutial information between two non-overlapping clustering.

The mutual information is normalized by the max of the two individual entropies.

\[NMI = (H(C1)+H(C2)-H(C1,C2))/max(H(C1),H(C2))\]

inputs can be node_to_cluster dictionaries, cluster lists of node sets or instances of Partition.

flowstab.network_clustering.norm_var_information(clusters1, clusters2, N=None, use_clust_list=False)[source]

Normalized variation of information between two non-overlapping clustering.

\[\hat{V}(C_1,C_2) = ({H(C1|C2)+H(C2|C1)})/{log_2 N}\]

inputs can be node_to_cluster dictionaries, cluster lists of node sets or instances of Partition.

flowstab.network_clustering.run_multi_louvain(clustering, num_repeat, **kwargs)[source]

Helper function to run multiple (serial) instances of the Louvain algorithm for the clustering given in clutering.

Parameters:
  • clustering (Clustering or SparseClustering instance)

  • num_repeat (int) – Number of repetition of the algorithm.

  • **kwargs (key word args) – Arguments passed when initializing the clusterings.

Returns:

  • n_loops (list) – list with the number of louvain loop for each repetition.

  • cluster_lists (list) – list with the cluster lists of each repetition..

  • stabilities (list) – list with the value of the stability for each repetition.

  • seeds (list) – list with the random seeds used in each repetition.

flowstab.network_clustering.sort_clusters(cluster_list_to_sort, cluster_list_model, thresh_ratio=0.3)[source]

Quick heuristic to sort a list of clusters in order to closely match another list

flowstab.network_clustering.static_clustering(A, t=1, rnd_seed=None, discrete_time_rw=False, linearized=False, directed=False)[source]

Initializes a clustering instance to optimize the continuous time Markov stability from the graph given by the adjacency matrix A.

Parameters:
  • A (scipy sparse csr matrix or numpy ndarray) – Adjacency matrix.

  • t (int or float, optional) – Markov time (resolution parameter). The default is 1.

  • rnd_seed (int, optional) – The default is None.

  • discrete_time_rw (bool, optional) – If true, powers of the transition matrix to the power of t are used for varying resolution. t must be an int. The default is False.

  • linearized (bool, optional) – If true, the linearized version of the Markov Stability is used. if linearized=false and discrete_time_rw==false, the matrix exponential of the random walk Laplacian is used to compute the transition matrix (slow). The default is False.

  • directed (bool, optional) – If true, the network must be strongly connected. The default is False.

Raises:
  • TypeError – ‘A must be numpy array or scipy csr.’.

  • ValueError – ‘sum of degrees equal 0’.

Returns:

Clustering instance initialized for Markov Stability optimization.

Return type:

Instance of `Clustering’ or ‘SparseClustering’

flowstab.network_clustering.logger[source]