flowstab.network_clustering¶
Module for network clustering and related operations
Copyright (C) 2021 Alexandre Bovet <alexandre.bovet@maths.ox.ac.uk>
Attributes¶
Classes¶
Base class for autocovariance matrix clustering with the Louvain algorithm. |
|
Symmetric Clustering. |
|
FlowIntegralClustering. |
|
Node partition as a list of node sets and a node to cluster dict. |
|
Symmetric Clustering using sparse matrices. |
Functions¶
|
Returns the average normalized variation of information for the |
|
Returns the Jaccard distance between two clustering. |
|
Normalized mutial information between two non-overlapping clustering. |
|
Normalized variation of information between two non-overlapping clustering. |
|
Helper function to run multiple (serial) instances of the Louvain algorithm for the |
|
Quick heuristic to sort a list of clusters in order to closely match another list |
|
Initializes a clustering instance to optimize the continuous time |
Module Contents¶
- class flowstab.network_clustering.BaseClustering(T=None, p1=None, p2=None, S=None, source_cluster_list=None, source_node_to_cluster_dict=None, target_cluster_list=None, target_node_to_cluster_dict=None, rnd_state=None, rnd_seed=None, S_threshold=None)[source]¶
Base class for autocovariance matrix clustering with the Louvain algorithm.
This class provides a base implementation for clustering autocovariance matrices using the Louvain algorithm. It can be used to cluster nodes in a network based on their autocovariance matrices.
Initialize a BaseClustering object.
- Parameters:
T (numpy.ndarray) – NxN transition matrix, T[i,j] is the probability of going from node i to node j between t1 and t2.
p1 (numpy.ndarray) – Nx1 probability density at t1. Default is the uniform probability.
p2 (numpy.ndarray) – Nx1 probability density at t2. Default is p1 @ T.
S (numpy.ndarray) – NxN covariance matrix. Default is diag(p1) @ T - outer(p1,p2).
source_cluster_list (list) – List of sets of nodes describing the source partition. Default is singleton clusters.
source_node_to_cluster_dict (dict) – Dictionary with mapping between nodes and source cluster number. Default is singleton clusters.
target_cluster_list (list) – List of sets of nodes describing the target partition. Default is singleton clusters.
target_node_to_cluster_dict (dict) – Dictionary with mapping between nodes and target cluster number. Default is singleton clusters.
rnd_state (np.random.RandomState) – Random state object. Default creates a new one.
rnd_seed (int) – Seed for the random object. Default is a random seed.
S_threshold (float) – Smallest values of S. Used to trim insignificantly small values.
note:: (..) – At least T or S must be provided to initialize the clustering.
note:: – Clusters can either be initialized with a cluster_list or a node_to_cluster_dict.
- abstract compute_stability(R=None)[source]¶
Compute the stability of the clustering.
- Parameters:
R (numpy.ndarray) – The clustered autocovariance matrix.
- Returns:
stability – The stability of the clustering.
- Return type:
float
- find_louvain_clustering(delta_r_threshold=np.finfo(float).eps, n_meta_iter_max=1000, n_sub_iter_max=1000, verbose=False, rnd_seed=None, save_progress=False, print_num_loops=False)[source]¶
Returns n_meta_loop
rnd_seed is used to set to state of the random number generator. Default is to keep the current state
- class flowstab.network_clustering.Clustering(p1=None, p2=None, T=None, S=None, cluster_list=None, node_to_cluster_dict=None, rnd_state=None, rnd_seed=None, S_threshold=None)[source]¶
Bases:
BaseClusteringSymmetric Clustering.
Finds the best partition that optimizes the stability between two times defined as the trace of the clustered autocovariance matrix.
At least T must be given to initialize the clustering.
Clusters can either be initilized with a cluster_list or a node_to_cluster_dict.
- Parameters:
T (numpy.ndarrays) – NxN transition matrix, T[i,j] is the probability of going from node i to node j between t1 and t2.
p1 (numpy.ndarrays) – Nx1 probability density at t1. Default is the uniform probability.
p2 (numpy.ndarrays) – Nx1 probability density at t2. Default is p1 @ T.
S (numpy.ndarrays) – NxN covariance matrix. Default is diag(p1) @ T - outer(p1,p2).
cluster_list (list) – list of set of nodes describing the partition. Default is singleton clusters.
node_to_cluster_dict (dict) – dictionary with mapping between nodes and cluster number. Default is singleton clusters.
rnd_state (np.random.RandomState) – Random state object. Default creates a new one.
rnd_seed (int) – Seed for the random object. Default is a random seed.
S_threshold (float) – Smallest values of S. Used to trim insignificantly small values.
- compute_stability(R=None)[source]¶
Returns the stability of the clusters given in cluster_list computed between times t1 and t2
Here, for symmetric clustering, we only care about the diagonal of R.
- find_louvain_clustering(delta_r_threshold=np.finfo(float).eps, n_meta_iter_max=1000, n_sub_iter_max=1000, verbose=False, rnd_seed=None, save_progress=False, print_num_loops=False)[source]¶
Returns n_meta_loop
rnd_seed is used to set to state of the random number generator. Default is to keep the current state
- class flowstab.network_clustering.FlowIntegralClustering(T_list=None, p1=None, time_list=None, T_inter_list=None, integral_time_grid=None, reverse_time=False, rtol=1e-08, verbose=False)[source]¶
FlowIntegralClustering.
Class to finds the best partition that optimizes the integral of the autocovariance between two times computed as
int_t1^t2 [diag(p1) @ T(t1,t) @ np.diag(1/pt) @ T(t1,t).T @ np.diag(p1) - outer(p1,p1)] dt
T_list or T_inter_list must be given to initialize the clustering.
Clusters can either be initilized with a cluster_list or a node_to_cluster_dict.
- Parameters:
T_list (list of scipy.sparse.csr matrices or numpy.ndarray) – list of K succesive NxN transition matrices, Tk[i,j] is the probability of going from node i to node j between t1 and tk+1.
T_inter_list (list of scipy.sparse.csr matrices or numpy.ndarray) – list of K succesive NxN inter event transition matrices, T_inter_k[i,j] is the probability of going from node i to node j between tk and tk+1.
p1 (numpy.ndarrays) – Nx1 probability density at t1. Default is the uniform probability.
time_list (list or numpy.array) – list of K+1 time instants corresponding to the tk of the transition matrices. Default is unit times.
integral_time_grid (list) – list of times until which to compute the integral. The final times and indices used are stored in _t_integral_grid and _k_integral_grid. Default is [time_list[0], time_list[-1]]
reverse_time (bool) – Whether to reverse time when computing T_list and I_list from T_inter_list, for backward flow stability. Default is False (forward flow stability). If T_list is provided in input, it must have been computed with the corresponding time direction.
rtol (float) – Relative tolerance to set I values to zero. Values smaller than I.max()*rtol are set to zero to keep I sparse.
- find_louvain_clustering(k=0, delta_r_threshold=np.finfo(float).eps, n_meta_iter_max=1000, n_sub_iter_max=1000, verbose=False, rnd_seed=None, save_progress=False, cluster_list=None, node_to_cluster_dict=None, rnd_state=None, S_threshold=None)[source]¶
Louvain algorithm to find the best partition.
The best partition is saved in `self.partition[k]
- Parameters:
k (int) – Index of the covariance integral to cluster. self.I_list[k] will be used.
delta_r_threshold (float, optional) – Minimal value for an improvement of the quality function. The default is np.finfo(float).eps.
n_meta_iter_max (int, optional) – Maximum number of meta iterations. The default is 1000.
n_sub_iter_max (int, optional) – Maximum number of sub iterations. The default is 1000.
verbose (bool, int, optional) – Degree of verbosity. The default is False.
rnd_seed (int) – Seed for the random object. Default is a random seed.
save_progress (bool, optional) – Whether to save the progress in the Clustering.partition_progress and Clustering.stability_progress. The default is False.
cluster_list (list of sets, optional) – list of set of nodes describing the partition. Default is singleton clusters.
node_to_cluster_dict (dict, optional) – dictionary with mapping between nodes and cluster number. Default is singleton clusters.
rnd_state (np.random.RandomState) – Random state object. Default creates a new one.
S_threshold (float.) – Smallest values of the covariance. Used to trim insignificantly small values. Default is None (no thresholding)
- Returns:
n_loop – Number of meta loops of the louvain algorithm.
- Return type:
int
- class flowstab.network_clustering.Partition(num_nodes: int, cluster_list: Sequence[set[int]] | None = None, node_to_cluster_dict: dict | None = None, check_integrity: bool = False)[source]¶
Node partition as a list of node sets and a node to cluster dict.
A Partition object represents a node partition that can be described as a list of node sets and a node to cluster dict. It provides methods to manipulate and analyze the partition.
Initialize a Partition object.
- Parameters:
num_nodes (int) – Number of nodes in the partition.
cluster_list (Collection | None, optional) – List of clusters with each cluster being a set of nodes.
node_to_cluster_dict (dict | None, optional) – Mapping that maps each node to the index of the corresponding cluster in cluster_list.
check_integrity (bool, optional) – If True, check that all nodes are in only one cluster.
note:: (..) – If both cluster_list and node_to_cluster_dict are provided, a ValueError is raised. If neither is provided, the default clustering is one node per cluster.
- __repr__()[source]¶
Return a string representation of the Partition object.
- Returns:
A string representation of the Partition object.
- Return type:
str
- check_integrity()[source]¶
Check the integrity of the partition.
This method checks that all nodes are in only one cluster, i.e. non-overlapping clusters whose union is the full set of nodes.
- Raises:
ValueError – If the partition is not valid.
- get_indicator(sparse=True)[source]¶
Get the indicator matrix for the partition.
- Parameters:
sparse (bool, optional) – If True, return a sparse indicator matrix. Default is True.
- Returns:
The indicator matrix for the partition.
- Return type:
ndarray or sparse matrix
- get_num_clusters(non_empty=False)[source]¶
Get the number of clusters in the partition.
- Parameters:
non_empty (bool, optional) – If True, only count non-empty clusters. Default is False.
- Returns:
The number of clusters in the partition.
- Return type:
int
- iter_cluster_node_index()[source]¶
Iterate over the node indices for each cluster.
- Yields:
list – The node indices for each cluster.
- move_node(node, c_f)[source]¶
Move a node to cluster c_f.
- Parameters:
node (int) – The node to move.
c_f (int) – The index of the cluster to move the node to.
- remove_empty_clusters()[source]¶
Remove empty clusters from cluster_list and node_to_cluster_dict and reindex clusters.
This method removes any empty clusters from the cluster_list and updates the node_to_cluster_dict accordingly. It also reindexes the clusters to ensure that the indices are consecutive.
- class flowstab.network_clustering.SparseClustering(p1=None, p2=None, T=None, S=None, cluster_list=None, node_to_cluster_dict=None, rnd_state=None, rnd_seed=None)[source]¶
Bases:
ClusteringSymmetric Clustering using sparse matrices.
TODO: fix formatting; complete if needed
Finds the best partition that optimizes the stability between two times defined as the trace of the clustered autocovariance matrix.
At least T must be given to initialize the clustering.
Clusters can either be initilized with a cluster_list or a node_to_cluster_dict.
- Parameters:
T (scipy csr_sparse matrix) – NxN transition matrix, T[i,j] is the probability of going from node i to node j between t1 and t2.
p1 (numpy.ndarrays) – Nx1 probability density at t1. Default is the uniform probability.
p2 (numpy.ndarrays) – Nx1 probability density at t2. Default is p1 @ T.
S (SparseAutocovMat) – NxN covariance matrix. Default is diag(p1) @ T - outer(p1,p2).
cluster_list (list) – list of set of nodes describing the partition. Default is singleton clusters.
node_to_cluster_dict (dict) – dictionary with mapping between nodes and cluster number. Default is singleton clusters.
rnd_state (np.random.RandomState) – Random state object. Default creates a new one.
rnd_seed (int) – Seed for the random object. Default is a random seed.
S_threshold (float) – Smallest values of S. Used to trim insignificantly small values.
TODO: adding docstring
- flowstab.network_clustering.avg_norm_var_information(clusters_lists_list, num_samples=None)[source]¶
Returns the average normalized variation of information for the clusters_list in clusters_lists_list.
By default uses all N*(N-1)/2 combinations possible of cluster_list pairs, where N = len(clusters_lists_list) or uses num_samples pairs if num_samples < N*(N-1)/2.
- flowstab.network_clustering.jaccard_distance(clusters1, clusters2)[source]¶
Returns the Jaccard distance between two clustering.
inputs can be node_to_cluster dictionaries or cluster lists of node sets
- flowstab.network_clustering.norm_mutual_information(clusters1, clusters2)[source]¶
Normalized mutial information between two non-overlapping clustering.
The mutual information is normalized by the max of the two individual entropies.
\[NMI = (H(C1)+H(C2)-H(C1,C2))/max(H(C1),H(C2))\]inputs can be node_to_cluster dictionaries, cluster lists of node sets or instances of Partition.
- flowstab.network_clustering.norm_var_information(clusters1, clusters2, N=None, use_clust_list=False)[source]¶
Normalized variation of information between two non-overlapping clustering.
\[\hat{V}(C_1,C_2) = ({H(C1|C2)+H(C2|C1)})/{log_2 N}\]inputs can be node_to_cluster dictionaries, cluster lists of node sets or instances of Partition.
- flowstab.network_clustering.run_multi_louvain(clustering, num_repeat, **kwargs)[source]¶
Helper function to run multiple (serial) instances of the Louvain algorithm for the clustering given in clutering.
- Parameters:
clustering (Clustering or SparseClustering instance)
num_repeat (int) – Number of repetition of the algorithm.
**kwargs (key word args) – Arguments passed when initializing the clusterings.
- Returns:
n_loops (list) – list with the number of louvain loop for each repetition.
cluster_lists (list) – list with the cluster lists of each repetition..
stabilities (list) – list with the value of the stability for each repetition.
seeds (list) – list with the random seeds used in each repetition.
- flowstab.network_clustering.sort_clusters(cluster_list_to_sort, cluster_list_model, thresh_ratio=0.3)[source]¶
Quick heuristic to sort a list of clusters in order to closely match another list
- flowstab.network_clustering.static_clustering(A, t=1, rnd_seed=None, discrete_time_rw=False, linearized=False, directed=False)[source]¶
Initializes a clustering instance to optimize the continuous time Markov stability from the graph given by the adjacency matrix A.
- Parameters:
A (scipy sparse csr matrix or numpy ndarray) – Adjacency matrix.
t (int or float, optional) – Markov time (resolution parameter). The default is 1.
rnd_seed (int, optional) – The default is None.
discrete_time_rw (bool, optional) – If true, powers of the transition matrix to the power of t are used for varying resolution. t must be an int. The default is False.
linearized (bool, optional) – If true, the linearized version of the Markov Stability is used. if linearized=false and discrete_time_rw==false, the matrix exponential of the random walk Laplacian is used to compute the transition matrix (slow). The default is False.
directed (bool, optional) – If true, the network must be strongly connected. The default is False.
- Raises:
TypeError – ‘A must be numpy array or scipy csr.’.
ValueError – ‘sum of degrees equal 0’.
- Returns:
Clustering instance initialized for Markov Stability optimization.
- Return type:
Instance of `Clustering’ or ‘SparseClustering’