from IPython.display import Image
Image(filename="SlidesGudhi/GeneralPipeLine.png")
Image(filename="SlidesGudhi/SimplexTree.png")
In this tutorial we will study three datasets :
For each dataset, we load the data in Python, we create various filtrations of simplicial complexes and we plot persistence diagrams.
We need the following Python modules:
import numpy as np
import pandas as pd
import pickle as pickle
import gudhi as gd
from pylab import *
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from IPython.display import Image
from sklearn.neighbors.kde import KernelDensity
%matplotlib inline
Download the dataset from this link data_acc file (save the file without opening it).
We load the data with the pickle module :
f = open("data_acc","rb")
data = pickle.load(f)
f.close()
For three walkers A, B and C, we have 100 times series of the acceleration recorded in $\mathbb R^3$ at 200 consecutive points.
data_A = data[0]
data_B = data[1]
data_C = data[2]
label = data[3]
print(label)
The object data_A
is a list of 100 time series of the 3d acceleration for Walker A, let's have a look at the dimensions of data_A
:
print(np.shape(data_A))
We now represent the trajectory of accelerations for the first time serie of walker A :
data_A_sample = data_A[0]
plt.gca(projection='3d')
plt.plot(data_A_sample [:,0],data_A_sample [:,1],data_A_sample [:,2] )
Image("SlidesGudhi/RipsDataPoint.png")
The RipsComplex()
function creates a one-skeleton graph from a point cloud or from a distance matrix(see further):
Rips_complex_sample = gd.RipsComplex(points = data_A_sample,max_edge_length=0.8 )
When creating a simplicial complex from this one-skeleton graph, Rips inserts the one skeleton graph into the data structure, and then expands the simplicial complex when required:
Rips_simplex_tree_sample = Rips_complex_sample.create_simplex_tree(max_dimension=3)
Now we can compute persistence on the simplex tree structure using the persistence()
method:
diag_Rips = Rips_simplex_tree_sample.persistence()
diag_Rips
We also have access to persistence_intervals per dimension using the persistence_intervals_in_dimension()
method.
For dimension 0:
Rips_simplex_tree_sample.persistence_intervals_in_dimension(0)
Finally we can plot the persistence diagram:
gd.plot_persistence_diagram(diag_Rips)
Image("SlidesGudhi/Alpha.png")
We define an alpha_complex for data_A_sample
by constructing a Simplex_tree using Delaunay Triangulation.
Alpha_complex_sample = gd.AlphaComplex(points = data_A_sample)
Alpha_simplex_tree_sample = Alpha_complex_sample.create_simplex_tree(max_alpha_square=0.3)
diag_Alpha = Alpha_simplex_tree_sample.persistence()
gd.plot_persistence_diagram(diag_Alpha)
With this dataset, we study configurations of protein binding. This example is borrowed from Using persistent homology and dynamical distances to analyze protein binding, V. Kovacev-Nikolic, P. Bubenik, D. Nikolic and G. Heo. Stat Appl Genet Mol Biol 2016 .
The paper compares closed and open forms of the maltose-binding protein (MBP), a large biomolecule consisting of 370 amino acid residues. The analysis is not based on geometric distances in $\mathbb R^3$ but on a metric of dynamical distances defined by $$D_{ij} = 1 - |C_{ij}|,$$ where $C$ is the correlation matrix between residues.
Correlation matrices between residues can be found at this link (thank you to the authors for sharing data).
The next statments will be usefull to load the fourteen correlation matrices in Python:
path_file = "./Peter corr_ProteinBinding/"
files_list = [
'1anf.corr_1.txt',
'1ez9.corr_1.txt',
'1fqa.corr_2.txt',
'1fqb.corr_3.txt',
'1fqc.corr_2.txt',
'1fqd.corr_3.txt',
'1jw4.corr_4.txt',
'1jw5.corr_5.txt',
'1lls.corr_6.txt',
'1mpd.corr_4.txt',
'1omp.corr_7.txt',
'3hpi.corr_5.txt',
'3mbp.corr_6.txt',
'4mbp.corr_7.txt']
len(files_list)
We now load the 14 correlation matrices using the pandas
module:
corr_list = [pd.read_csv(path_file+u,
header=None,
delim_whitespace=True) for u in files_list]
where corr_list
is the list of the 14 correlation matrices.
We now iterate in the list to compute the matrix of distances associated to each configuration:
dist_list = [1- np.abs(c) for c in corr_list]
We can print the first lines of the distance matrices using the method of distances with the head()
method for a pandas
object:
mat_dist = dist_list[0]
mat_dist.head()
Image("SlidesGudhi/RipsDistanceMatrix.png")
We compute the Rips complex filtration from this matrix of distance:
rips_complex_ref = gd.RipsComplex(distance_matrix=mat_dist.values,
max_edge_length=0.8)
and then we can create the simplex tree for the Rips filtration:
simplex_tree_ref = rips_complex_ref.create_simplex_tree(max_dimension=2)
Exercice. Compute and plot the persistence for the Rips filtration.
We use the crater dataset to illustrate how to compute the cubical complex filtration.
f = open("crater_tuto")
crater = pickle.load(f)
f.close()
The point cloud is composed of a center annulus and four clusters:
plt.scatter(crater[:,0],crater[:,1],s=0.1)
We can also visualize the density of the distribution (with a standard 2d-kernel estimator):
sns.kdeplot(crater, shade = True, cmap = "PuBu",bw=.3)
Image("SlidesGudhi/Cubical.png")
We create a grid on [0,10] x [0,10]
xval = np.arange(0,10,0.05)
yval = np.arange(0,10,0.05)
nx = len(xval)
ny = len(yval)
and we compute a standard kernel density estimator on this grid:
kde = KernelDensity(kernel='gaussian', bandwidth=0.3).fit(crater)
positions = np.array([[u,v] for u in xval for v in yval ])
scores = -exp(kde.score_samples(X= positions))
print(max(scores))
print(min(scores))
Now we can compute the cubical Complex filtration:
cc_density_crater= gd.CubicalComplex(dimensions= [nx ,ny],
top_dimensional_cells = scores)
Note that the filtration is not stored in simplex tree in this case.
To compute persistent homology, we directly use the persistence()
method for the CubicalComplex object:
pers_density_crater = cc_density_crater.persistence()
gd.plot_persistence_diagram(pers_density_crater)
It seems that the longest interval is not set at infinity:
pers_density_crater
Exercice. Fix Pawel's code ;).