TDA and Statistics using Gudhi Python Library

Part 1 - Data, Filtrations and Persistence Diagrams

Recap : general pipeline

Python Gudhi documentation

In [5]:
from IPython.display import Image
Image(filename="SlidesGudhi/GeneralPipeLine.png") 
Out[5]:
In [6]:
Image(filename="SlidesGudhi/SimplexTree.png") 
Out[6]:

In this tutorial we will study three datasets :

  • 3d acceleration for three walkers.
  • Protein binding : closed and open forms of the maltose-binding protein (MBP), a large biomolecule consisting of 370 amino acid residues
  • crater data : point cloud in $\mathbb R^2$

For each dataset, we load the data in Python, we create various filtrations of simplicial complexes and we plot persistence diagrams.

Python Modules

We need the following Python modules:

In [ ]:
import numpy as np
import pandas as pd
import pickle as pickle
import gudhi as gd
from pylab import *
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from IPython.display import Image
from sklearn.neighbors.kde import KernelDensity
%matplotlib inline

Dataset 1: Sensor data

Download the dataset from this link data_acc file (save the file without opening it).

We load the data with the pickle module :

In [ ]:
f = open("data_acc","rb")
data = pickle.load(f)
f.close()

For three walkers A, B and C, we have 100 times series of the acceleration recorded in $\mathbb R^3$ at 200 consecutive points.

In [ ]:
data_A = data[0]
data_B = data[1] 
data_C = data[2]
label = data[3]
print(label)

The object data_A is a list of 100 time series of the 3d acceleration for Walker A, let's have a look at the dimensions of data_A:

In [ ]:
print(np.shape(data_A))

We now represent the trajectory of accelerations for the first time serie of walker A :

In [ ]:
data_A_sample = data_A[0]
plt.gca(projection='3d')
plt.plot(data_A_sample [:,0],data_A_sample [:,1],data_A_sample [:,2] )

Rips filtration defined from a point cloud

Documentation

In [7]:
Image("SlidesGudhi/RipsDataPoint.png")
Out[7]:

The RipsComplex() function creates a one-skeleton graph from a point cloud or from a distance matrix(see further):

In [ ]:
Rips_complex_sample = gd.RipsComplex(points = data_A_sample,max_edge_length=0.8 )

When creating a simplicial complex from this one-skeleton graph, Rips inserts the one skeleton graph into the data structure, and then expands the simplicial complex when required:

In [ ]:
Rips_simplex_tree_sample = Rips_complex_sample.create_simplex_tree(max_dimension=3) 

Persistence

Documentation

Now we can compute persistence on the simplex tree structure using the persistence() method:

In [ ]:
diag_Rips = Rips_simplex_tree_sample.persistence()
diag_Rips

We also have access to persistence_intervals per dimension using the persistence_intervals_in_dimension() method.

For dimension 0:

In [ ]:
Rips_simplex_tree_sample.persistence_intervals_in_dimension(0)

Finally we can plot the persistence diagram:

In [ ]:
gd.plot_persistence_diagram(diag_Rips)

Alpha Complex filtration defined from a point cloud in $\mathbb R^3$

Documentation

In [8]:
Image("SlidesGudhi/Alpha.png")
Out[8]:

We define an alpha_complex for data_A_sample by constructing a Simplex_tree using Delaunay Triangulation.

In [ ]:
Alpha_complex_sample = gd.AlphaComplex(points = data_A_sample)
Alpha_simplex_tree_sample = Alpha_complex_sample.create_simplex_tree(max_alpha_square=0.3) 

Persistence Documentation

In [ ]:
diag_Alpha = Alpha_simplex_tree_sample.persistence()
gd.plot_persistence_diagram(diag_Alpha)

Dataset 2: Protein Binding

Annotation Examples

[[From Kovacev-Nikolic 2016](https://arxiv.org/pdf/1412.1394.pdf)]

With this dataset, we study configurations of protein binding. This example is borrowed from Using persistent homology and dynamical distances to analyze protein binding, V. Kovacev-Nikolic, P. Bubenik, D. Nikolic and G. Heo. Stat Appl Genet Mol Biol 2016 .

The paper compares closed and open forms of the maltose-binding protein (MBP), a large biomolecule consisting of 370 amino acid residues. The analysis is not based on geometric distances in $\mathbb R^3$ but on a metric of dynamical distances defined by $$D_{ij} = 1 - |C_{ij}|,$$ where $C$ is the correlation matrix between residues.

Correlation matrices between residues can be found at this link (thank you to the authors for sharing data).

The next statments will be usefull to load the fourteen correlation matrices in Python:

In [ ]:
path_file = "./Peter corr_ProteinBinding/"
files_list = [
'1anf.corr_1.txt',
'1ez9.corr_1.txt',
'1fqa.corr_2.txt',
'1fqb.corr_3.txt',
'1fqc.corr_2.txt',
'1fqd.corr_3.txt',
'1jw4.corr_4.txt',
'1jw5.corr_5.txt',
'1lls.corr_6.txt',
'1mpd.corr_4.txt',
'1omp.corr_7.txt',
'3hpi.corr_5.txt',
'3mbp.corr_6.txt',
'4mbp.corr_7.txt']
len(files_list)

We now load the 14 correlation matrices using the pandas module:

In [ ]:
corr_list = [pd.read_csv(path_file+u,
                         header=None,
                         delim_whitespace=True) for u in files_list]

where corr_list is the list of the 14 correlation matrices.
We now iterate in the list to compute the matrix of distances associated to each configuration:

In [ ]:
dist_list = [1- np.abs(c) for c in corr_list]

We can print the first lines of the distance matrices using the method of distances with the head() method for a pandas object:

In [ ]:
mat_dist = dist_list[0]
mat_dist.head()

Rips filtration defined from a matrix of distances

Documentation

In [ ]:
Image("SlidesGudhi/RipsDistanceMatrix.png")

We compute the Rips complex filtration from this matrix of distance:

In [ ]:
rips_complex_ref = gd.RipsComplex(distance_matrix=mat_dist.values,
                                  max_edge_length=0.8) 

and then we can create the simplex tree for the Rips filtration:

In [ ]:
simplex_tree_ref = rips_complex_ref.create_simplex_tree(max_dimension=2)

Persistence

Exercice. Compute and plot the persistence for the Rips filtration.

In [ ]:
 

Dataset 3: Crater dataset

We use the crater dataset to illustrate how to compute the cubical complex filtration.

In [ ]:
f = open("crater_tuto")
crater = pickle.load(f)
f.close()

The point cloud is composed of a center annulus and four clusters:

In [ ]:
plt.scatter(crater[:,0],crater[:,1],s=0.1)

We can also visualize the density of the distribution (with a standard 2d-kernel estimator):

In [ ]:
sns.kdeplot(crater, shade = True, cmap = "PuBu",bw=.3)

Cubical filtration

Documentation

In [ ]:
Image("SlidesGudhi/Cubical.png")

We create a grid on [0,10] x [0,10]

In [ ]:
xval = np.arange(0,10,0.05)
yval = np.arange(0,10,0.05)
nx = len(xval)
ny = len(yval)

and we compute a standard kernel density estimator on this grid:

In [ ]:
kde = KernelDensity(kernel='gaussian', bandwidth=0.3).fit(crater)
positions = np.array([[u,v] for u in xval for v in yval ])
scores =  -exp(kde.score_samples(X= positions))
In [ ]:
print(max(scores))
print(min(scores))

Now we can compute the cubical Complex filtration:

In [ ]:
cc_density_crater= gd.CubicalComplex(dimensions= [nx ,ny],
                                    top_dimensional_cells = scores)

Note that the filtration is not stored in simplex tree in this case.

Persistence

To compute persistent homology, we directly use the persistence() method for the CubicalComplex object:

In [ ]:
pers_density_crater = cc_density_crater.persistence()
In [ ]:
gd.plot_persistence_diagram(pers_density_crater)

It seems that the longest interval is not set at infinity:

In [ ]:
pers_density_crater

Exercice. Fix Pawel's code ;).

In [ ]: