The Shape of Data

Data is a large part of our everyday lives. We personally take in all kinds of data everyday. Things such as weather, political polls, meal costs, and much much more make up most of our daily lives. Companies collect data as well. Medical, consumer, weather, and other kinds of data have been collected for years. Recently, we have started learning how to use this data to model and predict future outcomes. This is a hot field of study in industry known as big data and data science. One particular field of study is concerned with the shape of data.

The anthem of Topological Data Analysis (TDA) is that data has shape and that shape matters. We would like to take a data sample and describe the topological space it was sampled from. This will help us make predictions to where new data may land. TDA has been used in many fields such as medical imaging [1] , sensor networks [2], sports analysis [3], disease progression [4], image analysis [5], signal analysis [6],  and many others.In this post, we are just going to give the basic idea. Suppose we have are given a data set that looks like this.


It seems obvious to the human eye that this data has been sampled from circular object. This is because we are wired to recognize patterns, especially ones as easy as this data set. But how could we get a computer to understand this pattern? This is where TDA comes in. Imagine that we begin growing balls around points.


As the balls grow they will intersect. When two balls intersect, we place a line segment (edge). When three balls intersect we place a triangle. When four balls intersect we place a tetrahedron and so on.


Eventually, the balls will have grown enough to bound a gap.


As we continue growing the balls, the gap will eventually close. Beyond this point nothing changes topologically, hence we can tell the computer to stop here. Now what we have done is created what is called a filtration  which is simply an increasing chain of spaces. To capture the topological properties, we use homology to count holes. We apply homology (count the holes) to each space in the filtration. Then, more or less, we measure how long the holes last. The idea is that the longer lasting holes are more important to the topological properties of the space the data was sampled from. This process is accurately called persistent homology.  There are, of course, some fine details excluded from this summary, especially the fact that TDA does not begin nor stop at persistent homology. If you would like to know more please check out some of the references I am leaving at the bottom. I will be making a post (or series of posts) soon that will go a little deeper in the theory.



The first 6 references are applications of persistent homology.

[1] Lee, Hyekyoung, et al. “Persistent brain network homology from the perspective of dendrogram.” Medical Imaging, IEEE Transactions on 31.12 (2012): 2267-2277.

[2] De Silva, Vin, and Robert Ghrist. “Homological sensor networks.” Notices of the American mathematical society 54.1 (2007).

[3] Goldfarb, Daniel. “An Application of Topological Data Analysis to Hockey Analytics.” arXiv preprint arXiv:1409.7635 (2014).

[4] Nicolau, Monica; Levine, Arnold J.; Carlsson, Gunnar (2011-04-26). “Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival”. Proceedings of the National Academy of Sciences 108 (17): 7265–7270.

[5] Bendich, P.; Edelsbrunner, H.; Kerber, M. (2010-11-01). “Computing Robustness and Persistence for Images”. IEEE Transactions on Visualization and Computer Graphics 16(6): 1251–1260.

[6]  Perea, Jose A.; Harer, John (2014-05-29). “Sliding Windows and Persistence: An Application of Topological Methods to Signal Analysis”. Foundations of Computational Mathematics 15 (3): 799–838.

The next few are just references for one who would like to get started in studying the subject.

[7] Edelsbrunner, Herbert, and John Harer. Computational topology: an introduction. American Mathematical Soc., 2010.

[8] Bubenik, Peter, and Jonathan A. Scott. “Categorification of persistent homology.” Discrete & Computational Geometry 51.3 (2014): 600-627.

[9] Lesnick, Michael. “The theory of the interleaving distance on multidimensional persistence modules.” Foundations of Computational Mathematics 15.3 (2015): 613-650.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s