Topological Data Analysis part 1

Data is a large part of our everyday lives. We personally take in all kinds of data everyday. Things such as weather, political polls, meal costs, and much much more make up most of our daily lives. Companies collect data as well. Medical, consumer, weather, and other kinds of data have been collected for years. Recently, we have started learning how to use this data to model and predict future outcomes. This is a hot field of study in industry known as big data and data science. One particular field of study is concerned with the shape of data.

The anthem of Topological Data Analysis is that data has shape and that shape matters. We would like to take a data sample and describe the topological space it was sampled from. This will help us make predictions to where new data may land. But what tools do we need? As the name suggests we are going to be looking in our topology textbooks. We need an invariant of homeomorphisms that can be described easily by linear algebra. This is so because we wish to be able to actually compute these things.

Homeomorphisms would be ideal, however they won’t work. It can be quite difficult to even come up with a homeomorphism. So we look to the next best thing, homotopy. Again homotopy groups are difficult to even write down, we wouldn’t want to try to code them. Okay, well there is one more ‘H’ word that might help us out, and you may have guessed it! It’s homology. Homology is invariant under homeomorphisms, meaning if two topological spaces are homeomorphic, then they have the same homology. Hence, if two spaces do not have the same homology, then they cannot be topologically the same. Also, homology is easily describable through linear algebra. This makes it incredibly easy to compute! The main problem is that two problems with the same homology may still not be homeomorphic. This problem is handled with the principle of Occam’s Razor and will be explained later.

In this post we will give an introduction to simplicial complexes. We will also define two simplicial complexes, the  Rips and Čech complexes, that are quite popular in practice. So let us begin with the definition of a simplex.

DEFINITION: An m-simplex  is the convex hull of  m+1 points (called vertices) in \mathbb{R}^n. We describe a simplex by its vertices. i.e. \sigma = \{x_0, ..., x_m\} will denote an m-simplex.

Persistent7

From left to right we have a 0-simplex, 1-simplex, 2-simplex, and 3-simplex.

If \sigma is an msimplex, and \tau is an n-simplex for n < m, and if the vertex set of \tau is contained in the vertex set of \sigma we say \tau \le \sigma. We are immediately able to build structures with these simplices.

DEFINITION: simplicial complex  K is a collection of simplices satisfying the following two rules.

  1. If \sigma \in K and \tau \le \sigma then \tau \in K.
  2. If \sigma, \tau \in K then \tau\cap\sigma = \emptyset or \tau\cap\sigma\in K.

The first property is commonly known as downward closure. We refer to the second one as a minimal incidence property.

On the left is a simplicial complex. The structure on the right fails property 2.

DEFINITION: simplicial map between, \phi : K\to L is a map so that whenever \{x_0, ..., x_m\} is a simplex in K, we have \{\phi(x_0), ..., \phi(x_m)\} is a simplex in L.

LEMMA:  Simp, with objects as simplicial complexes and morphisms as simplicial maps, forms a category.

Proof: The reader is encouraged to try to prove this using the definition of categories in Categories and Functors.

While it is great to have an idea for what simplicial complexes look like, this definition is actually slightly too concrete. For this reason, we need to define abstract simplicial complexes. The idea of these objects is that they are abstract enough to be useful in theory and computation, and it is okay to worry about the nice space that they will fit in later.

DEFINITION: An abstract simplicial complex K is a finite collection of sets satisfying, x\in K and y\subset x implies y\in K.

Notice we dropped the minimal incidence property as well as the necessity of having a vertex set of points in Euclidean space. The notion of a simplicial map is exactly the same. Hence we obtain a category AbSimp. What we will see next is a justification for using abstract simplicial complexes. First note, that for every simplicial complex K there is an abstract simplicial complex K^a with the same vertex set of K.

LEMMA: There is a functor u: AbSimp\to Simp so that K \cong u(K^a) for all K\in Simp and L \cong (u(L))^a for all L \in AbSimp.

This lemma simply says that we can associate each abstract simplicial complex with a simplicial complex in a nice way. This allows us to work with these abstract objects and then fit them into a nice space later.

Now, we will take a look at two popular abstract simplicial complexes. One gives an accurate description of the space but is not easily computable, while the other is not as accurate but is easily computable.

DEFINITION: Let X be a finite set of points in some metric space. Let \varepsilon be a positive real number. We define the Čech complex at scale \varepsilon to be the set of all simplices whose vertices lie in X and the intersection of the balls centered at these vertices with radius \varepsilon is nonempty. In symbols

\check{C}_\varepsilon (X) = \{\sigma\neq \emptyset\mid \cap_{x\in\sigma}B_\varepsilon(x) \neq \emptyset\}

Where B_\varepsilon(x) denotes the ball of radius \varepsilon centered at x.

Persistent5

The Nerve Theorem  tells us that our space is properly described by this complex. However, computationally, this complex is taxing. The problem is one is trying to find the intersection of metric balls which is much harder than just simply checking a condition. So we look to the next complex.

DEFINITION: Let X be a finite set of points in some metric space. Let \varepsilon be a positive real number. We define the Rips complex at scale \varepsilon to be the set of all simplices whose vertices are within 2\varepsilon of each other. In symbols,

R_\varepsilon (X) = \{\sigma \mid \forall x,y\in\sigma d(x,y) \le 2\varepsilon\}

Persistent6

Notice the Rips condition is simply one we have to check. Though as previously stated it does not give an accurate description of the space since as we see in the picture we would fill the triangle in well befor the Čech complex would. However! It does it “well enough” as described by the following Lemma.

LEMMA: (RIPS) R_\varepsilon \subset \check{C}_{\sqrt{2} \varepsilon} \subset R_{\sqrt{2} \varepsilon}.

Really this thing says that the Rips complex approximates the Čech complex well enough.

That will do it for part 1. It is meant to be an introduction to simplicial complex. Part 2 will cover enough Homology to understand Persistence. In the future we will look at concrete examples and the categorification of the field.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s