If you haven’t already, I would strongly recommend glancing over the part 1 of this post before continuing. In it we talked about a method for detecting a single changepoint through techniques such as maximum log-likelihood estimation. The question now is the more difficult one of how can we search for multiple changepoints?
The simplest thing to do would be to initially apply our Single Changepoint Detection (SCD) method to the series we want to analyse. If there is no changepoint found we are done, but if there is a changepoint found we end up with two distinct segments.
Previously we discussed what a changepoint is, how they work and examples of where you might find them. Today however we shall go one step further and attempt to discuss solving the problem of how to find these changepoints.
Multiple changepoints have be found, they are the red lines.
Let’s begin by trying to figure out a way of detecting if just a single changepoint exists in a set of data. The simplest way to pose this problem is to describe it through a ‘Hypothesis Test’, where we have a null hypothesis () and an alternative hypothesis (),
no changepoint 1 changepoint
If you are unfamiliar with hypothesis tests, the idea is to run some kind of test that will give us a value. If that value is above a certain threshold, we can reject the null hypothesis and in our case accept the alternative hypothesis. If not, we can’t disprove the null hypothesis and we accept it as true. In layman’s terms, it’s the classic innocent until proven guilty. Alternatively check it out yourself!
As you probably know Statistics is often used to try and analyse data. There are all different types of data, but the data we are interested in studying in this post is that of time-series data. This is exactly what it sounds like: information collected about some process over a time period. This time period can range from the small scale of seconds for signal processing all the way to larger scales such as years as seen in financial data.When we look at time-series data we often see sudden changes in the pattern of the data. We use the term ‘changepoints’ to describe the places where this occurs. In a more mathematical sense changepoints tend to happen when there is some change in the parameters of the data, i.e in the mean or variance of the series. Sometimes changepoints even occur when more than one parameter of the data changes.The time-series data that changepoint analysis is used for crops up in many different disciplines. In finance it is needed to keep track of volatility in the stock market, whilst climatology harnesses it to detect changes in the mean temperature of the planet. Even new fitness technology like activity trackers make use of it. In fact just about anything that has some variation over time could have changepoint analysis applied to it, making it worthwhile and active area of statistical research.
Stéphane Robin from AgroTech Paris recently gave a presentation at the recent STOR-i Conference where he talked about the role of changepoints in genomics. This is a sub-area of genetics to do with measuring data about DNA and RNA which is found along the chromosomes. It turns out that collecting this data along the genome is akin to a time-series, since there is so much of it! Particular experiments often aim to find regions of the genome where a specific event occurs, which makes changepoint analysis an incredibly useful asset to genomics.
The slides above illustrate some of the data that changepoint problems have to frequently solve in genomics.