<- c("neuroblastoma", "ggplot2", "changepoint", "binsegRcpp", "data.table",
packages "devtools", "patchwork", "binsegRcpp", "kableExtra", "gfpop")
<- packages %in% installed.packages()
found install.packages(packages[!found], repos='http://cran.us.r-project.org')
for(pack in packages) library(pack, character.only=TRUE)
Test 1: Binary Segmentation using binsegRcpp package.
Step 1: Filtering and plotting the data
The objective of this test is to generate binary segmentation models up to five segments of the neuroblastoma.profiles data, for profile.id=4 and chromosome=2. The independent variable is the position in base pairs, while the dependent variable is the normalized logratio.
First, we need to filter the neuroblastoma.profiles
dataframe in order to keep only the data points corresponding to profile.id=4 and chromosome=2. The resulting information is stored as a dataframe in nb.profiles
. A plot is generated in order to better visualize the data.
data(neuroblastoma) #Load the dataset
#Select the profiles dataframe
= neuroblastoma[['profiles']]
nb.profiles
# Logical vector indicating if the particular probe is going to be included
<- nb.profiles[['profile.id']] == 4 & nb.profiles[['chromosome']] == 2
filter
#Filter the dataset according to logic vector
<- data.frame(nb.profiles[filter,])
nb.profiles
# Generate the plot
plot(nb.profiles[['position']], nb.profiles[['logratio']], type="l", col="blue",
xlab="Position in base pairs", ylab="Normalized logratio")
grid()
mtext(line=2, cex=1.2, adj=0.5,
expression(bold("Neuroblastoma: Probe logratio vs Position")))
mtext(line=1, cex=1, adj=0.5, "profile.id = 4, chromosome = 2")
Step 2: Generating the segmentation models with binsegRcpp.
It is evident that there are disruptions in the distribution of the data. In particular, the mean seems to shift at certain positions (assuming a normal distribution). Therefore, it is ideal to use a changepoint detection algorithm. They allow to partition a signal into a number of segments, such that the distribution parameters are better fitted to the measured information. As a result, the signal is then modeled by several distributions, one per segment, instead of only one (Truong et al, 2020).
There are two main packages which implement the Binary Segmentation algorithm. In this first test, the BinsegRcpp Package will be used. It allows to generate segmentation models for normally distributed data with changes in mean. It is only required to specify the data, and the maximum number of segments (in this case 5). After computing the model, the coef
method creates a data table containing start, end, and mean for every model from one segment (the whole signal) up to the specified number of segments. In this case, it computes the parameter for models of 1, 2, 3, 4 and 5 segments.
# Generate the changepoint model
<- binsegRcpp::binseg_normal(nb.profiles[['logratio']], 5)
models.binsegRcpp
# Obtaining the table with parameters for models from 1 to 5 segments.
<- coef(models.binsegRcpp)
models.binsegRcpp.coef models.binsegRcpp.coef
## segments start end mean
## 1: 1 1 234 -0.020921530
## 2: 2 1 41 0.351231083
## 3: 2 42 234 -0.099979858
## 4: 3 1 41 0.351231083
## 5: 3 42 157 -0.168360880
## 6: 3 158 234 0.003035709
## 7: 4 1 41 0.351231083
## 8: 4 42 113 0.005885206
## 9: 4 114 157 -0.453490839
## 10: 4 158 234 0.003035709
## 11: 5 1 41 0.351231083
## 12: 5 42 113 0.005885206
## 13: 5 114 152 -0.426212837
## 14: 5 153 157 -0.666259257
## 15: 5 158 234 0.003035709
Step 3: Creating graphs for models from one to five segments.
Finally, the generated segments will be plotted on top of the neuroblastoma data. The orange horizontal lines are the mean of each segment, and the blue vertical lines are the changepoints. A facet grid is added so that each segmentation model appears in a different graph. The number in the right indicates the number of segments in each particular model.
# Function to graph a table of segments with start, end, and mean
<- function(segments_data, title){
graph_segments <- ggplot() +
plot geom_line(data=nb.profiles, # First graph the line
mapping=aes(x=position, y=logratio), color="#636363")+
geom_segment(data=segments_data, #Then make horizontal segments from start to end
mapping=aes(x=nb.profiles[['position']][start], y=mean,
xend=nb.profiles[['position']][end], yend=mean),
color="#fd8d3c", size=0.8) +
geom_vline(data=segments_data[start > 1], # Add vertical lines for each changepoint
mapping=aes(xintercept=nb.profiles[['position']][start]),
color="#56B4E9", size=0.9) +
facet_grid(segments ~ .) + # Separate into various graphs according to the segments.
theme_bw() +
ggtitle(title, subtitle="profile.id = 4, chromosome = 2") +
labs(x="Position in base pairs", y="Normalized logratio") +
theme(plot.title=element_text(face="bold",margin=margin(t=10,b=5),size=13),
plot.subtitle=element_text(margin=margin(t=0,b=5)),
axis.title.x=element_text(margin=margin(t=8,b=8), size=10),
axis.title.y=element_text(margin=margin(r=8,l=8), size=10))
return(plot)
}
# Calling the function with the binsegRcpp model.
graph_segments(models.binsegRcpp.coef, "Neuroblastoma segmentation models with binsegRcpp")
Something interesting is the order in which the segments are created. This is because in each iteration of the binary segmentation algorithm, the segmentation which produces the best decrease in the overall cost of the model is selected. The negative log-likelihood is usually selected as the cost function.
Test 2: Binary Segmentation using changepoint package.
Step 1: Computing the segmentation models with changepoint
In order to create segmentation models with thee changepoint package, the first thing to do is call the cpt.mean
function. It has more parameters than binseg_normal
from binsegRcpp. The reason is that this package is more flexible, and the usR is able to choose from several changepoint detection algorithms, penalty functions, and statistical distributions.
In order to graph the data, it is convenient to create a data table similar to the one generated with the coef
method in binsegRcpp. In this case this requires more work, since the cpt.range
object from the output only contains the means for the selected model (the one with five segments).
# Using changpoint package to compute the model
<- cpt.mean(data=nb.profiles[['logratio']], penalty="None",
models.changepoint pen.value=0, method="BinSeg", Q=4, test.stat="Normal",
class=TRUE, param.estimates=TRUE)
# Creating the coef data table
<- data.table(segments=integer(0), start=integer(0),
models.changepoint.coef end=integer(0), mean=double(0))
# For each model, compute and organize the mean, start and end of each segment.
for (i in 1:4){
<- sort(na.omit(c(1, cpts.full(models.changepoint)[i, ]+1)))
start <- sort(na.omit(c(cpts.full(models.changepoint)[i, ],
end length(data.set(models.changepoint)))))
for(j in 1:(i+1)){
<- mean(nb.profiles[['logratio']][start[j]:end[j]])
tmp.mean # Add to the coef data table
<- rbind(models.changepoint.coef,
models.changepoint.coef list(i+1 , start[j], end[j], tmp.mean))
}
}
# Setting the names
names(models.changepoint.coef) <- c("segments", "start", "end", "mean")
Step 2: Creating graphs for models from two to five segments.
As the segments data was already organized in the data table, in order to create the graphs it is just necessary to call the previously defined function graph_segments
. The generated graph is pretty similar to the previous one, with the exception that cpt.mean
doesn’t create the model of only one segment.
graph_segments(models.changepoint.coef, "Neuroblastoma segmentation models with changepoint")
Step 3: Comparison of binsegRcpp and changepoint models.
The following ggplot shows the changepoints computed by binsegRcpp and changepoint packages. Both of them produce exactly the same changepoints for this dataset (with a five segments model). In order to visualize them properly, one of the lines is thicker. The only difference is that the binsegRcpp package includes the last data index as a changepoint, since it considers the trivial one segment model. A function was created to produce the graph, since the same process will be repeated later.
= data.frame(changepoint=c(1:5), index=sort(models.binsegRcpp[['end']]), package=rep("binsegRcpp", 5))
binseg.cpts = data.frame(changepoint=c(1:4), index=cpts((models.changepoint)), package=rep("changepoint", 4))
changepoint.cpts
<- theme_bw() +
graph_style theme(plot.title=element_text(face="bold",margin=margin(t=10,b=15),size=16),
axis.title.x=element_text(margin=margin(t=10,b=10), size=14),
axis.title.y=element_text(margin=margin(r=10,l=15), size=14))
ggplot(data=rbind(changepoint.cpts, binseg.cpts)) +
geom_line(mapping=aes(x=changepoint, y=index, color=package), size=2) +
geom_text(aes(x=changepoint-0.1, y=index+5,label=index)) +
labs(x="Changepoint Number", y="Changepoint index", colour="Package") +
ggtitle(label="Predicted changepoints per package") + graph_style
Another ggplot is created in order to compare the loss values. For binsegRcpp, this can be found in the loss
column of the output dataframe. For changepoint package, the equivalent values can be obtained with pen.value.full
. However, when these values are plotted, it is evident that they greatly differ.
=data.frame(changepoint=1:5, loss=models.binsegRcpp$loss, package=rep("binsegRcpp", 5))
binseg.loss=data.frame(changepoint=2:5, loss=pen.value.full(models.changepoint), package=rep("changepoint", 4))
changepoint.loss
ggplot(data=rbind(changepoint.loss, binseg.loss)) +
geom_line(mapping=aes(x=changepoint, y=loss, color=package), size=2) +
geom_text(aes(x=changepoint+0.1, y=loss+0.5,label=round(loss, 2))) +
labs(x="Segment number", y="Loss value", colour="Package") +
ggtitle(label="Loss Value per segment") + graph_style
Both packages use the log-likelihood, in this case the squared loss, as a cost function. This function is used to determine the next optimal changepoint in each iteration of the Binary Segmentation algorithm; for each posible changepoint in a segment, the algorithm selects the one which produces the lowest sum of right and left costs. This value is referred to as the best decrease.
The changepoint package returns the best decrease obtained in each iteration in the pen.value.full
vector. However, it is important to note that this value is not the overall cost of the model, just the optimal decrease in the cost of the partitioned segment. On the other hand, binsegRcpp does return the overall cost (the sum of the cost of every segment).
This is reflected in the graph. The cost of the trivial model (one segment) is \(16.52\), as reported by binsegRcpp. Then, for the model of two segments the cost reported by binsegRcpp is \(9.64\), while the one reported by changepoint is \(6.88\). Note that \(9.64 = 16.52 - 6.88\). This shows that changepoint indeed returns the best decrease, while binsegRcpp does return the overall cost.
There is the logLik
function in changepoint, which does calculates the overall cost, but only of the model with \(Q\) segments, not of every model from \(1\) to \(Q\). To obtain the overall cost of every model up to \(Q\) segments, it is necessary to run cpt.mean
\(Q\) times.
Side note
The binsegRcpp package manages to produce the overall cost of the model without the use of other functions, since it considers the one-segment model at the beginning of the algorithm. If the cost of the whole signal is known (without segmentation), then in each iteration the cost of the current model can be obtained by substracting the best decrease. This is equivalent to substracting the cost of the whole segment (the one which is going to be partitioned), and adding the costs of the two newly created segments.
Test 3: Negative Binomial Cost Function
Step 1: Obtain the mathematical expression for the cost function.
According to the article, the negative binomial distribution models the probability of finding \(k\) successes before \(r\) failures on a series of Bernoulli experiments with a success probability of \(p\). In many changepoint detection algorithms, the cost function for a distribution is the equal to the log likelihood (Killick and Eckley, 2020). For the negative binomial distribution, the likelihood of parameters \(r\) and \(p\) given \(k\) successes is
\[L(r, p \mid k) = \binom{k + r -1}{r-1}(1-p)^kp^r\]
Then, for a signal \(y\) of length \(t\), the likelihood of parameters \(r\) and \(p\) given the whole signal \(y_{0}^t\) is equal to the product of the likelihoods for each individual observation \(y_i\)
\[L(r, p \mid y) = \prod_{i=0}^t \binom{y_i + r -1}{r-1}(1-p)^{y_i}p^r\]
When the natural logarithm is applied to this equation, it becomes the log likelihood.
\[\ln L(r, p \mid y) = \sum_{i=0}^t \ln \left( \binom{y_i + r -1}{r-1}(1-p)^{y_i}p^r \right)\]
The binomial factor can be expressed in terms of the gamma function
\[\ln L(r, p \mid y) = \sum_{i=0}^t \ln \left( \frac{\Gamma \left( y_i + r \right)}{ \Gamma \left( r \right) \Gamma \left( y_i + 1 \right)}(1-p)^{y_i}p^r \right)\]
Then, by applying the laws of logarithms,
\[ \ln L(r, p \mid y) = \sum_{i=0}^t ( \ln \Gamma ( y_i + r ) - \ln \Gamma ( r ) - \ln \Gamma ( y_i + 1) + y_i \ln (1-p) + r \ln p ) \]
Simplifying the sigma notation,
\[ \ln L(r, p \mid y) = \sum_{i=0}^t ( \ln \Gamma ( y_i + r ) - \ln \Gamma ( y_i + 1)) + (t+1)( r \ln p - \ln \Gamma ( r )) + \ln (1-p) \sum_{i=0}^t y_i \]
This expression represents the cost of a segment \(y_k\), if \(y\) has a negative binomial distribution. However, the values of parameters \(r\) and \(p\) are unknown when computing a changepoint model. Therefore, they must be estimated for each considered segment. These values can be parameterized in terms of the mean and the variance of the data, according to the following equations (Lindén and Mäntyniemi, 2011):
\[ r = \frac{\mu^2}{\sigma^2 - \mu} \]
\[ p = \frac{\mu}{\sigma^2} \]
Step 2: Implement the function in C
The following C function implements the cost function, by using the same arguments as the other cost functions in the changepoint package. The function lgamma()
is used to obtain the logarithm of the factorials.
void negbin_loss(double *SS, int *size, int *n, int *p, int *minorder,
int *optimalorder, int *maxorder, int *start, int *end, double *cost,
double *tol, int *error, double *shape, int *MBIC){
double l = *end - *start; // Length of segment
double sum = SS[*end] - SS[*start]; // Sum of segment data points.
double sum_squared = SS[*n + *end] - SS[*n + *start]; // Sum of squared data points.
double mean = sum/l;
double var = (sum_squared - sum*sum/l)/l;
double ps = mean/var; // Probability of success (parameterized).
double r = fabs(mean*mean/(var-mean)); // Number of allowed failures (parameterized).
1-ps) + l*(r*log(ps) - lgamma(r)); // Calculated in constant time
*cost = sum*log(// Adding the log factorials (constant time) for the first element
for(int i = *start; i <= *end; i++){
// Adding the log factorials (constant time) for each element
1] + r) - lgamma(SS[i] - SS[i-1] + 1);
*cost += lgamma(SS[i] - SS[i-
} }
Step 3: Testing and correction
Even tough this function is based on the obtained equation, and is mathematically correct, it did not produce the expected results when I tested it using the changepoint package with data generated by the gfpop package. I spent a few days debugging the function with lldb
, and there were some bugs. The first one occured in log(1-ps)
, since for some of the possible changepoints, the probability ps
resulted in a value greater than one so the logarithm produced a NaN
value. Also, in the for loop, defining i=*start
is wrong, because in the first iteration of the algorithm, *start
is 0. This means that sometimes a negative value is obtained from SS[i] - SS[i-1] + r
, since SS[i-1]
is out of bounds. When the lgamma
was computed, this resulted in infinity, so index 0 was selected as a changepoint. This was corrected by setting i=*start+1
.
Even after correcting the errors, the predicted changepoints were not accurate enough. Therefore, the cost function was rather implemented as described here, as suggested by Dr. Toby Hocking. However, the value of r
was calculated for every segment, instead of using a constant over dispersion parameter for the whole signal. The resulting cost function is pretty similar to the past one, but all the lgamma
terms are ignored. This function also has a better performance, since it does not have the for
loop.
void negbin_loss(double *SS, int *size, int *n, int *p, int *minorder, int *optimalorder,
int *maxorder, int *start, int *end, double *cost, double *tol, int *error, double *shape, int *MBIC){
double l = *end - *start; // Length of segment
double sum = SS[*end] - SS[*start]; // Sum of segment data points.
double sum_squared = SS[*n + *end] - SS[*n + *start]; // Sum of squared data points.
double mean = sum/l;
double var = (sum_squared - sum*sum/l)/l;
if(var == 0){var = 0.00000000001;} // Avoid division by zero
double r = ceil(fabs(mean*mean/(var-mean))); // Number of allowed failures (parameterized).
double ps = mean/var;
1-ps) + (l+1)*r*log(ps);
*cost = sum*log( }
This was tested by generating data with the gfpop package, using the negative binomial distribution, and checking if the computed changepoints matched the actual changepoints. I forked the original package and added the cost function, so it can be easily tested. On this first test, a vector with three changepoints is generated. Each segment has a different succes probability (specified in parameters
), but all of them have the same allowed failures (size
). The computed changepoints almost always are an exact match with the actual changepoints (100, 200, 300, 400).
# Installing and loading the fork
install_github("diego-urgell/changepointNegbin")
library("changepointNegbin")
<- dataGenerator(n=400, changepoints=c(0.25, 0.50, 0.75, 1), parameters=c(0.2, 0.35, 0.5, 0.65), type="negbin", size=50)
data <-cbind(c(0,cumsum(coredata(data))),c(0,cumsum(coredata(data)^2)))
sumstat <- changepointNegbin::BINSEG(sumstat=sumstat, pen = 0, cost_func = "negbin", Q=3)$cpts
model print(model)
## [1] 100 200 300 400
Then, I created a second test, in which the probability of success is the same for all the segments, but the allowed failures (r
) were different. In this case, the function predicts the actual changepoints most of the times, since the parameter r
was included in the cost function. However, occasionally there are very significant errors (for example, having two changepoints very near).
<- dataGenerator(n=100, changepoints=1, parameters=0.5, type="negbin", size=10)
piece1 <- dataGenerator(n=100, changepoints=1, parameters=0.5, type="negbin", size=30)
piece2 <- dataGenerator(n=100, changepoints=1, parameters=0.5, type="negbin", size=50)
piece3 <- dataGenerator(n=100, changepoints=1, parameters=0.5, type="negbin", size=70)
piece4
<- c(piece1, piece2, piece3, piece4)
data
<-cbind(c(0,cumsum(coredata(data))),c(0,cumsum(coredata(data)^2)))
sumstat <- changepointNegbin::BINSEG(sumstat=sumstat, pen = 0, cost_func = "negbin", Q=3)$cpts
model print(model)
## [1] 100 200 300 400
Overall, the cost function produces accurate changepoints, particularly for change in probability with constant r
. However, it may be further modified to effectively manage changes in r
.
Test 4: Change in mean and variance with binsegRcpp
Step 1: Find the cost function
The cost function is the log likelihood of the signal or segment \(y\) of length \(N\) given a mean \(\mu\) and a variance \(\sigma^2\). Considering a normal distribution, this can be expressed as
\[\begin{align*} \ln L (\sigma ^2 , \mu \mid y) &= \sum_{i = 0}^{t} \ln \left( \frac{1}{\sigma \sqrt{2 \pi}} {\rm e}^{ - \frac{1}{2} (\frac{y_i - \mu}{\sigma})^2} \right) \\ &= \sum_{i = 0}^{t} \left( - \frac{1}{2}\ln \sigma^2 - \frac{1}{2} \ln (2 \pi) - \frac{1}{2} \frac{(y_i - \mu)^2}{\sigma^2} \right) \\ &= - \frac{1}{2}\ln \sigma^2 \sum_{i = 0}^{t} 1 \: \: - \frac{1}{2} \ln (2 \pi) \sum_{i = 0}^{t} 1 \: \: - \frac{1}{2 \sigma^2} \sum_{i = 0}^{t} (y_i - \mu)^2 \\ &= - \frac{1}{2} N \ln \sigma^2 \: \: - \frac{1}{2} N \ln (2 \pi) \: \: - \frac{1}{2 \sigma^2} (\sigma^2 \cdot N) \\ &= - \frac{1}{2} N \left ( \ln \sigma^2 \: \: + \ln (2 \pi) \: \: + 1 \right) \end{align*}\]
This equation is pretty similar to the cost function implemented in the changepoint package, the only difference being that the one in changepoint does not take into account the \(-\frac{1}{2}\) factor
Step 2: Understand the code structure and make the changes
In order to make the modifications to the binsegRcpp package, it is necessary to first completely understand its structure. After creating a class diagram, it was evident that in order to calculate the change in mean and variance only small changes were required. The following diagram shows the structure including the changes
The implementation of the changes is in my fork of the binsegRcpp package. (I changed the name to binsegRcppMeanVar to avoid conflicts of versions). The files have comments which explain the modifications.
Step 3: Test the modified package
In order to verify that the modifications produce accurate results, the binsegRcppMeanVar package can bee tested against the changepoint package, by using the cpt.meanvar
function. Models were computed for the same neuroblastoma data, and both packages indeed produce the same results for segments end, mean and variance. This can be observed in the following table.
# Installing and loading the modified version of binsegRcpp
install_github("diego-urgell/binsegRcppMeanVar")
library("binsegRcppMeanVar")
# Computing the model with the modified version of binsegRcpp
<- binsegRcppMeanVar::binseg_normal(nb.profiles[['logratio']], 5)
models.binsegRcpp.mv <- coef(models.binsegRcpp.mv, 5:5)
models.binsegRcpp.mv.coef
# Computing the model with the changepeoint package using change in mean and variance
<- cpt.meanvar(data=nb.profiles[['logratio']], penalty="None",
models.changepoint.mv pen.value=0, method="BinSeg", Q=4,
test.stat="Normal", class=TRUE, param.estimates=TRUE)
# Adjusting the data to display the table
<- data.table(number=1:5,
changepoints 'end(binseg)'=models.binsegRcpp.mv.coef$end,
'end(cpts)'=c(cpts(models.changepoint), NA),
'mean(binseg)'=models.binsegRcpp.mv.coef$mean,
'mean(cpts)'=param.est(models.changepoint.mv)$mean,
'var(binseg)'=models.binsegRcpp.mv.coef$var,
'var(cpts)'=param.est(models.changepoint.mv)$variance)
kable(changepoints, caption="End of segment, mean and variance computed by packages
binsegRcppMeanVar (binseg) and changepoint (cpts) per segment number.") %>%
column_spec(column=c(1, 3, 5, 7), border_right=T) %>%
kable_styling(position="center")
number | end(binseg) | end(cpts) | mean(binseg) | mean(cpts) | var(binseg) | var(cpts) |
---|---|---|---|---|---|---|
1 | 41 | 41 | 0.3512311 | 0.3512311 | 0.0103211 | 0.0103211 |
2 | 113 | 113 | 0.0058852 | 0.0058852 | 0.0069499 | 0.0069499 |
3 | 152 | 152 | -0.4262128 | -0.4262128 | 0.0195413 | 0.0195413 |
4 | 157 | 157 | -0.6662593 | -0.6662593 | 0.0098378 | 0.0098378 |
5 | 234 | NA | 0.0030357 | 0.0030357 | 0.0068361 | 0.0068361 |
The only difference arises when obtaining the cost of the model. The following table shows the loss for the binsegRcppMeanVar model and the changepoint one
<- data.table(binsegRcppMeanVar=models.binsegRcpp.mv$loss[5],
loss changepoint=unlist(logLik(models.changepoint.mv))[1])
kable(loss, caption="Loss for the five segments model produced by each package") %>%
kable_styling(position="center", full_width=F)
binsegRcppMeanVar | changepoint |
---|---|
-220.8453 | -441.6906 |
The loss produced by binsegRcppMeanVar is exactly half the cost produced by changepoint, since the later does not take into account the \(-1/2\) factor that arises in the cost function.
References
Killick, R., & Eckley, I. A. (2014). Changepoint: AnRPackage for changepoint analysis. Journal of Statistical Software, 58(3). doi:10.18637/jss.v058.i03
Lindén, A., & Mäntyniemi, S. (2011). Using the negative binomial distribution to model overdispersion in ecological count data. Ecology, 92(7), 1414-1421. doi:10.1890/10-1831.1
Truong, C., Oudre, L., & Vayatis, N. (2020). Selective review of offline change point detection methods. Signal Processing, 167, 107299. doi:10.1016/j.sigpro.2019.107299
Wikipedia. (2021, March 02). Negative binomial distribution. Retrieved March 08, 2021, from https://en.wikipedia.org/wiki/Negative_binomial_distribution#Maximum_likelihood_estimation