Here's an explanation of the mysterious survival probability calculation from "redundancy.py":

- Code: Select all
`survival_probability = sum([nCk(R, i) * bp**(L*i) * (-1)**(i-1) for i in range(1,R+1)])`

To understand how we arrive at this calculation, let's look at a 2 bit example with a redundant copy, giving us 4 bits. For simplicity, 0 will be a good bit and 1 will be a corrupt bit. The bit corruption probability is 0.1.

Now, lets calculate the probability of each 4 bit bitstring, and mark with a * those that survived (at least one copy with no corruptions).

0000 0.6561 *

1000 0.0729 *

0100 0.0729 *

1100 0.0081 *

0010 0.0729 *

1010 0.0081

0110 0.0081

1110 0.0009

0001 0.0729 *

1001 0.0081

0101 0.0081

1101 0.0009

0011 0.0081 *

1011 0.0009

0111 0.0009

1111 0.0001

If we add together all the * entries, we get 0.9639.

The code that generated the list follows.

- Code: Select all
`def dec2bin(dec, length):`

bin = []

while dec > 0:

if dec % 2 == 1:

bin += [1]

dec -= 1

dec /= 2

else:

bin += [0]

dec /= 2

bin += [0] * (length - len(bin))

return bin

def prob(perm, p):

pr = 1

for d in perm:

if d == 0:

pr *= (1-p)

else:

pr *= p

return pr

L = 2

r = 2

p = 0.1

perms = []

for i in range(2**(L*r)):

perms += [dec2bin(i, (L*r))]

c = 0

for perm in perms:

bs = "".join([str(d) for d in perm])

if all([any(perm[L*i:L*(i+1)]) for i in range(r)]):

print bs, prob(perm, p)

c += prob(perm, p)

else:

print bs, prob(perm, p), "*"

print "Probability of survival", 1 - c

This approach becomes intractable pretty quickly. A gene of 4 bits with 5 copies requires checking 2

20 bitstrings.

We can simplify the calculation with combinatoric math. A bitstring schema has _ in the bitstring, which stands for either 0 or 1. So, the schema 000_ stands for both 0000 and 0001. A schema's probability is the sum of all probabilities of bitstrings that match the schema. We denote the probability of a bitstring and a schema with p(000_). So, p(000_) = p(0000)+p(0001) = 0.6561+0.0729 = 0.729.

We can see from the bitstring probability list that the survival probability is calculated by p(00__)+p(__00)-p(0000). The reason why we subtract p(0000) is because both p(00__) and p(__00) contain p(0000), and it gets double counted.

With our notation, let's expand our example to 3 copies, but only with 1 bit per gene, so we can still easily list all the possibilities.

000 0.729 *

100 0.081 *

010 0.081 *

110 0.009 *

001 0.081 *

101 0.009 *

011 0.009 *

111 0.001

The survival probability for this example is 0.999.

It might be tempting to repeat our previous example and say the total probability is calculated with p(00_)+p(0_0)+p(_00)-p(000). However, this gives us 0.729+0.081+0.729+0.081+0.729+0.081-0.729=1.701.

One problem is p(000) is counted more than twice, unlike the last example. It is counted three times. However, even if we subtract it twice, to make up for the additional count, we still don't get the right answer. Now, we get 0.927, too small.

Referring back to our list, we see that we missed the 0.009 probabilities. We could represent these with the schemas p(1__)+p(_1_)+p(__1). However, if we stick with using 0s, p(0__)+p(_0_)+p(__0), we will end up counting more bitstrings too frequently. Something to notice, though, is that both 0__ and _0_ contain 00_.

So, perhaps we can add the double underscore schemas, 0__, and subtract the single underscore schemas, 00_. How many single underscore schemas need to be subtracted? Since both 0__ and _0_ contain 00_, then one p(00_) has to be subtracted. Including all the single and double underscore schemas, we get:

p(0__)+p(_0_)+p(__0)-p(00_)-p(0_0)-p(_00).

Since each of these schemas also includes p(000), all the p(000) instances represented cancel out. Thus, all that remains is to add a p(000) to the above summation:

p(0__)+p(_0_)+p(__0)-p(00_)-p(0_0)-p(_00)+p(000).

Now we can see the general form, it remains to calculate the probability. The schemas with a single underscore (S[1]) all have the same probability, as do the double underscore (S[2]) and no underscore (S[0]) schemas. So, we can merely multiply the probability of each schema type by its number of members.

3S[2]-3S[1]+S[0].

You may notice these multipliers look a bit like Pascal's triangle:

- Code: Select all
` 1`

1 1

1 2 1

1 3 3 1

The elements can be calculated using the n-choose-k function, e.g. the fourth line can be calculated with [nCk(k,3) for k in range(4)]. This turns our previous summation into:

nCk(1,3)S[2]-nCk(2,3)S[1]+nCk(3,3)S[0].

The above pattern is known as the inclusion-exclusion principle.

https://en.wikipedia.org/wiki/Inclusion ... _principleThe final piece of our puzzle is calculating the probabilities for the schemas. What we are calculating is how many uncorrupted genes are in the schema, and summing over all the other genes. An uncorrupted gene's probability is (1-p)

L, where L is the length of the gene. Two uncorrupted genes is (1-p)

2L. Since we are summing over all the other genes, we move the uncorrupted genes out of the summation. The remaining genes, since we are including all their corrupted and uncorrupted variants, sum to 1. Thus, the probability of a schema, where B is the number of uncorrupted genes out of R copies, is S[B]=(1-p)

BL, and we can generalize our summation using q=1-p as:

(-1)

0nCk(1,3)q

L+(-1)

1nCk(2,3)q

2L+(-1)

2nCk(3,3)q

3L.

From this example we can see the pattern for R copies is:

(-1)

0nCk(1,R)q

L+(-1)

1nCk(2,R)q

2L+...+(-1)

R-1nCk(R,R)q

RL.

The Python source for this expression is, where bp is used instead of q:

- Code: Select all
`survival_probability = sum([nCk(R, i) * bp**(L*i) * (-1)**(i-1) for i in range(1,R+1)])`