I've had an idea rolling around in my head for awhile based on Douglas Axe's research on a 150 residue section of beta-glutamase. He used experiments to come up with a number for the probability of finding a functional fold by mutating it. It was approximately 10^-77. (I'm not sure of the exact number but it doesn't matter for the illustration.) Ever since becoming aware of that research, I have wondered if it would be possible to theoretically predict such results by simply restricting the type of amino acid substitutions which are allowed.

For instance, a protein with 150 residues would have 20^150 or 10^195 possible mutations. But the experimental number Axe found was much lower, suggesting a very large number of mutations can occur and still retain function but not anywhere close to 10^195. How many would that be? Very simply:

x/10^195 = 10^-77

x = 10^118

In other words, Axe's work suggests that there are 10^118 possible unique sequences in the 150-residue amino acid sequence that can result in a functional protein fold out of 10^195 theoretically possible unique sequences. There ought to be some way to get a handle on which unique sequences work and which don't.

To do this, I looked at a few similar cytochrome p450 and Na+K+-ATPase sequences and catalogued only single residue polymorphisms to come up with a list of single substitutions that appear to not affect protein structure and function. I ignored longer stretches of mutations since those types of mutations compensate for each other and are not easily predictable. I came up with the following code:

prim.-----sec.-----tert.

EDHN---GSC----AVIT

FYLM--------------AVIT

KRW

QP

Each amino acid in a group is assumed to be able to freely substitute for each other. The secondary and tertiary groups can substitute for each other or for their primary or secondary groups as well. The purpose of this exercise would be to write a simple program which could calculate the total number of possible sequences based on any particular proteomic code you like.

So for 150 amino acids the total possible would be 20^150 or 10^195. If we reduced the code to 6 then the total possible would be 6^150 or about 10^117. 10^117/10^195 = 10^78, which is very close to Axe's number. So very simply, the goal would be to have an average of 6 possibilities for each of the 20 amino acids. So for instance, this could be the allowed substitutions for each of the 20 amino acids:

E: DHNGSCTIVA

D: EHNGSCTIVA

H: DENGSCTIVA

N: DHEGSCTIVA

G: SCTIVA

S: GCTIVA

C: GSTIVA

A: VIT

V: AIT

I: VAT

T: AVI

F: YLMTIVA

Y: FLMTIVA

L: FYMTIVA

M: FYLTIVA

K: RW

R: KW

W: KR

Q: P

P: Q

(This corresponds to the substitution pattern I posted earlier.)

The program would simply take whatever amino acid exists at each location in the input sequence, assign the total number of amino acids allowed at that location based on the above rules, and multiply those numbers out for every location on the chain. So if position 1 had a E then the program would assign the number "11". If position 2 had a F the program would assign the number "8". If position 3 had a Q then it would be "2". And so on and so forth until it assigned a number to each position. Then it would simply multiply all the numbers at every position together to get a final result of the total number of allowed sequences. The average for this particular substitution code would be 6.15, so presumably the total number of possible sequences for a length of 150 would be 6.15^150 or pretty close to 10^118. The program just needs to calculate it for a real sequence rather than use the average.

The user would need to be able to manually enter numbers or codes for all positions or for ranges of positions so there could be a different substitution code for different positions along the sequence.

Another functionality would be a mutation limit. So using the above substitution code and a given sequence, this is what the mutation limit would do:

SEQUENCE: EGAF

MUTATION LIMIT: 1, 2, 3, 4

PROGRAM OUTPUT:

1: 10 + 6 + 3 + 7 + 1 = 27

2: 10*6 + 10*3 + 10*7 + 6*3 + 6*7 + 3*7 + 27 = 268

3: 10*6*3 + 10*6*7 + 10*3*7 + 6*3*7 + 268 = 1204

4: 10*6*3*7 + 1204 = 11*7*4*8 = 2464

If the mutation limit is 1, then the output would be "27" and the probability of a working protein would be 27/20^4 = 1.69e-4. If up to four mutations are allowed than there are 2,464 possible sequences out of 20^4 total possibilities, giving a 1.54e-2 probability of finding a working sequence at random.

The goal would be to construct a code that gives results which agree with experiments like Axe's. This program could also be written using probabilities calculated according to the genetic code which would be more complicated, but hopefully I've explained how it would work. This sort of approach might be useful to protein structure prediction.

So, is there anybody who could help me write such a program? I'm not a programmer. Presumably we would start simple and build complexities into the code, such as secondary and tertiary substitution groups, distance between point mutations, mutation limits (as shown) and perhaps even build in allowable block substitutions or different codes at different positions based on some criteria.

EDIT: Changed "mutations" to "unique sequences". The goal here is to find the number of sequences that work, not so much to catalog mutations.

EDIT: Improved explanation.