Protein structure verification: Introduction (2024)

Proteins are very simple molecules. They are linear chains of residues. And thereare only 20 different residue types.

So far for the theory....

The practice is different:

We do definitely not understand the mechanism by which proteinstructures fold. Hence, the "only" forces that are used in therefinement of a new structure are the measured data and some factsthat are well-known for ALL molecules. No protein-specific knowledgeis used. Generally, the used information is not sufficient to find aunique structure. A large fraction of the refinement procedureconsists of an expert looking at the structure, and making manualadjustments.
Proteins contain thousands of atoms, and it is impossible to keepan overview of everything that is going on by hand.

These two points combined are the origin of "incorrect structures" and"weak points in generally correct structures".

An everyday situation in a biocomputing lab

A molecular biologist, working on a big project to elucidate thefunction of a certain protein, turns to somebody in a biocomputinggroup with a very specific question. They initiate a database searchto find out what is known about the structure, and a very similarprotein turns up from the ProteinData Bank. A dream comes true? Can important conclusions bededuced from this structure?

It is important to realize that any model based on the structurethey found will be at least as bad as the structure!

Should they use the structure?

It is obvious that structures that are very old might be less accurate,because refinement techniques 15 years ago were not so advanced. So, as afirst indication, the year of publication can be used to see whether youmight want to use the structure. But this criterion is not very convincing...

Somewhat more sophisticated is to look at the resolution of theX-ray structure. The lower the number, the more data thecrystallographer had available, and the more chance that any mistakewould have been detected. But the resolution just gives you anapproximate indication of how well the structure can betheoretically. It doesn't tell you anything about which parts of thestructure are good and which parts can not be trusted. It also doesn'twarn you if there is a serious mistake somewhere.

An everyday situation in a crystallography lab

Crystallographers are refining a structure. They would like toknow how they're doing. Is the overall trace reasonable, or is there ashift in this helix? Do all the geometries look as expected, or aresome side-chains in strange conformations? Is there a good alternativeconformation for the backbone in this loop?

Should they deposit the structure already?

The crystallographer can use the same CHECK menu in WHAT IF to checkwhether there are weak points in his structure, using the results toimprove it before deposition. The big advantage is that he has theoriginal X-ray data available, and is still working on therefinement. To correct problems at a later stage would be much moredifficult.

How it all started

In 1963, G. N. Ramachandran, C. Ramakrishnan and V. Sasisekharanwrote a paper about a graphical representation of the two mostimportant backbone torsion angles (phi and psi) (J.Mol.Biol 7:95-99(1963)). They presented a simple theoretical model, showing that in aphi/psi plot all residues are fairly restricted in theirpossibilities.

This phi/psi plot, later called "Ramachandran plot", was the firstserious verification tool for protein structures. Structures that weresolved before 1963 were solved without knowledge of Ramachandran'swork, and thus the Ramachandran plot can be used as an independentjudgment of these structures. In later structures (after 1963), thecrystallographers could be aware of the work, and could use it to helpduring the refinement. So, in principle, there should not be anystructures with bad Ramachandran plots that have been deposited after1963.

However.... There are two reasons why there are still structureswith strange looking Ramachandran plots deposited in the Protein DataBank much later than 1963:

There was, until 1994, no automated check of incoming structures at the PDB, so bad structures occasionally slipped through.
There is no good way of using the Ramachandran analysis during a refinement. If the original X-ray data are insufficient to determine the backbone conformation accurately, the final Ramachandran plot will always look strange.

The need for different kinds of verification tools

The last point in the previous paragraph brings up a nice point: ofcourse it is good if a verification tool can be used duringrefinement. If such a tool is widely used, it will improve the qualityof the new structures, and probably make structure solution easier(more automatic) at the same time.

But it is also nice if some of the verification tools cannot be used during the refinement. These tools canlater be used for an "independent" judgement of the structure. Anumber of such checks are part of WHAT IF's protein structureverification menu.

Where does the knowledge come from?

Protein verification checks work by comparing structural parameters tostandard values. These standard values can be obtained in different ways:

The absolutely undisputed best way of getting knowledge aboutprotein structures is by studying unbiased protein structures.However, there are only very few highly accurate atomic resolutionstructures (less than approximately 1.2 Angstrom); definitelyinsufficient to calibrate many verification tools. And even the atomicresolution structures that are available are not completely unbiased.So, unfortunately, this will stay something for the distant future.
In many cases, WHAT IF's checks compare to standard values as theyare obtained from the 300 best structures in the Protein Data Bank. This set isregularly updated, and all checks recalibrated.
For some parameters (especially simple molecular geometry likebond lengths) protein fragments are no different from smallmolecules. For these parameters a calibration can be done using the Cambridge Structural Database(CSD): a database of small molecule crystal structures. This ispreferred over the "300 best proteins" approach, as small moleculestructures can be determined much more accuratelythan proteins, and in an (almost) unbiased way.
Some parameters can be obtained from theoretical calculations.That is what Ramachandran et. al did to obtain their "standard"Ramachandran plot. In general, however, these kind of calculations donot discriminate optimally between good and bad protein structures,because it is extremely difficult to design a good model.

Understanding variations in numbers and standards

Almost any number one can use to describe a structure has someuncertainty. For instance a single-bond distance between 2 carbonatoms in a small molecule structure is about 1.53 Angstroms, but someare a bit longer, and some are a bit shorter. Say: most C-C bonds insmall molecule structures are between 1.50 and 1.56 Angstroms long.

Where does this variation come from?

Obviously, a bond between two C-Br₃ groups(Br₃C-CBr₃) will be different from a bondbetween two C-H₃ groups (H₃C-CH₃).This is natural variation that is due to the fact that "C-C singlebond" does not completely specify the situation. This first reason forvariation is completely harmless, understanding it will aid in theunderstanding of the underlying principles of structures.
The values can be inaccurately determined. In a small moleculecrystal structure, bond distances are generally very accurate, so ifthe "real" bond distance for a certain C-C bond is 1.55 Angstrom, the"measured" distance will be between 1.548 and 1.552 Angstrom. Thissecond reason for variation is not harmless.

When considering the variation observed in a certain variable, itis of utmost importance to judge the influence of the two mentionedeffects on the variability. Looking at the example of a small moleculeC-C bond distance, the natural variation is much larger than theinaccuracy, and thus the variation observed ("A normal C-C bond lengthis between 1.50 and 1.56 Angstrom") is relevant.

For a C-C bond in an unrestrained protein structure, the oppositewould hold. The natural variation will be lower than for smallmolecules, because the variability in local connectivity is muchsmaller. The inaccuracy in the determination of the bond length,however, is much larger than for small moleculestructures. The combination of these two differences makes the 2ndeffect much larger than the 1st, making the observed variation ("A C-Cbond length in a protein is between 1.1 and 2.0 Angstrom") absolutelyirrelevant for the understanding of protein structures.

About "Average" and "Standard Deviation".

Lets assume we have found a parameter that is worth studying: wehave 1000 C-C distances from reliable small molecule structures. Anystandard statistics package will tell you something like "the averageis 1.532, the standard deviation of the population is 0.020".

What does this tell us about a structure with a 1.66A C-C bond?

Lets take a look at a simulated normal distribution:

From this picture it is clear that:

Any value that is within 2 standarddeviations from the mean is "completely normal".
If we get further away from the mean, it is increasingly unlikely to find points.
Less than 1 in 10000 points are more than 4 standard deviations away from the mean. These ones we can call "outliers".

Coming back to our 1.66A bond, what can we say? The distance to the mean is (1.66-1.532)=0.128 Angstrom, which is 6.4 standard deviations.This is highly unlikely, and would definitely warrant further study!!

Now lets do something similar for a protein. Lets say we have a 400residue protein. For each of the bonds in this protein (approximately4000) we do an analysis like the one above. Now we find 1 that is 4.5standard deviations away from the mean, all others are less than4.0. Is this one bond length deviation an error? Not really, becausewe expect 1 in 10000 to be more than 4.0 standard deviations away fromthe mean, and we studied 4000 numbers. One deviation seems to beallowed here. On the other hand, what makes this one bond so specialthat it wants to deviate more than all the other ones? This indicatesa fundamental feature of protein structure verification: It iscompletely normal to find a few outliers, but it is always worthinvestigating them. But: if outliers are not exceedinglyrare, there is something strange going on....

Z-score?

You might have noticed that we need the phrase "standard deviationsaway from the mean" quite a lot. Mathematicians hate repeating longphrases, and they have given this a new name: The number of "standarddeviations away from the mean" is called "Z". Formally: Z is themeasured value minus the "mean", divided by the "standard deviation ofthe population", or:

 X - mu Z = -------- sigma

So Z is negative if the value "X" is less than the mean, and Z ispositive if the value is greater than the mean. "Outliers" noware all values with Z<-4 or Z>4. WHAT IF uses this criterion a lotto decide which values need to be listed.

Something else: Z has a very nice property for doingstatistics. This can help us to judge whether outliers are indeedrare, or whether there are more (or less!) outliers thanexpected. This property is: The "root mean square" of a population ofZ values should be 1.0. So for our hypothetical 400 residue protein:

 ,----------------- \ / sum (Z²) RMS-Z = \ / --------------- \/ number of bonds

should be approximately 1.0. WHAT IF contains a number of these tests,and will complain if any of these values deviates from 1.0 in an"abnormal" way. This is normally a very sensitiveindicator!

Summarizing: There is a "Z-score" and an"RMS-Z-score". A Z-score should be 0, and most of the time if it isnegative it means worse than average, and positive better thanaverage. An RMS Z-score should be close to 1.0. Sometimes anydeviation from 1.0 is "bad" (e.g. bond distances), in other cases onedirection is "good" and the other is "bad". WHAT IF will give a"subjective" annotation to indicate whether a value is "good" or"bad".

Unfortunately, not all RMS Z-scores are clearly indicated as suchin the check report. This will change ASAP. The text does indicate inall cases what the good/bad values are.

Next section: The WHAT IF Check report