Some data are easy to come by. Some other data require considerable effort to gather. And acquiring a deep understanding of one’s subject of study usually requires a great deal of work, dedication, and perseverance.
Collecting open reading frames and putative proteins is easy. Indeed, as of March 10, 2014, there were more than 37,818,139 “protein” sequences in public databases. Why the quotation marks? Well, these are not quotation marks, they are “scare quotes” placed around the word “proteins” to indicate the fact that for the majority of these “protein sequences” we have no evidence that they are indeed proteins.
Curating such sequences, i.e., assessing that they are indeed real proteins and that these proteins have a function, is a much more difficult proposition. In fact, only 542,782 curated, i.e., proteins sequences, i.e., ~1.5% of the putative “protein” sequences, are known.
Determining the three-dimensional structure of a protein is a notoriously cumbersome, expensive, and time-consuming process. However, without a 3D structure, our knowledge of a protein—any protein—is superficial, incomplete, even meaningless. If we are indeed serious about fighting cancer and conquering disease, we must determine the 3D structures of the proteins involved in the etiology of cancer and other diseases. Unfortunately, only 92,126 proteins (~0.2%) with experimentally determined 3D structures have been deposited in databases.
The situation is simple: On the one hand, we have enormous “protein” databases that are replete with errors, wishful thinking, phantoms, and uncertainties. On the other, we have a tiny fraction of real proteins that have been studies in any depth. One would think that the priority of NIH and other agencies would be to close this gap, which I call the data-knowledge gap. One would think that NIH would be determined to increase the fraction of proteins with experimentally determined 3D structures from 0.2% to something a little more respectable.
Oh, well… one would be wrong.