Perhaps you’ve read in a biology textbook that humans have the same number of genes asC. elegans, a worm used in scientific research. Perhaps you’d also like to believe that humans are more complex than a worm that is used precisely for its simplicity. For decades, scientists have been trying to understand what drives organismal complexity, and researchers may have just found out where that complexity arises from.
Complexity in an organism is often defined by biologists as the number of cell types that an organism possesses. David Alvarez-Ponce, an associate professor and bioinformatician at the w88 slot, Reno, along with his coauthor and former graduate student Krishnamurthy Subramanian, currently at the Cancer Institute of New Jersey, propose that complexity is driven by the number of protein families and domains rather than by genome size or the number of genes. This new finding, which has the potential to rewrite biology textbooks, waspublished last month in the Proceedings of the National Academy of Sciences.
In the 1950s, scientists began to size up the genetic material of various species, hypothesizing that the amount of genetic material would be representative of the complexity of an organism. The researchers weighed the DNA of various species (these techniques would eventually be updated to counting the number of base pairs), including humans, and determined that genome size was not correlated with the number of cell types.
“When the human genome was sequenced for the first time, people expected that we would have a lot more genes in comparison toC. elegans,” Alvarez-Ponce said.
Researchers dubbed this the “C-value paradox” and found that it is mostly explained by how little of the genome is made up of protein-coding genes. Much of the genome is made up of non-coding material (for years, this was cast off as “junk DNA,” which is now known to have more function in the cell that previously thought). The C-value of humans is over 3 billion base pairs. The C-value ofTriticum aestivum, bread wheat, is much bigger—17 billion base pairs. If C-values were representative of organismal complexity, we would expect wheat to be more complex than humans, which scientists (and likely most people) generally disagree with.
Once there were several genomes sequenced from the 1970s to the early 2000s, researchers were able to count the number of protein-coding genes (the G-value). However, they were once again surprised by the results.
“There isn’t a good correlation between organisms’ complexity and how many genes they have,” Alvarez-Ponce said.
This became known as the G-value paradox. The G-value of humans is about 24,200 genes, and the G-value of wheat is about 124,207 genes. Once again, scientists don’t believe that wheat is more complex than humans, so they kept searching for a genetic factor that drives complexity.
Instead of using genome size or the number of protein-coding genes as predictors of organismal complexity, Alvarez-Ponce and Subramanian decided to investigate the number of protein families and domains.
“Families are groups of genes that are related to each other via duplication,” Alvarez-Ponce said.
Alvarez-Ponce offers an analogy using a toolbox. The toolbox has several types of tools. The tools themselves are the protein-coding genes, and the tool types are the families of genes. In theC. eleganstoolbox, there may be many varieties of the same tool. For example, there may be many screwdrivers, like a Phillips-head and a flathead, and there might be various sizes of each.
“They are not exactly identical, but they are similar, and they carry out similar functions,” Alvarez-Ponce explained.
In the human toolbox, there might be many different types of tools, including a saw, a file, a screwdriver, a level, pliers, a wrench and so on. When comparing the human and worm toolboxes, even though both boxes contain a similar number of items, the human toolbox is equipped to do more complex types of work.
The researchers also looked at protein domains, which are parts of protein structures that have similar functions (like binding to the cell membrane, for example). Using the Pfam database, hosted by the European Molecular Biology Laboratory, the researchers classified the protein families and domains of 16,929 species, divided into viruses (7,784 species), archaea (316 species), bacteria (7,236 species), protists (181 species), fungi (771 species), land plants (155 species) and animals (486 species). The researchers ran algorithms to compare the number of families and domains of proteins in progressively more complicated species. Starting with the viruses, which are the simplest organisms and have the fewest families and domains, they moved upward, identifying a positive relationship between complexity and the number of protein families and domains.
The researchers ran the same algorithm on data from multicellular organisms from another database in case there were biases introduced by human curation. The Ensembl Compara database’s gene families are defined automatically and are assumed to be free from human bias. Alvarez-Ponce and Subramanian found the same results when looking at the 884 species in the Ensembl Compara database as they found looking at the Pfam database.
“When we look at these quantities, they do correlate very well with organismal complexity,” Alvarez-Ponce said.
This approach for estimating genome complexity is more closely aligned with how cells function and how species have evolved over time.
“Protein families and domains give us a more nuanced understanding of how many functions a genome can encode and carry out,” Alvarez-Ponce added.
Alvarez-Ponce hopes this research can help resolve some of the questions about what drives the complexity of organisms and explain how life became more complex during evolution.