The Human Genome Project (HGP) is an international scientific research project with the goal of determining
the sequence of chemical base
pairs which make up human DNA (, and of identifying and mapping
all of the genes of the human genome from both a physical and functional
standpoint. It remains the
world's largest collaborative biological project. The project was proposed and funded by
the US government; planning started in 1984, the project got underway in 1990,
and was declared complete in 2003. A parallel project was conducted outside of
government by the Celera
Corporation, or Celera Genomics, which was formally launched in 1998. Most of
the government-sponsored sequencing was performed in twenty universities and research centers in the United States,
the United Kingdom, Japan, France, Germany, and China.
The Human Genome Project originally
aimed to map the nucleotides contained in a human haploid reference genome (more than three billion). The
"genome" of any given individual is unique; mapping "the human
genome" involves sequencing multiple variations of each gene. The project did not study the entire
DNA found in human cells; some heterochromatic areas (about 10% of the total genome)
remain not sequenced.
History
Briefly, in May, 1985 Robert Sinsheimer organized a workshop
to discuss sequencing the human genome, but for a number of reasons the
NIH was uninterested in pursuing the proposal. The following March, the Santa
Fe Workshop was organized by Charles De
Lisi and David Smith of the
Department of Energy's Office of Health and Environmental Research (OHER).] At
the same time Renato Dulbecco proposed whole genome sequencing in an essay in
Science. James Watson followed two months later with a workshop held at
the Cold Spring Harbor Laboratory.
Dr. Alvin Trivelpiece sought
and obtained the approval of DeLisi's proposal by Deputy Secretary William
Flynn Martin. This chart was used in the spring of 1986 by
Trivelpiece, then Director of the Office of Energy Research in the Department
of Energy, to brief Martin and Under Secretary Joseph Salgado regarding his
intention to reprogram $4 million to initiate the project with the approval
of Secretary Herrington. This reprogramming
was followed by a line item budget of $16 million in the Reagan
Administration’s 1987 budget submission to Congress. It subsequently
passed both Houses. The Project was planned for 15 years.
Candidate technologies were already
being considered for the proposed undertaking at least as early as 1985.
In 1990, the two major funding
agencies, DOE and NIH, developed a memorandum of understanding in order to
coordinate plans and set the clock for the initiation of the Project to
1990. At that time, David Galas was Director of the renamed “Office of
Biological and Environmental Research” in the U.S. Department of Energy’s
Office of Science and James Watson headed the NIH
Genome Program. In 1993, Aristides Patrinos succeeded Galas and Francis Collins succeeded James Watson, assuming the role
of overall Project Head as Director of the U.S. National Institutes of Health (NIH) National
Center for Human Genome Research (which would later become the National Human Genome Research
Institute).
A working draft of the genome was announced in 2000 and the papers describing
it were published in February 2001. A more complete draft was published in
2003, and genome "finishing" work continued for more than a decade.
The Human Genome Project was declared
complete in April 2003. An initial rough draft of the human genome was
available in June 2000 and by February 2001 a working draft had been completed
and published followed by the final sequencing mapping of the human genome on
April 14, 2003. Although this was reported to be 99% of the human genome with
99.99% accuracy a major quality assessment of the human genome sequence was
published on May 27, 2004 indicating over 92% of sampling exceeded 99.99%
accuracy which is within the intended goal. Further analyses and papers on the HGP
continue to occur.
Applications
The sequencing of the human genome
holds benefits for many fields, from molecular medicine to human evolution. The
Human Genome Project, through its sequencing of the DNA, can help us understand
diseases including: genotyping of specific viruses to direct appropriate
treatment; identification of oncogenes and mutations linked to different forms
of cancer; the design of medication and more accurate prediction of their
effects; advancement in forensic applied sciences; biofuels and other energy
applications; agriculture, livestock breeding, bioprocessing; risk assessment;
bioarcheology, anthropology, evolution. Another proposed benefit is the
commercial development of genomics research related to DNA based products, a
multibillion dollar industry.
The sequence of the DNA is stored in databases available to
anyone on the Internet. The
U.S. National Center for
Biotechnology Information (and sister
organizations in Europe and Japan) house the gene sequence in a database known
as GenBank, along
with sequences of known and hypothetical genes and proteins. Other
organizations, such as the UCSC Genome Browser at the
University of California, Santa Cruz, and Ensemble present
additional data and annotation and powerful tools for visualizing and searching
it. Computer
programs have been
developed to analyze the data, because the data itself is difficult to
interpret without such programs. Generally speaking, advances in genome
sequencing technology have followed Moore’s Law, a concept from computer
science which states that integrated circuits can increase in complexity at an
exponential rate. This means
that the speeds at which whole genomes can be sequenced can increase at a
similar rate, as was seen during the development of the above-mentioned Human
Genome Project.
Genome donors
In the IHGSC international public-sector Human Genome
Project (HGP), researchers collected blood (female) or sperm (male) samples
from a large number of donors. Only a few of many collected samples were
processed as DNA resources. Thus the donor identities were protected so neither
donors nor scientists could know whose DNA was sequenced. DNA clones from many
different libraries were used in
the overall project, with most of those libraries being created by Dr. Pieter J. de Jong's lab.
Much of the sequence (>70%) of the reference
genome produced by
the public HGP came from a single anonymous male donor from Buffalo,
New York (code name RP11).
HGP scientists used white
blood cells from the
blood of two male and two female donors (randomly selected from 20 of each) –
each donor yielding a separate DNA library. One of these libraries (RP11) was
used considerably more than others, due to quality considerations. One minor
technical issue is that male samples contain just over half as much DNA from
the sex chromosomes (one X
chromosome and one Y
chromosome) compared to female samples (which contain two X
chromosomes). The other 22 chromosomes (the autosomes) are the same for
both sexes.
Although the main sequencing phase
of the HGP has been completed, studies of DNA variation continue in the International HapMap Project, whose
goal is to identify patterns of single-nucleotide polymorphism (SNP) groups
(called haplotypes,
or “haps”). The DNA samples for the HapMap came from a total of 270
individuals: Yoruba
people in Ibadan, Nigeria; Japanese
people in Tokyo; Han Chinese in Beijing; and the
French Centre d’Etude du
Polymorphisme Humain (CEPH)
resource, which consisted of residents of the United States having ancestry
from Western and Northern
Europe.
In the Celera Genomics private-sector project, DNA
from five different individuals were used for sequencing. The lead scientist of
Celera Genomics at that time, Craig Venter, later acknowledged (in a public
letter to the journal Science)
that his DNA was one of 21 samples in the pool, five of which were selected for
use.
In
2007, a team led by Jonathan
Rothberg published James
Watson's entire genome, unveiling the six-billion-nucleotide genome of a
single individual for the first time.
Benefits
The work on interpretation of genome
data is still in its initial stages. It is anticipated that detailed knowledge
of the human genome will provide new avenues for advances in medicine and biotechnology.
Clear practical results of the project emerged even before the work was
finished. For example, a number of companies, such as Myriad
Genetics, started offering easy ways to administer genetic tests that can
show predisposition to a variety of illnesses, including breast
cancer, hemostasis
disorders, cystic
fibrosis, liver diseases and
many others. Also, the etiologies for cancers, Alzheimer's disease and other
areas of clinical interest are considered likely to benefit from genome
information and possibly may lead in the long term to significant advances in
their management.
There are also many tangible
benefits for biologists. For example, a researcher investigating a certain form
of cancer may have
narrowed down his/her search to a particular gene. By visiting the human genome
database on the World
Wide Web, this researcher can examine what other scientists have written
about this gene, including (potentially) the three-dimensional structure of its
product, its function(s), its evolutionary relationships to other human genes,
or to genes in mice or yeast or fruit flies, possible detrimental mutations,
interactions with other genes, body tissues in which this gene is activated,
and diseases associated with this gene or other datatypes. Further, deeper
understanding of the disease processes at the level of molecular biology may
determine new therapeutic procedures. Given the established importance of DNA
in molecular biology and its central role in determining the fundamental
operation of cellular processes, it is likely that expanded
knowledge in this area will facilitate medical advances in numerous areas of
clinical interest that may not have been possible without them.
The project has inspired and paved
the way for genomic work in other fields, such as agriculture. For example, by
studying the genetic composition of Tritium aestivum, the world’s most commonly
used bread wheat; great insight has been gained into the ways that
domestication has impacted the evolution of the plant. Which loci
are most susceptible to manipulation, and how does this play out in
evolutionary terms? Genetic sequencing has allowed these questions to be
addressed for the first time, as specific loci can be compared in wild and
domesticated strains of the plant. This will allow for advances in genetic
modification in the future which could yield healthier, more disease-resistant
wheat crops.
THE
PROCESS OF DECIPHERING THE HUMAN GENOME
1.
Experimental Procedures
a)
Since DNA varies from one individual to another with roughly 1 nucleotide per
500, when DNA is cut with restriction enzymes a polymorphic pattern of
fragments is produced which can be employed in genetic mapping by finding RFLPs
with similar traits (markers).
b)
Pulsed-field gel electrophoresis (PFGE) enables separation of large DNA
fragments up to 10 million bp (base pairs).
c)
Polymerase chain reaction (PCR) enables a manifold amplification of a DNA
sequence, providing working means for analyzing minute amounts of DNA.
d)
Yeast artificial chromosome (YAC) enables cloning of large DNA segments up
to 1 million bp.
e)
Sequence-tagged site (STS), the common mapping language, is a short (100-1000
bp) DNA segment, unique in the genome, defined by a pair of PCR primers.
Genomatron7 is an automated system that can screen hundreds of
STSs in hours.
f)
"Positional candidate" strategy is predicted to become the major
technique for identifying disease genes. The approach is based on an efficient
three-step process: i) localizing a disease gene to a chromosomal subregion
(using the traditional linkage analysis); ii) searching databases for an
attractive candidate gene within that subregion; and iii) testing the candidate
gene for disease-causing mutations. It is believed that by the first quarter of
1995, it helped identify more than fifty disease genes.
2.
The Undertaking of the Human Genome Project
Following
the founding of the HGP in 1984, the effort to sequence the entire human genome
began. It was advocated by several scientists, including Robert Sinsheimer
(then chancellor of the University of California, Santa Cruz), Charles Delisi
(DOE) and Renate Dulbecco (then president of the Salk Institute).
In
September of 1986, a National Research Council committee was asked to determine
whether the HGP should be advanced. In February of 1988, the committee
recommended its implementation, with the NIH playing a central role. Two months
later, another committee, appointed by the U.S. Congress Office of Technology
Assessment, released a report supporting the recommendation of the National
Research Council committee. That same year, Congress appropriated $17.3 million
to the NIH and $11.8 million to the DOE for genome research. An NIH office, the
Center for Human Genome Research, was created. It was later renamed the
National Center for Human Genome Research (NCHGR).
In
early 1990, the NIH and DOE, partners in managing the HGP, presented to
Congress a five-year term program, coordinated by the joint Subcommittee on the
Human Genome, with seven major goals:
a)
To develop maps of human chromosomes.
b)
To improve technology for DNA sequencing.
c)
To map and sequence the DNA of selected model organisms (mouse, Caenorhabditis
elegans, Drosophila melanogaster, Saccharomyces cervisiae, Escherichia coli) .
d)
To collect, manage and distribute data (bioinformatics).
e)
To study the legal, social and ethical issues involved and to develop policy
options.
f)
To develop and improve technology.
g)
To facilitate the transfer of technology.
A
number of bioinformatics databases have been created, such as the Genome Data
Base, which specializes in human genetic maps, supported by the NIH and DOE at
the John Hopkins University Welch Medical Library in Baltimore.8
A
program for the ethical, legal, and social implications, ELSI, has been
launched.
Progress
for the first five-year period was right on schedule, especially genetic
mapping and sequencing of model organisms, while sequencing techniques are
being progressively improved. The results of the linkage map were published in
the "Index Marker Catalog" of the NCHGR, and complete mapping with
10-15 cM (centimorgans) resolution was completed in 1993.