Last updated: March 07, 2012
Large-Scale Sequencing Program: Concept Clearance
Large-Scale Sequencing Program: Concept Clearance
September 12-13, 2005
National Advisory Council for Human Genome Research
Purpose and Background
Staff seeks Council clearance for the renewal of the NHGRI's large-scale sequencing program with a significantly modified structure compared to the current program. Over the past several years, the large amount of genomic sequence data that have been produced has proven to be a major stimulus for new experimental approaches to important biomedical research questions and for the development of important new biological insights. The proposed modifications to the sequencing program are designed to continue that very productive effort and to bring the benefits of genomic sequencing to critical new research areas. The format being proposed for the program is designed to take advantage of the experience that NHGRI and the sequencing community have gained over the past several years, the technological advances that have resulted in greater than a two-fold cost decrease during the last award period and those that appear to have the potential to reduce costs even more radically in the next, NHGRI's experience in managing the program so far, the Institute?s budget constraints, and community advice. Taking all of these considerations into account, Staff believes that the scientific value of sequencing and the advances expected over the next three years are enormous and comprise the major justification for the renewal of the NHGRI's sequencing program.
Over the last year, one or more of these issues has been discussed at five workshops. Most of the main points discussed in this clearance request derive from a major workshop NHGRI specifically organized in June 2005 to discuss the future of the large-scale sequencing program (please see the summary of that workshop provided with the Council materials). In addition, NHGRI received important input from two workshops on sequencing tumor genomes, one on the application of novel sequencing technologies, and another on human sequencing targets. Another primary source of information and opinion that went into the development of the proposed format for the renewal of the sequencing program has been the on-going discussions in the working groups (now three) that NHGRI established to develop recommendations for new sequencing targets.
Scientific opportunities
There is little question within the genomics community about the value of genomic sequence information produced so far. Indeed, many at the June 13, 2005 workshop expressed the opinion that sequence information is actually still undervalued by some in the scientific community who have not yet become familiar with its uses, and that once they do become familiar with it, demand will grow even more. One factor that helps justify renewal of the sequencing program is its past success and the utility and impact of the results so far. Going forward, there are compelling reasons to believe that additional large-scale sequence information will continue to be at least as valuable, discussed in each of the several categories below.
- Medical sequencing. Medical sequencing seeks to apply large-scale, inexpensive genomic sequencing- both random and targeted- towards gaining a deeper understanding of the genetic contribution to specific diseases. In spite of important advances in the past two decades, there is no question that our current understanding of the genetic basis of Mendelian and complex disease is still woefully incomplete. Sequence information will identify variants that confer disease risk, and will ultimately help inform diagnosis, prognosis, and treatment. By driving technological improvement and contributing to an increase in the understanding of disease, the NHGRI?s large-scale sequencing program will help to drive the state of the art towards a time that genomic sequencing will become part of the standard of patient care.
Under the "medical sequencing" umbrella, one project is of particularly high significance in its own right, and is thus addressed separately. In collaboration with the National Cancer Institute, NHGRI is planning a program of comprehensive genomic analysis of major tumor types. The purpose of this program, the Human Cancer Genome Project (See: Toward a Comprehensive Genomic Analysis of Cancer), is to generate a comprehensive description of all genomic variants that are associated at significant frequency with major tumor types. To this end, genomic sequencing of tumor DNA will be a major discovery tool, along with other technologies for genomic structural analysis. The HCGP will also include components for gene expression analysis and epigenetic analysis. As with medical sequencing in general, this effort is intended to inform cancer biology in a way that will have direct clinical relevance, including much more highly refined stratification of tumors and the identification of new targets, including pathways, for development of new therapeutic interventions.
- Normal human variation. Although the HapMap project has provided an unprecedented level of information about common human genetic variation, our knowledge of normal human genetic variation is still incomplete. We will need to learn more about the generality of the conclusions that are being drawn on the basis of the analysis of the first four HapMap populations, and much deeper knowledge about the occurrence and population distribution of rare alleles to interpret certain results from medical sequencing studies. In addition, our lack of systematic knowledge about non-SNP variation is particularly in need of attention.
- Comparative genome sequencing. Comparative sequencing has been shown to be one of the most efficient and powerful means to delineate important functional features in genomes (especially our own) and to understand major biological innovations. Much of the sequencing capacity of the last few years has been devoted to projects in this area. While much has been accomplished, a great deal remains to be learned. For example, it would be very attractive to sequence the primate lineage in more detail, in order to identify primate-specific and human-specific elements, highlighting the genomic differences between ourselves and our closest relatives. To maximize the value and cost efficiency of comparative sequencing, we will need to pay close attention to the required quality of the product. For some purposes, light coverage (2X) has been shown to be sufficient; but for other purposes, higher coverage or beyond ("comparative grade" or even finished) sequence may be desirable.
- Parasite and vector genomes. The genomes of fewer than 20 important eukaryotic pathogens of human, and six vectors of human disease, have been sequenced or are in process, primarily with support provided by other institutes and agencies. Combining opportunities in both the medical sequencing and comparative sequencing areas, NHGRI aims to take on a more significant role in the sequencing of additional pathogens and vectors. We are planning to co-sponsor and co-organize a workshop with the National Institute of Allergy and Infectious Diseases that will recommend important new targets in this area. Fully pursuing this area may require that the genomes of multiple, related organisms be obtained, for example when multiple species are vectors for the same pathogen, or where nonvector sequence from related species will aid in comparative genomics approaches.
- Model systems. While many of the major model systems have been sequenced, we are still approached by model organism communities (most recently, Aplysia and zebrafinch) about sequencing the genomes of the organisms that they are studying. This is understandable, since sequence is becoming ever more essential as a research tool, and investigators in many new areas are adopting genomic approaches to answer significant biomedical questions in their model of choice. This is likely to continue in the future, especially as sequencing costs fall further.
- Other areas that may arise. It has been our programmatic experience that new opportunities for the use of genomic sequence are not always predictable but continue to arise with some regularity. We believe that the NHGRI large-scale sequencing program should remain flexible enough to respond to such opportunities. Many factors are involved in this, including the dissemination and growth of the field of genomics which leads to awareness within new communities of the utility of sequence data, advances within the field that lead to insights about new questions and analyses, and changes in technology that make it easier to justify projects that previously seemed too expensive. As one example, there is increasing interest in the area of human metagenomics, and the use of sequencing to characterize our commensal microbial communities. As another, the availability of very cheap short reads may lead to a re-invigoration of SAGE-like transcriptome analyses, and the use of sequencing instead of hybridization approaches to look at products of chromatin immunoprecipitation of transcription factors.
Among all of these opportunities, NHGRI believes that our highest priority should be given to those projects that relate to medical sequencing, and we expect that this portion of the program will come to dominate the use of capacity over the next several years. However, we also believe that pursuing the other, more biological, opportunities will yield results of very high significance, and we aim to include them as well.
Based on our experience, we expect considerable interplay between these areas of opportunity, in ways that are not entirely foreseeable. For example, the annotation of conserved mammalian sequence elements could provide critical insight into how best to direct efforts in the area of medical sequencing. Thus, although these opportunities are listed above as separate items, the discussions about them and the results must be kept integrated, with decisions being made on an ongoing basis.
Technological horizons
In recent workshops, and in discussions with the sequencing community, it appears that sequencing technology is on the verge of achieving a significant gain in efficiency. All of the current large-scale centers are testing new technologies, and all expect them to be implemented in production in the near future. However, there remain significant challenges to full replacement of the existing ABI platform. The new technologies yield shorter read lengths, paired-end reads are not yet practical for some of these approaches, and centers are just beginning to understand how to use the data in large-scale projects. Thus, we are still some time away from being able to use these data to make a high quality de novo assembly of a complex genome. A likely scenario is that the new technologies will be first implemented as a complement to the longer reads afforded by ABI sequencers.
All in all, it seems reasonable to expect that, within the next three-year period, sequencing costs will decrease by five-fold or more. If realized, such cost decreases will undoubtedly change our view of the scientific opportunities afforded by sequence data. Projects once seen as important but too expensive will seem reasonable, and entirely new kinds of highly significant projects (e.g., cancer genome sequencing) will become possible. Anticipated cost decreases have also been considered carefully in planning the funding and structure of the next phase of the large-scale sequencing program. One clear recommendation from the June 13, 2005 workshop was that NHGRI take care not to cut the funding for the program so much at this critical stage that the incipient adoption of new technologies would be jeopardized. Thus, while significant funding reductions in the sequencing program may be needed to make scarce funds available for other activities of high priority for NHGRI, we must ensure that total sequencing capacity will increase significantly over the next three years.
Additional recommendations from the June Workshop
The June 13, 2005 workshop advised that the conception of sequencing centers as solely providing a commodity was not useful. While each center must have a highly developed core competency in state-of-the-art, efficient production of sequence information, a center should also be flexible enough to provide a range of sequencing products beyond whole genome shotgun data. This could include directed sequencing reads for medical sequencing, various kinds of genome finishing, EST or SAGE tag sequencing for initial annotation of certain genomes, or other sequence products that may represent the most useful way to approach a significant problem. Not every center would need to have every capability, but these capabilities should be available over the program as a whole.
Two additional conclusions of the workshop were that sequencing centers should have more autonomy in choosing targets, and also should be encouraged to help the rest of the scientific community learn how to use sequence data. These conclusions were based on the observations that 1) the centers often are among the best 'users' of the sequence data; 2) as capacity increases, we will need better and more rapid ways to select targets; and 3) we will need to do a better job of ensuring that the recipients of this increased amount of data know how to use it.
Finally, the workshop participants urged NHGRI to ensure that the program remains flexible to changing biological, technological, and other opportunities.
Proposed structure of the NHGRI sequencing program, 2006-2008
Core production
In this concept statement, NHGRI proposes to seek centers that have competencies in providing a range of sequencing products, including whole genome shotgun data and assembled genomes, directed sequencing products, refinement or finishing of draft sequence, and others. Such flexibility will be needed to address a spectrum of projects and ?inputs? into sequencing center pipelines. This spectrum is likely to change over the grant period. For example, most of current capacity is devoted to sequencing organismal genomes, for which a whole genome shotgun is appropriate. To this, NHGRI is likely to add medical sequencing projects, for which a directed approach will be more suitable, at least in the near term. In particular, the sequencing program will provide sequence data for the Human Cancer Genome Project. Medical sequencing in general is expected to occupy an increasing proportion of the overall NHGRI sequencing capacity. Flexibility to produce multiple types of sequence data thus would be considered an advantage in any individual application, though not an absolute requirement. Ultimately, such flexibility within the program as a whole would be sought in making funding decisions.
The program will continue to place the strongest emphasis on improving the current state of the art in all areas of production sequencing with respect to cost, throughput, and quality. These parameters will be measured on a regular basis throughout the life of the program, and will need to be addressed by all applicants. NHGRI has developed a rigorous process for monitoring cost and throughput and has, in collaboration with the sequencing community, established quality metrics for base calls, assemblies, and finished genomes. Monitoring is essential for good management of a large program (stewardship of public funds), but it also goes beyond that. We believe that monitoring cost, throughput and quality has been one of the major reasons for the success of the program as a whole in improving the state of the art in large-scale sequencing. As the range of sequencing products changes, NHGRI will have to develop new ways to measure throughput, cost, and quality.
In addition to sequence production, NHGRI proposes that the ?core production? aspect of the centers should include funds for technology development and the other ancillary activities currently undertaken at the existing centers, such as bioinformatics related to production. As in the current project period, these activities will be defined as those closely related to the support of production and its incremental improvement, and adoption of new technologies that have been developed elsewhere or with other sources of funds. While we propose no explicit limit on the proportion of funds that can be used for these activities, the amount would be included in overall calculations of cost, and should be no more than is needed to support the production goals of the center. More radical, long-term technology or bioinformatics development will not be appropriate for funding by this specific program. The RFA will contain detailed guidelines about production-related activities.
Dissemination and outreach
In addition to a sequence production 'core', the proposed program would include incentives for centers to disseminate knowledge through collaborations, education, and other means. Past performance in this regard could be a review criterion. This should help address the growing mismatch between the amount of sequence information available and the ability of investigators and whole scientific communities to use it. This may become more acute as the program increasingly emphasizes medical sequencing, and brings high throughput data production to new user communities. (NHGRI may have to address this through other programs, as well.) Another aspect of dissemination is the idea that sequencing centers should deliver a high-quality product to user communities. This would include automated annotation. Outreach was already an explicit component during the current project period. However, it would receive additional emphasis in the next solicitation, according to this concept proposal.
As before, all sequencing centers would also be required to have a separate Minority Action Plan to increase the representation of under-represented minorities in genome science.
Target selection process
NHGRI proposes that target selection remain largely (although not completely, see next section) separate from the funding of sequencing capacity. Currently, targets are selected through the deliberations of three working groups: 'Annotating the Human Genome,' 'Comparative Genome Evolution,' and 'Medical Sequencing.' We anticipate adding additional target choices from the Human Cancer Genome Project and also from a joint NHGRI/NIAID workshop on vectors and parasites. Additionally, we propose below a mechanism for the sequencing centers to initiate their own projects. In all cases, proposed projects (including new center-initiated projects) would be vetted by a Coordinating Committee and approved by Council.
Participants at the June 13, 2005 workshop urged NHGRI to modify the target selection process so that groups of targets would be proposed in the context of major biological problems. Ideally this would have the effect of taking better advantage of the most significant opportunities, and helping to lend more coherence to target selection. It would also tend to ameliorate the current circumstance where targets move through the approval process one at a time. However, considering that we expect capacity to increase significantly, this aspect of the target selection process may not scale well to match throughput. Therefore, NHGRI intends to examine and adjust the working group/coordinating committee model on an ongoing basis, with the intention of addressing the most important opportunities and making the process as efficient as possible.
Center-selected projects
NHGRI proposes that the new solicitation should contain a provision for applicants (if they choose) to propose biologically or medically significant projects that can be addressed by genomic sequence data, and an on-going process to identify additional projects of interest to them, which could be selected as the initial projects ended. Ideally, these self-identified projects would provide an opportunity for a center to do something that could not readily be accomplished through the Coordinating Committee process. The intent would be for these projects to demonstrate ways to solve problems that require the large-scale production of sequence information that the centers are uniquely capable of approaching. There would be a strict limit on the amount of funds that could be spent on these activities per year-initially, we propose limiting this to a maximum of 10 percent of the center's annual total costs. In general, we expect that these projects would be well-defined enough to be completed within a relatively finite period of time. As these projects are completed, new ones could be proposed, subject to approval by review through the coordinating committee process. If included in the application, the initial center-proposed projects (and the process for identifying new ones) would be reviewed and included in the overall priority score.
Funding, mechanism, and management
The current planned budget for the large-scale sequencing program in FY2006 is $133M. This is a decrease of 10 percent from the previous year; in total, the funds in the sequencing program have been reduced by more than $30M from its maximum of approximately $165M in FY2003. NHGRI proposes to continue this trend of budget decreases over the life of the new sequencing program. The actual budget will be guided by NHGRI?s estimation of the amount needed to sustain the simultaneous pathways towards (a) continuing to obtain important biomedical information, (b) achieving significant cost reductions through technological improvements, and (c) providing the increase in capacity needed to address and expand the opportunities to apply sequencing as a tool for biomedical research. NHGRI will review the amount of funding applied to the large-scale sequencing program each year to assess whether it continues to be appropriate, depending on the scientific opportunities, technological change, and overall budget considerations.
The award mechanism will remain the U54 cooperative agreement, to provide for the program structure required to manage the performance of these large community resource projects, and to provide for the overall flexibility to respond to new opportunities over the life of the program.
The management structure would be essentially the same as in the previous solicitation. NHGRI staff will monitor all aspects of production and other center activities on a quarterly basis with the aid of a panel of scientific advisors. Significant failure to meet goals could result in redistribution of funds. Target selection will be organized by NHGRI staff.