The genealogy of Arabidopsis thaliana
Official page for NSF 2010 grant DEB-0115062 (2001-2004)
Introduction
The goal of this project (a collaboration between the Bergelson, Kreitman, and Nordborg labs, as well as the [now-defunct] Genaissance Pharmaceuticals, Inc) was to sequence
roughly 1,500 short fragments in a panel of 96 lines using standard
PCR-based dideoxy sequencing. The main rationale for the project was investigating the feasibility of genome-wide association studies by describing population structure and linkage disequilibrium in A. thaliana. The project was thus analogous to first phase of the International HapMap Project. This page summarizes the main results from the study.
The data
The panel of 96 accessions is available from the stock centers (CS22564-CS22659), and the 1,214 annotated sequence alignments generated by the project can be downloaded here. The data are still subject to change as we add more alignments (from various sources), and clean up the existing data through comparison with other data. The old website generated for this project is still available, but it is not robust and was never very useful. We have plans to develop a much improved database that would ultimately contain all the polymorphism data currently being generated plus data for related species like A. lyrata. The bulk of the SNPs have also been submitted to TAIR. The data have proved extremely useful in many ways, some anticipated, some not.
High-quality sequence data
Our manually curated, high-quality dideoxy sequencing data played a very important role as quality control in analyzing the (much noisier) Perlegen re-sequencing data. Because a subset of the 96 accessions were used in the Perlegen study, it was possibly to calibrate the base-calling algorithms very accurately (Clark et al, 2007).
High-quality SNPs
Even though the number of SNPs generated by this project are dwarfed by the recently generated Perlegen data, they are nonetheless sufficiently dense for many uses, in particular linkage mapping. Several studies from multiple labs (Borevitz, Koornneef, Weigel, etc) utilizing markers from this study are underway, and several groups have developed software to select markers for particular crosses (e.g., MSQT, MarkerTracker).
Population structure
Contrary to early studies, the data revealed clear population structure and isolation by distance on a global scale (Nordborg et al., 2005). There was tremendous variation among regions in the amount of local population structure. For example, while populations in northern Sweden seemed to be quite distinct, populations in most other regions appeared to be much more freely mixing. North American populations showed all signs of having been recently introduced from Europe via a small number of founders. Finally, in spite of being highly selfing, A. thaliana is far from a collection of isolated lineages. Most alleles were shared world-wide, and there was often considerable variation even within local patches (Bakker et al, 2006). Recombination was evident on all scales.
Detecting selection
The data have proven to be valuable as a form of genomic control when
testing for selection. By comparing the pattern of polymorphism at
particular loci suspected of having been subject to selection with our
genome-wide data, it is possibly to establish rigorously that the former are, in a genomic sense, unusual. Using this approach, Toomajian et al. (2006) established that early-flowering alleles of the vernalization response locus FRI have been affected by a recent selective sweep, and Bakker et al. (2006) demonstrated that R genes, as a class, have been affect by some form of balancing selection.
Association mapping
These successes notwithstanding, the project was somewhat disappointing in that we found that linkage disequilibrium decayed much faster than previous results (Nordborg et al., 2002) had led us to believe, within 25 kb rather than within 250 kb (Nordborg et al., 2005). This meant that the marker density in the study, roughly one sequenced locus every 100 kb, was not sufficient to describe the genome-wide structure of linkage disequilibrium or carry out genome-wide association mapping (Aranzana et al., 2005). The data were sufficient for exploring the feasibility of genome-wide association mapping, however. We have in particular focused on the problem of confounding by population structure: spurious genotype-phenotype correlations that arise simply because both genotype and phenotypes are correlated with underlying structure. In a series of papers, we have demonstrated that this problem can be very serious, but that reasonably effective statistical remedies exist (Aranzana et al., 2005; Zhao et al., 2007).
In the long run, the fact that linkage disequilibrium decays more rapidly than originally believe is of course excellent news, because it means that association mapping will have higher resolution. The continuation of this study was designed to have a much higher marker density (250,000 SNPs, or one SNP every 500 bp), and also a much larger sample (over 1,000 lines) that includes more homogeneous regional samples to help overcome confounding by population structure. Meanwhile, the original sample of 96 lines are being phenotyped by large number of labs if for no other reason than to establish a baseline for variability in a given trait.
