How to format data for a conStruct analysis

Gideon Bradburd

January 03, 2019

Format data

This document describes the format of the data used in a conStruct analysis.

For information on how to run a conStruct analysis after you’ve formatted your data, see the companion vignette on how to run conStruct.

Throughout the document, I’ll be referring to the example dataset included with the package:

library(conStruct)
#> Loading required package: Rcpp
data(conStruct.data)

conStruct data

There are 3 data objects you need to run a conStruct analysis:

  1. allele frequency data

  2. geographic sampling coordinates

  3. geographic distance matrix

In the sections below, I walk through the specific format required for each.

Allele frequency data

You must specify a matrix of allele frequency data for your samples. (Make sure the data are of class matrix, and that it’s not a data.frame.) I assume that the data consist of bi-allelic SNPs. At each locus, you pick an allele to count across all samples (it doesn’t matter whether it’s randomly chosen or whether it’s always the major or minor allele). The frequency of the counted allele at a locus in a sample is the number of times the counted allele is observed at a locus divided by the number of chromosomes genotyped in that sample. A sample can consist of a single individual or multiple individuals lumped together. So, a successfully genotyped diploid individual heterozygous at a particular locus would have an allele frequency of 0.5. If the sample is a population of 13 haploids, of which 12 have the counted allele at a given locus, the frequency in that sample at that locus would be 12/13.

The matrix of allele frequencies should have one row per sample and one column per locus. Missing data should be denoted with the value NA. An small example allele frequency data matrix is shown below:

Sample Locus1 Locus2 Locus3 Locus4 Locus5 Locus6 Locus7 Locus8 Locus9 Locus10
Pop1 0 1 NA 0.8 0.7 0 0 0.6 0 1
Pop2 0 1 1 0.9 1 1 0.1 0.6 0 0.9
Pop3 0.2 0.75 0 1 1 NA 1 1 0.1 1
Pop4 0.1 0.9 1 1 0.8 1 0.2 0.7 0.1 0.3
Pop5 0 1 1 1 1 1 0.3 0.9 0.3 NA

An full example allele frequency data matrix is included in the conStruct.data object included with the package.

Geographic sampling coordinates

You must specify a matrix of geographic sampling coordinates, which will be used for plotting the results of the analysis. This should be a matrix with two columns that give the sample x-coordinates (longitude) and y-coordinates (latitude), respectively. The order of rows of the matrix should be the same as the order of the rows of the allele frequencies matrix. If you specify longitude and latitude, they should be in decimal degrees.

A full example sampling coordinate data matrix is included in the conStruct.data object included with the package.

Geographic distance matrix

If you choose to run the spatial model implemented in conStruct, you must specify a matrix of pairwise geographic distance between all samples. If the coordinates of the samples are real locations on Earth (as opposed to simulated coordinates), I recommend calculating pairwise great-circle distance between sampling coordinates (using, e.g., fields::rdist.earth).

The order of the samples in the geographic distance matrix should match both that of the geographic coordinates and that of the allele frequency data matrix, and all three matrices should have the same number of rows.

The geographic distance matrix you specify should be the full matrix (that is, not the upper- or lower-triangles), with values of 0 on the diagonal entries.

A full example geographic distance matrix between all samples is in the conStruct.data object included with the package.

Other formats to conStruct

For convenience, I’ve written a function to convert population genetic data in STRUCTURE format to that used in conStruct.

STRUCTURE to conStruct

The program STRUCTURE is one of the most widely used methods for model-based clustering in population genetics. Many existing programs, including plink (v1.9 and above) and PgdSpider, convert data from diverse formats (including .vcf files) into STRUCTURE format. In this section of the vignette, I walk through an example of converting a STRUCTURE format data file into a conStruct format data file.

STRUCTURE data format

More extensive documentation on STRUCTURE’s data format can be found in the STRUCTURE manual. An example STRUCTURE-formatted dataset is shown below:

Loc1 Loc1 Loc2 Loc2 Loc3 Loc3 Loc4 Loc4
Ind1 1 1 1 2 2 1 2 -9 -9
Ind2 1 1 2 2 2 2 2 2 2
Ind3 1 1 1 2 2 2 2 2 2
Ind4 2 -9 -9 1 2 1 1 1 1
Ind5 2 2 1 2 2 1 1 1 2
Example STRUCTURE format dataset, with one row per individual 
and two columns per locus. The first column gives sample names, the 
second refers to the sample locations, and the last 8 columns give 
genotype data for four loci. The numbers in the genotype data refer to 
the allele present at that locus: A1 = `1`, A2 = `2`, missing = `-9`.

To convert a STRUCTURE format file to conStruct format, you can use the function structure2conStruct, included in the conStruct package.

Below, I give an example of the usage of this function, assuming that the file containing the STRUCTURE format data is called “myStructureData.str”, and that it’s on the “desktop” directory on the computer. I also assume that the data are formatted as in the table above, with the genotype data starting at the 3rd column of the data matrix, and missing data denoted with a value of -9.

Note that the STRUCTURE-format data must be a text file and there can be no lines of text before the data table begins. If your file is in an Excel spreadsheet, it can be converted to a text file using Save As > File Format = Tab delimited Text (.txt). If there are lines of text at the top of the document before the data matrix begins, they must be deleted. In addition, your data can only contain bi-allelic data. If you have loci with more than two alleles, they should be not be included in the dataset. For more information on multi-allelic datasets, see the section on Microsatellites below.

An alternate STRUCTURE data format has two rows and one column per diploid genotype:

Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7 Loc8
Ind1 1 1 1 2 2 1 2 0 1
Ind1 1 1 2 2 2 2 2 0 1
Ind2 1 0 1 2 2 2 2 2 2
Ind2 1 0 2 1 2 1 1 1 1
Ind3 2 2 1 2 1 1 1 1 2
Ind3 2 2 1 2 1 1 1 1 2
Ind4 2 2 0 2 2 2 1 0 2
Ind4 2 2 0 2 2 1 1 0 2
Example STRUCTURE format dataset, with two rows per individual 
and one column per locus. The first column gives sample names, the 
second refers to the sample locations, and the last 8 columns give 
genotype data for 8 loci. The numbers in the genotype data refer to 
the allele present at that locus: A1 = `1`, A2 = `2`, missing = `0`.

Data in this format can be converted to conStruct format using the command below:

Further documentation for this function is in its help page, which you can go to using the command help(structure2conStruct).

If you wish to group multiple individuals together into a single sample for analysis you can collapse rows of the conStruct format data file. For example, if you have 12 individuals from 4 locations (3 individuals from each location), and you wish to analyze the data treating populations at a sampling location as the unit of analysis, you can do something like the following:

where pop.index is a vector that gives the population of origin for each of the individuals sampled. In the example above, with 12 individuals sampled from 4 locations (3 from each), pop.index would be c(1,1,1,2,2,2,3,3,3,4,4,4).

Microsatellites

This method is designed to run on large datasets consisting of bi-allelic SNPs. If you have a microsatellite dataset and you wish to run conStruct, the first consideration is whether you have sufficient data. You should have more loci than samples in your data matrix (i.e., your data matrix should have more columns than rows).

If that’s the case, the second consideration is how to format your microsat data so that you can run conStruct. There are two standard ways of “SNP-ifying” a microsat dataset.

The first is to lump all microsatellite alleles present at a locus into two categories: “major” and “other”. The “major” allele is the allele that occurs most frequently at a particular locus; all other alleles are put in the “other” bin. You then can create a dataset in which you only report the frequency of the major allele, effectively reducing the number of alleles per locus to 2. This method has the disadvantage of throwing out data, but acknowledges the simplex relationships between alleles at a locus (the sum of the frequencies of all alleles at a locus must be 1).

The second approach, introduced by Cavalli-Sforza, is to split out each allele at a locus into a separate pseudo-locus consisting of only that allele. That is, if you had 4 alleles present in the genotyped sample at a particular locus, at frequencies {0.4,0.3,0.1,0.2}, you would split those out into 4 separate columns in your data matrix (pseudo-loci), with frequencies in the sampled population of {0.4,0.3,0.1,0.2}. This approach has the advantage of not throwing data away, but does not acknowledge the inter-allele dependence structure in frequencies, and therefore introduces some pseudoreplication into the dataset. This pseudoreplication may make you overconfident in your results, as the credible intervals on parameter estimates may be artificially narrow.

I would recommend trying both approaches, and comparing the estimates of pairwise relatedness you get from each to those derived from the raw microsatellite data to see which best recovers the patterns of relatedness in the data. I also recommend running conStruct on datasets SNP-ified using both approaches, and comparing the results.