(Version 1.0) The data files generated according to this description
were used for the paper "Population differentiation and migration:
coalescence times in a two-sex island model for autosomal and X-linked
loci" by S Ramachandran, NA Rosenberg, MW Feldman, and J Wakeley
(Theor Pop Biol, 2008, Vol. 74:  291-301).

Created by S Ramachandran, Aug 12, 2008
---------------------------------------

HGDP.Xaut.datafiles.tar is an archive containing:

1. combined_aut_1048.stru (HGDP autosomal data - 1048 individuals, 783
microsatellites)

2. combined_aut_952.stru (HGDP autosomal data - 952 individuals, 783
microsatellites)

3. combined_X_1048.stru (HGDP X-chromosomal data - 1048 individuals,
36 microsatellites)

4. combined_X_952.stru (HGDP X-chromosomal data - 952 individuals, 36
microsatellites)

Common traits among files 1-4:

The files are in structure format; the first line contains locus
names.  All other rows represent individual data.

Each row also contains six columns of labels preceding the genotype data.
Col I:	 sex of individual (1 is male, 2 is female)
Col II:	 HGDP individual ID number
Col III: numeric code for population
Col IV:	 name of population
Col V:	 country of origin
Col VI:	 geographic region of origin

The rest of the columns represent individual genotype data.  Each
individual is represented by two rows of data; -9 indicates missing
data, and we code males as hemizygous at X-linked loci by a second row
of genotype data that is entirely missing.

Note all Bantu individuals (Bantus from Kenya and from southern
Africa) are grouped into one Bantu population in these files, with
population code 999.

-------------------------------------------------

Inference of sex:

We took the Screening Set 10 and 52 Diversity Genotype STRP files from
Marshfield's website
(http://research.marshfieldclinic.org/genetics/GenotypingData_Statistics/humanDiversityPanel.asp).
In these files, diploid genotypes are given for all individuals,
including males at X-linked loci.  Using the X-linked loci from both
files, we counted the fraction of loci where each individual was
heterozygous, across the 36 loci in the two screening sets.  No
individual had more than 15 loci with missing data; at least 21 loci
were included in the calculation for each individual.

Sex labels are given by Marshfield in both files.  The Screening Set
52 sex labels are an update of the Set 10 sex labels that correct most
earlier labeling errors.  All individuals labeled as male (denoted by a
1) in the Set 52 file, except two (#139 and #920), were heterozygous
at <15% of the loci at which they had scores.  All individuals labeled
as female in the Set 52 file, except one (#1239), were heterozygous at
>19% of loci at which they had scores.

We examined Y-chromosomal data for these same individuals from the
Marshfield screening sets, also available at the same website.
Individual #139 had data for all nine Y-chromosome markers across Set
10 and Set 52, and individual #1239 amplified eight Y-chromosome
markers.  We concluded these individuals were male.  Individual #920
amplified only one of the Y-chromosome markers.  We concluded this
individual was female.

After altering the sex labels of #920 and #1239, all Marshfield Set 52
sex labels agree with the sex inferred on the basis of X-chromosomal
genotypes at 294 loci in the study of Conrad et al. (2006), for the
1039 HGDP individuals genotyped for that study.

We note that Marshfield's Set 52 sex labels are incorrect for
individuals #920 and #1239, but these errors are corrected in the
files at http://www.people.fas.harvard.edu/~sramach/datasets.html.  A
plot of the proportion of X-linked loci where male and female
individuals are heterozygous can also be found there; individual #139
is labeled to show that it has a much higher proportion of
heterozygous loci on the X chromosome than other males.

-------------------------------------------------

The generation of files 1 and 2 (autosomal data):

To generate file 1, we used the data from Rosenberg et al. (2005),
combinedmicrosats-1048.stru, available at
http://rosenberglab.bioinformatics.med.umich.edu/diversity.html.

File 2 contains individuals in set H952 from Rosenberg (2006).  The
individuals in file 1 that are not contained in file 2 are those in
Supplementary Table 23 from Rosenberg (2006).

The generation of files 3 and 4 (X-chromosomal data):

File 3 was generated by combining the X-linked genotype data from
Screening Set 10 and 52 Diversity Genotype STRP files on Marshfield's
website
(http://research.marshfieldclinic.org/genetics/GenotypingData_Statistics/humanDiversityPanel.asp).
Homozygous males were made hemizygous, and loci where males were
scored as heterozygous were coded as missing data.

File 4 contains individuals in set H952 from Rosenberg (2006); the
individuals listed in Supplementary Table 23 from Rosenberg et
al. (2006) are excluded from this file.

-------------------------------------------------

References:

Conrad, D.F., Jakobsson, M., Coop, G., Wen, X., Wall, J.D., Rosenberg,
N.A., Pritchard, J.K., 2006. A worldwide survey of haplotype variation
and linkage disequilibrium in the human genome. Nature Genet. 38,
1251-1260.

Ramachandran, S., Rosenberg, N.A., Feldman, M.W., Wakeley,
J. Population differentiation and migration: coalescence times in a
two-sex island model for autosomal and X-linked loci. In review at
Theor. Pop. Biol.

Rosenberg, N.A., Mahajan, S., Ramachandran, S., Zhao, C., Pritchard,
J.K., Feldman, M.W., 2005. Clines, clusters, and the effect of study
design on the inference of human population structure. PLoS Genet. 1,
e70.

Rosenberg, N.A., 2006. Standardized subsets of the HGDP-CEPH Human
Genome Diversity Cell Line Panel, accounting for atypical and
duplicated samples and pairs of close relatives. Ann. Hum. Genet. 70,
841-847.
