Pedigree Help Manual

 

(This document is also available in PDF format)

 

Go to the table of content now

 

Chapter one. Overview and elements of strategies

 

 

1.  Pedigree 2.2 and its capabilities

 

On this site, you will find a set of algorithms and routines to reconstruct full pedigree in a group of individuals based on their genotype data, and this in the complete absence of parental information. In a nutshell, these programs allow you to potentially reconstruct the set of single generation relationships among your individuals i.e. which individuals are most likely full-sibs, half-sibs and unrelated. In addition, it allows generating the genotype of the unknown parents. Note that this program reconstructs the full pedigree, as opposed to other approaches that allow estimating relatedness or relationship on a pairwise basis among all pairs of individuals in your data set.

 

The main program (a Markov Chain Monte Carlo or MCMC engine) uses co-dominant marker data, such as typically microsatellite data to generate various partitions of individuals, and is at present limited to the following:

Maximum number of Loci: 30

Maximum Number of Alleles per Loci: 110

Maximum number of Individuals: 1000

 

            Missing values are acceptable and the routines generally allow subsequent identification of potential genotyping errors (mutation, human errors, null alleles). Further development of a version with automated genotyping error detection is presently taking place.

 

The development and improvement of this program and its capabilities is an on-going process, primarily driven by Christophe Herbinger, in the Departments of Biology and Mathematics & Statistics at Dalhousie University, Halifax, NS. Access to this site is free. Only a portion of the set of programs, tools and strategies implemented here have been published yet. Several of the recent improvements are being written for publication. Users will be kept updated on this aspect. The initial MCMC engine approach for full-sib reconstruction and some applications are described in:

 

Smith, B.R., Herbinger, C.M. and Merry, H.R., 2001. Accurate partition of individuals into full-sib families from genetic data without parental information.  Genetics, 158: 1329-1338.

  

Wilson, A.J., McDonald, G., Moghadam, H.K., Herbinger, C.M. and Ferguson M.M., 2003.  Marker-asssisted estimation of quantitative genetic parameters in rainbow trout. Genet. Res. (Camb.) 81: 145-156

 

Butler, K., Field, C., Herbinger, C.M. and Smith, B.R. Accuracy, efficiency and robustness of four algorithms allowing full sibship reconstruction from DNA marker data. Mol. Ecol. 13: 1589-1600

 

 

2. Organisation of the help manual

 

This manual is organized as described in the following table of content. Users should at least read Chapter One to get a feel for the program and the type of problems and approaches that Pedigree entices. Chapter Two is an in-depth examination of all the pedigree reconstruction and exploration tools. It is organized along the 6 basic operations of: entering Pedigree, submitting a job, consulting results (partition) of a job, comparing different partitions to get a finer resolution of the problem at hand, reconstructing the genotype of the unknown parents, and lastly inferring the statistical significance of the uncovered pedigree structure.

The second chapter is quite complete but not particularly easy to “digest”. It may be a good idea to go back and forth between playing with data sets and reading some sections of the manual rather than attempting to read all of Chapter Two at once. We have tried to make access to this manual somewhat context sensitive, so clicking on the help button in specific pages should take you to the relevant section of the manual.

 

Chapter One. Overview and elements of strategies

 

1. Pedigree 2.2 and its capabilities

2. Organisation of the help manual

3. A quick overview of the process involved

4. Complexity and accuracy of pedigree reconstruction

 

Chapter Two. Detailed Usage of Pedigree 2.2

 

1. Getting in Pedigree 2.2.    WELCOME page    LOGIN page  and   JOBS page.

2. Submitting a New Job

2.1 Control file

2.1.1 Number of iterations

2.1.2 Full-sib Constraint

2.1.3 Temperature

2.1.4 Weight

2.1.5 Seed

2.1.6 Optional parameters

2.2 Data file

2.3  Job name, Start-up partition, and submitting the job

3. Viewing detailed results associated with a completed Job

3.1 Summary files on the control, data, and execution of the various runs

                                                Sibling likelihood ratio

3.2 Graphical representation of the evolution of the partition score according to the iteration number

3.3 Consulting the partition associated with each run

3.3.1 “Groups” text files

3.3.2 “ Partition” html formatted files

                        Cohesion and repulsion

                        Group-group repulsion

3.4 Archiving a completed Job

4. Comparing partitions within a job

4.1 Comparing partitions obtained with the same weight

4.2 Comparing partitions obtained with different weights

4.2.1 Full-sib partition

4.2.2 Kin partition

4.3 Nesting a full-sib partition within a kin partition

4.3.1 Detecting genotype errors

4.3.2 Resolving half-sib structure

5. Reconstructing parental genotypes

5.1 Reconstructing parental genotype from a Partition window

5.2 Reconstructing parental genotype from a Compare window

6. Statistical inference on partitions

6.1 Global significance of a partition

6.2 Significance of specific groups within a partition

6.3 Implementation of significance testing in Pedigree 2.2

 

 

3. A quick overview of the process involved

 

In this manual, the term partition will be coming up repeatedly. A partition is, simply stated, an allocation of every individual in your data set into putative groups of related individuals. These groups can be strict full-sib families or other type of grouping depending upon a number of parameters that you provide. A group can be very small, containing only one individual, or can be very large, containing all individuals at the other extreme.

 

To generate a partition, Pedigree first computes the pairwise likelihood ratios (LR) of being full-sibs versus being unrelated for every pair of individuals in the data set given their genotypes. It then uses these pairwise ratios to calculate the Score associated with a given partition. The Score is of the form:

 

Sum {Log (W * LR)}                           Sum {Log (W * LR)}

For all pairs that belong                                     For all pairs that belong

to the same proposed groups                                        to different proposed groups

 

This partition score is constructed so that it should be maximal for the true configuration. Hence a proposed partition that maximises our Score should be a reasonable estimate of the true underlying configuration. W is the weight, a parameter provided by the user (more on this in section 2.1.4)

 

A new trial (run) involves sampling the space of possible partitions (possible pedigree configuration) with a Markov Chain Monte Carlo (MCMC) process and finding the partition with maximal score in that trial. The partition associated with that run is thus simply the one with the highest score found in that run. It is not necessarily the highest score partition that could have ever possibly been found because the space of possible partition is gigantic and some MCMC variation is expected between runs.

 

Reconstruction of a pedigree involves the following steps:

 

1.                          Generating a “best” full-sib partition, i.e., an optimal partition of individuals of the data set into putative full-sib families, where each full-sib family collection of genotypes at each locus is consistent with mendelian inheritance of alleles from 2 parents (i.e., 1 male crossed with 1 female). The main engine of the site does this, and it takes several trials (runs) with appropriate modifications of the parameters of the MCMC process, until the user is satisfied that the best full-sib partition is credible and seems to have the highest attainable score, given the parameters used. The user will find detailed description of the control parameters in section 2.1. Section 3.2 will provide some insights into the number of iterations to use. Section 3.3.2 describes the partition file and the tools available to explore it. Sections 4.1. and 4.2 provide elements of strategies in varying some control parameters and comparing different partition to identify the “best” full-sib partition.

 

2.                          Generating a “best”  kin group partition, i.e.  an optimal partition of individuals of the data set into putative kin group, where the individuals in each kin group seem related to another each in an unspecified way. Typically kin groups contain mixture of full-sibs and half-sibs. The main engine of the site again does this, and it takes as well several trials (runs) with appropriate modifications of the parameters of the MCMC process, until the user is satisfied that the best kin partition is credible and seems to have the highest attainable score, given the parameters used. The user will find detailed description of the control parameters in section 2.1. Section 3.2 will provide some insights into the number of iterations to use. Section 3.3.2 describes the partition file and the tools available to explore it. Sections 4.1. and 4.2 provide elements of strategies in varying some control parameters and comparing different partition to identify the “best” kin partition.

 

 

3.                           The next step consists in “nesting” the best full-sib partition within the best kin group partition. This allows identifying the various constitutive full-sib families within a kin group. Section 4.3 describes the various tools implemented in the “Compare” function that will do this nesting and thus allow identifying potential half-sib relationship among the full-sib families as well as potential genotype errors.

 

4.                          Using the “Parent?” function (see section 5) allows reconstructing putative parental genotypes of large full-sib families and identifying common parents across kin groups, hence refining the reconstructed pedigree by identifying potential partial or full factorial crosses.

 

5.                          Finally, the strategies described in section 6 allow testing whether the best full-sib partition or the best kin partition have identified potentially true pedigree relationships, rather than a collection of artefactual relationships, and can help identify which particular full-sib group or kin group may be real rather than an artefactual grouping of unrelated individuals.

 

 

From this very brief overview of the process, it should be clear that Pedigree is not a program that will produce one single pedigree solution to a given data set. Generating a plausible, detailed pedigree is not a straightforward “Send the Job, push the button and get the final result”operation. The number of possible way to allocate even a small number of individuals into simple genealogical groups is quite large and grows more than exponentially with increasing number of individuals. When noise in the data (genotype errors) and other types of genealogical relationships (i.e. half-sibs) are allowed, the number of possible solutions to a problem with several hundreds individuals is truly astronomical.

Pedigree’s main MCMC engine will generate easily and quickly full-sib and kin partitions that should uncover a large part of the sought-after pedigree information. Pedigree can also assess the statistical significance of these partitions.  However the last part of the operation, to identify possible genotype errors, and to reconstruct finely the half-sib relationships among the various groups is very much a user-driven process. Pedigree offers various powerful exploratory tools to facilitate this process but this is still a relatively time consuming operation that involves several steps, often in an iterative way. The main program and the routines implemented in this site should facilitate considerably that task for you, the user. It does not completely replace the need for you to take decision and exercise common sense, given Mendelian rules and given your intimate knowledge of the quality of your data and the potential complexity of your particular study system.

 

 

4. Complexity and accuracy of pedigree reconstruction

 

Ultimately, the accuracy of the pedigree reconstruction depends upon the complexity of the problem and the informativeness of the genotype data. This last aspect is intuitively easy to comprehend. Informativeness increases with the number of loci and the number of alleles per loci. It decreases with increasing level of background genotyping error and secondarily with increasing level of missing values. It is noteworthy that increasing the number of loci, or the number of alleles per loci does not increase noticeably the main program computational burden. It does however increase the amount of work that you, the user, will have to expend working with the various routines on this site to generate a final detailed pedigree.

 

The second aspect, the complexity of the problem is not as immediately intuitive. If your data set comprises few large full-sib families (for example sizes 8-10 and up), the main program will generally find these families accurately and with good reliability (you will find the same or almost the same solution in different trials). Smaller families are more complex to deal with and the interpretation of families of size 2 and 3 for example is difficult, as some of these “families” could be artefacts (i.e., unrelated individuals regrouped by chance).

If in addition, your collection comprises not only full-sibs but also half-sibs, the complexity increases substantially. If the full-sib families are nested within half-sib families (for example if the mating among the unknown parents consisted of several males mating with each female but no male mated with more than one female) then you will be able to reconstruct this sort of pedigree with good accuracy if again the half-sib and full-sib families are reasonably large. If, on the other hand, the mating among the unknown parents was more of a factorial type (many males mating with many females), then you will only be able to reconstruct this pedigree if each (or at least most) of the individual full-sib families within these factorial crosses is quite large.

 

Genotyping errors, null alleles (or variation of these) and mutations are a fact of life with large data sets and will create an additional layer of complexity. If an individual seems to be related to a full-sib group but cannot be a member of that full-sib family because its genotype at some loci would offend Mendelian rules, then that individual could be a half-sib of that family or maybe it is truly a full-sib member and the offending genotypes are due to genotype errors or mutations. The routines in this site will allow potential identification of genotype errors if again the affected full-sib groups are fairly large. However, you should be aware that there often will be several potential ways to allocate a collection of genotypes into a pedigree and you will ultimately have to decide what is the most plausible way.

 

To sum up this very quick and superficial look at the “complexity” of pedigree reconstruction, the best way to think is in term of the numbers of unknown parents and the number of unknown mating relationships among the unknown parents, given the size of your collection of individuals progenies. It should be intuitively obvious that reconstructing a pedigree involving few parents and few relationships is much easier than reconstructing one with many parents and many type of relationship among these parents. What may be less obvious is that it is easier to reconstruct accurately the pedigree of 500 individuals that were created by a given mating structure among, say, 20 parents, than to reconstruct the pedigree of 30 individuals among the 500 offspring. The first problem is “bigger” and so Pedigree will probably require more iterations to get to a solution than for the second problem. However that first solution will most likely be more accurate, potential genotype errors will be better identified and the genotypes of the unknown parent will be reconstructed with greater precision than in the second case. This simply reflect that fact that with 500 offspring we will have “sampled” the same set of unknown parents and unknown parental relationships much more intensely than with 30 offspring…   

 

 

 

Good luck with your problem!!
Chapter Two. Detailed Usage of Pedigree 2.2

 

 

1.  Getting in Pedigree 2.2. WELCOME page, LOGIN page and JOBS page.

 

Pedigree 2.2 can found at http://herbinger.biology.dal.ca :5080/Pedigree/

 

 

 

 

 

New users should click the “REGISTER” button and will be taken to a new window. There, they will have to enter a name for their organization and a user name for themselves. To avoid the common problem of users forgetting which part of their organization name and user name was entered in upper and lower cases, Pedigree imposes that the first letter is capitalized and the rest of the name is in lower case.

 

Users will then be asked to enter a password (3-30 characters) and to reconfirm their password.  The password is case sensitive. It would be a good idea to save this information in a safe place

Note for users registered before March 2005. If your original organization name and user name did not follow the convention of the first letter capitalized and the rest of the name in lower case, you will have to retype the information in the appropriate boxes in the window to access your account.

 

We are also asking you to provide us with some information on yourself and your organization, to facilitate our task of tracking down users, should we need to. This information is kept here at Dalhousie University and is not accessible to anybody, apart from the administrators and developers of Pedigree.

 

User already registered can click the “ENTER” button and will be redirected to the LOGIN page.

 

On this page, a small window “Breaking news” alerts the users to recent developments that may be of importance. Users can also enter the help manual by clicking the appropriate button, and from there, it is possible to download a pdf version of the Help manual. Users can login by entering the appropriate organization, user name and password. They will then be taken to their own JOBS page.

 

 

 

The user is presented with a list of his jobs being hosted on the server. From there, he can choose to examine anyone of them, delete it or submit a new job.

The status of all jobs submitted by the user is displayed on this page. Completed jobs show a green “100% done” status and for these jobs it is then possible to access detailed results by clicking on the “View details” button (see section 3).

 

Recently submitted jobs will first be queued, with their Status showing “Waiting”. After a maximum of 5 minutes, these jobs will start and the status will then show in red an approximate completion percentage of the estimated workload. Note that the page is not dynamically updated by default. It needs to be refreshed manually by clicking on the “Update this page” button or alternatively you can check in the “Every minute” check box and in that case Pedigree will automatically update itself every minute.

 

Job with improper control or data files may show a red status stating “Unknown” or may show a “ x% done” that appears frozen.

 

It is possible and indeed recommended to delete a completed job that is not useful anymore, by clicking on the “Delete” button. These jobs and all associated files take space on the server. For the moment, all jobs are kept active on the server for up to 365 days. However, if too many inactive files occupy a large amount of space, then the period of grace will be reduced. Note however that this “Delete” action should only be performed on completed jobs, i.e., jobs showing a status of “100% done”.  It should not be done on jobs that are presently running. Clicking on the “Delete” button on an on-going job does not interrupt it. It deletes some files and creates a number of other orphaned files. Let the job reach its end and then delete it. It is OK to delete jobs that seem to be frozen at some percentage of the work done. Section 3.4 explains how to archive jobs for future replay.

 

                       

 


2. Submitting a New Job

 

 

To submit a new job, the user has to provide :

  1. a control file and a data file, and optionally a start up partition file, and
  2. a job name for later retrieving the results. The control, data and start-up files are text files (ASCII) and they should be located on the user local computer. They can have any names as long as they do not contain exotic characters

 

2.1 Control file

 

The control file contains a number of parameters necessary for the main MCMC engine to run on a given data file. There are 5 essential parameters and 3 optional ones.

 

 

Essential parameters

 

The user must provide the following 5 essential parameters in the control file for each run of the MCMC on the data set. These parameters are entered on one line of the control file after the key word “RUN_PARAM”, with each parameter separated by tabs or spaces, as in the following example. Note that it is “RUN_PARAM”, with an underbar separating the two words, not a hyphen.

 

RUN_PARAM 500000 1 10 1 123 

 

Several lines of run parameters are generally contained in the same control file corresponding to one job. When submitted, these different runs with different parameters will take place sequentially and this job will contain all results associated with each of the run. For example, the following control file will have Pedigree doing 3 separate MCMC runs on the same data set with different combination of parameters. It will produce 3 partitions.

 

RUN_PARAM 500000 1 10 1 123 

RUN_PARAM 1000000 1 10 1 -1 

RUN_PARAM 1000000 0 10 1 -1 

 

The 5 essential parameters are described in the following 5 sections

 

2.1.1 Number of iterations

500000, in RUN_PARAM 500000 1 10 1 123.

 

The first parameter (iter) is the number of iterations of the Markov Chain. In general, it should be in the range of 500,000 to 5,000,000, unless the problem is small and easy (e.g. a few dozen individuals, simple pedigree, where 50,000-100,000 iterations may be enough) or on the contrary very large (over 500 individuals, potentially complicated pedigree, which probably will need several million iterations). Pedigree offers a visual way to assess if enough iteration has been used (see section 3.2). Some indication can also be found in the execution log (section 3.1) 

 

 

2.1.2 Full-sib Constraint

 1, in RUN_PARAM 500000 1 10 1 123.

 

The second parameter (FSC) is the “constraint” imposed on grouping the genotypes during pedigree reconstruction. For the moment, this parameter has two possible values: FSC = 0 or FSC = 1.

 

·        A value of 0 indicates that no constraints are imposed. In this case the MCMC algorithm samples the space of possible partitions (groupings) and find the partition that maximises the overall partition score (see chapter I, 3) but it does not impose any sort of constraints on the collection of genotypes at any locus for each of the proposed groupings of individuals. In this case, the MCMC generates a Kin partition, where the individuals in each group appear related to one other but in an unspecified way. The individuals in a kin group are generally a mixture of full-sibs and half-sibs. In some rare circumstances (very highly informative data set) it could also include more distantly related individuals. Alternatively, it may also include some unrelated individuals that are just regrouped by chance (see section 6)

 

·        A value of FSC=1 indicates that a full-sib constraint is imposed. In this case the MCMC algorithm samples the space of possible partitions (groupings) and find the partition that maximises the overall partition score but it imposes that for each proposed group at each locus, the collection of genotypes are strictly compatible with Mendelian inheritance from one cross of two parents. In other words, there should exist at least 2 genotypes (the genotypes of the unobserved parents) that, when crossed, could have generated the collection of genotypes observed in the proposed full-sib family at that locus, and this is true for each proposed full-sib family at every locus. In this case, the MCMC generates a Full-sib partition, where the individuals in each group appear to form a full-sib family.

 

Note. It should be noted, that imposing the full-sib constraint simply ensure that a collection of individuals could be derived from one pair of parents. Pedigree does not check how likely it is that a particular collection of genotypes is indeed a full-sib family. It simply checks that a pair of parents could have generated the purported group of progenies. In other word, a proposed grouping of 30 individuals where, at one locus, 16 individuals have genotype (A B) while 14 have (A C) is acceptable just as another grouping where 16 individuals have genotype (A B), 13 have (A C) and 1 has (B D). In the first case, the parent’s genotypes could be (A A) crossed with (B C) and the observed collection of “offspring” genotypes is fairly likely given the proposed parental genotypes. In the second case the parent’s genotype would have to be (A D) crossed by (B C), but the observed collection of “offspring” genotypes has a pretty low likelihood given the proposed parental genotype. This aspect is covered in more details in section 5.

 

Note. We are presently working in implementing a relaxed full-sib constraints that will allow grouping putative full-sibs but allowing relaxing the constraint so that it will not have to be true for every locus. For example, an individual that is compatible with a proposed full-sib grouping for 7 out of 8 loci would probably end up grouped in that family under the relaxed constraint but cannot under the strict constraint. This would allow for a quick identification of potential genotyping errors. Such identification of genotyping errors is already feasible (albeit in a slightly more cumbersome way) with the present version of Pedigree and is described in more details in section 4.3.1.

 

 

2.1.3 Temperature

10, in RUN_PARAM 500000 1 10 1 123.

 

The third parameter (Temp) is the initial “temperature” of the Markov chain. The higher the temperature the faster the algorithm samples the space of partitions but it may “jump all over” and may not be very stable. In other words, at higher temperature, the algorithm has a higher chance of not getting stuck on a local maximum (a partition that has higher score than other similar partitions but is not the true maximal partition), but it also has higher chance of not climbing to the top of the true global maximum (the true maximal partition). This is a consequence of the algorithm having a higher probability in the Markov chain of accepting new partitions that actually decrease the partition score. At lower temperature, the algorithm is quite stable, but it may require a very large number of iterations before it finds the optimum. It may actually never find the global maximum and could get stuck on a local maximum.  In our experience, we have found a temperature of 10 to be quite adequate over a fairly large range of problems and you should start there.  Once you have found a solution that seems to be stable over several trials (see section 4.1), you can try a longer run with higher temperature (for example 30) to see if you still get the same or nearly the same solution.

 

Note. The MCMC implemented on the web-served version of Pedigree 2.2 uses a simplified simulated annealing temperature schedule to optimize its performance. Simply stated, the total number of iterations is divided in 10 slices of 10% each. The MCMC then starts running with the stated initial temperature. Each time the Markov chain jumps into another iteration slice, it decreases its temperature by 10% of the initial temperature, so that the temperature reaches zero for the last iteration. This can be seen in the execution log (section 3.1)

 

 

2.1.4 Weight

1, in RUN_PARAM 500000 1 10 1 123.

 

The fourth parameter (W) is the weight that is used in computing the partition score. Recall that this partition score is of the form:

 

Sum {Log (W * LR)}                          Sum {Log (W * LR)}

For all pairs that belong                                     For all pairs that belong

to the same proposed group                                          to different proposed groups

 

A weight of 1 is generally neutral and you should always start with it. In this case, the partition with highest score found in a run will generally have most pairs of individuals with pairwise likelihood ratios higher than one within the same groups, and pairwise likelihood ratios below one across different groups. This is often, but not always, optimal.

For example when the data set actually contains some proportionally large full-sib families, the pairwise likelihood ratios estimated with the raw data are downwardly biased and an optimal partition constructed with a weight of one will often have these large full-sib families split in several sub-groups. Similarly, when the data set contains mixture of full-sibs and half-sibs, a weight of one is often too low to ensure that all related individuals end up in the same kin group. Increasing the weight is a way to correct for these problems, as it will force the coalescence of groups that seem related. However it is a tool to use with much caution, particularly, if you are operating without full-sib constraint (more details in section 4.2)

 

Note. One cannot compare directly the score of partitions obtained with different weight as this changes the scaling of the partition score.

 

2.1.5 Seed

-1, in RUN_PARAM 500000 1 10 1 -1.

 

The last parameter (seed) is the seed to be used by the Markov Chain for this specific run. It can be a user-entered parameter, and, in that case, the seed has to be a positive integer (comprised between 0 and 65535). Alternatively, the seed can be randomly generated by Pedigree itself. In that case, that parameter has to have the value –1 in the control file.

           

In general, the best is to use –1 for this parameter. The seed is then generated at random, and the Markov Chain follows a non-deterministic path with each new run with the parameter set to –1. This is important when we want to see how stable results may be. For example, if we run the algorithm three times on the same data set with exactly the same parameters (iterations, constraint, temperature and weight), how conserved are the 3 partitions that we obtain?  This may help us determine if we need to increase the number of iterations. It should be noted that when we use this option, the seeds are randomly generated by Pedigree, but their actual values are provided with the outputs.

 

On the other hand, there are many situations when we may want to recreate exactly an earlier run and its associated results. Entering the exact seed will then force the Markov Chain to take the same deterministic path. If the same data set and the same other parameters are used in this run, Pedigree will then generate the same solution. This is quite useful in many circumstances. For example, the “Compare” utility is a powerful tool to compare different partitions and to nest a partition within another (see section 4). However, it is only possible to use that tool to compare partitions obtained within a single Job. Hence if we want to compare partitions obtained in different jobs, we simply have to use the appropriate seeds and parameters to re-create the partitions of interest in one new single job.  

 

 

2.1.6 Optional parameters

 

The following 3 parameters have default values and probably should be left as such by relatively inexperienced users. The easiest way to achieve this is then simply not to have lines in the control file with these parameter names, and the program will choose the default values.

 

MISSING_VALUE any numerical value        ; default = 0
RATIO_METHOD FS or HS                     ; default = FS
SCORE_INTERVAL any numerical value        ; 10000

 

The missing value is simply the code that the program will use to denote an entry with missing value. Note that the program does not allow an individual to have partial missing genotype information. Either the individual has complete genotype information at that locus (e.g. genotype 123 153 at locus gmo13) or it has completely missing genotype information (e.g. genotype 0 0 at locus gmo13). The individual cannot have genotype 123 0 at locus gmo13, even if the user has confidence in one allele but not the other one for a particular individual. These cases have to be treated as completely missing genotype data.

 

The ratio method is at the heart of the MCMC and should not be changed except under special circumstances. The Full-sib (FS) ratio is the one that is used to reconstruct mixtures of both full-sibs and half-sibs. The half-sib ratio method (HS) is used mostly for developmental reasons and may be useful when a data set contains only half-sibs and no full-sibs.

 

The score interval is simply the number of iterations that separate the program’s partial output in the execution log (more in section 3.1). Unless you use a very small or a very large number of iterations (e.g. <50,000 or > 10,000,000) there is not particular reason to change this parameter.

 

Final Note on the control file. Any line that doesn’t start with a recognized keyword is ignored. This allows for comment lines. Blank lines are permitted.

 

 

2.2 Data file

 

This is a text file that contains all the co-dominant genotype information as well as, optionally, a name for the data set itself and names for the individuals. The first line is optional and allows you to give a name to your data set that will be forwarded to the results. This is achieved by entering a name after the key word DATASET in the first line as in the following example.

 

DATASET mesDonnees

 

The real data is then entered after this optional line, with data from different individuals entered on different lines (e.g. 100 individuals = 100 lines). The first column (field) may contain the name of the various individuals. Alphanumeric values are acceptable for names. Maximum size is 24 characters. Special characters like hyphen or quotes are permitted, but spaces or tabs are not allowed. The program detects that Names have been provided if the number of data columns is odd. It then assumes that the first column contains the sample names. If the number of data columns is even, then the program allocates a name in the form Sample0001, Sample0002, etc to each individual line of data.

 

After the optional sample name (1st column), every other column contains the co-dominant genetic data, with genotype information entered in 2 separate columns per locus. For example, in a data set with individual names, column (field) 2 contains the first allele of the first locus and column 3 contains the second allele of the first locus, and so forth. Allele values have to be numerical (alphanumerical values are not acceptable) and can be simple ranks (e.g., allele 1, 2, 3 etc) or can be allele weight in base pairs (allele 294, 298, 302) or any other type of numerical value. The two alleles in a genotype do not have to be ranked; the program will do this automatically, with the first allele value smaller or equal to the second allele. However, it is strongly recommended that you enter your original data set with the same convention, as all the outputs from the Pedigree program follow this convention.

 

At each locus, missing data are acceptable but the missing entries have to be coded with a special missing value character. A blank field is not acceptable. The program uses 0 (zero) by default and you are encouraged to use the same convention. However, you can use another integer value if you specify it in the control file (see section 2.1.6). Note that the program does not allow an individual to have partial missing genotype information. Either the individual has complete genotype information at that locus (e.g. genotype 123 153 at locus gmo13) or it has completely missing genotype information (e.g. genotype 0 0 at locus gmo13). The individual cannot have genotype 123 0 at locus gmo13, even if the user has confidence in one allele but not the other one for a particular individual. These cases have to be treated as completely missing genotype data.

 

Final Note on the data file: A line may be commented out by placing an exclamation mark at the beginning. Data items may be separated by spaces, tabs, comas, or semicolons. Blank lines are permitted.

 

 

2.3  Job name, Start-up partition, and submitting the job

 

After supplying a control and a data file, the third step when submitting a job to Pedigree is to give it a job name. This name has no formal length limit. Characters allowed are alphanumerical, spaces, hyphens and underscores. This name allows you to subsequently identify more easily this particular job among many others that you may have run. If you don’t supply a job name, it will by default be assigned a name consisting of the current date and time (Dalhousie University time).

 

The next optional step is to submit a start-up partition file. By default, Pedigree start a MCMC run with the “atomic partition”, i.e, the partition where every individual is initially assigned to its own unique group (e.g., 303 individuals in 303 different groups of one) and then try to maximize the partition score by moving individuals among groups. This is a good strategy that will work in most circumstances.

It is however sometime useful to start with a different partition than the atomic partition, and Pedigree offers that option. For example, this may be used to see if Pedigree can find a better partition (one with higher score), possibly with different parameters for temperature and number of iterations, than a particular partition found in an earlier run.  This is achieved by saving on your local computer a particular output text file generated by Pedigree (the Group file, see section 3.3.1) during that earlier run. You can then rename that text file to any name of your choice. Later you can specify this file as a start-up partition. Pedigree will then upload this data file; reconstruct the partition from the data contained therein and will start the MCMC run from that particular partition. If no partition with higher score is encountered during a long MCMC run, then this new run will produce the same final partition as the original one, a good indication that this start up partition may have as high a score as can be found. Remember however that this comparison of score is only valid if the same weight was used. One cannot compare directly the score of partitions obtained with different weight as this changes the scaling of the partition score.

 

Caution! You should pay utmost attention to the value of the full-sib constraint in both your start up partition and your new MCMC run. The following table indicates what is allowed and what is not.

 

 

 

 

New MCMC run

 

Start-up partition

 

FSC = 0

FSC = 1

FSC = 0

OK

OK

FSC = 1

NO

OK

 

You cannot do an MCMC run with full-sib constraint (FSC=1) with a start-up partition that was obtained without full-sib constraint (FSC = 0). This start-up partition most likely violates the full-sib constraint. The various partitions that would be proposed by the MCMC from the start up partition would also most probably violate the full-sib constraint and will be rejected outright. Hence the MCMC will not be able to move away from the start-up partition. However, Pedigree does not check that the start-up partition itself verifies the constraint. The new run with full-sib constraint will then output as false optimal solution the start-up partition.

 

Clicking the “Submit” button will then upload the control and data file, along with the optional job name and start-up partition file. A window will open to let you know if the file uploads were successful. The job will be placed on a queue, and will start to be processed after a maximum of 5 minutes. You should click “back” on your browser to come page to your JOBS page where you can check the status of you new job(s) by clicking on the “Update” button a regular interval (see section 1). Alternatively, you can check in the “Every minute” check box and in that case Pedigree will automatically update itself every minute.

 


3. Viewing detailed results associated with a completed Job

 

On your job page, you can consult the detailed results associated with any job that is complete, i.e that show a green “100% done” in its Status line. Clicking on the “View details” button will take you to a JOB DETAIL page, where you will find your organization and user name as well as the name of the specific job that you are presently consulting in the upper right portion of the page.

 

Note. Many of the various functions and tools that are described in the following sections 3, 4, 5 and 6 involve opening very large html formatted files. This can sometime take a while and the user is well advised to exercise patience. It is also the case that some browsers will sometime use a previously locally stored page, if loading the new page takes too long. The user then needs to force the browser to “reload” the page. This is achieved with pressing F5 with Internet Explorer, or the Reload button with Netscape.

 

3.1 Summary files on the control, data, and execution of the various runs

 

 

 

 

Below your name and job name, you are given access to a number of raw (text) files: Controls, Input, Data status, Execution log, Allele frequencies and one large html formatted file-sibling likelihood ratios. These files are generated once per job. These files (except Execution log) contain summary information about the genetic data and the control parameters used. They should be consulted in details the first time you are running analyses on a new data set, as they will point out to possible problems in data acquisition.

 

Clicking on “Control” will transfer you to a text file window where the control file that was uploaded and used by Pedigree is displayed. This text file window also shows the date the job was submitted, the name and location of the original control and data file on the user personal computer, the name and location of the start up partition file if one was used and most importantly the specific seeds that were used by the computer for each run. All this information is needed to archive the job for future replay (section 3.4). Clicking on the “Back” button of your browser will take you back to the Job Detail page.

Clicking on  “Input” will do the same for the genetic data file that was uploaded and used by Pedigree. The “Data Status” file contains useful summary information about the data file such as the name of the dataset, the date it was submitted, the number of individuals and the number of loci in the data set. In addition, for each locus the number of alleles observed and the number and names of individuals with missing data is displayed. The “Allele frequencies” file has the same structure and display, in addition to the same summary information, the name and relative frequencies of each allele for each locus. It is a good idea to inspect these files the first few times you work on particular data set. This may help you detect some potential problems, for example by comparing with similar summary output that may have been generated by other programs.

 

The “Execution log” is also a very important file for troubleshooting. It is a very large text file that records, in a fairly self-explanatory way, all the operations that Pedigree performed during this job. In particular it displays information on how Pedigree read the information from the input and control data files (and optionally the start-up partition file). This can help detect if inappropriate characters in these text files prevented correct or complete reading of the necessary information. The Execution log also displays for each run the seed that it used and provides a detailed output of the run. It also shows the change of temperature at the appropriate iteration (see section 2.1.3). This allows a detailed examination of the change of the scores according to iteration number that can be useful to estimate if the total number of iteration needs to be changed. The same information is also displayed graphically  (section 3.2) in a way that is user-friendlier.

 

Sibling likelihood ratio

 

The last summary file accessible in this section of the Job Detail page is a formatted (html) page that stores all the pairwise likelihood ratios among all pair of individuals in the data file.

 

 

 

Note. This is a very large html formatted file. Loading this page can sometime take a while and the user is well advised to exercise patience. It is also the case that some browsers will sometime use a previously locally stored page, if loading the new page takes too long. The user then needs to force the browser to “reload” the page. This is achieved with pressing F5 with Internet Explorer, or the Reload button with Netscape.

 

The upper graph represents the distribution (on a decimal log scale) of the calculated full-sib to unrelated likelihood ratios for every pair of individuals in the data set (full-sibling likelihood ratios in the figure legend). These ratios are the ones used by default in the calculation of the partition score (see section 2.1.6). The distribution of these ratios gives some visual indication as to the presence of potentially related individuals in the data set. Most truly related pairs of individuals should have pairwise ratios superior to 1 while most truly unrelated pairs should have pairwise ratios lower than 1. A distribution skewed to the right (toward large pairwise full-sib to unrelated ratios) is an indication that there may be related individuals in the data set. Of course, this depends as well upon the number of loci used, and the separation between potentially related pairs and potentially unrelated pairs is clearer in data sets with a large number of loci.

 

The lower graph is similar, but presents the distribution of calculated half-sib to unrelated ratios for every pairs of individuals in the data set (Half-sibling likelihood ratios in the figure legend). As explained in section 2.1.6, this is not the ratio normally used for building the partition score, although this can be “forced” by providing the appropriate parameter in the control file. The half-sib ratio method (HS) is used mostly for developmental reasons and may be useful when a data set contains only half-sibs and no full-sibs. A distribution skewed to the right (toward large half-sib to unrelated pairwise ratios) is again an indication that there may be related individuals in the data set.

 

The lower part of the window allows opening frames with Sample to Sample ratios, so that the user can potentially find the actual value of both full-sibling and half-sibling ratios for any specific pair of individuals in the data set.

 

 

3.2 Graphical representation of the evolution of the partition score according to the iteration number

 

            On the lower part of the JOB DETAIL window, the user  will find the following graph, depicting the evolution of the partition score according to the iteration number for the various runs.

 

This graphic interface allows, at a quick glance, an evaluation of the change of the partition score over the prescribed number of iterations and this for each of the different runs in the job. You can zoom in by selecting an area you want to enlarge and zoom out of a particular area by clicking on the [-] button in the graph, or by resizing and moving the right and bottom sliders. This is a versatile and intuitively easy tool to get a feel for the number of iterations that should be used. The partition score should increase very fast initially but then plateau with few changes in score toward the last third of the run. If you still see large change in score toward the end of the run then you should increase the run length. Remember however that changing the weight changes the score (see section 2.1.4) and so you cannot compare directly different runs obtained with different weight.

 

3.3 Consulting the partition associated with each run

 

 

The following summary table in the JOB DETAIL page allows consulting detailed results associated with the partition with the highest score found in each run. Remember that there is one run for each line of essential parameters entered after the RUN_PARAM keyword in the control file (see section 2.1) The summary table indicates for these different runs, the value of the five essential control parameters, the number of groups (full-sib groups or kin groups, according to the FSC constraint value in the parameters, see section 2.1.2) in which the various individuals were allocated in the partition with the highest score in that run, as well as the actual score of this partition. 

 

The scores of the different partitions obtained in different runs allow potential comparisons of these different partitions subject to certain rules. It is OK to compare directly the score of different partitions as long as these runs were generated with the same weight and the same full-sib constraint. It does not matter if different temperature and number of iteration were used. On the contrary, given the enormous number of potential partitions, and the presence of MCMC variance across run, it is generally a good idea to try various combinations of number of iterations and possibly temperature to try to find the best possible partition (the partition with the highest score for a given weight and full-sib constraint). Of course the seed number should differ among runs, otherwise Pedigree will follow a deterministic path and generate the same solution (see section 2.1.5).

 

Comparing the score of partitions obtained with the same weight but with different full-sib constraints is not inherently wrong but it is not very informative either. By its very nature, a kin partition (FSC=0) will practically always reach a higher score than a full-sib partition (FSC=1) because it can explore groupings that are not allowed for the full-sib partition. Comparing score of partitions obtained with different weight is not allowable as changing the weight changes the scale of the score. The comparison (and choice) of partitions obtained with different weight is covered in more details in section 4.2.

 

The summary table also allows opening files to consult in details the partitions obtained in each run. Under the “View” category, a text file (Groups) can be opened and, if needed, downloaded back to the user local computer (see next section 3.3.1). Clicking on the Partition button will open a new (large) html window  and allow in-depth consultation of the partition (see section 3.3.2). Lastly under the “Compare with run” category a number of html windows will open and allow in-depth comparisons of various partitions. This is a full topic on itself and this is covered in section 4.

 

 

3.3.1 “Groups” text file

 

Clicking on the “Groups” button will transfer you to a text file window containing a condensed version of the partition with the highest score found in this run. A partial example is found below:

 

 

--------------------------------------- Pedigree v2.2  ---
DataSet     :                                2/14/2005
# samples   :  292
# loci      :    9
----------------------------------------------------------
run #       :  2                      Seed :  25790
Temperature : 10.0    Full-sibbling Const. :  ON
Weight      :  1.0    Stopped at iteration :  1000000
----------------------------------------------------------
Number of groups  :  44     Partition score :        468430.59
Computation time  :     21.160
Number of related :   2804  Number of moves :    248
group sizes      :    38   37   26   25   22   18   13   10    9    8
group sizes      :     7    6    5    5    4    4    4    3    3    3
group sizes      :     3    3    2    2    2    2    2    2    2    2
group sizes      :     2    2    2    2    2    2    1    1    1    1
group sizes      :     1    1    1    1
----------------------------------------------------------
Sample           Group GroupSize
 1                 30     2
 2                  3    26
 3                  1    38
 4                 19     3
 5                  1    38
 6                  1    38
 7                 11     7
 8                 13     5
 9                  1    38
 10                 3    26
 11                 7    13
 12                 5    22
 13                34     2
 14                12     6
 15                14     5
 16                 1    38
 17                31     2
 18                 1    38
 19                 4    25
 20                 1    38

etc, until #292

 

The header part of the file contains in a fairly self explanatory way a number of essential information on that partition: the date the run took place, the Dataset used (if provided by the user in the data file information, see section 2.2), the number of samples and the number of loci in the data set and all relevant control parameters. It then provides the number of groups (44 here), the partition score, and the computation time needed (in seconds). Number of related” and “Number of moves” are two overall measures of the amount of relatedness detected in the partition that can be useful to test for the significance of the proposed partition (see section 6.1). Following is the distribution of the group size. In the above example, the biggest group had 38 individuals, followed by another group with 37 individuals and so forth until the 44th group of size 1. It should be noted that the groups are numbered and ranked from the biggest to the smallest group in every partition. The lower part of the file contains the group allocation and the group size for every individual in the data set, with 1 individual per line.

 

This “groups” text file is fairly small and contains all the relevant group information. Hence it can be easily downloaded and the information entered in other data management tools such as a database or a spreadsheet for example. This file can also be saved as such and then up-loaded to Pedigree as a Start-up partition file (see section 2.3).

 

 

3.3.2 “ Partition” html formatted files

 

Clicking on the “Partition” button will transfer you to an html file window containing the same group information than in the “Groups” text file described in the previous section, but also containing all the individual and group genotype information. Exploratory tools in this page allow an in-depth exploration of the partition with highest score found in this run.

 

Note. This is a large html formatted file. Loading this page can sometime take a while and the user is well advised to exercise patience. It is also the case that some browsers will sometime use a previously locally stored page, if loading the new page takes too long. The user then needs to force the browser to “reload” the page. This is achieved with pressing F5 with Internet Explorer, or the Reload button with Netscape.

 

 

 

The header part of the file is presented in a similar way to the information contained in the text-formatted version of the Groups file (section 3.3.1) and contains in a fairly self explanatory way a number of essential information on that partition: the date the run took place, the Dataset used (if provided by the user in the data file information, see section 2.2), the number of samples in the data set and all relevant control parameters. It then provides the number of groups (44 here), the partition score and the computation time needed (in second). Number of related” and “Number of moves” are two overall measures of the amount of relatedness detected in the partition that can be useful to test for the significance of the proposed partition (see section 6.1). Following is the distribution of the group size. In the above example, the biggest group had 38 individuals, followed by another group with 37 individuals and so forth until the 44th group of size 1. It should be noted that the groups are numbered and ranked from the biggest to the smallest group in every partition.

 

               For each group, the list of individuals is provided first, followed locus by locus, by the number and distribution of genotypes found in that group. In the example above, the proposed group 1 (38 individuals) shows 4 genotypes at locus 1. Eight individuals in the group have genotype [228_248] and there are a total of 44 individuals in the data set with that genotype at that locus, another 8 individuals have genotype [228_236] out of 12 such occurrences in the data set and so forth. 
 
               The first major exploratory tool available in this page is that both individual sample number and genotypes can be highlighted. Clicking on a particular sample number in any group will highlight in red the specific genotypes of that individual at every locus, as well as all other occurrences of the same genotypes in other groups. Alternatively, clicking on a particular genotype at a particular locus will highlight every individual in the data set that shares this genotype at this locus. These capabilities allow a user to easily scan the partition and find individuals and genotypes of interest. Please note however, that highlighting individuals and genotypes can take a while if the html file is large.
               The second exploratory tool is that for each group, a “Parents?” Button allows opening a new html window containing all putative parental genotypes for this group. This aspect is described in details in section 5.  
 
 
               Cohesion and repulsion 
 
               The file also provides two measures associated with each group, the cohesion score and the repulsion score. The cohesion score of the group is the average, over all pairs of individuals in that group, of the Logarithm (Decimal) of the pairwise likelihood ratios used for pedigree reconstruction. A high cohesion score indicates that, on average, the individuals appear strongly related within the group. The repulsion score of the group is a similar measure. It is the average, over all pairs of individuals in the data set, with one individual in that group and the other individual in another group, of the Negative of the Logarithm (Decimal) of the pairwise likelihood ratios. A high repulsion score indicates that on average the individuals in that group appear strongly unrelated to the rest of the individuals in other groups. These two measures, in particular the cohesion score, are useful when assessing the statistical significance of specific groups through data randomization trials (see section 6.2)
 
               In the example above, the group #1 of interest comprises n=38 individuals out of a total number of N= 292 individuals in the data set. There are n*(n-1)/2 pairwise comparisons in a group of size n, so here we have 38*37/2 = 703 pairwise likelihood ratios within this group #1. The cohesion score of 1.49 is the arithmetic mean of the decimal logarithms of the pairwise likelihood ratios over these 703 pairs. A more intuitive way to think about this quantitative measure is that 101.49 (= 30.9) is the geometric mean of the 703 pairwise likelihood ratios, so these pairs of individuals within group #1 are, on average (a geometric average), about 31 times more likely to be full-sibs than being unrelated. 
               We also have n*(N-n) pairwise comparisons across group #1, i.e. with 1 individual in group #1 (size n) and the other individual not in group #1,  so here we have 38*(292-38) = 9652 pairwise likelihood ratios “across” this group #1. The repulsion score of 2.3496 is the negative of the arithmetic mean of the decimal logarithms of the pairwise likelihood ratios over these 9652 pairs. A more intuitive way to think about this quantitative measure is that 102.3496 (= 223.66) is the geometric mean of the inverse of the 9652 pairwise likelihood ratios, so these pairs of individuals across group 1 are, on average (a geometric average), about 224 time more likely to be unrelated than being full-sibs.
 
               Note that the pairwise likelihood ratio used for pedigree reconstruction and therefore used to compile the cohesion and the repulsion scores is, by default, the pairwise likelihood ratio of being full-sib versus being unrelated, although this can changed with the appropriate parameter in the control file (see section 2.1.6). 
 
               Lastly two buttons are located towards the beginning of the page, just before the group information. “Definitions” will open a window with several definitions of the partition score as well as the definitions of the various distances, of the likelihood ratios, and of the repulsion and cohesion scores used by the program.   “Group/Group repulsion” move you towards the end of the html partition page, where a color matrix allows a quick evaluation of the “Repulsion /Attraction” between the different groups in this partition. This Group/Group repulsion measure is similar to the group repulsion score presented above. It is the average, over all pairs of individuals in the data set, with one individual in the first group and the other individual in the second group, of the Negative of the Logarithm (Decimal) of the pairwise likelihood ratios. A high repulsion score indicates that in average the individuals in the first group appear strongly unrelated to individuals in the second group, while a low repulsion (below zero) indicates on the contrary some degree of attraction between the two groups.
 

 

 

This color matrix allows a quick inspection and identification of the various groups that seem to be related. This is very useful in several contexts. As mentioned earlier (section 2.1.4), large full-sib groups or kin groups are sometime split in subunits. These split groups should show a low degree of repulsion (colour toward red) toward one another in this Graph. Another common situation is the case of full-sib families nested within half-sibship, e.g. progenies of one male crossed with several females. If the partition was generated with the Full-sib constraint on (see section 2.1.2), the different full-sib families will generally be separated but will also show a low degree of repulsion toward one another as they are related at the half-sib level. This graph thus allows a quick inspection of the groups in the partition and complements the more formal approach (“Compare Partition”) described in section 4.

 

 

3.4 Archiving a completed Job

 

As mentioned at the end of section 1, it is a good idea to delete jobs that are not useful anymore. If however, the user feels that he /she may want to consult these results at a later time, it is very easy to archive a job.  On his/her personal computer, the user simply saves in a safe location, the control file, the data file and optionally the start-up partition file if one was used in that particular job. If the particular seeds used in the different runs had been randomly generated by Pedigree, i.e. they had a –1 value in the original control file (e.g. RUN_PARAM 500000 1 10 1 –1), the user then needs to identify the specific value of the seeds that were used by the program for each of the different runs in that job. These seeds can easily be found in the “Groups” and “Partition” files and can be located as well in the execution log. However the easiest way is simply to open the control file associated with the job of interest from the JOB DETAIL window (see section 3.1) and save this file back on the user personal computer. All the necessary information is located in that file. The user can also use comment lines (see section 2.1.6) in that control file to add other useful information, such as for example the job name that was used, the name of the start up partition if one was used, or any other relevant information.

 

To re-create a previous job, the user simply creates a new copy of the control file and replaces the –1 values by the true seed values that were actually used by Pedigree. These files, i.e. the new updated control file, the original data file and optionally the original start-up partition file can then be used to submit a new job that will produce identical results to those associated with the original job.


4. Comparing partitions within a job

 

 


 

 

Clicking on the appropriate link within the “Compare with Run” right portion of the table allows comparing the partitions obtained in any two runs within a Job. This is a central tool to reconstruct pedigrees.

 

This tool is used for two purposes:

1.                          It can be used to compare similar partitions to see how conserved particular partitions are across different MCMC runs, to compare partitions obtained with different weights, and to ultimately help identify the optimal full-sib partition and kin partition associated with this data set.

2.                          It is also used to nest the optimal full-sib partition within the optimal kin partition. This helps identify genotype errors and identify half-sib relationship among the various family groups.

 

4.1 Comparing partitions obtained with the same weight

 

As mentioned earlier, the space of possible partitions is gigantic. Different MCMC runs will generally generate different or slightly different solutions. The user then needs to try several runs with the same parameters and several runs with different parameters (temperature and number of iterations) to explore and compare the various solutions and ultimately decide upon the optimal full-sib and kin group partition that will be used for the next steps (sections 4.2 and 4.3). For example, once you have found a solution that seems to be stable over several trials with a temperature of 10, you can try a longer run with higher temperature (for example 30) to see if you still get the same or nearly the same solution.

 

We start with the simple case of comparing different partitions obtained with the same weight. This will generally be a weight of 1, although in some cases, higher weight will be used (see section 4.2). For example, in the job detail figure presented above, the results of eight different runs are presented. The four first runs used the same sets of parameters (number of iterations, weight and temperature) to generate four full-sib partitions while the last four runs used again the same sets of parameters to generate four kin partitions.

 

Among the four kin partitions, it can be seen that the same solution with highest score was attained twice, in runs 05 and 08. This would a good candidate for our best kin partition. The strategy would then to try to see if we can get a higher score (still with a weight of 1) by extending the number of iterations and playing with the temperature. Ultimately, if no other higher score partition is obtained, we would use this partition as a start-up partition in a few trial with fairly long runs and different temperatures to check that we can’t indeed do better. If this is the case then this solution  would be our best kin partition.

The four full-sib partitions also appeared to generate similar solutions with fairly similar scores and they identified 44 or 45 full-sib groups. Run 02 and Run 01 had the two highest scores (468430.59 and 467812.25 respectively). Since these partitions were generated with the same weight (weight  = 1 here), the scores of these can be directly compared and the best partition among the four appears to be partition 02. 

 

Using the compare function allows the user to detect the differences between these two partitions 02 and 01. This can be done by clicking the “01” in the second line of the “Compare with run” window. You should generally first choose the line corresponding to what seems to be the best partition. This is the more compact partition or the one with higher score if you used the same weight. Here it is partition 02. This will be your reference partition and then you click on the specific comparative partition (here partition 01) in that line. This convention is most important to follow when comparing fairly different partitions, for example when comparing partitions obtained with different weights (see section 4.2). The partition obtained with the higher weight should be more compact (fewer groups) and should be used as your reference. The line corresponding to this partition should be chosen first and then the less compact partition (obtained with a lower weight) should be chosen within that line. The same is true when comparing a more compact kin partition to a presumably less compact full-sib partition (see section 4.3). Here our 2 partitions obtained in run 02 and run 01 are presumably fairly similar. If you were not to follow the convention, and were to chose partition 01 as your reference run and compare it to run 02 (by clicking the “02” in the first line of that window), you would still be provided with fairly similar functionalities in the “Compare” window.

 

Clicking the “01” in the second line of the “Compare with run” window will open a new window:

 

 

 

 

 

The upper left and right panels are full scrollable. The left panel lists all the groups identified in the reference partition (here partition 02) with all the individual samples and shows where these same individuals were distributed in the comparative partition (here partition 01). The right panel represents the converse and lists all the groups identified in the comparative partition (i.e., 01 here) with all the individual samples and show where these same individuals were distributed in the reference partition (02 here).  The two panels also provide the cohesion and repulsion scores associated with each groups in the two partitions.

 

It should also be noted that every individual sample can be clicked on. Clicking on individual sample # 19 in the left panel (this individual was found in group 4, of size 25, in reference partition 02) will have for effect to scroll the right panel and put the group where the same individual # 19 was found in partition 01 at the top of the window (it was found in group 6, of size 17, in comparative partition 01, you can see this on the right panel on the image next page). You can click on a sample number on the left or on the right panel and it will have the same effect (moving the group containing the specific individual to the top of the window) on the opposite panel.

 

Hence these two upper windows allow a quick and intuitive exploration of the similarities and differences between the two partitions being compared. In the particular example of partition 02 and partition 01 here, most large groups were completely conserved across the two partitions except for one group that was somewhat reshuffled. Individuals in very small groups of size 3 and less were more commonly reshuffled. This is a fairly typical observation. Small groups that appear unrelated to other larger groups are often not very stable across different trials, simply indicating that there is often not enough information to allocate these individuals with much certainty into stable, small groupings. 

 

The large group that was somewhat reshuffled between partitions 02 and 01 is the group # 4 from partition 02 (see image above). It comprised 25 individuals. These 25 individuals were allocated to two different groups in partition 01. Sixteen of these individuals were found in group #6 in run 01, a group of total size 17. This can be seen in the image above where these 16 individuals are listed in group #6 (16/17) of partition 01. The rest of the individuals (9) were found in another group, #8 of size 12. In the example here, we probably would want to have a look at these 25 individuals that were somewhat reshuffled between partitions 01 and 02 to decide if partition 02 (with higher score) appears indeed more credible than partition 01. We probably would want to look as well at the few extra individuals that were added to the 2 groups in partition 01 (1 individual in group #6 and 3 individual in group #8) but that were not added to group # 4 of 25 individuals in our better partition 02. This can be done with the tools available on the lower panel of the page that allow for a more in depth look at specific individual genotypes and at collection of genotypes in groups of individuals of interest.

 

Clicking on any individual sample number in either the left or the right panel will load the individual sample number and its genotype into the lowest box of the lower panel. Note that panning the computer mouse over the individual sample numbers in either the left or the right panel show the individual’s genotype in the bottom part of the compare window, but you need to click on an individual sample number to load its genotype in the lower box of the lower panel.

 

Clicking in the upper left panel over the group number in the reference partition, or on the group number of the comparative partition (in the upper left panel again) will load the appropriate collection of genotypes in the upper box of the lower panel. This will list the various genotypes as well as the number of observed genotype in the group or sub group of interest. For example, clicking on the “4” in “Group in partition 02” will load the genotypes and their distribution observed in the 25 individuals of that group # 4 in partition 02. This is a group of interest, as it was not conserved in partition 01. In that partition 01, the 25 individuals were in two different groups, 16 of them in group # 6 (of size 17) and 9 in group # 8 (of size 12).  If you were to click on the “6” in “Group in partition 01”, this would then load the genotypes and their distribution observed in the specific 16 individuals out of the 17 that were in group # 6 in partition 01. This would allow a quick comparison of the list of genotypes observed in the large full sib group #4, size 25 in partition 02) and that observed in the smaller subset of these same individuals (16 individuals seen in group #6 of partition 01)

 

 

 

As described above you can load a specific individual genotype in the lower box in that lower panel and you can load a list of genotypes of a group in the upper box in that lower panel as well. When you do this, Pedigree compares at each locus the genotype of the specific individual with the list of genotypes in the group above. Each locus-specific genotype of the individual of interest that is not seen in the group list is then highlighted by placing *** *** below it.  This is useful to identify reasons why an individual is not placed in a specific group, and in particular it allows identifying possible genotype errors. This is described in section 4.3.

 

At each locus, there can be a maximum of five genotypes observed for a given full-sib group (4 real genotypes plus the missing value genotype, generally 000 000). The upper box in the lower panel has a window of 5 visible lines and thus at each locus, all the genotypes of any full-sib group will be visible. If however, you are looking at the genotype collection of a kin partition, there could be many more than 5 genotypes. The box will only show 5 lines of genotypes but they will show in RED, alerting you that you should scroll that box to see the complete list of genotype in that kin group.  

 

Lastly, when a group or a subgroup number has been clicked on and a list of genotypes loaded in the upper box of the lower panel, clicking on the “Parents?” button will open a new window where all possible parental genotypes for this group will be listed. Again this is a powerful tool to investigate and compare partitions and individuals. This is reviewed in details in section 5.

 

 

                        4.2 Comparing partitions obtained with different weights

 

As explained in section 2.1.4, the weight is an important parameter that allows correcting some problems that are commonly seen. For example when the data set actually contains some proportionally large full-sib families, partition constructed with a weight of one will often have these large full-sib families split in several groups. Similarly, when the data set contains mixture of full-sibs and half-sibs, a weight of one is often too low to ensure that every related individual ends up in the same kin group. Increasing the weight forces the coalescence of initially separate groups and allows overcoming the problems described above. However, this is a tool to use with caution. The compare function is very useful to analyze the change that occurs when increasing the weight and to decide what to accept. The situation is somewhat different when looking at full-sib partition or at kin partition, so this problem is explored in the following two separate sections.

 

 

                                                4.2.1 Full-sib partition

 

Increasing the weight parameter when generating a full-sib partition is mostly useful to detect instances where full-sib families are split in subgroups. This is a common occurrence when a few large full-sib families represent a substantial portion of the total sample number.

To deal with this potential problem, the following strategy generally works well. You should first try to generate a full-sib partition with the highest possible score with a weight of 1, by running several trials and varying the other parameters. Ultimately, use your best full-sib partition so far as a start up partition to verify that indeed you cannot improve further the score. You then do the same with a weight of 5, and the same with a weight of 10, for example. Finally, using the same specific parameters (in particular the seeds) that were used to generate your “best” full-sib partition with weight 1, “best” full-sib partition with weight 5, “best” full-sib partition with weight 10, you recreate these three best full-sib partitions in a single Job. You can now use the “Compare” function to analyze the changes among these partitions as the weight increases. Remember that you should first choose the line corresponding to the higher weight, more compact full-sib partition, and then click on the specific comparative full-sib partition (lower weight, less compact) in that line. These are the changes that are expected. As the weight increases, the number of detected groups will decrease as some families coalesce into larger families. However, the full-sib constraint in place will counteract this “forcing” by preventing the coalescence of groups that are not full-sib compatible.

 

If large families were indeed split into smaller sub-units, it will be relatively easy to detect these fairly large groups coalescing neatly. Examination of the genotype distribution (see end of section 4.1) and the reconstructed parental genotypes (see section 5) in both the split groups (with lower weight) and in the coalesced groups (with higher weight) should not reveal anything strange and the reconstructed parental genotypes of the coalesced groups should be credible.

 

There could be cases where only one or a few individuals are added to an already large group. If the data set only consists of a few loci with only a few alleles each, it is often possible for half-sib(s) or even more distantly related individuals to satisfy the full-sib constraint and thus to be added to the full-sib group when higher weight promote increased coalescence. Again a careful examination of the list of genotypes in the original full-sib family (with lower weight) and the genotype(s) of the new recruit(s) will give hints as to the plausibility of this addition.  Ultimately the user must take decision and exercise common sense, given Mendelian rules and given his/her intimate knowledge of the quality of the data and the potential complexity of the particular study system.

 

Lastly, it will be observed that many very small groups (mostly singleton and doubleton) commonly coalesce into slightly larger groups (size 3 and 4) when increasing weight. This mostly reflects the fact that the full-sib constraint is less able to prevent the artefactual coalescence of such small groups into slightly larger groups. This does not necessarily indicates that all these coalescence events of very small groups are artifacts, but there is generally not enough information in the genotype list to distinguish a plausible from an artefactual coalescence. The user is then advised to take a cautious approach in the interpretation of these small group coalescence events. Fortunately, for most biological problems where pedigree reconstruction is of interest (e.g. estimating mating success, mean kinship calculation, or quantitative genetic parameter estimation, etc), these smaller full-sib groups are generally less informative than the bigger full-sib groups. It should also be noted that when estimating the significance of full-sib groups through randomization trials (see section 6.2), these smaller full-sib groups of size 3 and below can rarely be distinguished from artefactual groupings of unrelated individuals.

 

To sum up this short discussion, the main interest of comparing full-sib partitions obtained with different weights is to detect the presence of large full-sib families that are split in the partition with lower weight. The user should also note that the group-group repulsion graph (see the end of section 3.3.2) for the lower weight partition should also offer some indication of which full-sib family groups may be split. The coalescence of very small groups is much less informative and should be interpreted with caution.

 

 

                                                4.2.2 Kin partition

Increasing the weight parameter when generating a kin partition is mostly useful to force the grouping of full-sibs and half-sibs into kin groups. Subsequent comparison of kin partition to full-sib partition can then be used to reconstruct the half-sib structure and detect genotyping errors (see section 4.3). This is however a tool to use with caution. In the case of kin partition, there are no counteracting forces to prevent the indiscriminate lumping of unrelated individuals. In the previous case of full-sib partition, the full-sib constraint was acting as a counteracting force (see previous section 4.2.1).

 

A strategy similar to the previous one (section 4.2.1) generally works well if you exercise caution. You should first try to generate a kin partition with the highest possible score with a weight of 1, by running several trials and varying the other parameters. Ultimately, use your best kin partition so far as a start up partition to verify that indeed you cannot improve further the score. You then do the same with a weight of 5 (and possibly with a weight of 10). You also need to identify a best full-sib partition as described in the previous section 4.2.1. If you decided in that previous full-sib analysis that a weight 5 or 10 was justified because it forced the coalescence of some large full-sib groups that were obviously split, then you use the “best” full-sib partition with a weight of 5 (or 10). If there was no evidence of large full-sib group splitting, then you are probably better off using the more conservative best full-sib partition obtained with a weight of 1. Finally, using the same specific parameters (in particular the seeds) that were used to generate your “best” kin partition with weight 1 and  “best” kin partition with weight 5 (and maybe weight of 10), as well as you best full-sib partition, you recreate these best kin partitions and best full-sib partition in a single Job. You can now use the “Compare” function to analyze the changes among these kin partitions as the weight increases. Remember that you should first choose the line corresponding to the higher weight, more compact kin partition, and then click on the specific comparative kin partition (lower weight less compact) in that line.

 

As in the previous case of full sib groups, the number of detected kin groups should decrease when the weight increases as some kin groups coalesce into larger kin groups. The coalescence of different kin groups may be rather well behaved, for example if you see whole or nearly whole large groups amalgamating neatly. If this is the case, you should note that the group-group repulsion graph (see the end of section 3.3.2) for the lower weight kin partition should also offer some indication of which kin groups may be prone to coalesce with increasing weight. You can then examine the list of genotypes in the kin groups pre-amalgamation (lower weight) and post-amalgamation (higher weight) to get a sense of what may be going on. Another example of “reasonably well-behaved” amalgamation is when individuals that were initially found in very small groups join bigger groups. This is often indicative of isolated half sibs joining an already formed kin group. 

Lastly, as in the case of full sib partition, there will be any instances where small or very small kin groups coalesce together with increasing weight. As mentioned previously, many of these coalescence events are often artifacts. The user is then well advised to exercise caution in interpreting these coalescence events of very small groups, as there is generally not enough information in the genotype list to distinguish a plausible from an artefactual coalescence. It should also be noted that when estimating the significance of kin groups through randomization trials (see section 6.2), these smaller kin groups of size 6-8 and below can rarely be distinguished from artefactual groupings of unrelated individuals.

 

In many instances however, the process of coalescence of large groups may not be so neat, and you may observe fairly large kin groups being split and added to different other kin groups, with a considerable amount of reshuffling of individuals among groups. The tools contained here allow examining the various lists of genotypes with different weights and the user should try to do so but it may be quite difficult to discern a clear pattern of what is going on. When it appears that amalgamation of your kin groups is not straightforward, the next option is then to proceed to “nesting” your best full-sib partition within the two best kin partitions that you have obtained for example with weight 1 and 5 (see section 4.3). The comparisons of the two “nesting” results can sometime help sort out the pedigree relationships among the various individuals, if you are dealing with a relatively “clean” pedigree consisting mostly of full-sib families nested within half-sib families (as in example 2 and 3 at the end of section 4.3.2).  However, the sort of situation described above (a considerable amount of reshuffling of individuals among groups) can also be indicative of a complex pedigree with the presence of many factorial crosses among the unobserved males and female parents. These are very difficult pedigrees to resolve unless the detected individual full-sib families are quite large, in which case an indirect route through full-sib parental genotype reconstruction can sometime work better  (see example 4 at the end of section 4.3.2 and section 5.2).

 

 

Final thoughts on the use of weight: Increasing the weight indiscriminately in the case of a kin partition will simply result in fewer, bigger groups but this may have little to do with the true pedigree. As a general guideline, using a weight similar to the one used to generate the best full-sib partition should result in kin groups containing most full-sibs and many of the half-sibs as well. Increasing the weight a little more should result in the kin groups picking a few more half-sibs, but artificial grouping of unrelated individuals and groups will increase as well. You should rarely have to use a weight of 10 or above in a kin partition. The interpretation of large kin groups and their amalgamation can be time consuming but should be possible if the coalescence of these groups is fairly well behaved. The coalescence of many small kin groups is much more complicated and may be impossible to distinguish from artefactual grouping of unrelated individuals.

 

You can also think of the weight as a parameter under your control that can let you choose to some degree the type of mistakes that Pedigree will do. Using a weight of 1 will generally ensure that unrelated individuals are rarely grouped together. Individuals that are grouped are generally truly related, but many truly related individuals are probably missed and groups may be split. Increasing the weight increase the probability that all truly related individuals are grouped together, but an increasing proportion of unrelated individuals will be added to the groups as well. Depending upon your particular problem, you may prefer one type of error to the other and this should guide as well your choice of a weight value.  

 

 

4.3 Nesting a full-sib partition within a kin partition

 

Nesting the best full-sib partition within the best kin partition is key for identifying the presence of genotype errors and to establish potential half-sib relationships among the identified full-sib families.

 

The basic idea here is to identify the constitutive full-sib families inside a kin group. A kin group is a group of individuals that appear related in an unspecified way, and generally comprises a mixture of full-sibs and half-sibs. In contrast, the individuals in a full-sib group appear to form a full-sib family. At each locus, the collection of genotypes is strictly compatible with Mendelian inheritance from one cross of two parents. In other words, there should exist at least 2 genotypes (the genotypes of the unobserved parents) that, when crossed, could have generated the collection of genotypes observed in the proposed full-sib family at that locus.

 

 


 

As seen in the example in the page above, the comparison of the best kin partition (partition 08) to the best full-sib partition (partition 02) shows that kin group #1 (50 individuals, in partition 08) comprise the large full-sib family # 2 (37 individuals, in partition 02), another smaller full-sib family (# 14 of size 5) and 5 other very small full-sib groups of size 1 or 2. The seven full-sib groups that had been identified in the best full-sib partition 02 are conserved intact, nested in kin group #1 in the best kin partition. The same is true of pretty much all the full-sib families nested in the different kin groups in the example above (only a portion of the nested comparison is shown). Out of the 45 full-sib families identified in the best full-sib partition (partition 02), only three very small full-sib families of size 2 were reshuffled in different kin groups. This indicates that Pedigree is able to identify a clear pedigree structure and further analyses to determine the relationships between the various related full-sib groups within a kin group should be successful. 

 

The presence of different full-sib families within a kin group may result from two different phenomena:

1.                  This could be due to the presence of genotype “errors” that will result in true full-sibs being expelled from their true groups. Hence the kin group may be really a single full-sib family, but the genotype errors force this full-sib family to be split in subgroups when the full-sib constraint is on.

 

2.                  This could be due to the presence of half-sib relationships among distinct full-sib groups. Hence in this case the kin group is truly a mixture of half-sibs and full-sibs

 

In real data sets, the two phenomena are often at work simultaneously, but the tools available in the “Compare” window and in the “Parents?” window generally allow distinguishing between them and allocating the individuals to the correct genealogical groups.

 

 

                                                4.3.1 Detecting genotype errors

 

            Some of the individuals may be truly full-sibs of an identified full-sib group, but genotype errors, or mutations or the presence of null alleles at some locus may prevent these individuals to join their proper full-sib family group because their genotypes do not respect the full-sib constraint. When this takes place, a true full-sib family will often be split in a full-sib partition, with most of the full-sibs being properly regrouped in a full-sib group, and the sibs that have been “affected” by mutations, errors or null alleles generally placed in small full-sib groups on their own. If there are several “affected” individuals from the same full-sib family, they will often be regrouped together in smaller satellite (splinter) full-sib families, separate from the larger full-sib group containing the unaffected sibs in the full-sib partition. When the full-sib constraint is removed to generate a kin group partition, these affected individuals naturally join the rest of their sibs in a kin group. 

 

As mentioned at the end of section 4.1 the lower panel of the “Compare” page allows a careful examination of individual genotype and comparing it to a list of genotypes in a group or subgroup. This is particularly handy to detect individual that may have been affected by genotype errors. In the previous section, kin group # 1 of size 50 was shown to be comprised of a large full-sib family (size 37) and of several much smaller full-sib families. Each of the individual genotypes in the small families can be compared to the list of genotypes in large full-sib family to see how different this particular individual is from the putative full-sib group.

 

 

For example here, individual # 17 has 8 out of 9 locus perfectly compatible with the genotype list in the large full-sib family # 2. However at locus 2, its genotype (307_355) is highlighted since it not comprised with the list of genotypes of the family # 2, but this genotype is very similar to one of the 4 genotypes (303_355) seen in this putative full-sib family. This individual is almost certainly a true member of this family, with a small error affecting one allele at one locus preventing it to join its proper group, when the full-sib constraint is on.

 

Caution. A locus specific genotype is highlighted with *** *** when this genotype is not seen in the list of genotype above. This does not necessarily mean that this genotype is actually incompatible with a proposed full-sib family. If the comparative family is small, it may be that all possible genotypes at that locus have not been seen and the single locus genotype of the individual of interest may actually be compatible. For example, at a hypothetical locus, an individual with genotype (A B) is highlighted (*** ***) when compared to the list of genotypes of a small full-sib family comprising 3 genotypes (A D), 2 genotypes (B C) and 1 genotype (C D). The parent’s genotypes had to be (A C) by (B D) and this individual genotype (A B) is indeed compatible with this proposed parental cross. It is highlighted simply because it is not seen in this small list. The user should also see section 5 for additional information on this subject.

 

The general rule of thumb thus is to accept as a true full-sib affected by a genotype error, an individual that has one, or at most two “offending” loci preventing the integration of that individual into a larger full-sib family. The basic logic is as follows: if the sort of genotype error described above affects an individual single locus genotype at random with probability p (most likely in the order of 1%-5%), in a family of size n, we can expect n*p individuals with one genotype error (most of whom will probably be expelled from the full-sib group), n*p2 individuals with 2 errors, n*p3 individuals with 3 errors, and so forth. Unless we are dealing with very large groups and with very large error rate, we should rarely see individuals affected by two errors and almost never individuals with more than two errors. Note however that there could be conditions where this logic is undermined. For example, particularly low sample quality may result in a higher probability of multiple single-locus genotype errors affecting some specific individuals. There could also be systemic laboratory errors resulting in multiple single-locus genotype errors in some specific individuals. One example that comes to mind is that of improper sample placement in the genotyping chain.

 

Null alleles are fairly common occurrence with microsatellite DNA markers that are typically used for DNA marker-based pedigree reconstruction. The presence of a null allele at one locus in one of the unobserved parent results in a typical situation that is instructive to analyze. Let’s examine the following hypothetical example where a number of progenies were produced by the cross of a parent with genotype (A Null) by another parent with genotype (B C) at a particular locus. The list of genotype in the progenies should then be (A B) (A C) (Null B) and (Null C). Since the null allele is not detectable, the progenies with genotype (Null B) will be erroneously scored as (B B) and (Null C) as (C C). This list of apparent genotypes (A B) (A C) (B B) and (C C) does not respect the full-sib constraint. When generating a kin partition, all the progenies should be found in the same kin group, but when generating a full-sib partition, they will not. What will generally happen is that all progenies with genotypes (A B) (A C) and one of the two erroneous classes (B B) or (C C) will be regrouped in a full-sib family, while the individuals with the other erroneous genotype class will be regrouped in a separate splinter full-sib family. The split generally occurs with the least numerous progenies of the two classes carrying a null allele being forced out as the splinter group. For example lets assume that there are quite a few more progenies with apparent genotype (B B) than apparent genotype (C C). The progenies with genotype (C C) will then generally be forced out and constitute a separate splinter family in the full-sib partition.

This situation is easily detectable with the tools described here. Every individual in that splinter family should have their genotype (C C) at that particular locus highlighted as not present in the genotype list of the bigger full-sib family, but their genotypes at other loci should be compatible with the bigger full-sib family (apart from possibly a few unlucky individuals that may have been affected by another genotype error in addition to the null allele). In addition, looking at the parental genotype reconstruction of the bigger family should reveal that the genotype list (A B)  (A C) and (B B) is not very likely. Such a list could be generated by the Parental cross (A B) by (B C), but about ¼ of the progenies should exhibit the genotype (B C) and this genotype is not seen in the progeny list (see additional information at the end of section 5.1)

 

                                                4.3.2 Resolving half-sib structure

 

The second reason why a kin group may be comprised of several full-sib families is that these full-sib families may be half-sibling to one another. When there is at least one large full-sib group (4 individuals and more) among these constitutive full-sib families, comparing the genotype of other individuals in smaller constitutive full-sib groups to the list of genotypes in this larger full-sib family generally allows differentiating this situation from the situation due to genotyping errors described in the previous section (4.3.1).

 

As mentioned at the end of section 4.1, the lower panel of the “Compare” page allows a careful examination of individual genotype and comparing it to a list of genotypes in a group or subgroup. This is particularly handy to detect an individual that was not allocated to the bigger full-sib group because it is a half-sib rather than because of genotype errors. The key information is that this individual will generally exhibit several loci that depart from the list of genotypes of the bigger full-sib family, as opposed to 1 (or rarely 2 loci) that departs from that list in the case of true full-sibs with genotyping errors. In addition, the putative parental genotypes of the larger full-sib group can be reconstructed by clicking on the button “parents?” when the correct group or subgroup is loaded into the upper box of the lower panel. It is then possible to check that the genotype of the putative half-sib could indeed be an offspring from one of the two reconstructed parent genotypes at each locus.

 

If several large constitutive full-sib families are found within one kin group, another powerful examination of putative half-sib relationship is possible by reconstructing and comparing the parental genotypes for each of these large constitutive full-sib groups. If these are indeed half-sibs to one another, then there will be one common parental genotype at each locus, for each pair of these large constitutive full-sib groups. It should be noted however that this type of analysis is powerful and robust mostly when the sizes of the constitutive full-sib families are reasonably large (>7or 8 individuals). Parental genotype reconstruction is feasible for smaller full-sib families but a few genotype errors among the progenies may lead to improper parental genotype reconstruction (see the end of section 5.1)

 

To examine more finely the possibilities and limitations of resolving half-sib structure with Pedigree 2.2, it is now necessary to distinguish among the various type of full-sib and half-sib structure that could exist in a data set. Let’s imagine that we have a collection of progenies that originated from the mating of 10 females and 10 males. The various colors in the following tables indicate which full-sib groups should be regrouped in kin groups.

 

Example 1

The simplest case is exemplified in the following table, and it corresponds to 10 full-sib families. Pedigree 2.2 should be able to reconstruct that pedigree easily. Both best full-sib partition and best kin partition should get close to 10 groups, and the comparisons between these two partitions should permit identifying individuals that may have been affected by genotype errors.

 

 

 

M1

M2

M3

M4

M5

M6

M7

M8

M9

M10

F1

X

 

 

 

 

 

 

 

 

 

F2

 

X

 

 

 

 

 

 

 

 

F3

 

 

X

 

 

 

 

 

 

 

F4

 

 

 

X

 

 

 

 

 

 

F5

 

 

 

 

X

 

 

 

 

 

F6

 

 

 

 

 

X

 

 

 

 

F7

 

 

 

 

 

 

X

 

 

 

F8

 

 

 

 

 

 

 

X

 

 

F9

 

 

 

 

 

 

 

 

X

 

F10

 

 

 

 

 

 

 

 

 

X

 

 

Example 2

The following case (full-sib families nested within half-sib families) is a bit more complicated but should still work quite well with a good informative DNA marker data set. The best full-sib partition should get close to 14 groups and the best kin group partition should identify 6 kin group: (M1 by F1, F2, F3, F4); (F5 by M2, M3, M4); (F6 by M5, M6); (M7 by F7, F8); (F9 by M8, M9) and (M10 by F10). Nesting the best full-sib partition within the best kin group partition should allow resolving the half-sib structure and identifying the common parent within each kin group: M1, F5, F6, M7 and F9 respectively.

Pedigree 2.2 will work well with this type of situation because individuals within each kin group are related (e.g. they all have M1 as a common parent in kin group 1) so they will group together easily without requiring large weight, yet each kin group is clearly separated as they share no common parents, so each kin group should separate easily. In other word, there should be enough attraction within groups and repulsion across groups to resolve this clean nested structure.

 

 

 

M1

M2

M3

M4

M5

M6

M7

M8

M9

M10

F1

X

 

 

 

 

 

 

 

 

 

F2

X

 

 

 

 

 

 

 

 

 

F3

X

 

 

 

 

 

 

 

 

 

F4

X

 

 

 

 

 

 

 

 

 

F5

 

X

X

X

 

 

 

 

 

 

F6

 

 

 

 

X

X

 

 

 

 

F7

 

 

 

 

 

 

X

 

 

 

F8

 

 

 

 

 

 

X

 

 

 

F9

 

 

 

 

 

 

 

X

X

 

F10

 

 

 

 

 

 

 

 

 

X

 

 

 

Example 3

The following situation (full-sib families nested within half-sib families with some degree of factorial crossing) is getting a bit more complicated. If the data set is informative and the individual full-sib families are fairly large, then the best full-sib partition should point to about 17 full-sib groups. The best kin group partition should point out to 5 kin groups: (M1, M2 by F1, F2); (M3, M4, M5 by F3, F4, F5); (M6, M7, M8 by F6, F7, F8); (M9 by F9) and (M10 by F10). However, you may have to increase the weight a fair bit over 1 to get this kin partition. The basic reason is that the 3 first kin groups are less cohesive that in the previous example and you will probably have to promote coalescence within kin group by increasing the weight. For example, in the first kin group, the progenies from M1 by F1 should want to join their half-sib from M1 by F2, and similarly progenies from M2 by F2 should also want to join their half-sibs from M1 by F2. However the progenies from M1 by F1 are unrelated to the progenies from M2 by F2 and this will counteract the tendency to group these three full-sib families groups together in a single kin group, particularly if the “bridging” full-sib group M1 by F2 is small compared to the 2 other full-sib groups. However, as you increase the weight, these various groups should coalesce as predicted towards 5 kin groups because these 5 kin groups are well separated (there are no common parents among them) so they should exhibit little tendency to merge across kin groups when you increase the weight.

 

 

M1

M2

M3

M4

M5

M6

M7

M8

M9

M10

F1

X

 

 

 

 

 

 

 

 

 

F2

X

X

 

 

 

 

 

 

 

 

F3

 

 

X

X

X

 

 

 

 

 

F4

 

 

X

 

 

 

 

 

 

 

F5

 

 

X

 

X

 

 

 

 

 

F6

 

 

 

 

 

X

 

 

 

 

F7

 

 

 

 

 

X

X

X

 

 

F8

 

 

 

 

 

 

X

X

 

 

F9

 

 

 

 

 

 

 

 

X

 

F10

 

 

 

 

 

 

 

 

 

X

 

 

Example 4

The last and most difficult case correspond to a fully factorial or partial factorial crossing design where all or most of the males have mated with all or most of the females. This is a very difficult situation for Pedigree 2.2 to resolve because of two basic problems. First, there is no clean way to separate the progenies in clearly distinct kin groups (which explains why no color is used in this table). Every possible kin group partition will have kin group including unrelated individuals (as was the case in the previous example but probably worse) and will have as well related individuals allocated to different kin groups. Pedigree will generate a best kin group partition, but its interpretation will be quite difficult, and that kin partition could even have some full-sib groups separated in different kin groups. In other word the “best” kin partition is not going to be very good and informative. Second, it will be difficult to generate a good full-sib partition because each full-sib family has many half-sibs in the data set. Many of these half-sibs could “join” a full-sib group and generate an erroneous full-sib partition and lead to erroneous parental genotype reconstruction, particularly if the data set is not very informative and does not contain many different loci.

 

 

 

 

M1

M2

M3

M4

M5

M6

M7

M8

M9

M10

F1

X

X

 

X

 

 

X

 

X

X

F2

X

X

 

X

 

X

 

X

 

 

F3

 

 

X

X

X

 

X

 

X

X

F4

X

X

X

 

X

X

 

X

 

X

F5

 

 

X

 

X

X

X

 

X

 

F6

X

 

 

X

 

 

 

X

 

X

F7

 

X

X

X

 

X

X

X

X

 

F8

X

X

X

 

X

 

X

X

 

X

F9

X

X

 

X

 

 

 

 

X

 

F10

X

 

X

 

X

X

X

 

 

X

 

In this situation, the best strategy is 1) to try generating a good full-sib partition, 2) followed by parental genotype reconstruction in each purported full-sib group, and 3) followed by comparison of the parental genotypes to identify instances of factorial mating. This will work best under two conditions:

·        First you need to have a highly informative data set without too much genotyping errors, so that the full-sib constraint allows a relatively clean separation of full-sibs from half-sibs.

·        Second, most of the full-sib families must be large so that parental genotype reconstruction is fairly accurate. This then allows identifying the common parents that mating across many partners, hence recognizing the fully factorial or partial factorial crossing design.

 

 


5. Reconstructing parental genotypes

 

As mentioned in several sections of this manual, Pedigree 2.2 can reconstruct the putative genotypes of the unobserved parents. Note that we are talking here about the genotypes of the 2 unobserved parents of a putative full-sib family. Again this is a handy and powerful tool to complement the in-depth examination and comparison of partitions. Clicking on the button “Parents?” will open a new window with the potential parental genotypes. The button “Parents?” can be found in two different places, in the html formatted “Partition” window (section 3.3.2) and in the “Compare” window” (see the end of section 4.1). The second route (from a “Compare” window) offers additional exploratory tools that are not available from the first route (from a “Partition” windows). We start here with the simplest case, from a html formatted “Partition” window.

 

5.1 Reconstructing parental genotype from a Partition window

 

 

 

 

Clicking on the “Parent?” button associated with one of the identified group within the partition will open a new window with all possible parental genotypes at every locus for this group. Please note that loading the partition file can take some time and that the “Parents?” button is only operational when the partition file is fully loaded.

 

The following figure is that corresponding to the parental genotype of group 1, a large putative full-sib family with 38 individuals in run 02. The information in the new window is quite straightforward. Locus by locus, every possible parental cross that could have generated the collection of offspring genotypes seen in the group is listed. For each proposed parental genotype cross, the list of offspring genotypes that should have been observed is shown, as well as the expected and observed number of offspring with the corresponding genotypes. Lastly, a chi-square value summarizing the goodness of fit between the observed and expected number of offspring in the various genotype classes is shown as well. Please note that this chi-square is presented simply as a way to alert the user to poor fit between the potential parental genotype cross and the collection of observed offspring genotypes. This poor fit is characterized by high chi- square value. No attempt is made to present probability values associated with these chi-square values.

 

 

 

As can be seen in this large full-sib group, for 8 out of 9 loci there is really only one possible parental genotype cross that could have generated the observed collection of offspring genotypes. This is not surprising given the large size of this putative full-sib family (38 individuals). For locus 7, there are actually 5 possible parental genotype crosses that could have generated the observed offspring genotype list. Among these, the fifth parental genotype possibility listed is  (178_XXX by 194_198). The XXX simply indicate that any allele value would be acceptable and that this allele was unobserved in the offspring genotype collection. Note that if both reconstructed parental genotype show an XXX allele, it does not mean that the unobserved allele value has to be the same for both parents. For example, an acceptable parental genotype cross listed as (178_XXX by 194_XXX) could represent (178_198 by 194_198) but also (174_182 by 194_198). A quick examination of the 5 possible parental genotype crosses at that locus 7 reveals that the first proposed one has a good fit to the observed offspring list (chi-square of  1.684). The other four proposed parental genotype crosses could indeed generate the collection of genotypes observed in the offspring, but each of these proposed parental crosses has a poor fit (very high chi-square value of  41.368) with the list of observed offspring genotypes.  These last four proposed parental genotype crosses are thus possible but unlikely.

 

Only one parental genotype cross is listed for Locus 8 (175_179 by 171_179), but the chi-square of that lone possibility is quite high as well (38.412) thus indicating a probable problem. Examination of the list of observed offspring genotypes as compared to the expected number of genotypes points to an obvious problem. Many offspring are observed with genotype (175_179) and (171_175) but none with genotype (171_179) and only one offspring is seen with genotype (179_179). Yet these last two offspring genotypes should be just as common than the two first ones if the proposed parental cross is correct.  If this lone offspring with genotype (179_179) was not in the group, the most probable parental genotype cross would be (175_175) by (171_ 179) and it would probably have a good fit (low chi-square). There is thus reasonable ground to believe that this specific offspring genotype does not really belong to the offspring genotype list. The user should then return to the Partition page and click on this genotype at this locus in this group. This will then highlight the sample number of this problematic member of the full-sib group. Clicking back on the correct sample number will then highlight the complete multilocus genotype of that individual and will allow a first examination of the global fit of that individual in the full-sib family. If this individual seems a bona fide member of the family group based on the other 8 loci, then the particular genotype of this individual at locus 8 is probably a genotype error. The user can then go back to the original data and check for evidence of genotype scoring error or data entry error. Alternatively, if this individual’s genotypes at many other loci seem a little bit “off” compared to the rest of the full-sib family, then this individual may not be a full-sib but possibly rather a half-sib of that group.

 

A second more powerful examination of this type of situation is possible when the parental genotype reconstruction is invoked from the ‘Compare” window rather than from the Partition page. This aspect is now looked in detail in the following section.

 

5.2 Reconstructing parental genotype from a Compare window

 

 

 

The image above corresponds to the same example that we have seen in section 4.3.1, i.e., nesting of our best full-sib partition (Partition 02, 44 full-sib groups) within our best kin partition (Partition 08, 17 kin groups). As mentioned in section 4.1, clicking in the upper left panel over the group number in the reference partition, or on the group number of the comparative partition (in the upper left panel again) will load the appropriate collection of genotypes in the upper box of the lower panel. For example clicking on the 1 of group 1 in the comparative Partition 02 will load the genotype list observed in the 38 individuals of this full-sib family. This full-sib family #1 in Partition 02 is nested within the kin group # 2 (42 individuals) in Partition 08. This is the same family that was looked at in the previous section. Clicking on the “Parents?” button will then open another window showing all the possible reconstructed parental genotypes for each locus. This window has the exact same structure as the one produced when clicking the button “Parent?” from a Partition window. However, doing this operation from a “Compare” window offers 2 main additional advantages.

 

1.      First, it is possible to quickly compare the potential parental genotype crosses of groups that were reshuffled when generating various full-sib partitions. For example, in section 4.1, two full-sib partition were compared, where most full-sib groups were conserved across the two partitions, but two large full-sib groups were somewhat reshuffled. From the appropriate “compare” windows, it would be easy to examine the putative parental genotypes of the various reshuffled groups and subgroups and to see if one solution may seen more plausible than another one. A similar case is possible when investigating the coalescence of full-sib family groups with increasing weight (see section 4.2.1). It is possible to examine the putative parental genotypes before and after coalescence and this may be informative as to the plausibility of the merging of these groups.

2.      Second, in a “Compare” window, the reconstruction of parental genotype crosses is actually performed based on the list of genotypes that has been loaded in the upper box of the lower panel. This box can be fully edited by the user and the parental genotype reconstruction will be based on the edited data. For example, it was noted above that the parental genotype reconstructed for this group1 at locus 8 was unlikely (high chi-square = 38.4.12) and this seemed to be driven by a single progeny with genotype (179_179). It is quite easy to explore this possibility by editing the box of offspring progenies.

 

 

For example, the user could erase the single observation of this genotype at locus 8 and then click on the “Parent?” button.

 

 

 

The new reconstructed parental genotypes will now show 5 different possibilities for locus 8, with one of them (175_175 by 171_179) having a good fit as expected while the four others are quite unlikely.

 

 

 

 

 

 

 

 

 


 

This possibility of editing the list of offspring genotypes and seeing instantly the resulting change in the parental genotype reconstruction is thus a very handy tool. However the user has to make sure that any change that he /she effects does not interfere with the alignment of the rest of the genotypes in the box. It should also be noted that any edits in this box have only temporary effects and they are thus quite safe. The raw genotype data as well as the partition data generated by the MCMC are not modified by these user-driven inputs. Modifying the list genotypes in the box will affect the parental genotype reconstruction, but clicking again on the correct group or subgroup number in the upper left panel will reload the original list of genotypes.   

 

A last aspect of parental genotype reconstruction that is worth mentioning is that this operation can be done for any group in any type of partition, even if the group is a kin group that probably violates the full-sib constraint. In other word, the program will attempt to reconstruct the genotypes of the two unobserved parents of a putative full-sib family, even when the group is not a full-sib family. This can sometime be informative. For example clicking on the group # 2 in partition 08 will load in the box the list of genotypes observed among the 42 individuals of that kin group. As mentioned in section 4.1, the box will show 5 lines of genotypes in RED, alerting you that you should scroll that box to see the complete list of genotype in that kin group. Clicking on the “Parents?” button will again open a window with the list of putative parental genotype crosses that could have generated the offspring list.   

 

 

As can be seen, no possibilities of parental cross are listed for locus, 2, 3 (and 8), indicating that for these loci, the collection of observed offspring genotypes is not respecting the full-sib constraint. However, for the other 6 loci, at least one possible parental cross is listed. The putative parental crosses, not very surprisingly, are the same ones that were identified for the large constitutive full-sib group of 38 individuals (family 1 in run 02). This is another indication that this kin group of 42 individuals is almost a full-sib family. Indeed, comparing the genotype of the 4 extra individuals to the large constitutive full-sib family of size 38 as explained in section 4.3.1 revealed that these extra individuals were really true full-sibs that had been each affected by 1 genotype error.

 

 


 

6.  Statistical inference on partition

 

Pedigree 2.2 will sometime group by chance unrelated individuals in artefactual full-sib families or kin groups. This is a common occurrence with most pedigree reconstruction algorithms. Natural populations of interest for pedigree reconstruction may often consist of a collection of unrelated individuals and individuals in small but real full-sib families or kin groups. It becomes thus very important for the biological interpretation of a reconstructed pedigree in such natural populations to be able to assess whether a particular partition has uncovered something “real” and to potentially distinguish a true from an artefactual group.

 

Through intensive data re-sampling, Pedigree offers a powerful way to assess the overall significance of a specific partition and to look in more details at the significance of specific full-sib or kin groups found in partitions. The partitions of interest here are the best full-sib partition and the best kin partition found by the user and used afterward for fine detailed pedigree reconstruction as described in sections 4 and 5. This approach consists in assessing how often the observed overall degree of pedigree structure as well as the observed cohesion of specific groups found in these best partitions are seen in a large number of artificial data sets of unrelated individuals. The basic idea is to randomize the original genotype data set, by permuting intensively the collection of genotypes locus by locus. Each randomization trial results in a data set with the exact same number of individuals and with the same allelic and genotypic frequencies as the original data set. However, every individual in the randomized data set is now “unrelated” to every other individual. In other words, the genotype similarities of any pair of individuals in the randomized data set reflect random sampling from the pool of available genotypes (identity by chance) and cannot reflect inheritance of similar genotypes from common ancestry (identity by descent).

Pedigree generates a large number of such randomized data sets (typically somewhere between 100 and 1000), the exact number being set by the user. Pedigree then runs the MCMC algorithm on each randomized data set with the same exact parameters (see section 2.1) that were used for the best full-sib partition and the best kin partition. For example, if the best full-sib partition was generated with 2,000,000 iterations, a temperature of 10, a weight of 5 (and obviously a full-sib constraint of 1), then Pedigree will run with these parameters on each of the 100 to 1000 randomized data sets. This allows Pedigree to generate the distribution of partition’ attributes under the null hypothesis that every individual is unrelated in the dataset. The attributes are metrics that describe the overall degree of relatedness in the data set, or they can be metrics associated with a particular group within the partition. We can then do statistical inference by comparing the metrics of the real partition to the distribution of these metrics in the 100 to 1000 randomized data sets.

 

 

         6.1 Global significance of a partition

 

The first question of interest is whether the particular full-sib partition or kin partition found by the user actually reveals any real pedigree information, as opposed to being a collection of artefactual small groups of truly unrelated individuals. Two metrics of overall significance can be used for this sort of inference: the partition score and the number of related pairs in the partition.

 

The simplest and most powerful between these two metrics is simply the partition score itself. Hence, if Pedigree created 1000 randomized data set, we can count T, the number of time a partition score as high or higher than the real partition score was observed among these 1000 trials and estimate the overall statistical significance p as T/1000. Preliminary analyses seem to indicate that this will allow detecting real “signal” from the data, even when only a small portion of the individuals are actually related. However there is no easy biological interpretation of the partition score.

The second measure that can be used is simply the number of related pairs in the partition of interest. This metric can vary between 0, when all the individuals are found in separate singleton groups (the all-unrelated or atomic partition) and N(N-1)/2 when every individual is found in one single group of size N. Again, if Pedigree created 1000 randomized data set, we can count T, the number of time a number of related pairs as high or higher than the number of related pairs in the real partition was observed among these 1000 trials and estimate the overall statistical significance p as T/1000. This metric is generally less powerful than the partition score at detecting small amount of true relatedness in the real data set, but it offers a more intuitive explanation of what was found. For example if the maximum number of related pairs observed in the 1000 randomized trials was 500 and the actual number of related pairs in the real partition was 1500, then a very rough approximation may be that about 1000 of these pairs are “real” rather than artefactual.

 

The number of moves from the all unrelated partition is another metric that could be used and this is reported in the Groups text file (section 3.3.1) and Partition html file (section 3.3.2) associated with a particular partition. The number of moves between any two partitions is simply the number of individuals that should be removed to get perfect agreement between the two partitions. Here we are interested in the number of moves from a specific partition of interest to the all unrelated “atomic” partition where every individual is into its own separate group, to provide some idea of the amount of pedigree structure present in the partition of interest. This metric can vary between 0, when all the individuals are found in separate singleton groups (the all unrelated, atomic partition) and (N-1) when every individual is found in one single group of size N. This metric is less powerful than the two previous ones at detecting small amount of true relatedness and the interpretation of small differences in number of moves between the partition based on the real data and on the randomized data is not immediately obvious. This metric is presented here mostly because the Number of moves has been used in the literature to characterize the degree of similarity between two partitions. However the two metrics described above are preferable.

 

 

         6.2 Significance of specific groups within a partition

 

If it appears from the sort of analyses described above that a specific kin partition or full-sib partition is globally significant, the second question of interest is to identify which particular groups may be real and which may be artefactual.

The group-specific metric of interest is then the group cohesion score. This information is provided in the Partition file (see cohesion and repulsion scores in section 3.3.2) and in the Compare window (mid section 4.1). The group cohesion score is the average, over every pair of individuals in the group, of the decimal logarithm of the likelihood ratio of being full-sibs over being unrelated. This gives information on the cohesion of the group, i.e. the strength of the “attraction” of the different individuals toward one another. We can then compare the cohesion score of a particular group of interest in the real partition to the distribution of the cohesion scores of groups of the same size in the randomized trials. Hence, if Pedigree created 1000 randomized data set, we can count T, the number of randomized trial that contained at least one group of the same size with a cohesion score as high or higher than the cohesion score of the group of interest in the real partition, and estimate the statistical significance of this group as T/1000.  The logic underlying this inference strategy is based on two observations.

 

·        First, the pairwise likelihood ratio of being full-sibs versus unrelated is can be substantially higher in a truly related pair (full-sibs or half-sibs) than in a pair of truly unrelated individuals that happen to have similar genotypes by chance. This is increasingly true as more loci are used to calculate the ratio.

·        Second, in a real group (e.g. a full-sib family), most of the pairwise ratios should be high because every pair of individuals is truly a pair of full-sibs. A pair of true full-sibs can sometime have a low pairwise likelihood ratio when they happen to have systematically sampled different alleles from their parents by chance, but this will be a rare occurrence, and again this will be increasingly rarer as more loci are used to calculate the ratio. In contrast, in an artefactual group of unrelated individuals, only some of the pairwise likelihood ratio will be higher than one while many other pairwise ratios will be lower than one, because the probability that several individuals have overall similar genotypes simply by chance is quite small. Again this probability of seeing several individuals with similar genotype by chance decreases with increasing number of loci.

 

Hence for both of these reasons, we expect that the cohesion score of a real group should be quite a bit higher than the cohesion score of an artefactual group. However we need to do the comparison of cohesion score for a given group size, because the cohesion score of artefactual groups increases with decreasing group size. In the randomized trials, we can logically expect to see quite a few cases where 3 individuals may have by chance similar genotypes (hence being joined by Pedigree in an artefactual group of size 3 with a high cohesion score), but we should expect few cases were 9 individuals happen to have similar genotype by chance. We should thus expect to see many less occurrences of groups of size 9 and when they do occur by chance they should have low cohesion score. In contrast, the cohesion of true groups is much less sensitive to group size and will be quite high both for small and large groups.

 

The practical implications of these observations is that this data re-sampling approach will generally be quite powerful at detecting that medium to large groups are indeed significant, but this may not be possible for the many smaller groups that are often seen in a real partition. It should be also noted that this is less of a problem for full-sib partitions than for kin partition. If individuals are truly unrelated, they will rarely be regrouped in artefactual full-sib groups of size 5 and above because the probability that 5 or individuals would have by chance a collection of genotypes that respect the full-sib constraint at every locus is pretty low, depending upon the number of loci and number of allele per loci. 

 

 

         6.3 Implementation of significance testing in Pedigree 2.2

 

The tools necessary for implementing the significance testing described above are already operational in Pedigree 2.2. However, these capabilities are not directly accessible to the user from the web interface, because running significance testing requires a considerable amount of computer CPU time.

 

If you have analyzed a particular data set and have generated an optimal full-sib partition and kin partition that you wish to test with the approaches described above, we are asking you to contact Christophe Herbinger (Christophe.Herbinger@dal.ca) and explain briefly the situation. We will then give you access to the significance testing tools.

 

 

 

rgin-bottom:.0001pt;text-indent:.5in'> 

If you have analyzed a particular data set and have generated an optimal full-sib partition and kin partition that you wish to test with the approaches described above, we are asking you to contact Christophe Herbinger (Christophe.Herbinger@dal.ca) and explain briefly the situation. We will then give you access to the significance testing tools.

 

 

 

every pair of individuals is truly a pair of full-sibs. A pair of true full-sibs can sometime have a low pairwise likelihood ratio when they happen to have systematically sampled different alleles from their parents by chance, but this will be a rare occurrence, and again this will be increasingly rarer as more loci are used to calculate the ratio. In contrast, in an artefactual group of unrelated individuals, only some of the pairwise likelihood ratio will be higher than one while many other pairwise ratios will be lower than one, because the probability that several individuals have overall similar genotypes simply by chance is quite small. Again this probability of seeing several individuals with similar genotype by chance decreases with increasing number of loci.

 

Hence for both of these reasons, we expect that the cohesion score of a real group should be quite a bit higher than the cohesion score of an artefactual group. However we need to do the comparison of cohesion score for a given group size, because the cohesion score of artefactual groups increases with decreasing group size. In the randomized trials, we can logically expect to see quite a few cases where 3 individuals may have by chance similar genotypes (hence being joined by Pedigree in an artefactual group of size 3 with a high cohesion score), but we should expect few cases were 9 individuals happen to have similar genotype by chance. We should thus expect to see many less occurrences of groups of size 9 and when they do occur by chance they should have low cohesion score. In contrast, the cohesion of true groups is much less sensitive to group size and will be quite high both for small and large groups.

 

The practical implications of these observations is that this data re-sampling approach will generally be quite powerful at detecting that medium to large groups are indeed significant, but this may not be possible for the many smaller groups that are often seen in a real partition. It should be also noted that this is less of a problem for full-sib partitions than for kin partition. If individuals are truly unrelated, they will rarely be regrouped in artefactual full-sib groups of size 5 and above because the probability that 5 or individuals would have by chance a collection of genotypes that respect the full-sib constraint at every locus is pretty low, depending upon the number of loci and number of allele per loci. 

 

 

         6.3 Implementation of significance testing in Pedigree 2.2

 

The tools necessary for implementing the significance testing described above are already operational in Pedigree 2.2. However, these capabilities are not directly accessible to the user from the web interface, because running significance testing requires a considerable amount of computer CPU time.

 

If you have analyzed a particular data set and have generated an optimal full-sib partition and kin partition that you wish to test with the approaches described above, we are asking you to contact Christophe Herbinger (Christophe.Herbinger@dal.ca) and explain briefly the situation. We will then give you access to the significance testing tools.