Y-DNA Family Grouping App FAQ

Why is grouping by probable descent from a common male ancestor within the genealogical time frame or surname era important?
What is the genealogical time frame?
What is the surname era?
How does the app determine groups?
Are the groups limited to kits that all probably have a common ancestor within the genealogical time frame as defined by FTDNA?
Why does the app merge together overlapping groups of "big kits"?
Why are "small kits" assigned to multiple groups rather than merging the groups as is done with "big kits"?
What is genetic distance?
How does the app calculate genetic distance between two kits?
How does the app use genetic distance to determine whether two kits "match"?
What are FTDNA's guidelines for interpreting genetic distance between two kits?
Does the app use haplotypes or SNP results? If not, why not?
What are the requirements, advantages and limitations of the input options?
How can I get an .html file of a project's results page to submit as input?
How can I get a .csv file of the project data to submit as input?
Does using the app with private information compromise the confidentiality of that information?
Why does the app assign some 12- and 25-marker kits to multiple groups?
In the Group Assignments table, why do some higher group numbers appear above lower group numbers?
Why are there a lot of kits that are not assigned to any group by the app?
What is relational distance?
What are the limitations of the Reorganized Table?
How can I avoid having date formats in downloaded tables?
How should a surname group administrator use the results of the app to organize kits into groups?
How can an administrator change the grouping of project kits?

Why is grouping by probable descent from a male common ancestor within the genealogical time frame or surname era important?

The primary reason that people join surname projects is to find other people who have their surname (or the surname of one of their ancestors) and who may be related to them through a common male ancestor. This information can be vital to help break through genealogical "brick walls" and to find a more distant ancestor with that surname. Identifying all people in a surname project who are probably related within the genealogical time frame or surname era and organizing them into groups is therefore one of the most important and useful benefits of a surname project.

What is the genealogical time frame?

The genealogical time frame is generally defined as the period of time during which it is possible to establish genealogical relationships based on documentary evidence. See ISOGG wiki definition of "genealogical time frame". FTDNA, however, uses a more restrictive and precise definition of genealogical time frame. FTDNA defines "genealogical time frame" as the most recent 15 generations. See FTDNA's definition. Based on a new generation starting, on average, 30 years after the start of the prior generation (i.e., most children being born when, on average, their father was 30 years old), FTDNA's definition of genealogical time frame translates to approximately the last 450 years, which means back to about 1550. Since FTDNA's definition of genealogical time frame is limited to the most recent 15 generations, whenever this FAQ or other parts of the app refer to kits having a genetic distance which, under FTDNA's genetic distance interpretation guidelines, indicates that they probably have a common ancestor within the genealogical time frame, it means within the most recent 15 generations.

What is the surname era?

The surname era is the period of time in a particular region or country that the use of a family name has been hereditary. See ISOGG wiki definition of "surname era". As used in the context of a particular surname project, however, this app uses "surname era" to mean the period of time that that particular surname has been in use as a hereditary family name.

How does the app determine groups?

At a general level, the app identifies groups of kits that have a genetic distance from each other which indicates that they probably share a common male ancestor within the genealogical time frame and then merges any overlapping groups unless the overlap is only because of a shared 12- or 25-marker kit (a "small kit").

More specifically, the app first goes down the rows in the results table and, for each 37-, 67- and 111-marker kit (a "big kit"), the app checks to see what other kits in the project have a genetic distance which, under FTDNA's genetic distance interpretation guidelines, indicate that they probably have a common ancestor within the genealogical time frame with that big kit (aka "matching kits"). If a big kit has any matching kits, the app forms a group consisting of that big kit and all of the kits (including small kits) that match with it. After groups based on each of the big kits are formed, all overlapping groups that share a common big kit are merged. In other words, if big kit A matches with big kit B and are put together in Group 1 and big kit B matches with kit kit C and are put together in Group 2, Group 1 and Group 2 will be merged because they have big kit B in common, even though the genetic distance between big kit A and big kit C is too large for them to be deemed "matching kits" and have been put together in the same initial group.

Small kits are included in the intial groupings based on matches with each big kit, but cannot cause the merger of any of the those groups. So, for example, if big kit A matches with small kit B and are put together in group 1 and big kit C also matches with small kit B and are put together in Group 2, Group 1 and Group 2 will not be merged because the only kit they have in common is small kit B. Instead, the app will indicate that small kit B has been assigned to both Group 1 and Group 2.

Finally, the app forms goes down the rows in the results table and, for each small kit, checks to see what other small kits match with it. Matching small kits are assigned to a new group and then overlapping small-kit-only groups are merged together, so that if small kit A matches with small kit B in Group 18 and small kit B also matches with small kit C in Group 19, the two small kits groups are merged together.

Why does the app merge together overlapping groups of "big kits"?

In testing, it was found that defining a group as only containing kits that, under FTDNA guidelines probably shared a common male ancestor within the genealogical time frame with every other kit in the group, often resulted in a large number of overlapping groups that experienced project administrators had concluded should be placed in one group. Merging the groups with overlapping big kits, on the other hand, was found to result in groups that were generally very consistent with those created by experienced project administrators.

Merging groups with overlapping big kits means that some kits in a group may have a genetic distance from some other kits in the group which, based on FTDNA guidelines, indicates that they probably do not share a common male ancestor in the genealogical time frame (i.e., 15 generations).

Why are "small kits" assigned to multiple groups rather than merging the groups as is done with "big kits"?

It is quite common for a small kit to match with big kits in different groups. In testing, it was found that merging groups containing big kits that overlap only because they had one or more small kits in common, resulted in merging groups that experienced project administrators thought should be kept separate and that probably represented unrelated lineages. Therefore, rather than merging the two groups based on the assumption that the small kit represents some connecting median between two parts of one related group, the app assumes that the small kit just needs additional testing to determine which of the two big kit groups it belongs in, and shows the small kit as matching both groups.

While each of the groups initially calculated by the app are limited to kits that, under FTDNA's guidelines, all probably have a common male ancestor within the genealogical time frame as defined by FTDNA (i.e., 15 generations), because groups with overlapping big kits are then merged, a final group may be broader and may contain some kits that, under FTDNA's guidelines, probably do not share a common male ancestor within that genealogical time frame, but instead probably share a common male ancestor within a somewhat longer time frame. Assuming the kits in the project are limited to those with the same surname, all the kits in a group should probably have a common male ancestor within the surname era for that particular surname. If, however, men who shared a common male ancestor within the near pre-surname period adopted the same surname (probably most likely in the case of clan-based surnames), the app's groups could include kits whose most recent common male ancestor dates from the near pre-surname period.

What is genetic distance?

ISOGG defines genetic distance as the number of mutations between two sets of DNA test results. All genetic distance numbers are just estimates of the number of mutations because, while the test results show the differences in STR results, they do not show whether the differences were the result of one or multiple mutations and do not show any mutations where a second mutation reversed a first mutation.

How does the app calculate genetic distance between two kits?

The genetic distance between two kits is determined based on the maximum standard panel size that both kits have results for. For example, if one kit tested for 111 markers and the other kit tested for 37 markers, their genetic distance is calculated based on a 37-marker comparison.

Genetic distance can be calculated using a variety of different algorithms. The simplest algorithm is the "infinite alleles model" which treats each difference between two markers as the result of a single mutation. Using this model, the genetic distance between two kits is simply the number of markers that differ between the kits.

The other basic algorithm for calculating genetic distance is the "step-wise mutation model." In this model, if a marker value differs by two between two kits, the difference is deemed to have been the result of two separate mutations. Using this model, the genetic distance between two kits is the sum of the differences between each of the individual markers.

The app uses FTDNA's genetic distance algorithm, which is a hybrid of the two more basic models. In most cases, the step-wise model is used; however, for certain markers and in certain situations, a variant of the infinite alleles model is used. Specifically, genetic distance between two kits is calculated as the sum of differences between the values for each separate marker (see FTDNA 'Genetic Distance'), except that (i) for null markers and multi-value markers, the amount (if any) added to the genetic distance calculation is determined in accordance with the methods adopted by FTDNA in 2016, as described in this article by Roberta Estes, and (ii) for DYS389ii, the value for DYS389i is first subtracted from DYS389ii before calculating the difference, as described in this FTDNA forum discussion and the "DYS389I&II" section of this article by John Barrett Robb.

Based on a comparison of the results of the app's genetic distance results and those given by FTDNA, the app's calculations appear to be fully consistent with those used by FTDNA, with one narrow exception. Consistent with Roberta Estes' description of FTDNA's 2016 changes in their methods of calculating genetic distance referenced above, the app treats every null marker as a single mutation. However, a FTDNA forum post indicates that, as of August 2017, FTDNA's internal computer program only treated "some" null markers as a single mutation. FTDNA's response that only treating certain null markers as a single mutation was "unfortunate" and "at this time," suggests that they agree that all null markers should be treated as a single mutation and that they hoped to correct their program in the future.

How does the app use genetic distance to determine whether two kits "match"?

For purposes of creating the groups, the app treats two kits as "matching" and appropriate for inclusion in the same group, if the genetic distance between the two kits, as determined based on the maximum standard panel size that both kits tested for (i.e., 12, 25, 37, 67 or 111), falls within the range necessary for the two kits to be deemed to "probably" have a common male ancestor within the genealogical time frame, as set forth in FTDNA's genetic distance interpretation guidelines. Under those guidelines, two kits are deemed to "probably" have a common male ancestor within the genealogical time frame if they have a genetic distance of not more than 1 in the case of a 12-marker comparison, 2 in the case of a 25-marker comparison, 4 in the case of a 37-marker comparison, 6 in the case of a 67-marker comparison, and 7 in the case of a 111-marker comparison. Note that, with respect to 67- and 111-marker comparisons, these guidelines are more restrictive than the criteria FTDNA uses in deciding which kits to show as "matches" on a person's Y-DNA Matches page. On that page, in addition to showing kits that "probably" have a common male ancestor within the genealogical time frame, FTDNA also includes 67- and 111-marker kits that "only possibly" have a common male ancestor within that time frame. See "Y-DNA - Matches Page, are only exact matches shown?" on the Y-DNA - Matches Page.

What are FTDNA's guidelines for interpreting genetic distance between two kits?

FTDNA's guidelines for interpreting genetic distance between two kits are summarized in its article Expected Relationships with Y-DNA STR Matches and set forth separately for each kit size in more detail in the articles listed below:

It should be noted that the interpretation guidelines for 12- and 67-marker comparisons in the summary article differ slightly from those in the separate articles for those comparisons. Specifically, the summary article indicates that a genetic distance of 1 on a 12-marker comparison means that the test takers are "probably related," while the separate article for 12-marker comparisons indicates that a genetic distance of 1 means they are only "possibly related." In the case of a 67-marker comparison, the summary article indicates that a genetic distance of 5-6 means that the test takers are "probably related" and that a genetic distance of 7 means that they are only "possibly related," while the separate article for 67-marker comparisons indicates that a genetic distance of 5-6 means that the test takers are "related" and that a genetic distance of 7 means that they are only "probably related." The app uses the summary article's guidelines in both cases: in the case of a 12-marker comparison, because it is very common for kits that have a genetic distance of 1 on a 12-marker comparison to be solid matches when compared at a greater number of markers (and because the app does not allow 12-marker kits to merge groups, in any event) and, in the case of a 67-marker comparison, because counting a genetic distance of 7 as a match seems inconsistent with the guidelines for 111-marker comparisons which also treat a genetic distance of 7 as the largest genetic distance for a match.

Does the app use haplotypes or SNP results? If not, why not?

No. The app determines kit groupings solely on the basis of STR test results. While appropriate SNP testing is more definitive than STR testing in determining descent from a common ancestor, since most test takers in most surname projects have not taken the necessary SNP tests to make that determination, SNP tests cannot currently be used as the primary basis for forming groups for most surname projects. To the extent appropriate SNP results exist, they should, however, be reviewed to confirm (or disconfirm) the groupings determined by STR results.

What are the requirements, advantages and limitations of the input options?

The app uses Y-DNA STR results for multiple kits (e.g., the kits in a surname project) as input data. The data must be in the format used by FTDNA in its surname projects.

Data can be input by submitting either (1) an .html file of the project's FTDNA public or GAP Y-DNA Classic or Colorized results page or (2) a .csv file of the project's Y-DNA STR results in FTDNA format (i.e., with the same column order and headings as used in FTDNA results tables). For instructions on how to obtain an appropriate .html file, see How to get an .html file of a project's results; for instructions on how to obtain an appropriate .cvs file, see How to get a .csv file containing project data.

Each input option has advantages and disadvantages:

html input option
- Advantages
- Disdvantages
csv input option
- Advantages
- Disdvantages

How can I get an .html file of a project's results page to submit as input?

You can get an .html file of a project's results page to submit as input in the following manner:

Make sure you are logged into the FTDNA website under your account as a project administrator or a project member (if you are one) so that you will be able to see non-public results on the project's results page.
Go to the project's public or, if you are a project administrator, GAP Y-DNA Classic or Colorized results. page
Use the "Save Page As" function of your web browser to save the page as a "webpage, html only" file. The "Save Page As" function can generally be found in the File menu of the browser or in the menu that appears when you right click on the webpage.

How can I get a .csv file of the project data to submit as input?

If you are a project administrator of the FTDNA project whose results you want to analyze, the easiest way to obtain a .csv file with the project's results is by using the "Download Files" function from "Project Administration" on the GAP menu bar or the "Export to Spreadsheet" button on the GAP results page.

Alternatively, or if you are a member of the project whose results you want to analysis but not an administrator of the project, you can create a .csv file that contains all of the project's results in the following manner:

Either (1) go to a Y-DNA results page (either Classic or Colorized) on your project's regular site (not a GAP results page), use your web browser's copy function to copy the entire page, and paste it into an Excel spreadsheet, or (2) get an .html file of the project's results (see instructions in the section above) and open it in Excel.
Delete all rows in the spreadsheet that are above the table's column headers row or below the last kit row in the table.
If you need to prepare a .csv file for a very large project whose kits do not all fit on one page of FTDNA's Y-DNA Results chart, use one of the methods described above to copy all the results into Excel and combine them in a single single spreadsheet, omitting all column heading rows other than the one at the top of the spreadsheet.
Save the spreadsheet as a csv file.

Regardless of whether you are a project administrator and are able to download the project results as a .csv file or create your own .csv file in Excel, you may want to open the file in Excel and remove the data for any kits that do not claim male lineal descent from a man with the project surname, since including the data for those kits may alter the grouping results.

Note that although Excel may automatically change certain multi-value markers into dates, the program will automatically convert them back to the proper values; you do not need to manually reconvert yourself.

Does using the app compromise the confidentiality of private information?

No. The app does not store, save or disclose any submitted data. The data is only used in the internal processing of the app, and is only retained in the app during the user's current session.

Why does the app assign some 12- and 25-marker kits to multiple groups?

If a small kit matches with big kits in separate groups, rather than merging the two groups, the small kit is assigned to both groups. Additional testing would probably show which of the groups the small kit properly belongs to. Small kits are also assigned to groups which consist only of other small kits that they match with.

In the Group Assignments table, why do some higher group numbers appear above lower group numbers?

The app first goes down the rows and assigns group numbers to groups formed on the basis of matches between 37-, 67- and 111- marker kits ("big kits"). After these "big kit" groups are formed, the app goes back to the top and assigns group numbers to groups consisting solely of matching 12- and 25-marker kits ("small kits"). Therefore, if a group with a higher number appears above lower group numbers, that means that the group with the higher number consists only of small kits.

Why are there a lot of kits that are not assigned to any group by the app?

If the Group column for a kit is blank that means that its genetic distance from every other kit in the project is too large for it to be considered to probably have a common male ancestor within the genealogical time frame with any other kit in the project, as determined by FTDNA's genetic distance interpretation guidelines. You can view the genetic distances for a particular kit from every other kit in the project by clicking on the kit's number. It is common for a substantial portion of kits in a surname project to not match with any other kit in the project. The most likely reason for this is that men who have taken Y-DNA tests are still a small percentage of the total population and, therefore, there are many paternal lines for which no representative, or only one representative, has taken a Y-DNA test. As the number of men who have taken Y-DNA tests grows, the percentage of unmatched kits should go down.

What is relational distance?

"Relational distance" is the term used by the app for the genetic closeness of two kits (or a kit and the modal values of a group) as determined by FTDNA's genetic distance interpretation guidelines. A relational distance of 0 means the two kits (or a kit and the modal values of the group, as the case may be) are "very tightly related" under those guidelines, 1 means they are "tightly related," 2 means "related," 3 means "probably related," and 4 means "only possibly related." If there is no relationship distance number, the kits are "not related" under FTDNA's genetic guidelines. Per FTDNA's guidelines, being "related" means having a common male ancestor within the genealogical time frame.

What are the limitations of the Reorganized Table?

The Reorganized Table organizes the kits in a project based on the groups the app has assigned the kits to. The organization suggested by the Reorganized Table should, however, always be reviewed and, if necessary modified.

Set forth below are some reasons why the organization proposed by the Reorganized Table should be modified:

Ideally, before inputting project data into the app, all kits that do not claim male lineal descent from a male ancestor with the project surname, should be removed. If those kits have not been removed prior to inputting the project data into the app, those kits may be inappropriately included in matching surname groups. They should instead be in a separate group for all non-surname kits.
SNP results may show that a kit or group of kits should be grouped differently. SNP results are more definitive than STR results and should trump the app's suggested groupings based on STR results.
Genealogical information may indicate that a kit that is assigned to multiple groups belongs in one of the groups and not in the other or indicate that a kit that has a genetic distance that is slightly too great to have been included in a group by the app has a common male ancestor with kits in the group and should be placed in the group.
A close analysis of the STR results may show that two distinct groups that are probably not related in the genealogical time frame or surname era have been merged. In this case, the two distinct groups should be separated. This type of merger between unrelated groups would most likely occur because of convergent mutations which caused one or more kits to have STR results that are sufficiently close to kits in the other group to have cause the kit(s) to be placed in both groups.

How can I avoid having date formats in downloaded tables?

When tables are downloaded from the app, they are downloaded into a csv file. A csv file does not contain any information that tells Excel what format to apply to the data in the file. Therefore, when a csv file is opened in Excel or imported into an Excel spreadsheet, Excel makes a guess as to what type of format to apply to the data. In the case of hyphenated numbers that could be dates, Excel by default applies date formatting. To prevent Excel from turning the data into dates, (1) do not open the csv file in Excel but instead import it into an Excel spreadsheet (how to do that varies by what version of Excel you are using; sometimes there is an import button on the toolbar and sometimes the import function is in the Data menu), (2) when asked how the data is delimited, select commas, and (3) when asked if you want to apply formatting to specific columns, separately highlight the columns for DYS385 and DYS459 and select Text for each. If this process if followed, the imported data should not contain any date-formatted data.

If you open or import an FTDNA downloaded csv file in Excel, the same problem of hyphenated values being converted into dates will occur. The steps described above should prevent that from happening. However, the app contains code that will automatically reconvert dates to the correct hyphenated values, so csv files containing date formats should work fine as input files.

How should a surname group administrator use the results of the app to organize kits into groups?

If a surname project's kits have already been put into groups based on relational distance to other kits, the app will largely just confirm the existing groupings, but the Group Assignments page of the app may suggest a number of kits that should be reviewed to see if they have been grouped properly. Clicking on the kit number and looking at the Relational Distances page for those kits may help determine whether they are properly placed.

If a surname project's kits are currently only grouped by haplotype or have otherwise not been fully organized into groups based on relational distance, the app's Reorganized Table page will provide a good starting point for an appropriate organization of the project's kits.

The app's results should be reviewed with available SNP test results and genealogical information to it order to make a final determination as to which kits probably share a common male ancestor within the genealogical time frame or the surname era.

It is suggested that kits in a surname project be organized in the following manner:

First, matched groups, each of which should consist only of kits for men that may have male lineal descent from a male with the project surname and that (based on STR results, as supplemented by SNP results and genealogy) probably share a common male ancestor within the genealogical time frame or surname era. For ease of reference and to make sure they appear in order, each matched group's name should begin "Matched Group 01," "Group 01," "Lineage I" or something similar. Either arabic or roman numerals may be used for the numbers. If arabic numerals are used and there are more than 9 matched groups in the project, it is important to precede the numbers less than 10 with a 0 so that they will be listed first. It is common and useful to following the number with a few words that describe what the kits in the group have in common. Since there is no other place on the results page to explain how groups were formed, perhaps the best description is "- kits in the group probably share a common male ancestor within the genealogical time frame or surname era." Other descriptions, such as haplotype or terminal SNP, common ancestor, or geographic location, are less desirable because (1) they may not be correct for all current or future members of the group, (2) they may describe some kits that are not included in the group, (3) they do not describe the basis on which the group was formed, and (4) they may be duplicative of information that is listed for individual kits (e.g., ancestry, geography and haplotype). If a project has a large number of kits that probably have a common male ancestor in the genealogical time frame or surname era, the project administrator may wish to break the kits up and put them into separate groups. The app's Subgroups Table can be used to help identify meaningful subgroups. If the project administrator decides to break related kits into separate groups, it is important that the names for these subgroups indicate both (1) that all the subgroups probably share a common male ancestor in the genealogical time frame or surname era and (2) the basis on which each separate group was formed. An example of a heading that does this is "Group 01A - Descendants of John Brown; probably share a common male ancestor with other 01 groups."

Second, one or more groups of 12- and 25-marker kits that need additional testing before they can be properly assigned to a matched group. Projects can choose between two reasonable approaches to 12- and 25-marker kits. They can either deem all such kits as needing additional testing or only those that match with "big kits" in more than one matched group (i.e., those in the "Multi-Group Small Kits" group in the app's Reorganized Table). Small kits deemed to need additional testing can either all be put in a single group labelled so that it comes after all the matched groups (such as "Kits needing additional testing - kits in this group need to upgrade to at least 37 STRs in order to be properly placed in a matched group") or, if a project elects to deem all small kits as needing additional testing, small kits that match with big kits in a particular group could be put in a separate additional testing group following the applicable matching group (such as "Lineage I possible - kits in this group need to upgrade to at least 37 STRs in order to be properly placed in a matched group").

Third, a group of unmatched surname kits, consisting of all kits for men that may have male lineal descent from a male with the project surname but that (based on STR results, as supplemented by SNP results and genealogy) probably do not share a common male ancestor with any other kit in the project within the genealogical time frame (other than non-surname kits). Note that unmatched kits would include any kits that only match with small kits that the project adminstrator decides should be shown as needing additional testing (e.g. probably all kits in the "Conditionally Matched Kits" group of the app's Reorganized Table). Some projects set up multiple groups of unmatched kits based on haplotype; however, it is unclear what benefit there is to doing so. Any group for unmatched kits should be named so that it comes after all of the groups for matched kits and all of the groups for kits needed additional testing. Naming the group starting with "Unmatched" is usually sufficient for that purpose. An example of an appropriate full group name would be "Unmatched Ashleys - kits in this group probably do not share a common ancestor with any other Ashley kit in the genealogical time frame."

Lastly, a group for non-surname kits, consisting of all kits for men that do not have male lineal descent from a male with the project surname. It is common for people who do not have male lineal descent from a male with the project surname to join a project because they have an ancestor with that surname and are interested in his ancestry. Unfortunately, their YDNA results become part of the project's results even though they are not relevant to the project. All kits of this type should be put in a single group for non-surname kits which is named so that it appears at the bottom of the results table. In order to make sure that the group appears last, it may be necessary to start the group name with a meaningless letter such as x or z. An example of an appropriate full group name would be "xNon-Ashleys - kits in this group are not believed to have male lineal descent from an Ashley."

How can an administrator change the grouping of project kits?

To form subgroups or edit subgroups in a project, a project administrator should go to their GAP home page and, under the "Project Administration" menu, select "Member Subgrouping." Under the Manage column, clicking the edit icon lets you change the group name, description, and color, while clicking the arrow icon lets you add kits to, or remove kits from, the group. In order to assign a kit to a new group, it must first be removed from its existing group and then, after going into the arrow/select page for the new group, added to the new group.