Y-DNA Family Grouping App FAQ
Why is grouping by probable descent from a male common
ancestor within the genealogical time frame or surname era important?
The primary reason that people join surname projects is to find other people who have
their surname (or the surname of one of their ancestors) and who may be
related to them through a common male ancestor. This
information can be vital to help break through genealogical "brick
walls" and to find a more distant ancestor with that surname.
Identifying all people in a surname project who are probably
related within the genealogical time frame or surname era and organizing them into
groups is therefore one of the most important and useful benefits of a
surname project.
What is the genealogical time frame?
The genealogical time frame is generally defined as the period of time during which it
is possible to establish genealogical relationships based on
documentary evidence. See
ISOGG wiki definition of "genealogical time frame". FTDNA, however, uses a more restrictive and precise definition of
genealogical time frame. FTDNA defines "genealogical time frame" as the most recent 15 generations. See FTDNA's definition. Based on
a new generation starting, on average, 30 years after the
start of the prior generation (i.e., most children being born when, on
average, their father was 30 years old), FTDNA's definition of genealogical time
frame translates to approximately the last 450 years, which means back to about 1550. Since FTDNA's definition of genealogical
time frame is limited to the most recent 15 generations, whenever this FAQ or other parts of the app refer to kits
having a genetic distance which, under FTDNA's genetic distance interpretation guidelines,
indicates that they probably have a common ancestor within the genealogical time frame, it means within the most recent
15 generations.
What is the surname era?
The surname era is the period of time in a particular region or country that the use of a family name has been
hereditary. See
ISOGG wiki definition of "surname era". As used in the context of a particular surname project, however, this
app uses "surname era" to mean the period of time that that particular surname has been in use as a hereditary family name.
How does the app determine groups?
At a general level, the app identifies groups of kits that have a
genetic distance from each other
which indicates that they probably share a common male ancestor within the
genealogical time frame
and then merges any overlapping groups unless the overlap is only because of a shared
12- or 25-marker kit (a "small kit").
More specifically, the app first goes down the rows in the results table and, for each 37-, 67- and 111-marker
kit (a "big kit"), the app checks to see what other kits in the project have
a genetic distance which, under
FTDNA's genetic distance interpretation guidelines,
indicate that they probably have a common ancestor within the genealogical time frame
with that big kit (aka "matching
kits"). If a big kit has any matching kits, the app forms a group consisting of that big kit and
all of the kits (including small kits) that match with it. After groups based on each of the big kits are formed, all overlapping groups that share a
common big kit are merged. In other words, if big kit A matches with big kit B and are put together in Group 1 and
big kit B matches with kit kit C and are put together in Group 2, Group 1 and Group 2 will be merged because they have big kit
B in common, even though the genetic distance between big kit
A and big kit C is too large for them to be deemed "matching kits" and have been put together in the same initial group.
Small kits are included in the intial groupings based on matches with each big kit, but cannot cause the merger of any of the
those groups. So, for example, if big kit A matches with small kit B and are put together in group 1 and big kit C also matches with
small kit B and are put together in Group 2, Group 1 and Group 2 will not be merged because the only kit they have in common is
small kit B. Instead, the app will indicate that small kit B has been assigned to both Group 1 and Group 2.
Finally, the app forms goes down the rows in the results table and, for each small kit, checks to see what other small kits match
with it. Matching small kits are assigned to a new group and then overlapping small-kit-only groups are
merged together, so that if small kit A matches with small kit B in
Group 18 and small kit B also matches with small kit C in Group 19, the
two small kits groups are merged together.
Why does the app merge together overlapping groups of "big kits"?
In testing, it was found that defining a group as only containing kits that,
under FTDNA guidelines probably shared a common male ancestor within the
genealogical time frame with every other kit in the group, often resulted in a large number of
overlapping groups that experienced project administrators had concluded should be placed
in one group. Merging the groups with overlapping big kits, on the other hand, was found to
result in groups that were generally very consistent with those created by experienced
project administrators.
Merging groups with overlapping big kits means that some kits in a group
may have a genetic distance from some other kits in the group which, based on FTDNA guidelines, indicates
that they probably do not share a common male ancestor in the genealogical time frame
(i.e., 15 generations).
Why are "small kits" assigned to multiple groups rather than merging the
groups as is done with "big kits"?
It is quite common for a small kit to match with big kits in different groups.
In testing, it was found that merging
groups containing big kits that overlap only because they had one or more small kits in common, resulted in merging groups that
experienced project administrators thought should be kept separate and that
probably represented unrelated lineages. Therefore, rather than merging the two groups based on
the assumption that the small kit represents some connecting
median between two parts of one related group, the app
assumes that the small kit just needs additional testing to determine
which of the two big kit groups it belongs in, and shows the small kit as matching both groups.
While each of the groups initially calculated by the app are limited to kits that, under FTDNA's guidelines,
all probably have a common male ancestor within the
genealogical time frame as defined by FTDNA (i.e., 15 generations),
because groups with overlapping big kits are then merged, a final group may be broader and may contain some kits that, under FTDNA's guidelines,
probably do not share a common male ancestor within that genealogical time frame, but instead probably share a common
male ancestor within a somewhat longer time frame. Assuming the kits in the project are limited to those with the
same surname, all the kits in a group should probably have a common male ancestor within the
surname era for that particular
surname. If, however, men who shared a common male ancestor within the near pre-surname period adopted the
same surname (probably most likely in the case of clan-based surnames), the app's groups could include kits whose most recent
common male ancestor dates from the near pre-surname period.
What is genetic distance?
ISOGG defines genetic distance as the
number of mutations between two sets of DNA test results. All
genetic distance numbers are just estimates of the number of mutations because, while the test results show
the differences in STR results, they do not show whether the differences were the
result of one or multiple mutations and do not show any mutations where a second mutation
reversed a first mutation.
How does the app calculate genetic distance between two kits?
The genetic distance between two kits is determined based on the maximum standard
panel size that both kits have results for. For example, if one kit tested
for 111 markers and the other kit tested for 37 markers, their genetic
distance is calculated based on a 37-marker comparison.
Genetic distance can be calculated using a variety of different algorithms. The simplest
algorithm is the
"infinite alleles model"
which treats each difference between two
markers as the result of a single mutation. Using this model, the genetic distance
between two kits is simply the number of markers that differ between the kits.
The other basic algorithm for calculating genetic distance is the
"step-wise mutation model."
In this model, if a marker value differs by two between two kits, the difference is deemed to have been the result of
two separate mutations. Using this model, the genetic distance between two kits is the
sum of the differences between each of the individual markers.
The app uses FTDNA's genetic distance algorithm, which is a hybrid of the two more
basic models. In most cases, the step-wise model is used; however, for certain markers and in
certain situations, a variant of the infinite alleles model is used. Specifically, genetic distance between
two kits is calculated as
the sum of differences between the values for each separate marker
(see
FTDNA 'Genetic Distance'), except that (i) for null markers and multi-value markers, the amount
(if any) added to the genetic distance calculation is determined in
accordance with the methods adopted by FTDNA in 2016, as described in
this article by Roberta Estes, and (ii) for
DYS389ii, the value for DYS389i is first subtracted from
DYS389ii before calculating the difference, as described in this FTDNA forum discussion and the "DYS389I&II" section of
this article by John Barrett Robb.
Based on a comparison of the results of the app's genetic distance results and
those given by FTDNA, the app's calculations appear to be fully consistent with
those used by FTDNA, with one narrow exception. Consistent with Roberta Estes' description of FTDNA's 2016 changes in their
methods of calculating genetic distance referenced above, the app treats every null marker as a single mutation. However,
a FTDNA forum post indicates that, as of August 2017, FTDNA's internal computer program only treated "some"
null markers as a single mutation. FTDNA's response that only treating certain null markers as a single
mutation was "unfortunate" and "at this time," suggests that they agree that all null markers should be treated as a
single mutation and that they hoped to correct their program in the future.
How does the app use genetic distance to determine
whether two kits "match"?
For purposes of creating the groups, the app treats two kits as "matching" and
appropriate for inclusion in the same group, if the genetic distance
between the two kits, as determined based on the maximum standard panel
size that both kits tested for (i.e., 12, 25, 37, 67 or 111), falls
within the range necessary for the two kits to be deemed to "probably" have a common male
ancestor within the genealogical time frame, as set
forth in FTDNA's genetic distance interpretation guidelines. Under those guidelines, two
kits are deemed to "probably" have a common male ancestor within the genealogical time frame if they have a
genetic distance of not more than 1 in the case of a 12-marker
comparison, 2 in the case of a 25-marker comparison, 4 in the case of a
37-marker comparison, 6 in the case of a 67-marker comparison, and 7 in
the case of a 111-marker comparison. Note that, with respect to 67- and 111-marker comparisons, these guidelines
are more restrictive than the criteria FTDNA uses in deciding which kits to show as "matches" on a person's
Y-DNA Matches page. On that page, in addition to showing kits that "probably" have a common male ancestor within the
genealogical time frame, FTDNA also includes 67- and 111-marker kits that "only possibly" have a common male
ancestor within that time frame. See "Y-DNA - Matches Page, are only exact matches shown?" on the Y-DNA - Matches Page.
What are FTDNA's guidelines for interpreting genetic distance between two kits?
FTDNA's guidelines for interpreting genetic distance between two kits are summarized in its article
Expected Relationships with Y-DNA STR Matches and set forth separately for each kit size in more detail in
the articles listed below:
It should be noted that the interpretation guidelines for 12- and 67-marker
comparisons in the summary article differ slightly from those in the separate articles for those comparisons.
Specifically, the summary article indicates that a genetic distance of 1 on a 12-marker comparison means
that the test takers are "probably related," while the separate article for 12-marker comparisons indicates
that a genetic distance of 1 means they are only "possibly related." In the case of a 67-marker comparison,
the summary article indicates that a genetic distance of 5-6 means that the test takers are "probably related" and that
a genetic distance of 7 means that they are only "possibly related," while the separate
article for 67-marker comparisons indicates that a genetic distance of 5-6 means that the test takers are
"related" and that a genetic distance of 7 means that they are only "probably related." The app uses the summary
article's guidelines in both cases: in the case of a 12-marker comparison, because it is very common for kits that have a
genetic distance of 1 on a 12-marker comparison to be solid matches when compared at a greater number of markers (and because
the app does not allow 12-marker kits to merge groups, in any event) and, in the case of a 67-marker comparison, because
counting a genetic distance of 7 as a
match seems inconsistent with the guidelines for 111-marker comparisons which also treat a genetic distance of 7 as the
largest genetic distance for a match.
Does the app use haplotypes or SNP results? If not, why not?
No. The app determines kit groupings solely on the basis of STR test results. While
appropriate SNP testing is more definitive than STR testing in determining descent from
a common ancestor, since most test takers in most surname projects have
not taken the necessary SNP tests to make that determination, SNP tests
cannot currently be used as the primary basis for forming groups for
most surname projects. To the extent appropriate SNP results exist, they
should, however, be reviewed to confirm (or disconfirm) the groupings
determined by STR results.
The app uses Y-DNA STR results for multiple kits (e.g., the kits in a surname project) as input data. The data must be in the format
used by FTDNA in its surname projects.
Data can be input by submitting either (1) an .html file of the project's FTDNA public or GAP Y-DNA
Classic or Colorized results page or (2) a .csv file of the project's Y-DNA STR results in FTDNA format (i.e., with the same column order
and headings as used in FTDNA results tables). For instructions on how to obtain an appropriate .html file, see
How to get an .html file of a project's results; for instructions on how to obtain an appropriate .cvs file, see
How to get a .csv file
containing project data.
Each input option has advantages and disadvantages:
- html input option
- Advantages
- Most convenient option if (1) you are not a project administrator, (2) the data for all the project kits are on a single
webpage and (3) you don't need to edit the data to add or delete kits.
- Disdvantages
- Only captures data that is visible on a single webpage.
- Includes all the data on the captured webpage, which may include data you do not want to include (e.g., data for
kits that do not claim male lineal descent from a male with the project surname).
- csv input option
- Advantages
- If you are a project administrator, can easily capture data for all kits in the project, even if the data spans
multiple webpages.
- Can be edited to exclude data for certain kits (e.g., data for kits that do not claim male lineal descent from a male
with the project surname).
- Disdvantages
- Less convenient than the html input option unless you are a project administrator.
You can get an .html file of a project's results page to submit as input in the following manner:
-
Make sure you are logged into the FTDNA website under your account as a project administrator or a project member (if you are one)
so that you will be able to see non-public results on the project's results page.
-
Go to the project's public or, if you are a project administrator, GAP Y-DNA Classic or Colorized results.
page
-
Use the "Save Page As" function of your web browser to save the page as a "webpage, html only" file. The "Save Page As"
function can generally be found in the File menu of the browser or in the menu
that appears when you right click on the webpage.
If you are a project administrator of the FTDNA project whose results you want to analyze, the easiest way to obtain a
.csv file with the project's results is by using the
"Download Files" function from "Project Administration" on the GAP menu bar or the "Export to
Spreadsheet" button on the GAP results page.
Alternatively, or if you are a member of the project whose results you want to analysis but not an administrator of the project,
you can create a .csv file that contains all of the project's results in the following manner:
- Either (1) go to a Y-DNA results page (either Classic or Colorized) on your project's regular site (not a GAP
results page), use your web browser's copy function to copy the entire page,
and paste it into an Excel spreadsheet, or (2) get an .html file of the project's results (see instructions in the section above)
and open it in Excel.
- Delete all rows in the spreadsheet that are above the table's column headers row or below the last kit row in the table.
- If you need to prepare a .csv file for a very large project whose kits do
not all fit on one page of FTDNA's Y-DNA Results chart,
use one of the methods described above to copy all the results into Excel and combine them in a single single spreadsheet,
omitting all column heading rows
other than the one at the top of the spreadsheet.
- Save the spreadsheet as a csv file.
Regardless of whether you are a project administrator and are able to download the project results as a .csv file or create your own .csv file
in Excel, you may want to open the file in Excel and remove the data for any kits that do not claim male lineal descent
from a man with the project surname, since including the data for those kits may alter the grouping results.
Note that although Excel may automatically change certain multi-value markers
into dates, the program will automatically convert them back to the proper values; you do not need to manually reconvert
yourself.
Does using the app compromise the confidentiality of private information?
No. The app does not store, save or disclose any submitted data. The data is only used in the internal processing of
the app, and is only retained in the app during the user's current session.
Why does the app assign some 12- and 25-marker kits to multiple groups?
If a small kit matches with big kits in separate groups, rather than merging the two
groups, the small kit is assigned to both groups. Additional testing
would probably show which of the groups the small kit properly belongs
to. Small kits are also assigned to groups which consist only of other
small kits that they match with.
In the Group Assignments table, why do some higher group numbers appear above lower group numbers?
The app first goes down the rows and assigns group numbers to groups formed on
the basis of matches between 37-, 67- and 111- marker kits ("big
kits"). After these "big kit" groups are formed, the app goes back
to the top and assigns group numbers to groups consisting solely of
matching 12- and 25-marker kits ("small kits"). Therefore, if a group with a higher number appears above
lower group numbers, that means that the group with the higher number consists only of small kits.
Why are there a lot of kits that are not assigned to any group by the app?
If the Group column for a kit is blank that means that its genetic distance from
every other kit in the project is too large for it to be considered to
probably have a common male ancestor within the genealogical time frame
with any other kit in the project, as determined by
FTDNA's genetic distance interpretation guidelines. You can view the genetic distances for a
particular kit from every other kit in the project by clicking on the
kit's number. It is common for a substantial portion of kits in a surname project to
not match with any other kit in the project. The most likely reason
for this is that men who have taken Y-DNA tests are still a small
percentage of the total population and, therefore, there are many
paternal lines for which no representative, or only one representative, has taken
a Y-DNA test. As the number of men who have taken Y-DNA tests grows, the percentage
of unmatched kits should go down.
What is relational distance?
"Relational distance" is the term used by the app for the genetic closeness of two kits (or a kit and the
modal values of a group) as determined by
FTDNA's genetic distance interpretation guidelines. A relational distance
of 0 means the two kits (or a kit and the modal values of the group, as the case may be) are "very tightly
related" under those guidelines, 1 means they are "tightly related,"
2 means "related," 3 means "probably related," and 4 means "only possibly related." If there is
no relationship distance number, the kits are "not related"
under FTDNA's genetic guidelines.
Per FTDNA's guidelines, being "related" means having a common male ancestor within the
genealogical time frame.
What are the limitations of the Reorganized Table?
The Reorganized Table organizes the kits in a project based on the groups the app has assigned the kits to. The organization
suggested by the Reorganized Table should, however, always be reviewed and, if necessary modified.
Set forth below are some reasons why the organization proposed by the Reorganized Table should be modified:
- Ideally, before inputting project data into the app, all kits that do not claim male lineal descent from a male ancestor with the project surname,
should be removed. If those kits have not been removed prior to inputting the project data into the app,
those kits may be inappropriately included in matching surname groups. They should instead be in a separate group for all
non-surname kits.
- SNP results may show that a kit or group of kits should be grouped differently. SNP results are more definitive
than STR results and should trump the app's suggested
groupings based on STR results.
- Genealogical information may indicate that a kit that is assigned to multiple groups belongs in one of the groups
and not in the other or
indicate that a kit that has a genetic distance that is slightly too great to have been included in a group by the app has a common
male ancestor with kits in the group and should be placed in the
group.
- A close analysis of the STR results may show that two distinct groups that are probably not related in the
genealogical time frame or surname era have been merged. In this case, the two distinct groups
should be separated. This type of merger between unrelated groups would most likely occur because of
convergent mutations which
caused one or more kits to have STR results that are sufficiently close to kits in the other group to have cause the kit(s)
to be placed
in both groups.
When tables are downloaded from the app, they are downloaded into a csv file. A csv file does not contain any
information that tells Excel what format to apply to the data in the file. Therefore, when a csv file is opened in Excel
or imported into an Excel spreadsheet, Excel makes a guess as to what type of format to apply to the data. In the case of
hyphenated numbers that could be dates, Excel by default applies date formatting. To prevent Excel from turning the data
into dates, (1) do not open the csv file in Excel but instead import it into an Excel spreadsheet (how to do that varies
by what version of Excel you are using; sometimes there is an import button on the toolbar and sometimes the import function
is in the Data menu), (2) when asked how the data is delimited, select commas, and (3) when asked if you want to apply
formatting to specific columns, separately highlight the columns for DYS385 and DYS459 and select Text for each. If this process
if followed, the imported data should not contain any date-formatted data.
If you open or import an FTDNA downloaded csv file in Excel, the same problem of hyphenated values being converted into
dates will occur. The steps described above should prevent that from happening. However, the app contains code that will
automatically reconvert dates to the correct hyphenated values, so csv files containing date formats should work fine as input
files.
How should a surname group administrator use the results of the app to organize kits into groups?
If a surname project's kits have already been put into groups based on relational distance to other kits, the
app will largely just confirm the existing
groupings, but the Group Assignments page of the app may suggest a number of kits that should
be reviewed to see if they have been grouped properly. Clicking on the kit number and looking at the
Relational Distances page for those kits may help determine whether they are properly placed.
If a surname project's kits are currently only grouped by haplotype or have otherwise
not been fully organized into groups based on relational distance, the app's Reorganized Table page will provide
a good starting point for an appropriate
organization of the project's kits.
The app's results should be reviewed with available SNP test results and genealogical information to it order to make a
final determination as to which kits probably share a common male ancestor within the genealogical time frame or the
surname era.
It is suggested that kits in a surname project be organized in the following manner:
- First, matched groups, each of which should consist only of kits for men that may have male
lineal descent from a male with the project surname and that (based on STR results, as supplemented by SNP results
and genealogy) probably share a common male ancestor within the genealogical time frame or surname era. For ease of reference and to
make sure they appear in order, each matched group's name should begin "Matched Group 01," "Group 01," "Lineage I" or something
similar. Either arabic or roman numerals may be used for the numbers. If arabic numerals are used and there are more than 9
matched groups in the project, it is important to precede the numbers less than 10 with a 0 so that they will
be listed first. It is common and useful to following the number with a few words that describe what the kits in the group have in
common. Since there is no other place on the results page to explain how groups were formed, perhaps the best description is
"- kits in the group probably share a common male ancestor within the genealogical time frame or surname era." Other descriptions, such
as haplotype or terminal SNP, common ancestor, or geographic location, are less desirable because (1) they may not be correct
for all current or future members of the group, (2) they may describe some kits that are not included in the group, (3)
they do not describe the basis on which the group was formed, and (4) they may be duplicative of information that is listed for
individual kits (e.g., ancestry, geography and haplotype). If a project has a large number of kits that probably have a
common male ancestor in the genealogical time frame or surname era, the project administrator may wish to break the kits up and put
them into separate groups. The app's Subgroups Table can be used to help identify meaningful subgroups. If the project administrator
decides to break related kits into separate groups, it is important that the names for these subgroups indicate both
(1) that all the subgroups probably share a common male ancestor in the genealogical time frame or surname era and (2) the basis on which each
separate group was formed. An example of a heading that does this is "Group 01A - Descendants of John Brown; probably share a common
male ancestor with other 01 groups."
- Second, one or more groups of 12- and 25-marker kits that need additional testing before they can be properly assigned to a matched group.
Projects can choose between two reasonable approaches to 12- and 25-marker kits. They can either deem all such kits as needing
additional testing or only those that match with "big kits" in more than one matched group (i.e., those in the "Multi-Group Small
Kits" group in the app's Reorganized Table). Small kits deemed to need additional
testing can either all be put in a single group labelled so that it comes after all the matched groups (such as "Kits
needing additional testing - kits in this group need to upgrade to at least 37 STRs in order to be properly placed in
a matched group") or, if a project elects to deem all small kits as needing additional testing, small kits that
match with big kits in a particular group could be put in a separate additional testing group following the applicable
matching group (such as "Lineage I possible - kits in this group need to upgrade to at least 37 STRs in order to be properly
placed in a matched group").
- Third, a group of unmatched surname kits, consisting of all kits for men that may have male
lineal descent from a male with the project surname but that (based on STR results, as supplemented by SNP results
and genealogy) probably do not share a common male ancestor with any other kit in the project within the genealogical time frame
(other than non-surname kits). Note that unmatched kits would include any kits that only
match with small kits that the project adminstrator decides should be shown as needing additional testing (e.g. probably
all kits in the "Conditionally Matched Kits" group of the app's Reorganized Table). Some projects set up multiple groups
of unmatched kits based on haplotype; however, it is
unclear what benefit there is to doing so. Any group for unmatched kits should be named so that it comes after all of the
groups for matched kits and all of the groups for kits needed additional testing. Naming the group starting with "Unmatched" is
usually sufficient for that purpose. An example of an appropriate full group name would be "Unmatched Ashleys - kits
in this group probably do not share a common ancestor with any other Ashley kit in the genealogical time
frame."
- Lastly, a group for non-surname kits, consisting of all kits for men that do not have male
lineal descent from a male with the project surname. It is common for people who do not have
male lineal descent from a male with the project surname to join a project because they have an ancestor with that surname and
are interested in his ancestry. Unfortunately, their YDNA results become part of the project's results even though they are not
relevant to the project. All kits of this type should be put in a single group for non-surname kits which is named so that it
appears at the bottom of the results table. In order to make sure that the group appears last, it may be necessary to start
the group name with a meaningless letter such as x or z. An example of an appropriate full group name would be "xNon-Ashleys -
kits in this group are not believed to have male lineal descent from an Ashley."
How can an administrator change the grouping of project kits?
To form subgroups or edit subgroups in a project, a project administrator should go to
their GAP home page and, under the "Project Administration" menu, select "Member Subgrouping."
Under the Manage column, clicking the edit icon lets you change the group name,
description, and color, while clicking the arrow icon lets you add kits to, or remove kits from, the group.
In order to assign a kit to a new group, it must
first be removed from its existing group and then, after going into the arrow/select
page for the new group, added to the new group.