# Materials & Methods

**Table of contents:**

- Automatic Predictive Model Constructor APMC
- Hardy-Weinberg Equilibrium
- Chi-square
- PIC & H
- Genetic distance
- Sequences analysis tools
- References

**APMC**

**Hardy–Weinberg equilibrium**

Input values (number of homozygous, heterozygous and rare homozygous individuals) in integers form, are used to count alleles frequencies - "p" and "q" values, by means of Hardy-Weinberg principle the expected numbers of homozygous, heterozygous and rare homozygous individuals are obtained. Observed and expected values are compared by the chi-square goodness of fit test.

If observed number of individuals in one of the cells is smaller then 5, application will count additionally chi-square and p-value with Yate`s correction for continuity in this case status of results will be determined on the basis of these values.

If at the used level of significance, there are statistically important differences, application will count the fixation index - Fis, positive values may means a reduction of heterozygosity, caused by for example Wahlund effect, inbreed or genetic drift. While negative Fis value means the excess of heterozygotes, caused by for example breeding selection or bottleneck effect.

**FIS = 1 - (observed heterozygosity/expected heterozygosity)**

**Chi-square tests**

Chi-square test family is a statistical instrument which allows examining the relation between variables. The chi-square tests must be used with assumptions:

- test group must be sufficiently large.
- data should be maximally randomly drawn from a population.
- values in cells are adequate only when: (1) no more than 20% of the expected values are smaller than five; (2) there are no cells with zero.

**Independence (Associations) chi-square**

Research hypothesis:

H0: There is no association between variables.

Ha : There is an association between variables.

The expected values for independence chi-square and chi-square value are counted following formulas:

- Eij = formula to count expected values for the ith row and jth column.
- Ti = sum of values in the ith row
- Tj = sum of values in the jth column
- N = sum of values in all rows or columns
- Oi – observed value
- Ei – expected value
- n – a set of all cells

If the values in the cell do not fulfil third condition (look at chi-square assumptions), we recommend to use chi-square with Yates's correction for continuity, following formula:

It is important to note that this correction is adequate for 2x2 contingency tables, otherwise more accurate results may be obtained by combining columns or rows. Degrees of freedom are counted by dof = (r − 1) * (c − 1)

r – number of rows (variables)

c – number of columns (groups)

**The coefficient of contingency** are measures of association, they can be used to estimate the strength of a relationship between variables.

A coefficient of contingency are counted following:

- for 2x2 (1 dof) tables by means of Phi-coefficient:

- for tables with dof > 1 using Crammer`s V coefficient

N, n - sum of values in row or columns

m - smaller of the number of columns or rows

Range of Coefficient of contingency value is between 0.0 - 1.0.

**Chi-square goodness of fit test:**

Allows to examine the compliance of the observed and expected values.

Research hypothesis:

H0: The observed values are equal to theoretical values (expected).

Ha : The observed values are not equal to theoretical values (expected).

Degrees of freedom are counted by:

dof = k − 1

k - number of columns (groups)

Total values must be the same for both observed and expected frequencies, expected frequencies less than 1 are not allowed.

Part of calculations in this module (as well as in Hardy-Weinberg module) are executed by modified "scipy.stats.chi2_contingency"Python library.

** **

**PIC and H**

Heterozygosity (H) is a parameter indicating the average frequency of heterozygous individuals occurrence. Polymorphic information content (PIC), as well as Heterozygosity, is a measure of locus polymorphism counted for markers used in linkage analysis.

PIC & H are counted for bi,- or multi,- allelic locus following formulas:

Pi , Pj are the frequency of the ith and jth allele (float format),

l – number of alleles

PIC for dominant marker (inputs are in form of dichotomous variables 1/0), can be calculated using formula:

PIC = 1− [fi2 + (1− fi)2]

Where fi is frequency of amplified band (1) and (1 - fi) is frequency of absence of band (0) in study group.

for codominant marker range of PIC value is between 0.0-1.0

for dominant marker range of PIC value is between 0.0-0.5

Important note!In case of dominant marker it is possible to count an approximate number of heterozygous individuals based on Hardy-Weinberg law:

If

q2 is a frequency of homozygous genotype (absence band frequency) then (q2)1/2 isqallele frequency. In case of biallelic locusp+q= 1, sop= 1 –q. Based on counted alleles frequencies we could estimate the expected number of genotypes from Hardy-Weinberg law.Above-mentioned calculations allow counting more accurate PIC value based on results obtained with dominant markers (with the assumption that study group is in a state of genetic balance).

**Genetic distance**

Genetic distance is a measure of differences between pairs of populations, there are a couple of methods allows to estimate differences, based on alleles frequency in examined loci, such as:

- Standard genetic distance (Nei, 1972).
- Geometric genetic distance (Nei, 1978).
- Genetic distance based on STR loci polymorphism (Nei and Takazaki, 1983).

1) Standard genetic distance is used to determine the normalized differences between populations when the individual codon changes are independent and they are compliance with Poisson distribution, by means of formula:

2) Geometric genetic distance allows to determine differences between populations when loci differ in rate of codon changes (in this case standard genetic distance may be underestimated). Geometric distance is calculated following formula:

Caution !When some of the allele frequencies are equal to 0, geometric distance is impossible to calculate, because Jx, Jy and Jxy are geometric means of jx, jy and jxy values. So in case if one of jx or jy or jxy = 0

Jx or Jy or Jxy also will be equal to 0, it makes that application will try to divide by 0 or count -ln(0). In this case app return error: "The quantity or quality of the data is inappropriate!".

3) Genetic distance could be an estimate based on short tandem repeats (STR) polymorphism (Nei and Takazaki, 1983). This model of genetic distance estimation was considered more effective than other methods. It is calculated following formula:

- D - standard genetic distance
- D` - geometric genetic distance
- DA - genetic distance based on STR polymorphism
- jX- the sum of allele frequency squares in "j" locus in X population.
- jY- the sum of allele frequency squares in "j" locus in Y population.
- jXY- the sum of allele frequency product between X and Y population in "j" locus.
- JX-the arithmetic or geometric (J`) mean (depending on the type of distance) the jX values for the pair-population.
- JY- the arithmetic or geometric (J`) mean (depending on the type of distance) the jY values for the pair-population.
- JXY- the arithmetic or geometric (J`) mean (depending on the type of distance) the jXY values for the pair-population.
- Xij & Yij frequencies of "i" STR in "j" locus in pair-population X,Y
- r - a number of analyzed STR loci

The results of genetic distance estimation are presented in the form of distance matrix and dendrogram. The application offers the possibility to select several types of methods to construct dendrograms such as unweighted pair group method with arithmetic mean (UPGMA), weighted pair group method with averaging (WMPGA), centroid linkage clustering (UPGMC), weighted pair-group centroid clustering (WPGMC), single-linkage clustering and complete-linkage clustering. The construction of dendrograms is performed by the Python library "scipy.cluster.hierarchy.linkage" lib documentation and theoretical information are available from Python library

**Sequences analysis tools**

This module contains a few tools dedicated for work with nucleotide or amino acid sequences. Based on **Biopython **lib and **Muscle****.**

**1)** **Dot plot **allows to compare two sequences. Input values: names and raw sequences (only A, C, T/U, G and N characters in sequences are acceptable) or **GenBank **ID: e.g. KY783590.3

output values are: classic alignment, dot plot, and parameters:

- Score: one point for every match, max score value = length of shorter sequence.

- Coverage: percentage of sequences coverage.

- Average identity: number of matches divided by average sequences length x 100.

- Fragmental identity: number of matches in coverage fragment of sequences x 100.

**Important note** "In case of short sequences (up to 20 bp) dot plot will not generate."

**2) Consensus Sequence **allows to determine consensus sequence based on numerous sequences.

Acceptable input format:

a) sequences in **FASTA **format e.g.

>XYZ

gtcgtg

>ZYX

actga

b)** GenBank IDs, **each ID in second line e.g.

Z78532.1 Z78533.1 Z78529.1

...

c) **FASTA** file (example)

In case of Fasta file upload and sequences downloaded from GenBank, app return file containing consensus sequence in Fasta format.

The result is a consensus sequence. If more than threshold value (frequency) of elements in all input sequences in **n **position are the same, this element in position **n **is consider as consensus element in sequence, if no nucleotide appears more often than threshold value in used sequences in a consensus sequence will be "N" character, if sequences have different length and if less than threshold value have any nucleotide in nposition in consensus sequence will be "-" character.

**3) Sequences Tools**

Allows to transform (reverse, complement, reverse-complement, transcription, translation) one or numerous sequences in **FASTA **format.

**We recommend ****MEGA ****for handling FASTA files.**

**References:**

- Scikit-learn: Machine Learning in Python, Pedregosa
*et al.*, JMLR 12, pp. 2825-2830, 2011. - Sa´ndor Nagy, Pe´ter Poczai, Istva´n Cerna´k, Ahmad Mousapour Gorji, Ge´za Hegedus , Ja´nos Taller, (2012): PICcalc: An Online Program to Calculate Polymorphic Information Content for Molecular Genetic Studies, Biochemical Genetics 50: 670–672.
- Yu.V. Chesnokov, A.M. Artemyeva (2015): Bioinformatics and math statistics, Agricultural Biology, 50 (5): 571-578.
- Pilar Soengas, Pablo Velasco, Guillermo Padilla, Amando Ordas, Maria Elena Cartea (2006). Genetic relationships among Brassica napus crops based on SSR markers, HORT SCIENCE 41, 1195–1199.
- Sorana D. Bolboacă , Lorentz Jäntschi, Adriana F. Sestraş , Radu E. Sestraş and Doru C. Pamfil, (2011): Pearson-Fisher Chi-Square Statistic Revisited, Information, 2, 528-545.
- Yates, F. (1934) Contingency tables involving small numbers and the χ2 test. Journal of the Royal Statistical Society, Suppl.1, 217–235.
- Zeynel Dalkılıç , H. Osman Mestav, Gonca Günver-Dalkılıç and Hilmi Kocataş (2011): Genetic diversity of male fig (Ficus carica caprificus L.) genotypes with random amplified polymorphic DNA (RAPD) markers, African Journal of Biotechnology, 10 (4): 519-526.
- Antony Ugoni, Bruce F. Walker (1995): The Chi square test: an introduction, Chiropractors and Osteopaths Musculo-Skeletal Interest Group. 4. 61-4.
- Quinnipiac University. Maaning of Pearson’s. Retrieved Jun 20, 2016 from: http://faculty.quinnipiac.edu/libarts/polsci/Statistics.html Yule, G. U. (1912). J. R. Statist. Soc., 75, 76–642.
- Chi Square Based Measures, from http://uregina.ca/~gingrich/ch11a.pdf
- Robin S. Waples (2015): Testing for Hardy–Weinberg Proportions: Have We Lost the Plot ?, Journal of Heredity, 106 (1): 1–19.
- M. Nei, R. K. Chesser (1983): Estimation of fixation indices and gene diversites, Annal of Human Genetics, 47, 253-259.
- R. K. Chesse (1991): Influence of gene flow and breeding tactics on gene diversity within populations. Genetics, 129 (2): 573-583.
- Bolesław Żuk, Heliodor Wierzbicki, Magdalena Zatoń-Dobrowolska, Zofia Kulisiewicz. Genetyka Populacji i Metody Hodowlane, Powszechne Wydawnictwo Rolnicze i Leśne, Warszawa, 2011.
- Roldán-Ruiz, I., van Euwijk, F., Gilliland, T. et al. (2001): A comparative study of molecular and morphological methods of describing relationships between perennial ryegrass (Lolium perenne L.) varieties, Theoretical and Applied Genetics, 103(8): 1138–1150.
- Nei M.,Tajima F. and Tateno Y. (1983): Accuracy of estimated phylogenetic trees from molecular data. II. Gene frequency data. Journal of Molecular Evolution, 19, 153-170.
- Carl E. Hildebrand, David C. Torney, and Robert P. Wagner (1992): Informativeness of Polymorphic DNA Markers, Los Alamos Science, 20, 100-102.
- Li Jin and Ranajit Chakraborty (1994): Estimation of Genetic Distance and Coefficient of Gene Diversity from SingleProbe Multilocus DNA Fingerprinting Data, Molecular Biology and Evolution, 11(1): 120-127.
- Masatoshi Nei (1978): The Theory of Genetic Distance and Evolution of Human Races, Japan Journal of Human Genetics, 23, 341-369
- Masatoshi Nei and Naoko Takazaki (1996): The Root of the Phylogenetic Tree of Human Populations, 13(1): 170-177.
- Masatoshi Nei (1972): Genetic distance between populations, The American Naturalist 106, 283-292.
- J. De Riek · E. Calsyn · I. Everaert · E. Van Bockstaele M. De Loose (2001), AFLP based alternatives for the assessment of Distinctness, Uniformity and Stability of sugar beet varieties, Theor Appl Genet, 103: 1254–126.
- Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics.
*Bioinformatics*, 25, 1422-1423. - Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput,
*Nucleic Acids Research***32**(5), 1792-97.