Title: | Genetics and Independence Testing of Mixed Genetic Panels |
---|---|
Description: | Developed to deal with multi-locus genotype data, this package is especially designed for those panel which include different type of markers. Basic genetic parameters like allele frequency, genotype frequency, heterozygosity and Hardy-Weinberg test of mixed genetic data can be obtained. In addition, a new test for mutual independence which is compatible for mixed genetic data is developed in this package. |
Authors: | Bing Song |
Maintainer: | Bing Song <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.0 |
Built: | 2025-01-22 04:13:11 UTC |
Source: | https://github.com/ice4prince/mixindependr |
Calculate Allele Frequency
AlleleFreq(x,sep)
AlleleFreq(x,sep)
x |
a dataset of genotypes. Each row denotes each individual; each column contain each marker. |
sep |
the allele separator in the imported genotype data.Note: when using the special character like "|", remember to protect it as "\|"(default). |
This function calculates the allele frequencies of one dataset.
a matrix of allele frequencies. Each row denotes each allele; each column denotes each marker. The order of makers follows x.
require(mixIndependR) x <- data.frame(STR1=c("12|12","13|14","13|13","14|15"), SNP1=c("A|A","T|T","A|T","A|T")) AlleleFreq(x,"\\|")
require(mixIndependR) x <- data.frame(STR1=c("12|12","13|14","13|13","14|15"), SNP1=c("A|A","T|T","A|T","A|T")) AlleleFreq(x,"\\|")
Generate Comparison Observed and Expected No. of Heterozygous Loci.
ComposPare_K(h,Ex,trans)
ComposPare_K(h,Ex,trans)
h |
a double made up of "0" and "1" where 1 means heterozygous and 0 means homozygous; Outcome of function "Heterozygous"; Each column denotes each locus and each row denotes each individual. |
Ex |
a dataframe of expected density, outcome of function "DistHetero", on each possible total number of heterozygous loci. |
trans |
a logic variable, if True, the outcome is a dataframe of n x 2. n is the number of individuals of original imported database. First column is the observed No. of Heterozygous Loci and the second is the expected one. If False, the dataframe is 2n x 2, where n is the number of individuals of original imported database. The first column is a categorical variable denoting the frequency is observed or expected value; the second column is the frequency of No. of heterozygous loci. |
This function generates a dataframe in which the observed and expected heterozygous loci for each sample are included. The observed ones are calculated from the original dataset. However, the expected ones are simulated according to the expected probability with the same sample size as observed sample.
a dataframe of observed and expected No. of heterozygous loci for each individual.
h<-matrix(rbinom(20,1,0.5),nrow=5) Ex <- data.frame(K=c(0:5),Density=rnorm(6,mean = 0.5,sd=0.05)) ComposPare_K(h,Ex,trans = TRUE)
h<-matrix(rbinom(20,1,0.5),nrow=5) Ex <- data.frame(K=c(0:5),Density=rnorm(6,mean = 0.5,sd=0.05)) ComposPare_K(h,Ex,trans = TRUE)
Generate Comparison Observed and Expected No. of Shared Alleles.
ComposPare_X(AS,Ex,trans=TRUE)
ComposPare_X(AS,Ex,trans=TRUE)
AS |
a double made up of "0","1" and "2" denoting number of shared alleles; Outcome of function "AlleleShare_Table"; Each column denotes each locus and each row denotes each pair of individuals. |
Ex |
a dataframe of expected density, outcome of function "DistAlleleShare", on each possible total number of shared Alleles. |
trans |
a logic variable, if True, the outcome is a dataframe of n x 2. n is the number of individuals of original imported database. First column is the observed No. of Heterozygous Loci and the second is the expected one. If False, the dataframe is 2n x 2, where n is the number of individuals of original imported database. The first column is a categorical variable denoting the frequency is observed or expected value; the second column is the frequency of No. of heterozygous loci. |
This function generates a dataframe in which the observed and expected shared alleles for each pair of individuals. The observed ones are calculated from the original dataset through "AlleleShare_Table". However, the expected ones are simulated according to the expected probability with the same sample size as the observed sample.
a dataframe of observed and expected No. of shared alleles for each pair of individuals.
AS<-matrix(sample(c(0:2),20,replace=TRUE,prob=c(0.3,0.3,0.4)),nrow=5) Ex <- data.frame(X=c(0:8),Density=rnorm(9,mean = 0.5,sd=0.05)) ComposPare_X(AS,Ex,trans = TRUE)
AS<-matrix(sample(c(0:2),20,replace=TRUE,prob=c(0.3,0.3,0.4)),nrow=5) Ex <- data.frame(X=c(0:8),Density=rnorm(9,mean = 0.5,sd=0.05)) ComposPare_X(AS,Ex,trans = TRUE)
Simple count including zero###
counta(z, y)
counta(z, y)
z |
a vector you would like to check |
y |
an element you would like to count.(Even it is not included in z) |
This function counts how many the assigned elements there are in one vector.
the times that y appears in z
z <-rbinom(20,1,0.5) counta(z,0)
z <-rbinom(20,1,0.5) counta(z,0)
Build a simulated distribution for Chi-Square
Dist_SimuChisq(s,prob,b)
Dist_SimuChisq(s,prob,b)
s |
a matrix of frequencies for each simulated sample. Each row for each sample. |
prob |
a vector of expected probability for each simulated sample. |
b |
the times of bootstrapping. |
This function build the distribution of Chi square statistics for simulated samples
a vector of Chi-square statistics, length is the times of sampling.
require(mixIndependR) h<-runif(10) s<-Simulate_DistK(h,500,100) Exp <- DistHetero(h) Dist_SimuChisq(s,Exp$Density,10)
require(mixIndependR) h<-runif(10) s<-Simulate_DistK(h,500,100) Exp <- DistHetero(h) Dist_SimuChisq(s,Exp$Density,10)
Build Expected Distribution of Numbers of Heterozygous Loci
DistHetero(H)
DistHetero(H)
H |
a vector of average heterozygosity of each locus |
This function build the expected distribution of numbers of heterozygous loci for known heterozygosity of each loci.
a dataframe of expected density on each possible total number of heterozygous loci.
Chakraborty, R. (1981, ISSN:0016-6731)
DistHetero(runif(10))
DistHetero(runif(10))
Build Observed Distribution of No. of Heterozygous loci
FreqHetero(h)
FreqHetero(h)
h |
a dataframe of heterozygosity, made up with 0 and 1, outcome of function "Heterozygous" Rows for individuals, and columns for markers. |
This function build the observed distributions from observed heterozygosity table, made up of 0,1.
a dataframe of frequencies of each number of heterozygous loci(from 0 to No. of loci)
h<-matrix(rbinom(20,1,0.5),nrow=5) FreqHetero(h)
h<-matrix(rbinom(20,1,0.5),nrow=5) FreqHetero(h)
Calculate Genotype Frequency###
GenotypeFreq(x,sep,expect=TRUE)
GenotypeFreq(x,sep,expect=TRUE)
x |
a dataframe of genotype data with rownames of sample ID and column names of markers. |
sep |
allele separator in the imported genotype data. Note: when using the special character like "|", remember to protect it as "\|"(default). |
expect |
a logic variable. If expect is true, the function will calculate the expected genotype probabilities. If false, calculate the observed genotype frequencies. |
This function calculates the observed or expected genotype frequency from dataset and allele frequency.#####
a dataframe of genotype frequencies. Each row denotes each genotype; each column denotes each loci. The order of markers follows x; the genotypes are ordered from homozygous to heterozygous.
Chakraborty, R., Srinivasan, M. R., & Daiger, S. P. (1993, ISSN:0002-9297).
require(mixIndependR) x <- data.frame(SNP1=c("A|A","T|T","A|T","A|T"), STR1=c("12|12","13|14","13|13","14|15")) GenotypeFreq(x,"\\|",expect=TRUE)
require(mixIndependR) x <- data.frame(SNP1=c("A|A","T|T","A|T","A|T"), STR1=c("12|12","13|14","13|13","14|15")) GenotypeFreq(x,"\\|",expect=TRUE)
Test heterozygosity at each locus
Heterozygous(x,sep)
Heterozygous(x,sep)
x |
a dataset of genotypes with rownames of sample ID and column names of markers. |
sep |
allele separator in the imported genotype data. Note: when using the special character like "|", remember to protect it as "\|"(default). |
This function test the heterozygosity of each individuals at each locus.Output a table and Usually followed by write.csv(as.data.frame(y),file = "~/*.csv") to export the results.
a dataframe of heterozygosity.0 is homozygous;1 is heterozygous. Each row denotes each individual; Each column denotes each locus.
x <- data.frame(STR1=c("12|12","13|14","13|13","14|15"), SNP1=c("A|A","T|T","A|T","A|T")) Heterozygous(x,"\\|")
x <- data.frame(STR1=c("12|12","13|14","13|13","14|15"), SNP1=c("A|A","T|T","A|T","A|T")) Heterozygous(x,"\\|")
Test the Hardy Weinberg Equilibrium with Chi-square test####
HWE.Chisq(G,G0,rescale.p=FALSE,simulate.p.value=TRUE,B=2000)
HWE.Chisq(G,G0,rescale.p=FALSE,simulate.p.value=TRUE,B=2000)
G |
a dataframe of observed genotype frequencies. Each row denotes each genotype; each column denotes each marker. The order of markers follows x; the genotypes are ordered by: from 1:l-th column, the genotypes are homozygous in order as : p1p1, p2p2,p3p3,...,plpl;from ll-th to u-th column, the genotypes are heterozygous in order as:choose(l,2) like: p1p2,p1p3,...,p1pl,p2p3,p2p4,...p2pl,...p(l-1)pl |
G0 |
a dataframe of expected genotype probabilities;each row denotes each genotype; each column denotes each loci. The order of markers follows x; the genotypes are ordered by: from 1:l-th column, the genotypes are homozygous in order as : p1p1, p2p2,p3p3,...,plpl;from ll-th to u-th column, the genotypes are heterozygous in order as:choose(l,2) like: p1p2,p1p3,...,p1pl,p2p3,p2p4,...p2pl,...p(l-1)pl |
rescale.p |
a logical scalar; if TRUE then p is rescaled (if necessary) to sum to 1. If rescale.p is FALSE, and p does not sum to 1, an error is given. |
simulate.p.value |
a logical indicating whether to compute p-values by Monte Carlo simulation. |
B |
an integer specifying the number of replicates used in the Monte Carlo test. |
This function check the Hardy Weinberg Equilibrium from observed and expected distribution with Chi-square test#####
a vector of result of p-values for chi-square test; the orders of markers follows x.
require(mixIndependR) x <- data.frame(STR1=c("11|12","12|13","11|13","13|15"), STR2=c("12|12","13|14","13|13","14|15"), SNP1=c("A|T","A|A","T|A","A|T"), SNP2=c("A|A","T|T","A|T","T|A")) G <- GenotypeFreq(x,expect = FALSE) G0 <- GenotypeFreq(x,expect = TRUE) HWE.Chisq(G,G0,rescale.p=FALSE,simulate.p.value=TRUE,B=2000)
require(mixIndependR) x <- data.frame(STR1=c("11|12","12|13","11|13","13|15"), STR2=c("12|12","13|14","13|13","14|15"), SNP1=c("A|T","A|A","T|A","A|T"), SNP2=c("A|A","T|T","A|T","T|A")) G <- GenotypeFreq(x,expect = FALSE) G0 <- GenotypeFreq(x,expect = TRUE) HWE.Chisq(G,G0,rescale.p=FALSE,simulate.p.value=TRUE,B=2000)
This dataset is the phased genotypes for a mix panel with 100 variants. These variants are selected from the reference haplotype data of Gymrek's lab (see Reference). This is a sample with 2504 individuals.
data(mixexample)
data(mixexample)
A dataframe with 2504 observations on 100 variables. This dataframe is the phased genotype files for 100 variants (including SNPs and STRs) for 2504 individuals.
1000 Genomes SNP-STR Haplotype Panel <https://gymreklab.com/2018/03/05/snpstr_imputation.html> The genotypes of panel after selection <https://github.com/ice4prince/mixIndependR/tree/main/data>
Saini et al. (2018). A reference haplotype panel for genome-wide imputation of short tandem repeats. Nat Commun 9(1): 4397. <https://pubmed.ncbi.nlm.nih.gov/30353011/>
data(mixexample)
data(mixexample)
Quick pvalue of total number of heterozygous loci
mixIndependK(x,sep,t,B)
mixIndependK(x,sep,t,B)
x |
a dataset of alleles. Each row denotes each individual.One allele in one cell.In the (2r-1)th column, there is the same locus with the 2r-th column; noted: no column for ID, make row.names=1 when importing. |
sep |
allele separator in the imported genotype data. Note: when using the special character like "|", remember to protect it as "\|". |
t |
times of simulation in "Simulate_DistK" and "Simulate_DistX". |
B |
times of bootstrapping in Chi Squares Test. |
This function is a summary of pipeline for number of heterozygous loci (K), and generates the p-value of K for the target dataset.
pvalue (1-cumulative probabilities) for the number of heterozygous loci(K)
x <- data.frame(SNP1=c("A|A","T|T","A|T","A|T"), STR1=c("12|12","13|14","13|13","14|15")) mixIndependK(x,sep ="\\|",10,10)
x <- data.frame(SNP1=c("A|A","T|T","A|T","A|T"), STR1=c("12|12","13|14","13|13","14|15")) mixIndependK(x,sep ="\\|",10,10)
Quick pvalue of total number of shared alleles
mixIndependX(x,sep,t,B)
mixIndependX(x,sep,t,B)
x |
a dataset of alleles. Each row denotes each individual.One allele in one cell.In the (2r-1)th column, there is the same locus with the 2r-th column; noted: no column for ID, make row.names=1 when importing. |
sep |
allele separator in the imported genotype data. Note: when using the special character like "|", remember to protect it as "\|". |
t |
times of simulation in "Simulate_DistK" and "Simulate_DistX". |
B |
times of bootstrapping in Chi Squares Test. |
This function is a summary of pipeline for number of shared alleles(X), and generates the p-value of K for the target dataset.
pvalue (1-cumulative probabilities) for the number of shared alleles(K)
x <- data.frame(SNP1=c("A|A","T|T","A|T","A|T"), STR1=c("12|12","13|14","13|13","14|15")) mixIndependX(x,sep="\\|",10,10)
x <- data.frame(SNP1=c("A|A","T|T","A|T","A|T"), STR1=c("12|12","13|14","13|13","14|15")) mixIndependX(x,sep="\\|",10,10)
Import genotype data from vcf files/
read_vcf_gt(x)
read_vcf_gt(x)
x |
The vcf file with its directory |
This function extract the genotypes and allele status from a vcf file.
a list contains the genotype and allele status.
## Not run: df<-read_vcf_gt("~/x.vcf") ## End(Not run)
## Not run: df<-read_vcf_gt("~/x.vcf") ## End(Not run)
Calculate Real or Expected Average Heterozygosity at each locus
RxpHetero(h,p,HWE)
RxpHetero(h,p,HWE)
h |
a dataset of heterozygosity, made up with 0 and 1. Output of function "Heterozygous". Each row denotes each individual. Each row denotes each locus. |
p |
a dataset of allele frequency, Output of function "AlleleFreq". Each row denotes each allele, and each column denotes each locus. |
HWE |
a logic variable. When TRUE, this function will calculate the expected heterozygosity under Hardy-Weinberg Equilibrium: H= 1-sum(q_i^2); q_i is the allele frequency; If FALSE, this function calculate the average heterozygosity from real heterozygosity table. |
This function calculate average heterozygosity at each locus.Output a vector of number of loci.
a vector of average heterozygosity on each loci.
Chakraborty, R., & Jin, L. (1992, ISSN:1432-1203) <doi:10.1007/BF00197257>
x <- data.frame(STR1=c(12,13,13,14,15,13,14,12,14,15), STR1_1=c(12,14,13,15,13,14,13,12,14,15), SNP1=c("A","T","A","A","T","A","A","T","T","A"), SNP1_1=c("A","T","T","T","A","T","A","A","T","T")) require(mixIndependR) h <- Heterozygous(x) p <- AlleleFreq(x) RxpHetero(h,p,HWE=TRUE)
x <- data.frame(STR1=c(12,13,13,14,15,13,14,12,14,15), STR1_1=c(12,14,13,15,13,14,13,12,14,15), SNP1=c("A","T","A","A","T","A","A","T","T","A"), SNP1_1=c("A","T","T","T","A","T","A","A","T","T")) require(mixIndependR) h <- Heterozygous(x) p <- AlleleFreq(x) RxpHetero(h,p,HWE=TRUE)
Generate a Bundle of Simulated distributions for No. of heterozygous loci with known heterozygosites
Simulate_DistK(H,m,t)
Simulate_DistK(H,m,t)
H |
a vector of average heterozygosity of each loci. Length of H is the number of loci. |
m |
the sample size you want, usually similar to the real sample size. |
t |
the number of samples you want to build |
This function generates multinomial distribution for loci known the heterozygosity and build the simulated distribution for no. of heterozygous loci.
a matrix of frequencies of No. of Heterozygous Loci. Each row denotes each simulated sample; Each column denotes each No. of Heterozygous loci, from 0 to length of H.
Simulate_DistK(runif(10),500,100)
Simulate_DistK(runif(10),500,100)
Build a simulated distribution for No. of Shared Alleles
Simulate_DistX(e,m,t)
Simulate_DistX(e,m,t)
e |
a matrix of Probability of Sharing 2,1 or 0 alleles at each loci. Each row denotes each locus. Three columns denote sharing 0,1 or 2 alleles. |
m |
the sample size you want, usually similar to the real sample size. |
t |
the number of samples you want to build/ the times to generate a sample |
This function generates multinomial distribution for loci known the Allele Frequency and Expected Probability of Shared 2,1 or 0 alleles
a matrix of frequencies of No. of shared alleles. Each row denotes each simulated sample; Each column denotes each No. of shared alleles, from 0 to 2e length of e.
e0<-data.frame("P0"=runif(5,min = 0,max = 0.5),"P1"=runif(5,0,0.5)) e<-data.frame(e0,"P2"=1-rowSums(e0)) Simulate_DistX(e,500,10)
e0<-data.frame("P0"=runif(5,min = 0,max = 0.5),"P1"=runif(5,0,0.5)) e<-data.frame(e0,"P2"=1-rowSums(e0)) Simulate_DistX(e,500,10)
Split each column to two columns for a table of genotypes
splitGenotype(df,sep,dif,rowbind)
splitGenotype(df,sep,dif,rowbind)
df |
a dataframe of genotype data with rownames of sample ID and column names of markers. |
sep |
allele separator in the imported genotype data. Note: when using the special character like "|", remember to protect it as "\|"(default). |
dif |
a symbol differentiate the one marker on each allele. |
rowbind |
a logical variable. If rowbind is TRUE, the output is arranged with double rows but the same columns, and the table of the second allele is followed after the first allele table by rows with double individual IDs in the same order. |
The function convert a genotype data to allele data with double columns or with double rows; the rownames are sample ID in the same order but twice if the rows are doubled, and the column names are in the same order or in the order of alphabet by pairs if columns are doubled.
The parameter "sep" is the symbol of allele separator in the imported genotype data.
The parameter "dif" is the difference between the second and the first appearance for the same marker. For example, if "dif = _1", the column names of output will be "marker1" "marker1 _1","marker2","marker2 _1", if the original list of column names is "marker1","marker2".
a dataframe with doubled columns of import data and alleles in different columns
## Not run: df <- data.frame(SNP1=c("A|A","T|T","A|T","A|T"), STR1=c("12|12","13|14","13|13","14|15")) splitGenotype(df) ## End(Not run)
## Not run: df <- data.frame(SNP1=c("A|A","T|T","A|T","A|T"), STR1=c("12|12","13|14","13|13","14|15")) splitGenotype(df) ## End(Not run)