Title: | Find Matches to Canonical SiRNA Seeds in Genomic Features |
---|---|
Description: | On-target gene knockdown using siRNA ideally results from binding fully complementary regions in mRNA transcripts to induce cleavage. Off-target siRNA gene knockdown can occur through several modes, one being a seed-mediated mechanism mimicking miRNA gene regulation. Seed-mediated off-target effects occur when the ~8 nucleotides at the 5’ end of the guide strand, called a seed region, bind the 3’ untranslated regions of mRNA, causing reduced translation. Experiments using siRNA knockdown paired with RNA-seq can be used to detect siRNA sequences with potential off-target effects driven by the seed region. 'SeedMatchR' provides tools for exploring and detecting potential seed-mediated off-target effects of siRNA in RNA-seq experiments. 'SeedMatchR' is designed to extend current differential expression analysis tools, such as 'DESeq2', by annotating results with predicted seed matches. Using publicly available data, we demonstrate the ability of 'SeedMatchR' to detect cumulative changes in differential gene expression attributed to siRNA seed regions. |
Authors: | Tareian Cazares [aut, cre] , Gulcin Ozer [aut] , Jibo Wang [aut], Rick Higgs [aut], Eli Lilly and Company [cph] |
Maintainer: | Tareian Cazares <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.1 |
Built: | 2024-11-09 05:55:41 UTC |
Source: | https://github.com/tacazares/seedmatchr |
Check if input gene lists overlap
check_gene_list_overlap(gene.lists)
check_gene_list_overlap(gene.lists)
gene.lists |
A list of gene lists. example: list(c("gene1", "gene2"), c("gene1")) |
Warning if gene sets overlap
# Overlap check_gene_list_overlap(list(c("gene1", "gene2"), c("gene1"))) #No overlap check_gene_list_overlap(list(c("gene1", "gene2"), c("gene3")))
# Overlap check_gene_list_overlap(list(c("gene1", "gene2"), c("gene1"))) #No overlap check_gene_list_overlap(list(c("gene1", "gene2"), c("gene3")))
This functions will take differential expression results, such as from DESEQ2, as a data.frame
and
plot the ecdf for the input gene.lists
.
The gene sets to plot should be provided as a list of lists.
Example:
gene.lists = list("Background" = c("gene1", "gene2"), "Target" = c("gene2", "gene3"), "Overlap" = c("gene2"))
This function will also perform statistical testing if plot.hist
is TRUE.
The output will be saved to a PDF if an output.filename
is provided.
Users can define the groups that are to be compared in the statistical test
using the null.name
and target.name
arguments. The names must be found
in gene.lists
. The factor.order
is used to order the groups in the
analysis.
This functions returns:
$plot
: The ECDF plot
$stats
: The stats results object
de_fc_ecdf( res, gene.lists, title = "ECDF", output.filename = NULL, palette = SeedMatchR.palette, factor.order = NULL, x.lims = c(-1, 1), stats.test = c("KS", "Wilcoxen"), alternative = c("greater", "less", "two.sided"), null.name = 1, target.name = 2, height = 5, width = 5, dpi = 320, l2fc.col = "log2FoldChange" )
de_fc_ecdf( res, gene.lists, title = "ECDF", output.filename = NULL, palette = SeedMatchR.palette, factor.order = NULL, x.lims = c(-1, 1), stats.test = c("KS", "Wilcoxen"), alternative = c("greater", "less", "two.sided"), null.name = 1, target.name = 2, height = 5, width = 5, dpi = 320, l2fc.col = "log2FoldChange" )
res |
The DESeq2 results dataframe |
gene.lists |
A nest list of gene names. Example: gene.lists = list("Background" = gene.list2, "Target" = gene.list1, "Overlap" = gene.list3) |
title |
The tile of the plot |
output.filename |
If the output filename is provided, then the plot is saved. |
palette |
The color palette to use for your curves |
factor.order |
The order to use for the legends |
x.lims |
The xlimits range |
stats.test |
The statistic test to use. Options: KS, Kuiper, DTS, CVM, AD, Wass |
alternative |
The alternative hypothesis to test. Options: greater, less, two.sided |
null.name |
The name in the gene.list to use as the null for ecdf plots |
target.name |
The name in the gene.list to use as the target for ecdf plots |
height |
Plot height in inches |
width |
Plot width in inches |
dpi |
The dpi resolution for the figure |
l2fc.col |
The name of the column containing log2FoldChange values. Based on DESEQ2 names as default. |
A ggplot object for the ECDF plot
library(dplyr) guide.seq = "UUAUAGAGCAAGAACACUGUUUU" anno.db = load_species_anno_db("human") features = get_feature_seqs(anno.db$tx.db, anno.db$dna) # Load test data get_example_data("sirna") sirna.data = load_example_data("sirna") res <- sirna.data$Schlegel_2022_Ttr_D1_30mkg # Filter DESeq2 results for SeedMatchR res = filter_deseq(res, fdr.cutoff=1, fc.cutoff=0, rm.na.log2fc = TRUE) res = SeedMatchR(res, anno.db$gtf, features$seqs, guide.seq, "mer7m8") # Gene set 1 mer7m8.list = res$gene_id[res$mer7m8 >= 1] # Gene set 2 background.list = res$gene_id[!(res$mer7m8 %in% mer7m8.list)] ecdf.results = de_fc_ecdf(res, list("Background" = background.list, "mer7m8" = mer7m8.list), stats.test = "KS", factor.order = c("Background", "mer7m8"), null.name = "Background", target.name = "mer7m8")
library(dplyr) guide.seq = "UUAUAGAGCAAGAACACUGUUUU" anno.db = load_species_anno_db("human") features = get_feature_seqs(anno.db$tx.db, anno.db$dna) # Load test data get_example_data("sirna") sirna.data = load_example_data("sirna") res <- sirna.data$Schlegel_2022_Ttr_D1_30mkg # Filter DESeq2 results for SeedMatchR res = filter_deseq(res, fdr.cutoff=1, fc.cutoff=0, rm.na.log2fc = TRUE) res = SeedMatchR(res, anno.db$gtf, features$seqs, guide.seq, "mer7m8") # Gene set 1 mer7m8.list = res$gene_id[res$mer7m8 >= 1] # Gene set 2 background.list = res$gene_id[!(res$mer7m8 %in% mer7m8.list)] ecdf.results = de_fc_ecdf(res, list("Background" = background.list, "mer7m8" = mer7m8.list), stats.test = "KS", factor.order = c("Background", "mer7m8"), null.name = "Background", target.name = "mer7m8")
Download and parse DESeq2 output from GSE184929
download_parse_file(download.path, output.path)
download_parse_file(download.path, output.path)
download.path |
File url to be downloaded |
output.path |
Filename used for saving downloaded file |
DESeq2 results as a data.frame.
download_parse_file()
download_parse_file()
stats
packageThis function uses the stats
package to test the ECDF
of log2(Fold Changes) between two groups based on DESeq2 analysis.
The inputs of this function are a differential expression results data.frame
and two sets of
gene IDs called gene.list1
and gene.list2
. The functions will look for a
column called log2FoldChange
in the dataframe.
ecdf_stat_test( res, gene.list1, gene.list2, stats.test = c("KS", "Wilcoxen"), alternative = "greater", l2fc.col = "log2FoldChange" )
ecdf_stat_test( res, gene.list1, gene.list2, stats.test = c("KS", "Wilcoxen"), alternative = "greater", l2fc.col = "log2FoldChange" )
res |
Input results file data frame |
gene.list1 |
Gene list 1: Usually null distribution |
gene.list2 |
Gene list 2: Target set of genes |
stats.test |
Stats test to use. Options: KS or Wilcoxen |
alternative |
The alternative hypothesis to test. Options: greater, less, two.sided |
l2fc.col |
The name of the column containing log2FoldChange values. Based on DESEQ2 names as default. |
A vector containing the dstat and pvalue
library(dplyr) guide.seq = "UUAUAGAGCAAGAACACUGUUUU" anno.db = load_species_anno_db("human") features = get_feature_seqs(anno.db$tx.db, anno.db$dna) # Load test data get_example_data("sirna") sirna.data = load_example_data("sirna") res <- sirna.data$Schlegel_2022_Ttr_D1_30mkg # Filter DESeq2 results for SeedMatchR res = filter_deseq(res, fdr.cutoff=1, fc.cutoff=0, rm.na.log2fc = TRUE) res = SeedMatchR(res, anno.db$gtf, features$seqs, guide.seq, "mer7m8") # Gene set 1 mer7m8.list = res$gene_id[res$mer7m8 >= 1] # Gene set 2 background.list = res$gene_id[!(res$mer7m8 %in% mer7m8.list)] ecdf.res = ecdf_stat_test(res, mer7m8.list, background.list)
library(dplyr) guide.seq = "UUAUAGAGCAAGAACACUGUUUU" anno.db = load_species_anno_db("human") features = get_feature_seqs(anno.db$tx.db, anno.db$dna) # Load test data get_example_data("sirna") sirna.data = load_example_data("sirna") res <- sirna.data$Schlegel_2022_Ttr_D1_30mkg # Filter DESeq2 results for SeedMatchR res = filter_deseq(res, fdr.cutoff=1, fc.cutoff=0, rm.na.log2fc = TRUE) res = SeedMatchR(res, anno.db$gtf, features$seqs, guide.seq, "mer7m8") # Gene set 1 mer7m8.list = res$gene_id[res$mer7m8 >= 1] # Gene set 2 background.list = res$gene_id[!(res$mer7m8 %in% mer7m8.list)] ecdf.res = ecdf_stat_test(res, mer7m8.list, background.list)
Filter DESeqDataSet
results for use with seed matching
and counting functions.
The filtering criteria are:
Filter out genes that are not expressed or counted at all: baseMean = 0 & pvalue = NA & log2FoldChange = NA
Filter out genes that are expressed, but there is not difference across groups: log2FoldChange = 0
Filter out genes with extreme outliers: pvalue = NA and padj = NA
Filter out genes that have been excluded by independent filtering. padj = NA
Filter results by the fdr.cutoff
Filter the results by the log2FoldChange
Filter the results by the baseMean
Remove NA gene_ids and log2FoldChange values
filter_deseq( res, fdr.cutoff = 1, fc.cutoff = 0, rm.na.name = FALSE, rm.na.log2fc = FALSE, baseMean.cutoff = 0 )
filter_deseq( res, fdr.cutoff = 1, fc.cutoff = 0, rm.na.name = FALSE, rm.na.log2fc = FALSE, baseMean.cutoff = 0 )
res |
The DESEQ2 results as a data frame |
fdr.cutoff |
The false discovery rate cutoff to use. |
fc.cutoff |
The fold change cutoff to use. The absolute value will be used as the cutoff and values greater-than-or-equal-to will be kept. |
rm.na.name |
Remove na values from the gene_name column |
rm.na.log2fc |
Remove na values from the log2FoldChange column |
baseMean.cutoff |
The minimum baseMean expression cutoff |
A modified DESEQ2 results table that has been filtered
# Load test data get_example_data("sirna") sirna.data = load_example_data("sirna") res <- sirna.data$Schlegel_2022_Ttr_D1_30mkg # Filter DESeq2 results for SeedMatchR res = filter_deseq(res, fdr.cutoff=1, fc.cutoff=0, rm.na.log2fc = TRUE)
# Load test data get_example_data("sirna") sirna.data = load_example_data("sirna") res <- sirna.data$Schlegel_2022_Ttr_D1_30mkg # Filter DESeq2 results for SeedMatchR res = filter_deseq(res, fdr.cutoff=1, fc.cutoff=0, rm.na.log2fc = TRUE)
This function will download data that can be used for SeedMatchR. Choosing 'sirna' will download 3 DESeq2 results files from GSE184929. Choosing 'mirna' will download the miRDB database as a tsv.
get_example_data(example.type)
get_example_data(example.type)
example.type |
Name of the example to load. Options: sirna, mirna |
None?
get_example_data()
get_example_data()
This function is used to get the genomic features of interest
and the DNA sequences associated with them. This function takes advantage of
the GenomicFeatures
package functions threeUTRsByTranscript
,
fiveUTRsByTranscript
, exonsBy
, intronsByTranscript
, and cdsBy
. These
functions are used to generate the features given an input tx.db
object. A
2bit dna
input is also required for extracting features sequences.
The output of the this function is:
$db
: the feature GRanges
object
$seqs
: DNAStringSet
of sequences associated to those features
get_feature_seqs(tx.db, dna, feature.type = "3UTR")
get_feature_seqs(tx.db, dna, feature.type = "3UTR")
tx.db |
A tx.db object |
dna |
A 2bit dna sequence |
feature.type |
The type of feature to return. Options: 3UTR, 5UTR, exons, introns, cds |
list containing the feature db object and the feature sequences
anno.db = load_species_anno_db("human") features = get_feature_seqs(anno.db$tx.db, anno.db$dna)
anno.db = load_species_anno_db("human") features = get_feature_seqs(anno.db$tx.db, anno.db$dna)
Given a sequence greater than 8 bp oriented 5' -> 3' and a seed
definition, this function will return an object containing seed-specific sequence
information. Users can input a custom seed name, but must provide the start
position (start.pos
) and stop position (stop.pos
) that define the
range of the seed sequence.
Built-in options: mer8
, mer7A1
, mer7m8
, mer6
Note: The seed definitions mer8
and mer7A1
force a U at position g1.
This results in an A in the target sequence being searched.
get_seed(guide.seq, seed.name = "mer7m8", start.pos = 1, stop.pos = 8)
get_seed(guide.seq, seed.name = "mer7m8", start.pos = 1, stop.pos = 8)
guide.seq |
A character string greater than 8 bp and oriented 5'-> 3'. |
seed.name |
The seed name of interest. Options: mer8, mer7A1, mer7m8, mer6. If not in the default list, the start.pos and stop.pos arguments will be used to define the seed. |
start.pos |
The start position for a custom seed definition |
stop.pos |
The stop position for a custom seed definition |
An object with the entries:
Guide
: Input guide sequence. Input is expected to be RNA.
Seed.Name
: The seed name.
Seed.Seq.RNA
: The seed sequence as a RNAString
Seed.Seq.DNA
: The seed sequence as a DNAString
Target.Seq
: The target DNA sequence based on the reverse complement of
the seed as a DNAString
# Example Ttr from Schlegel et al. 2022 guide.seq = "UUAUAGAGCAAGAACACUGUUUU" # Get seed match seed.seq = get_seed(guide.seq, "mer7m8")
# Example Ttr from Schlegel et al. 2022 guide.seq = "UUAUAGAGCAAGAACACUGUUUU" # Get seed match seed.seq = get_seed(guide.seq, "mer7m8")
Load example DESeq2 data into the environment
load_example_data(example.type)
load_example_data(example.type)
example.type |
Name of the example to load. Options: sirna, mirna |
Loads either the Schlegel 2022 RNAseq data or miRDB into the environment.
load_example_data()
load_example_data()
AnnotationDb
Use AnnotationHub
to load species-specific GTF and 2bit DNA
sequences. This function currently works for human, rat, and mouse.
The function will return:
$gtf
: A GRanges
object containing the GTF information
$tx.db
: A tx.db
object made from the GTF
$dna
: The 2bit DNA sequence as a DNAStringSet
load_species_anno_db(species.name, remove.na.rows = TRUE)
load_species_anno_db(species.name, remove.na.rows = TRUE)
species.name |
Species name. Options: human, rat, mouse |
remove.na.rows |
Remove rows with NA in the gene_id column |
Species specific AnnotationDb
anno.db = load_species_anno_db("human")
anno.db = load_species_anno_db("human")
Plot the Guide Strand with different optional seeds
plot_seeds(guide.seq)
plot_seeds(guide.seq)
guide.seq |
Guide a.k.a anti-sense sequence oriented 5' > 3'. Sequence must be greater than 8 bp. |
A msaggplot of the guide sequence in addition to the available seed sequences
library(msa) # Ttr siRNA sequence guide.seq = "UUAUAGAGCAAGAACACUGUUUU" # generate seed plot plotted.seeds = plot_seeds(guide.seq)
library(msa) # Ttr siRNA sequence guide.seq = "UUAUAGAGCAAGAACACUGUUUU" # generate seed plot plotted.seeds = plot_seeds(guide.seq)
Find seed matches in a DNAStringSet
object of
sequences. This function will use get.seed
extract the seed sequence
from the guide sequence. The seed is then searched across all rows of the
DNAStringSet
object using vpatterncount
.
This function returns the input DESeq2 results data.frame
with an
additional column that contains the counts for the input seed.name
.
SeedMatchR( res, gtf, seqs, sequence, seed.name = "mer7m8", col.name = NULL, mismatches = 0, indels = FALSE, tx.id.col = TRUE )
SeedMatchR( res, gtf, seqs, sequence, seed.name = "mer7m8", col.name = NULL, mismatches = 0, indels = FALSE, tx.id.col = TRUE )
res |
A DESeq2 results |
gtf |
GTF file used to map features to genes. The object must have columns transcript_id and gene_id |
seqs |
The |
sequence |
The |
seed.name |
The name of specific seed to extract. Options are: mer8, mer7A1, mer7m8, mer6 |
col.name |
The string to use for the column name. Defaults to seed name |
mismatches |
The number of mismatches to allow in search |
indels |
Whether to allow indels in search |
tx.id.col |
Use the transcript_id column instead of gene_id |
A modified DESeq2 results dataframe that has column named after the seed of choice representing the number of match counts.
library(dplyr) seq = "UUAUAGAGCAAGAACACUGUUUU" anno.db = load_species_anno_db("human") features = get_feature_seqs(anno.db$tx.db, anno.db$dna) # Load test data res <- Schlegel_2022_Ttr_D1_30mkg # Filter DESeq2 results for SeedMatchR res = filter_deseq(res, fdr.cutoff=1, fc.cutoff=0, rm.na.log2fc = TRUE) res = SeedMatchR(res, anno.db$gtf, features$seqs, seq, "mer7m8")
library(dplyr) seq = "UUAUAGAGCAAGAACACUGUUUU" anno.db = load_species_anno_db("human") features = get_feature_seqs(anno.db$tx.db, anno.db$dna) # Load test data res <- Schlegel_2022_Ttr_D1_30mkg # Filter DESeq2 results for SeedMatchR res = filter_deseq(res, fdr.cutoff=1, fc.cutoff=0, rm.na.log2fc = TRUE) res = SeedMatchR(res, anno.db$gtf, features$seqs, seq, "mer7m8")