Title: | A Pipeline to Define Gene Families in Legumes and Beyond |
---|---|
Description: | A pipeline with high specificity and sensitivity in extracting proteins from the RefSeq database (National Center for Biotechnology Information). Manual identification of gene families is highly time-consuming and laborious, requiring an iterative process of manual and computational analysis to identify members of a given family. The pipelines implements an automatic approach for the identification of gene families based on the conserved domains that specifically define that family. See Die et al. (2018) <doi:10.1101/436659> for more information and examples. |
Authors: | Jose V. Die [aut, cre] , Moamen M. Elmassry [ctb], Kimberly H. LeBlanc [ctb], Olaitan I. Awe [ctb], Allissa Dillman [ctb], Ben Busby [aut] |
Maintainer: | Jose V. Die <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.11 |
Built: | 2024-11-12 05:26:49 UTC |
Source: | https://github.com/ncbi-hackathons/genehummus |
Summarizes a dataframe of protein ids and return the total number of accessions per organism.
accessions_by_spp(my_accessions)
accessions_by_spp(my_accessions)
my_accessions |
A data frame with accession protein ids and organisms |
A data.frame
of summarized results including columns:
organism, taxonomic species
N.seqs, total number of sequences
Jose V. Die
getAccessions
to create the data frame with acession
id and organism for each protein identifier.
my_prots = data.frame(accession = c("XP_014620925", "XP_003546066", "XP_025640041", "XP_019453956", "XP_006584791", "XP_020212415", "XP_017436622", "XP_004503803", "XP_019463844"), organism = c("Glycine max", "Glycine max", "Arachis hypogaea", "Lupinus angustifolius", "Glycine max", "Cajanus cajan", "Vigna angularis", "Cicer arietinum", "Lupinus angustifolius")) accessions_by_spp(my_prots)
my_prots = data.frame(accession = c("XP_014620925", "XP_003546066", "XP_025640041", "XP_019453956", "XP_006584791", "XP_020212415", "XP_017436622", "XP_004503803", "XP_019463844"), organism = c("Glycine max", "Glycine max", "Arachis hypogaea", "Lupinus angustifolius", "Glycine max", "Cajanus cajan", "Vigna angularis", "Cicer arietinum", "Lupinus angustifolius")) accessions_by_spp(my_prots)
Filter a dataframe of protein ids and return the accessions for a given species or organism.
accessions_from_spp(my_accessions, spp)
accessions_from_spp(my_accessions, spp)
my_accessions |
A data frame with accession protein ids and organisms |
spp |
A string with the scientific name of the species or organism. |
A string vector with protein accession (XP accession, RefSeq database)
Jose V. Die
getAccessions
to create the data frame with accession
id and organism for each protein identifier.
my_prots = data.frame(accession = c("XP_014620925", "XP_003546066", "XP_025640041", "XP_019453956", "XP_006584791", "XP_020212415", "XP_017436622", "XP_004503803", "XP_019463844"), organism = c("Glycine max", "Glycine max", "Arachis hypogaea", "Lupinus angustifolius", "Glycine max", "Cajanus cajan", "Vigna angularis", "Cicer arietinum", "Lupinus angustifolius")) accessions_from_spp(my_prots, "Glycine max")
my_prots = data.frame(accession = c("XP_014620925", "XP_003546066", "XP_025640041", "XP_019453956", "XP_006584791", "XP_020212415", "XP_017436622", "XP_004503803", "XP_019463844"), organism = c("Glycine max", "Glycine max", "Arachis hypogaea", "Lupinus angustifolius", "Glycine max", "Cajanus cajan", "Vigna angularis", "Cicer arietinum", "Lupinus angustifolius")) accessions_from_spp(my_prots, "Glycine max")
Core function used by getAccessions
.
accessions_warning(protein_ids)
accessions_warning(protein_ids)
protein_ids |
A string vector containing protein identifiers. |
Jose V. Die
Parses SPARCLE database (NCBI) and extract electronic identifiers for each conserved domain.
archids_warning(gene_family)
archids_warning(gene_family)
gene_family |
A string with conserved domain(s). |
Jose V. Die
Extract the protein identifier for the given taxonomic species, which are hosted by the RefSeq database (NCBI).
extract_proteins(targets, taxonIds)
extract_proteins(targets, taxonIds)
targets |
A string with the electronic links for the SPARCLE architecture. |
taxonIds |
A string with the taxonomic species identifiers; legume species (by default). |
First, get the protein ids from RefSeq database. Then, extract only the ids for the selected taxonomic species (by default, legume species).
Jose V. Die
Parse the architecture identifiers and extract those that contain, at least, those selected in the filter.
filterArch_ids(archs_ids, filter)
filterArch_ids(archs_ids, filter)
archs_ids |
A string with the architecture identifiers that contain, at least, one of the conserved domains defining the gene family. |
filter |
A string with the domains (and order) that are required (at least) for the proteins to have. |
the architecture identifiers from all the potential protein architectures
defined by getArch_ids
that contain, at least, the conserved
domains explicitily show by the filter.
Jose V. Die
## Not run: archs_ids <- getArch_ids("pfam02362") my_filter <- c("B3_DNA", "Auxin_resp") filterArch_ids(archs_ids, my_filter) ## End(Not run)
## Not run: archs_ids <- getArch_ids("pfam02362") my_filter <- c("B3_DNA", "Auxin_resp") filterArch_ids(archs_ids, my_filter) ## End(Not run)
Parse the architecture identifiers and extract those that contain, at least, the conserved domaind selected as filter.
filterarchids_warning(archs_ids, filter)
filterarchids_warning(archs_ids, filter)
archs_ids |
A string with the architecture identifiers. |
filter |
A string with the domains as filter. |
Jose V. Die
genehummus is a pipeline with high specificity and sensitivity in extracting proteins from the RefSeq database (NCBI).
Jose V. Die [email protected], Moamen M. Elmassry, Kimberly H. LeBlanc, Olaitan I. Awe, Allissa Dillman, Ben Busby
see the preprint in BioRvix
Parse a string vector with sequence descriptions (title and species) and extract the species name.
get_spp(description)
get_spp(description)
description |
A string vector with the sequence description (title and species). |
for each sequence description, extract the species name.
Jose V. Die
The getAccessions
function parses the protein page for each identifier
and extracts the accession id (usually referred as XP accession in the RefSeq
database) and the organism given by the scientific name.
The accessions_by_spp
and accessions_from_spp
functions are
convenient filters for further cleaning of getAccessions
by giving
the total number of XP accessions per species or extracting the XP
accessions for a given species, respectively.
getAccessions(protein_ids)
getAccessions(protein_ids)
protein_ids |
A string vector containing protein identifiers. |
A data.frame
of protein ids including columns:
accession
organism
Jose V. Die
accessions_by_spp
to summarize the total number of
accession proteins per species.
accessions_from_spp
to filter the accession ids for
a given species
## Not run: prot_ids <- c("593705262", "1379669790", "1150156484") getAccessions(prot_ids) ## End(Not run)
## Not run: prot_ids <- c("593705262", "1379669790", "1150156484") getAccessions(prot_ids) ## End(Not run)
Parses the SPARCLE database (NCBI) and extract the electronic identifiers for each conserved domain.
getArch_ids(gene_family)
getArch_ids(gene_family)
gene_family |
A string with the conserved domain(s) defining the gene family. The domains have to be shown in the same order appearing in the sequences. |
the architectures identifiers for each of the conserved domains.
Jose V. Die
arf <- c("pfam06507") getArch_ids(arf)
arf <- c("pfam06507") getArch_ids(arf)
Parses the architecture identifiers and extract their corresponding labels.
getArch_labels(arch_ids)
getArch_labels(arch_ids)
arch_ids |
A string with the architecture electronic identifiers. |
print out the description label for the candidate architectures that contain the proteins we are looking for.
Jose V. Die
filterArch_ids
filtered_archids <- c("12034188", "12034184") getArch_labels(filtered_archids)
filtered_archids <- c("12034188", "12034184") getArch_labels(filtered_archids)
Parse the RefSeq database using protein architecture identifiers (SPARCLE dabatse) and extract the protein ids. for the selected taxonomic species.
getProteins_from_tax_ids(arch_ids, taxonIds)
getProteins_from_tax_ids(arch_ids, taxonIds)
arch_ids |
A string with the electronic links for the SPARCLE. |
taxonIds |
A vector string with taxonomy ids; Legume species available in RefSeq, by default. |
RefSeq protein identifiers for selected species.
Jose V. Die
filtered_archids <- c("12034184") medicago <- c(3880) getProteins_from_tax_ids(filtered_archids, medicago)
filtered_archids <- c("12034184") medicago <- c(3880) getProteins_from_tax_ids(filtered_archids, medicago)
Parse the RefSeq database and extract all the protein identifiers that have a given architecture.
getProtlinks(archs_ids)
getProtlinks(archs_ids)
archs_ids |
A string with the architecture identifiers (SPARCLE database, NCBI) |
Jose V. Die
Parses the SPARCLE database (NCBI) and extract the electronic links for a given conserved domain.
getSparcleArchs(CD)
getSparcleArchs(CD)
CD |
A string with the conserved domain(s) |
Jose V. Die
Parses the architecture identifier and extract the corresponding labels.
labels_warning(arch_ids)
labels_warning(arch_ids)
arch_ids |
A string with the architecture electronic identifiers. |
Jose V. Die
Taxonomy identifier for about 10,000 legume species
data(legumesIds)
data(legumesIds)
a numeric vector with 10.009 elements
Taxonomy identifiers for Fabaceae species (Taxonomy databse, NCBI).
XP accessions for the Auxin Response Factor gene family in the Legume Legume taxonomy (NCBI RefSeq database, as of SEP 2018).
data(my_legumes)
data(my_legumes)
a list of length 10 with 563 elements.
1. chickpea
2. medicago
3. soybean
4. arachis_duranensis
5. arachis_ipaensis
6. cajanus
7. vigna_angulata
8. vigna_radiata
9. phaseolus
10. lupinus
Protein identifiers for Fabaceae species (RefSeq databse, NCBI).
Parse the RefSeq database using protein architecture identifiers and
extract protein ids. for selected taxonomic species. Core function used
by getProteins_from_tax_ids
.
proteins_warning(arch_ids, taxonIds)
proteins_warning(arch_ids, taxonIds)
arch_ids |
A string with the electronic links for the SPARCLE. |
taxonIds |
A vector string with taxonomy ids. |
Jose V. Die
Split a vector into N elements, so that each element contains a given length.
sizeIds
sizeIds
An object of class numeric
of length 1.
Jose V. Die