Package 'geneHummus' reference manual

Title:	A Pipeline to Define Gene Families in Legumes and Beyond
Description:	A pipeline with high specificity and sensitivity in extracting proteins from the RefSeq database (National Center for Biotechnology Information). Manual identification of gene families is highly time-consuming and laborious, requiring an iterative process of manual and computational analysis to identify members of a given family. The pipelines implements an automatic approach for the identification of gene families based on the conserved domains that specifically define that family. See Die et al. (2018) <doi:10.1101/436659> for more information and examples.
Authors:	Jose V. Die [aut, cre] , Moamen M. Elmassry [ctb], Kimberly H. LeBlanc [ctb], Olaitan I. Awe [ctb], Allissa Dillman [ctb], Ben Busby [aut]
Maintainer:	Jose V. Die <[email protected]>
License:	MIT + file LICENSE
Version:	1.0.11
Built:	2025-03-12 05:01:13 UTC
Source:	https://github.com/ncbi-hackathons/genehummus

Compute the total number of accession proteins per species

Description

Summarizes a dataframe of protein ids and return the total number of accessions per organism.

Usage

accessions_by_spp(my_accessions)
accessions_by_spp(my_accessions)

Arguments

my_accessions

A data frame with accession protein ids and organisms

Value

A data.frame of summarized results including columns:

organism, taxonomic species
N.seqs, total number of sequences

Author(s)

Jose V. Die

Examples

my_prots = data.frame(accession = c("XP_014620925", "XP_003546066", 
   "XP_025640041", "XP_019453956", "XP_006584791", "XP_020212415", 
   "XP_017436622", "XP_004503803", "XP_019463844"),
   organism =  c("Glycine max", "Glycine max", "Arachis hypogaea",
   "Lupinus angustifolius", "Glycine max", "Cajanus cajan", 
   "Vigna angularis", "Cicer arietinum", "Lupinus angustifolius"))
   
accessions_by_spp(my_prots)
 
my_prots = data.frame(accession = c("XP_014620925", "XP_003546066", 
   "XP_025640041", "XP_019453956", "XP_006584791", "XP_020212415", 
   "XP_017436622", "XP_004503803", "XP_019463844"),
   organism =  c("Glycine max", "Glycine max", "Arachis hypogaea",
   "Lupinus angustifolius", "Glycine max", "Cajanus cajan", 
   "Vigna angularis", "Cicer arietinum", "Lupinus angustifolius"))
   
accessions_by_spp(my_prots)

Extract the accession ids (XP accession) for a given organism

Description

Filter a dataframe of protein ids and return the accessions for a given species or organism.

Usage

accessions_from_spp(my_accessions, spp)
accessions_from_spp(my_accessions, spp)

Arguments

`my_accessions`	A data frame with accession protein ids and organisms
`spp`	A string with the scientific name of the species or organism.

Value

A string vector with protein accession (XP accession, RefSeq database)

Author(s)

Jose V. Die

Examples

my_prots = data.frame(accession = c("XP_014620925", "XP_003546066", 
   "XP_025640041", "XP_019453956", "XP_006584791", "XP_020212415", 
   "XP_017436622", "XP_004503803", "XP_019463844"),
   organism =  c("Glycine max", "Glycine max", "Arachis hypogaea",
   "Lupinus angustifolius", "Glycine max", "Cajanus cajan", 
   "Vigna angularis", "Cicer arietinum", "Lupinus angustifolius"))
   
accessions_from_spp(my_prots, "Glycine max")
 
my_prots = data.frame(accession = c("XP_014620925", "XP_003546066", 
   "XP_025640041", "XP_019453956", "XP_006584791", "XP_020212415", 
   "XP_017436622", "XP_004503803", "XP_019463844"),
   organism =  c("Glycine max", "Glycine max", "Arachis hypogaea",
   "Lupinus angustifolius", "Glycine max", "Cajanus cajan", 
   "Vigna angularis", "Cicer arietinum", "Lupinus angustifolius"))
   
accessions_from_spp(my_prots, "Glycine max")

Get acessions and organism for each protein identifier

Description

Core function used by getAccessions.

Usage

accessions_warning(protein_ids)
accessions_warning(protein_ids)

Arguments

protein_ids

A string vector containing protein identifiers.

Author(s)

Jose V. Die

Get architecture identifiers for the conserved domains

Description

Parses SPARCLE database (NCBI) and extract electronic identifiers for each conserved domain.

Usage

archids_warning(gene_family)
archids_warning(gene_family)

Arguments

gene_family

A string with conserved domain(s).

Author(s)

Jose V. Die

Get the protein identifiers

Description

Extract the protein identifier for the given taxonomic species, which are hosted by the RefSeq database (NCBI).

Usage

extract_proteins(targets, taxonIds)
extract_proteins(targets, taxonIds)

Arguments

`targets`	A string with the electronic links for the SPARCLE architecture.
`taxonIds`	A string with the taxonomic species identifiers; legume species (by default).

Details

First, get the protein ids from RefSeq database. Then, extract only the ids for the selected taxonomic species (by default, legume species).

Author(s)

Jose V. Die

Filter the protein architectures based on conserved domains

Description

Parse the architecture identifiers and extract those that contain, at least, those selected in the filter.

Usage

filterArch_ids(archs_ids, filter)
filterArch_ids(archs_ids, filter)

Arguments

`archs_ids`	A string with the architecture identifiers that contain, at least, one of the conserved domains defining the gene family.
`filter`	A string with the domains (and order) that are required (at least) for the proteins to have.

Value

the architecture identifiers from all the potential protein architectures defined by getArch_ids that contain, at least, the conserved domains explicitily show by the filter.

Author(s)

Jose V. Die

Examples

## Not run: 
archs_ids <- getArch_ids("pfam02362")
my_filter <- c("B3_DNA", "Auxin_resp")

filterArch_ids(archs_ids, my_filter) 

## End(Not run)


## Not run: 
archs_ids <- getArch_ids("pfam02362")
my_filter <- c("B3_DNA", "Auxin_resp")

filterArch_ids(archs_ids, my_filter) 

## End(Not run)

Filter protein architectures based on conserved domains

Description

Parse the architecture identifiers and extract those that contain, at least, the conserved domaind selected as filter.

Usage

filterarchids_warning(archs_ids, filter)
filterarchids_warning(archs_ids, filter)

Arguments

`archs_ids`	A string with the architecture identifiers.
`filter`	A string with the domains as filter.

Author(s)

Jose V. Die

genehummus: A pipeline to define gene families in Legumes and beyond

Description

genehummus is a pipeline with high specificity and sensitivity in extracting proteins from the RefSeq database (NCBI).

Author(s)

Jose V. Die [email protected], Moamen M. Elmassry, Kimberly H. LeBlanc, Olaitan I. Awe, Allissa Dillman, Ben Busby

Get the species name from the description sequence

Description

Parse a string vector with sequence descriptions (title and species) and extract the species name.

Usage

get_spp(description)
get_spp(description)

Arguments

description

A string vector with the sequence description (title and species).

Value

for each sequence description, extract the species name.

Author(s)

Jose V. Die

Get the acessions ids and the organism for each protein identifier

Description

The getAccessions function parses the protein page for each identifier and extracts the accession id (usually referred as XP accession in the RefSeq database) and the organism given by the scientific name.

The accessions_by_spp and accessions_from_spp functions are convenient filters for further cleaning of getAccessions by giving the total number of XP accessions per species or extracting the XP accessions for a given species, respectively.

Usage

getAccessions(protein_ids)
getAccessions(protein_ids)

Arguments

protein_ids

A string vector containing protein identifiers.

Value

A data.frame of protein ids including columns:

accession
organism

Author(s)

Jose V. Die

Examples



## Not run: 
prot_ids <- c("593705262", "1379669790", "1150156484")
getAccessions(prot_ids)
## End(Not run)
 
## Not run: 
prot_ids <- c("593705262", "1379669790", "1150156484")
getAccessions(prot_ids)
## End(Not run)

Get the potential architecture identifiers for the conserved domains

Description

Parses the SPARCLE database (NCBI) and extract the electronic identifiers for each conserved domain.

Usage

getArch_ids(gene_family)
getArch_ids(gene_family)

Arguments

gene_family

A string with the conserved domain(s) defining the gene family. The domains have to be shown in the same order appearing in the sequences.

Value

the architectures identifiers for each of the conserved domains.

Author(s)

Jose V. Die

Examples

arf <- c("pfam06507")
getArch_ids(arf)

arf <- c("pfam06507")
getArch_ids(arf)

Get the description label for a protein architecture identifier

Description

Parses the architecture identifiers and extract their corresponding labels.

Usage

getArch_labels(arch_ids)
getArch_labels(arch_ids)

Arguments

arch_ids

A string with the architecture electronic identifiers.

Value

print out the description label for the candidate architectures that contain the proteins we are looking for.

Author(s)

Jose V. Die

Examples




filtered_archids <- c("12034188", "12034184")
getArch_labels(filtered_archids)

filtered_archids <- c("12034188", "12034184")
getArch_labels(filtered_archids)

Get the RefSeq protein identifiers for the given taxonomic species

Description

Parse the RefSeq database using protein architecture identifiers (SPARCLE dabatse) and extract the protein ids. for the selected taxonomic species.

Usage

getProteins_from_tax_ids(arch_ids, taxonIds)
getProteins_from_tax_ids(arch_ids, taxonIds)

Arguments

`arch_ids`	A string with the electronic links for the SPARCLE.
`taxonIds`	A vector string with taxonomy ids; Legume species available in RefSeq, by default.

Value

RefSeq protein identifiers for selected species.

Author(s)

Jose V. Die

Examples

filtered_archids <- c("12034184")
medicago <- c(3880)
getProteins_from_tax_ids(filtered_archids, medicago)

filtered_archids <- c("12034184")
medicago <- c(3880)
getProteins_from_tax_ids(filtered_archids, medicago)

Get the protein identifiers for a given architecture

Description

Parse the RefSeq database and extract all the protein identifiers that have a given architecture.

Usage

getProtlinks(archs_ids)
getProtlinks(archs_ids)

Arguments

archs_ids

A string with the architecture identifiers (SPARCLE database, NCBI)

Author(s)

Jose V. Die

Get the electronic architecture for a conserved domain

Description

Parses the SPARCLE database (NCBI) and extract the electronic links for a given conserved domain.

Usage

getSparcleArchs(CD)
getSparcleArchs(CD)

Arguments

`CD`	A string with the conserved domain(s)

Author(s)

Jose V. Die

Get description label for a protein architecture identifier

Description

Parses the architecture identifier and extract the corresponding labels.

Usage

labels_warning(arch_ids)
labels_warning(arch_ids)

Arguments

arch_ids

A string with the architecture electronic identifiers.

Author(s)

Jose V. Die

NCBI taxonomy ids for the legume family

Description

Taxonomy identifier for about 10,000 legume species

Usage

data(legumesIds)
data(legumesIds)

Format

a numeric vector with 10.009 elements

Source

Taxonomy identifiers for Fabaceae species (Taxonomy databse, NCBI).

ARF proteins per legume specie

Description

XP accessions for the Auxin Response Factor gene family in the Legume Legume taxonomy (NCBI RefSeq database, as of SEP 2018).

Usage

data(my_legumes)
data(my_legumes)

Format

a list of length 10 with 563 elements.

1. chickpea
2. medicago
3. soybean
4. arachis_duranensis
5. arachis_ipaensis
6. cajanus
7. vigna_angulata
8. vigna_radiata
9. phaseolus
10. lupinus

Source

Protein identifiers for Fabaceae species (RefSeq databse, NCBI).

Get RefSeq protein identifiers for the given taxonomic species

Description

Parse the RefSeq database using protein architecture identifiers and extract protein ids. for selected taxonomic species. Core function used by getProteins_from_tax_ids.

Usage

proteins_warning(arch_ids, taxonIds)
proteins_warning(arch_ids, taxonIds)

Arguments

`arch_ids`	A string with the electronic links for the SPARCLE.
`taxonIds`	A vector string with taxonomy ids.

Author(s)

Jose V. Die

Build a list containing N elements per element list

Description

Split a vector into N elements, so that each element contains a given length.

Usage

sizeIds
sizeIds

Format

An object of class numeric of length 1.

Author(s)

Jose V. Die

Package 'geneHummus'

Help Index

Compute the total number of accession proteins per species

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Extract the accession ids (XP accession) for a given organism

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Get acessions and organism for each protein identifier

Description

Usage

Arguments

Author(s)

Get architecture identifiers for the conserved domains

Description

Usage

Arguments

Author(s)

Get the protein identifiers

Description

Usage

Arguments

Details

Author(s)

Filter the protein architectures based on conserved domains

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Filter protein architectures based on conserved domains

Description

Usage

Arguments

Author(s)

genehummus: A pipeline to define gene families in Legumes and beyond

Description

Author(s)

See Also

Get the species name from the description sequence

Description

Usage

Arguments

Value

Author(s)

Get the acessions ids and the organism for each protein identifier

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Get the potential architecture identifiers for the conserved domains

Description

Usage

Arguments

Value

Author(s)

Examples

Get the description label for a protein architecture identifier

Description

Usage

Arguments

Value

Author(s)

See Also

Examples