Access Ensemble Genome Database to BiomaRt of R

Posted on 2017-08-16 Edited on 2024-02-18

Why We Hate Cilantro

I want to analysis SNP in coding region of olfactory receptor. In 2012, researchers studied SNP correlated with cilantro preference, and they found an SNP rs72921001 influence people’s feeling of cilantro. They said rs72921001 is a frequently accured SNP (an A -> C in DNA sequence) in OR6A2, an olfactory receptor.

Resonable, right? I immediately think this could cause an amino acid missense mutation. However, when I searched this SNP in Ensembl, I found it is located at upstream flanking region of OR10A2, and OR6A2 is another gene… Now I think maybe the name is changed during this time, and rs72921001 might influence the expression of OR10A2 in gene level… Umm.

Can I get more imformation of SNP in olfactory receptor? Let’s try by R!

Ensembl Database

Talking about genome database, Ensembl is the one I prefered and used. I’ll copy the introdution in their main page…

Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.

How can I access large dataset form Ensembl? Officialy they provide ready-made tools here. BioMart here is the most strong tool here. Acctually, what I really want to use here are scripts, so I found this: biomaRt. It can access large dataset in BioMart database, including Ensembl, COSMIC, Uniprot, HGNC, Gramene, Wormbase and dbSNP mapped to Ensembl.

Access Ensembl Database by biomaRt

Firstly, install it by Bioconductor:

1
2
3

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")

Then import biomaRt and Biostrings (for sequence processing)

library(biomaRt)
library(Biostrings)


setwd("C:\\Users\\Birdlet\\Desktop\\OR\\OR-Ensembl")

# Set Ensembl Server
ensembl = useEnsembl(biomart = "snp", dataset = "hsapiens_snp", mirror = 'asia')

# FUN: Fetch OR SNP data
fetch_OR <- function(OR, OR_name){
    filename = paste('data\\', OR_name, '.csv', sep = '')
    fetched = getBM(attributes = c('ensembl_gene_stable_id', 'allele', 'refsnp_id',
                                   'refsnp_source', 'consequence_type_tv', 'minor_allele_freq',
                                   'sift_score', 'polyphen_score',
                                   'ensembl_peptide_allele', 'translation_start'),
                    filters = c('ensembl_gene', 'minor_allele_freq_second', 'so_mini_parent_name'),
                    values = list(ensembl_gene = OR, freq = '0.001', 
                                  parent_name = 'missense_variant'),
                    mart = ensembl)
    write.table(fetched, file = filename, quote = FALSE, row.names = FALSE, 
              col.names = FALSE, sep = ',', fileEncoding = 'utf-8')
}

Firstly, I use useEnsembl function to initialize the web server, and I give paprameters to choose “SNP” of “Homo Sapiens”. For accelerating connection, I also use mirror in Asia rater than default.

Next, I defined a funtion for later. getBM funtion fetchs data from Ensembl, I want data including attributes of ‘ensembl_gene_stable_id’, ‘allele’, ‘minor_allele_freq’, and set ‘minor_allele_freq_second’ larger than ‘0.001’. Then I can get genes’ SNP data with MAF larger than 0.1 %.

More details please see manual for biomRt. ;P

Access each SNP record from Ensembl

After preperation, I can input a file of protein id, and wait for downloading data!

ENSG00000221858,G/A,rs9655672,dbSNP,missense_variant,0.00858626,0.13,0.361,A/T,223
ENSG00000221858,C/A/T,rs80128486,dbSNP,missense_variant,0.0081869,0.07,0.1,L/I,279
ENSG00000221858,C/A/T,rs80128486,dbSNP,missense_variant,0.0081869,0.21,0.06,L/F,279
ENSG00000221858,G/A,rs142231743,dbSNP,missense_variant,0.00419329,0.08,0.376,V/I,183
ENSG00000221858,T/A,rs143394460,dbSNP,missense_variant,0.00159744,0,0.986,I/N,91
ENSG00000221858,G/A,rs34947817,dbSNP,missense_variant,0.114217,0.3,0.007,S/N,26

Update Later… Updated!

You can found more detail about how to use it in https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html. Also, you can simply use help in R, it’s really convinient in RStdio.

BTW, I dislike cilantro but now I can accept it… Genes and environments together determine what we are!