For this project, we are considering switching to NCBI's Entrez Gene and would like feedback. "The primary goals of Entrez Gene are to provide tracked, unique identifiers for genes and to report information associated with those identifiers for unrestricted public use. The identifier that is assigned (GeneID) is an integer, and is species specific [4]."
The main advantages of Entrez versus HGNC for gene identification include:
more species than human
GeneIDs which are less error prone than ambiguous gene symbols
integration with many other NCBI services such as HomoloGene, which can relate orthologous genes across species
The main disadvantage is familiarity, as most biologists conceive human genes in terms of their HGNC symbols. Although symbol information is available in Entrez Gene, there is no guarantee that each Entrez Gene record has a single corresponding, current HGNC symbol.
I am interested in:
the best way to retrieve, store, parse, and map to Entrez Gene records
how stable Entrez Gene identifiers are for protein-coding genes in humans
the difficulty of updating to new versions of the Entrez Gene database
We use Entrez internally in our lab at the moment. We store them in a SQL database with their associated features. We've started using ElasticSearch to perform mapping from sources that use multiple alternative identifiers or that also include aliases, but in general we try to convert to Entrez.
There are some problems with Entrez <-> HGNC, but they are relatively minor — especially for protein coding genes. IMP, GIANT, Tribe, and our other servers use entrez internally and map to symbol for display purposes.
I have talked to people who are discussing more sophisticated systems to generate identifiers that unify the databases through an automated process capable of resolving ambiguities, and I am hopeful that some of those projects will come to fruition.
We have proceeded with Entrez Gene for gene identification.
@caseygreene, do you have any advice or information on how to build the SQL database? I found this site which provides instructions and Perl scripts. Do you use the same schema?
We have done it a couple of ways. Currently we like to We have an EntrezID field that's an index, a systematic name that's an index (if you're human, this is HGNC identifiers), standard name (you won't need this for human only), the gene description, a foreign key to the organism (again, if only human, won't need this), the aliases (space separated list of previous/alternative names — used only for full text search), whether or not the gene is now obsolete (used during updates), and a few other things that we use for search.
For other identifiers, we have a table of cross reference databases, which has a name (index) and a url. URL has characters in it that signify that the ID for the database is supposed to go there.
We then have a table of cross references, which has foreign keys to both the cross reference database and gene, as well as a cross reference id (also db index).
If you want python code to generate this and/or load identifiers using the Django ORM, we can supply it. We might also be able to open source it as part of a pip installable django app on pypi. This is on our to-do list, so we could potentially reprioritize if this is particularly useful to you.
Casey Greene: I should add that the full text search is basically a last ditch effort to find genes w/o any matching identifiers in the structured databases. We use this on the description and alias field, as well as all of the cross reference IDs. We use elasticsearch via django-haystack to power this. It's pretty easy to set this up on AWS or locally. We previously used solr, but the configuration headache didn't outweigh the benefits.
@caseygreene, thanks for the description of your database setup. In the immediate term, I don't need any of the advanced features that your design accommodates such as elastic search and efficient lookup, so I just did a simple parsing of the human subset and exported three tsv files (genes, symbols, cross-references).
Therefore, don't reprioritize for me, but I think the pypi package is a great idea. It looks like your Tribe API already supports Entrez gene lookup. However, I'm confused about the usage, since the demo code is equivalent to:
How do you specify the query string (the gene symbol/name for which you want the GeneID)? It may be the case that your API already provides most of the functionality a user may need. In that case, a local Entrez Gene database may not be needed at all.
Summary: my vote is for a powerful, well-documented API as a primary resource, with an open source codebase.
For identifier mapping, you might want to check out BridgeDb, which provides both mapping databases and libraries to add identifier mapping functionality to any project. It's 100% free and open source.
There are many ways to integrate BridgeDb into your own tool or resource. The easiest is simply to make web service calls, like:
If performance is an issue, e.g., you want to query 10,000 times a day, then you can install the databases locally and implement the libs provided by the project into your tool and have complete control over database versions, etc.
@alexanderpico, thanks for the BridgeDB suggestion. It looks like several transcript/gene/protein resources are integrated including HGNC, Entrez Gene, Affy, Illumina, WikiGenes, UniGene, UCSC Genome Browser, Uniprot, RefSeq, miRBase, and Ensembl. That's great — we may or may not need these mappings at this point.
One worry I have is that the resource is outdated. The build date for human gene products is 2013-07-01. However, on 2014-11-21 version 2.0.0 was released. Does this mean the database was also rebuilt? In either case, I would like more frequent updates. Do you know the status of the project and whether it is actively maintained?
One final note is that @caseygreene's Tribe service allows free-text gene lookup, through elasticsearch. Currently, we do not need this feature. However, perhaps @alexanderpico Pathways4Life project [1] does. Also, perhaps Tribe—a gene set wiki with a private option—would like to autopopulate WikiPathways.
Right. The database build system was recently updated from using Ensembl's Perl API to using BioMart. This will allow frequent updates; probably quarterly.
We only want genes for non-extinct Homo sapiens (tax_id = 9606). We've updated our Entrez Gene processing to filter for a 9606 tax_id.
The downstream effects of this update should be minimal, since only 73 genes were removed (all mitochondrial). However, we may rebuild some of our resources if necessary. The inclusion of these genes should only present problems when matching by symbol rather than GeneID. We avoid matching by symbol whenever possible.
Mike Murphy — RefSeq Curator at NCBI\NLM\NIH — responded to our inquiry regarding the duplicate symbols. With permission, we've copied his response below:
Thank you for your notification of two cases where the same symbol is used to represent different human GeneIDs. In each case, one of the symbols is "official" (as determined by the Human Gene Nomenclature Committee) and the other is "unofficial". We consistently use official nomenclature for the gene feature, when available. Unfortunately, situations do arise where the same symbol is used in an official and unofficial capacity on different loci. It is our general policy to retain shared symbols and names on different loci for query and retrieval purposes by various users of our database. However, in both of the cases you pointed out, the two genes with the same symbol really represent the same gene. Therefore, I merged GeneID 105369145 into GeneID 266553, and I merged GeneID 105372382 into GeneID 2867. These updates should be publicly visible within a couple of days.
Sorry that I'm a bit late to posting this. Our final list of 56,352 human Entrez Genes is available in genes-human.tsv[1]. The dhimmel/entrez-gene repository also contains several convenient files for doing analyses using Entrez Genes.
In Hetionet v1.0, we only included the 20,945 protein-coding genes.
HGNC — the authority in charge of human genes — curates a resource of gene families[1]. Families are described as follows [2]:
Genes are grouped into families based on homology or a shared characteristic such as a common function and/or phenotype, or membership of a complex.
These gene families are a helpful resource for grouping genes together. For example, GRIA2 (glutamate ionotropic receptor AMPA type subunit 2) and GRIN2C (glutamate ionotropic receptor NMDA type subunit 2C) both belong to the "Glutamate receptors" family.
I created the dhimmel/hgnc repository to annotate Entrez Genes with their corresponding HGNC gene families (versioned repo link).
Gene families in HGNC are arranged into a hierarchy. I'd venture to call it an ontology, but I haven't seen HGNC use that term. The network of families is a directed acyclic graph. Families can have multiple superfamilies. For example "Glutamate ionotropic receptors" is a subfamily of "Glutamate receptors" and "Ligand gated ion channels". Note there is no single root family. Instead there are many disconnected family hierarchies, i.e. few families are more general than "Ion channels".
Genes can be directly annotated to multiple families. For example, AAAS (aladin WD repeat nucleoporin) belongs to the "Nucleoporins" and "WD repeat domain containing" families. In addition to direct family annotations, dhimmel/hgnc propagates annotations so that genes belonging to "Glutamate metabotropic receptors" also belong to "Glutamate receptors".