常用数据库 - HGNC

常见的 Gene ID 官网

这个标准比较多,有Ensembl ID,HGNC ID,Entrez ID(NCBI),Refseq ID

数据库 链接地址
Ensembl https://asia.ensembl.org/index.html
HGNC https://www.genenames.org/
Entrez https://www.ncbi.nlm.nih.gov/gene/672 【案例】
Refseq https://www.ncbi.nlm.nih.gov/nuccore/NM_031991.4【所有物种,很少用】

https://www.genenames.org/about/guidelines/

基因命名映射关系建立

由于面向临床生产过程中的基因名称,参考相关指南和质评规范,统一以HGNC命名为统一参考。
HGNC官网是随时都在进行更新的,没有固定的更新周期。
但是本身每个月会保留一次版本镜像快照。所以日常更新建议使用快照版本,以便后续数据的相关回溯。

快照记录索引链接

快照记录的包含所有基因的名称映射(包含假基因,非编码RNA等)可以根据需要进行筛选以提高后续效率。

关注基因的基因ID获取

针对关注的转录本及基因信息,获取现有注释体系下,基因ID和注释基因名称之间的对应关系

1
2
3
4
5
6
# 基于目前版本提报的基因和转录本信息,通过注释配置文件,获取基因ID和基因名称的对应关系总列表文件
grep -w -f /jdfstj1/B2C_COM_P1/PipeAdmin/04.Pipeline/aio.v2.Dev.liubo/chip_info/PanCancer_IDT_v1/PanCancer.v1.Trans.list /jdfstj1/B2C_COM_P1/PipeAdmin/04.Pipeline/aio.v2.Dev.liubo/database/Ref104/ncbi_anno_rel104.dbref > ncbi_ref104.geneSymble2ID

#基于NCBI生成的ncbi_ref104.geneSymble2ID 文件中,其中第9、10两列为后续用于映射的列
第9列: ncbi注释的基因symble
第10列:entrez_id

HGNC数据整理准备

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 首先下载HGNC官方对应名称信息,根据需要调整相应的文件日期

wget http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/archive/monthly/tsv/hgnc_complete_set_2021-06-01.txt

其中重要的列为第1,2,9,19,24列。
# 提取其中的映射相关的重要信息示例如下:
awk -F '\t' '{print $1"\t"$19"\t"$2"\t"$9"\t"$24}' /jdfstj1/B2C_COM_P1/Research_and_Development/Database/HGNC/hgnc_complete_set_2021-06-01.txt |head
head hgnc_complete_set_2021-06-01.fit.tsv
hgnc_id entrez_id symbol alias_symbol refseq_accession
HGNC:5 1 A1BG NM_130786
HGNC:37133 503538 A1BG-AS1 FLJ23569 NR_015380
HGNC:24086 29974 A1CF "ACF|ASP|ACF64|ACF65|APOBEC1CF" NM_014576

#其中比较主要的信息分别如下
#第1洌hgnc_id:HGNC的基因ID
#entrez_id:NCBI的基因ID
#symbol:基因统一命名
#alias_symbol:基因历史别名
#
所在列数 字段头 官方含义 字段信息
1 hgnc_id ID used to designate a gene family or group the gene has been assigned to. HGNC对应的基因ID
2 symbol Status of the symbol report, which can be either “Approved” or “Entry Withdrawn”. 基因命名
3 name miRBase ID
4 locus_group Same as “location” but single digit chromosomes are prefixed with a 0 enabling them to be sorted in correct numerical order (e.g. 02q34).
5 locus_type A group name for a set of related locus types as defined by the HGNC (e.g. non-coding RNA).
6 status snoRNABase ID
7 location lncRNA Database ID
8 location_sortable Cytogenetic location of the gene (e.g. 2q34).
9 alias_symbol Other names used to refer to this gene as seen in the “SYNONYMS” field in the gene symbol report. 基因曾用名
10 alias_name The HGNC ID that the Alliance of Genome Resources (AGR) have linked to their record of the gene. Use the HGNC ID to link to the AGR.
11 prev_symbol Gene names previously approved by the HGNC for this gene. Equates to the “PREVIOUS SYMBOLS & NAMES” field within the gene symbol report.
12 prev_name Orphanet ID
13 gene_group The HGNC ID used within the GenCC database as the unique identifier of their gene reports within the GenCC database.
14 gene_group_id Name given to a gene family or group the gene has been assigned to. Equates to the “GENE FAMILY” field within the gene symbol report.
15 date_approved_reserved Symbol used within the Catalogue of somatic mutations in cancer for the gene.
16 date_symbol_changed The date the gene name was last changed.
17 date_name_changed Date the entry was last modified.
18 date_modified The date the entry was first approved.
19 entrez_id Ensembl gene ID. Found within the “GENE RESOURCES” section of the gene symbol report.
20 ensembl_gene_id International Nucleotide Sequence Database Collaboration (GenBank, ENA and DDBJ) accession number(s). Found within the “NUCLEOTIDE SEQUENCES” section of the gene symbol report.
21 vega_id UniProt protein accession. Found within the “PROTEIN RESOURCES” section of the gene symbol report.
22 ucsc_id The HGNC approved gene symbol. Equates to the “APPROVED SYMBOL” field within the gene symbol report.
23 ena The date the gene symbol was last changed.
24 refseq_accession Pubmed and Europe Pubmed Central PMID(s). HGNC提供的转录本ID
25 ccds_id Symbol used to link to the SLC tables database at bioparadigms.org for the gene
26 uniprot_ids UCSC gene ID. Found within the “GENE RESOURCES” section of the gene symbol report.
27 pubmed_id Pseudogene.org
28 mgd_id ID used to link to the MEROPS peptidase database
29 rgd_id RefSeq nucleotide accession(s). Found within the “NUCLEOTIDE SEQUENCES” section of the gene symbol report.
30 lsdb The locus type as defined by the HGNC (e.g. RNA, transfer).
31 cosmic Symbol used within the Human Cell Differentiation Molecule database for the gene
32 omim_id HGNC approved name for the gene. Equates to the “APPROVED NAME” field within the gene symbol report.
33 mirbase Mouse genome informatics database ID. Found within the “HOMOLOGS” section of the gene symbol report.
34 homeodb HGNC ID. A unique ID created by the HGNC for every approved symbol.
35 snornabase Rat genome database gene ID. Found within the “HOMOLOGS” section of the gene symbol report.
36 bioparadigms_slc Other symbols used to refer to this gene as seen in the “SYNONYMS” field in the symbol report.
37 orphanet Online Mendelian Inheritance in Man (OMIM) ID
38 pseudogene.org Symbols previously approved by the HGNC for this gene. Equates to the “PREVIOUS SYMBOLS & NAMES” field within the gene symbol report.
39 horde_id Homeobox Database ID
40 merops NCBI and Ensembl transcript IDs/acessions including the version number for one high-quality representative transcript per protein-coding gene that is well-supported by experimental data and represents the biology of the gene. The IDs are delimited by .
41 imgt Symbol used within HORDE for the gene
42 iuphar ID used to link to the Human Intermediate Filament Database
43 kznf_gene_catalog The objectId used to link to the IUPHAR/BPS Guide to PHARMACOLOGY database. To link to IUPHAR/BPS Guide to PHARMACOLOGY database only use the number (only use 1 from the result objectId:1)
44 mamit-trnadb The name of the Locus Specific Mutation Database and URL for the gene separated by a character
45 cd Consensus CDS ID. Found within the “NUCLEOTIDE SEQUENCES” section of the gene symbol report.
46 lncrnadb The gene symbol used to link to LNCipedia - a comprehensive compendium of human long non-coding RNAs.
47 enzyme_id Entrez gene ID. Found within the “GENE RESOURCES” section of the gene symbol report. ncbi的基因symbol
48 intermediate_filament_db Symbol used within international ImMunoGeneTics information system
49 rna_central_ids Rat genome database gene ID. Found within the “HOMOLOGS” section of the gene symbol report.
50 lncipedia ID used to link to the Human KZNF Gene Catalog
51 gtrnadb ID used to designate a gene family or group the gene has been assigned to.
52 agr The HGNC ID that the Alliance of Genome Resources (AGR) have linked to their record of the gene. Use the HGNC ID to link to the AGR.
53 mane_select ID to link to the Mamit-tRNA database
54 gencc ENZYME EC accession number

信息来源

进行HGNC映射文件更新

1
2
3
4
5
# 生成HGNC基因名称转换文件
perl /jdfstj1/B2C_COM_P1/Research_and_Development/Database/HGNC/creat.change_genelist.pl
-hgnc hgnc_complete_set_2021-06-01.txt
-ncbi ncbi_ref104.geneSymble2ID
-o change_gene.list.tmp

结果文件示例如下(change_gene.list.tmp) :

entrez_id HGNC_symble NCBI_symble Tag
1 A1BG A1BG Match
29974 A1CF A1CF Match
2 A2M A2M Match
144568 A2ML1 A2ML1 Match
53947 A4GALT A4GALT Match
51146 A4GNT A4GNT Match
8086 AAAS AAAS Match
65985 AACS AACS Match
13 AADAC AADAC Match

前三列为基因对应的信息,第四列tag为示例信息,提示是否能匹配上:

如果标记为match,则表明HGNC和NCBI的对应基因ID可以匹配。

如果标记为MisMatch,则表明两个数据对应的基因ID无法进行匹配,

更新后,在目录下readme中记录数据库的更新日期和操作人员。

1
2
eg:
# liubo4 @ 20210616
-------------本文结束感谢您的阅读-------------