常见的 Gene ID 官网
这个标准比较多,有Ensembl ID,HGNC ID,Entrez ID(NCBI),Refseq ID
数据库 | 链接地址 |
---|---|
Ensembl | https://asia.ensembl.org/index.html |
HGNC | https://www.genenames.org/ |
Entrez | https://www.ncbi.nlm.nih.gov/gene/672 【案例】 |
Refseq | https://www.ncbi.nlm.nih.gov/nuccore/NM_031991.4【所有物种,很少用】 |
https://www.genenames.org/about/guidelines/
基因命名映射关系建立
由于面向临床生产过程中的基因名称,参考相关指南和质评规范,统一以HGNC命名为统一参考。
HGNC官网是随时都在进行更新的,没有固定的更新周期。
但是本身每个月会保留一次版本镜像快照。所以日常更新建议使用快照版本,以便后续数据的相关回溯。
快照记录的包含所有基因的名称映射(包含假基因,非编码RNA等)可以根据需要进行筛选以提高后续效率。
关注基因的基因ID获取
针对关注的转录本及基因信息,获取现有注释体系下,基因ID和注释基因名称之间的对应关系
1 | # 基于目前版本提报的基因和转录本信息,通过注释配置文件,获取基因ID和基因名称的对应关系总列表文件 |
HGNC数据整理准备
1 | # 首先下载HGNC官方对应名称信息,根据需要调整相应的文件日期 |
所在列数 | 字段头 | 官方含义 | 字段信息 | |
---|---|---|---|---|
1 | hgnc_id | ID used to designate a gene family or group the gene has been assigned to. | HGNC对应的基因ID | |
2 | symbol | Status of the symbol report, which can be either “Approved” or “Entry Withdrawn”. | 基因命名 | |
3 | name | miRBase ID | ||
4 | locus_group | Same as “location” but single digit chromosomes are prefixed with a 0 enabling them to be sorted in correct numerical order (e.g. 02q34). | ||
5 | locus_type | A group name for a set of related locus types as defined by the HGNC (e.g. non-coding RNA). | ||
6 | status | snoRNABase ID | ||
7 | location | lncRNA Database ID | ||
8 | location_sortable | Cytogenetic location of the gene (e.g. 2q34). | ||
9 | alias_symbol | Other names used to refer to this gene as seen in the “SYNONYMS” field in the gene symbol report. | 基因曾用名 | |
10 | alias_name | The HGNC ID that the Alliance of Genome Resources (AGR) have linked to their record of the gene. Use the HGNC ID to link to the AGR. | ||
11 | prev_symbol | Gene names previously approved by the HGNC for this gene. Equates to the “PREVIOUS SYMBOLS & NAMES” field within the gene symbol report. | ||
12 | prev_name | Orphanet ID | ||
13 | gene_group | The HGNC ID used within the GenCC database as the unique identifier of their gene reports within the GenCC database. | ||
14 | gene_group_id | Name given to a gene family or group the gene has been assigned to. Equates to the “GENE FAMILY” field within the gene symbol report. | ||
15 | date_approved_reserved | Symbol used within the Catalogue of somatic mutations in cancer for the gene. | ||
16 | date_symbol_changed | The date the gene name was last changed. | ||
17 | date_name_changed | Date the entry was last modified. | ||
18 | date_modified | The date the entry was first approved. | ||
19 | entrez_id | Ensembl gene ID. Found within the “GENE RESOURCES” section of the gene symbol report. | ||
20 | ensembl_gene_id | International Nucleotide Sequence Database Collaboration (GenBank, ENA and DDBJ) accession number(s). Found within the “NUCLEOTIDE SEQUENCES” section of the gene symbol report. | ||
21 | vega_id | UniProt protein accession. Found within the “PROTEIN RESOURCES” section of the gene symbol report. | ||
22 | ucsc_id | The HGNC approved gene symbol. Equates to the “APPROVED SYMBOL” field within the gene symbol report. | ||
23 | ena | The date the gene symbol was last changed. | ||
24 | refseq_accession | Pubmed and Europe Pubmed Central PMID(s). | HGNC提供的转录本ID | |
25 | ccds_id | Symbol used to link to the SLC tables database at bioparadigms.org for the gene | ||
26 | uniprot_ids | UCSC gene ID. Found within the “GENE RESOURCES” section of the gene symbol report. | ||
27 | pubmed_id | Pseudogene.org | ||
28 | mgd_id | ID used to link to the MEROPS peptidase database | ||
29 | rgd_id | RefSeq nucleotide accession(s). Found within the “NUCLEOTIDE SEQUENCES” section of the gene symbol report. | ||
30 | lsdb | The locus type as defined by the HGNC (e.g. RNA, transfer). | ||
31 | cosmic | Symbol used within the Human Cell Differentiation Molecule database for the gene | ||
32 | omim_id | HGNC approved name for the gene. Equates to the “APPROVED NAME” field within the gene symbol report. | ||
33 | mirbase | Mouse genome informatics database ID. Found within the “HOMOLOGS” section of the gene symbol report. | ||
34 | homeodb | HGNC ID. A unique ID created by the HGNC for every approved symbol. | ||
35 | snornabase | Rat genome database gene ID. Found within the “HOMOLOGS” section of the gene symbol report. | ||
36 | bioparadigms_slc | Other symbols used to refer to this gene as seen in the “SYNONYMS” field in the symbol report. | ||
37 | orphanet | Online Mendelian Inheritance in Man (OMIM) ID | ||
38 | pseudogene.org | Symbols previously approved by the HGNC for this gene. Equates to the “PREVIOUS SYMBOLS & NAMES” field within the gene symbol report. | ||
39 | horde_id | Homeobox Database ID | ||
40 | merops | NCBI and Ensembl transcript IDs/acessions including the version number for one high-quality representative transcript per protein-coding gene that is well-supported by experimental data and represents the biology of the gene. The IDs are delimited by | . | |
41 | imgt | Symbol used within HORDE for the gene | ||
42 | iuphar | ID used to link to the Human Intermediate Filament Database | ||
43 | kznf_gene_catalog | The objectId used to link to the IUPHAR/BPS Guide to PHARMACOLOGY database. To link to IUPHAR/BPS Guide to PHARMACOLOGY database only use the number (only use 1 from the result objectId:1) | ||
44 | mamit-trnadb | The name of the Locus Specific Mutation Database and URL for the gene separated by a | character | |
45 | cd | Consensus CDS ID. Found within the “NUCLEOTIDE SEQUENCES” section of the gene symbol report. | ||
46 | lncrnadb | The gene symbol used to link to LNCipedia - a comprehensive compendium of human long non-coding RNAs. | ||
47 | enzyme_id | Entrez gene ID. Found within the “GENE RESOURCES” section of the gene symbol report. | ncbi的基因symbol | |
48 | intermediate_filament_db | Symbol used within international ImMunoGeneTics information system | ||
49 | rna_central_ids | Rat genome database gene ID. Found within the “HOMOLOGS” section of the gene symbol report. | ||
50 | lncipedia | ID used to link to the Human KZNF Gene Catalog | ||
51 | gtrnadb | ID used to designate a gene family or group the gene has been assigned to. | ||
52 | agr | The HGNC ID that the Alliance of Genome Resources (AGR) have linked to their record of the gene. Use the HGNC ID to link to the AGR. | ||
53 | mane_select | ID to link to the Mamit-tRNA database | ||
54 | gencc | ENZYME EC accession number |
进行HGNC映射文件更新
1 | # 生成HGNC基因名称转换文件 |
结果文件示例如下(change_gene.list.tmp) :
entrez_id | HGNC_symble | NCBI_symble | Tag |
---|---|---|---|
1 | A1BG | A1BG | Match |
29974 | A1CF | A1CF | Match |
2 | A2M | A2M | Match |
144568 | A2ML1 | A2ML1 | Match |
53947 | A4GALT | A4GALT | Match |
51146 | A4GNT | A4GNT | Match |
8086 | AAAS | AAAS | Match |
65985 | AACS | AACS | Match |
13 | AADAC | AADAC | Match |
前三列为基因对应的信息,第四列tag为示例信息,提示是否能匹配上:
如果标记为match,则表明HGNC和NCBI的对应基因ID可以匹配。
如果标记为MisMatch,则表明两个数据对应的基因ID无法进行匹配,
更新后,在目录下readme中记录数据库的更新日期和操作人员。
1 | eg: |