文件格式说明 - Bed

定义

Browser Extensible Data (BED) is a whitespace-delimited file format, where each file consists of one or more lines Each line describes discrete genomic features by physical start and end position on a linear chromosome. The file extension for the BED format is .bed.

官方文档

格式说明

格式概览

一个完整的bed文件包含如下的12列信息(可以通过标注bedn只使用其中的前n列信息)

Col Field Type Regex or range Brief description
1 chrom String [[:alnum:]_]{1,255} Chromosome name
2 chromStart Int [0, 264 − 1] Feature start position
3 chromEnd Int [0, 264 − 1] Feature end position
4 name String [^\t]{0,255} Feature description
5 score Int [0, 1000] A numerical value
6 strand String [-+.] Feature strand
7 thickStart Int [0, 264 − 1] Thick start position
8 thickEnd Int [0, 264 − 1] Thick end position
9 itemRgb Int,Int,Int ([0, 255], [0, 255], [0, 255]) \ 0 Display color
10 blockCount Int [0, chromEnd − chromStart]5 Number of blocks
11 blockSizes List[Int] ([[:digit:]]+,){blockCount−1}[[:digit:]]+,?6 Block sizes
12 blockStarts List[Int] ([[:digit:]]+,){blockCount−1}[[:digit:]]+,? Block start positions

各列内容的详细介绍如下:

坐标(Coordinates)

  1. chrom: The name of the chromosome or scaffold where the feature is present. Limiting
    only to word characters only, instead of all non-whitespace characters, makes BED files
    more portable to varying environments which may make different assumptions about allowed
    characters. The name must be between 1 and 255 characters long, inclusive.
  2. chromStart: Start position of the feature on the chromosome or scaffold. chromStart must be
    an integer greater than or equal to 0 and less than the total number of bases of the chromo-
    some to which it belongs. If the size of the chromoschromosomeome is unknown, then chromStart must
    be less than or equal to $2^{64}$ − 1, which is the maximum size of an unsigned 64-bit integer.
  3. chromEnd: End position of the feature on the chromosome or scaffold. chromEnd must be
    an integer greater than or equal to the value of chromStart and less than or equal to the total
    number of bases in the chromosome to which it belongs. If the size of the chromosome
    is unknown, then chromEnd must be less than or equal to $2^{64}$ − 1, the maximum size of an
    unsigned 64-bit integer.

简单属性(Simple attributes)

  1. name: String that describes the feature. The name must be 0 to 255 non-tab characters. The
    name must not be empty or contain whitespace, unless all fields in file are delimited exclusively
    using single tab characters. A visual representation of the BED format may display the name
    next to the feature.
  2. score: Integer between 0 and 1000, inclusive. If the feature has no score information, then 0
    should be used as a default value. A visual representation of the BED format may shade
    features differently depending on their score.
  3. strand: Strand that the feature appears on. The strand may either refer to the + (sense or
    coding) strand or the - (antisense or complementary) strand. If the feature has no strand
    information or unknown strand, then a dot (.) must be used.

Display attributes

  1. thickStart: Start position at which the feature is visualized with a thicker or accented display.
    This value must be an integer between chromStart and chromEnd, inclusive. There is no
    specified default value for thickStart.
  2. thickEnd: End position at which the feature is visualized with a thicker or accented display.
    This value must be an integer greater than or equal to thickStart and less than or equal
    to chromEnd, inclusive. In BED files with fewer than 7 fields, the whole feature has thick
    display. In BED7+ files, to achieve the same effect, set thickStart equal to chromStart
    and thickEnd equal to chromEnd. If this field is not specified but thickStart is, then the
    entire feature has thick display. There is no specified default value for thickEnd.
  3. itemRgb: A triple of integers that determines the color of this feature when visualized. The
    triple is three integers separated by commas. Each integer is between 0 and 255, inclusive. To
    make a feature black, itemRgb may be a single 0, which is visualized identically to a feature
    with itemRgb of 0,0,0.

Blocks

  1. blockCount: Number of blocks in the feature. blockCount must be an integer greater than 0.
    blockCount is mandatory in BED12+ files. Null or empty blockCount are not allowed,
    because blockSizes and blockStarts rely on blockCount. A visual representation of the BED
    format may have blocks appear thicker than the rest of the feature.
  2. blockSizes: Comma-separated list of length blockCount containing the size of each block. There
    must be no spaces before or after commas. There may be a trailing comma after the last
    element of the list. blockSizes is mandatory in BED12+ files. Null or empty blockSizes is not
    allowed, because blockStarts cannot be verified without blockSizes.
  3. blockStarts: Comma-separated list of length blockCount containing each block’s start position,
    relative to chromStart. There must not be spaces before or after the commas. There may
    be a trailing comma after the last element of the list. Each element in blockStarts is paired
    with the corresponding element in blockSizes. Each blockStarts element must be an integer
    between 0 and chromEnd−chromStart, inclusive. For each couple i of (blockStartsi, blockSizesi),
    the quantity chromStart + blockStartsi + blockSizesi must be less or equal to chromEnd. These
    conditions enforce that each block is contained within the feature. The first block must
    start at chromStart and the last block must end at chromEnd. Moreover, the blocks must not
    overlap. The list must be sorted in ascending order. blockStarts is mandatory in BED12+
    files. Null or empty blockStarts is not allowed.

术语和概念 (Terminology and concepts)

0-start, half-open coordinate system: bed区间的位置描述为0起始,区间为左闭右开区间。
A coordinate system where the first base starts at position 0, and the start of the interval is included but the end is not. For example, for a sequence of bases ACTGCG, the bases given by the interval [2, 4) are TG.

BEDn: 表示文件包含bed的前n个字段(总计12个字段)
A file with the first n fields of the BED format. For example, BED3 means a file with only the first 3 fields; BED12 means a file with all 12 fields.

BEDn+: 表示包含前n个bed的定义字段,后续跟随一些自定义字段。
A file that has n fields of the BED format, followed by any number of fields of custom data defined by a user.

BEDn+m: 表示包含前n个bed的定义字段,后续跟随m个自定义字段。
A file that has a custom tab-delimited format starting with the first n fields of the BED format, followed by m fields of custom data defined by a user. For example, BED6+4 means a file with the first 6 fields of the BED format, followed by 4 user-defined fields

block: Linear subfeatures within a feature. Usually used to designate exons

chromosome: 染色体编号
A sequence of nucleobases with a name. In this specification, “chromosome” may also describe a named scaffold that does not fit the biological definition of a chromosome. Often, chromosomes are numbered starting from 1. There are also often sex chromosomes such as W, X, Y, and Z, mitochondrial chromosomes such as M, and possibly scaffolds from an unknown chromosome, often labeled Un. The name of each chromosome is often prefixed with chr. Examples of chromosome names include chr1, 21, chrX, chrM, chrUn, chr19_KI270914v1_alt, and chrUn_KI270435v1.

feature: A linear region of a chromosome with specified properties. For example, a file’s features might all be peaks called from ChIP-seq data, or transcript.

field: Data stored as non-tab text. All fields are 7-bit US ASCII.

file: Sequence of one or more lines.

line: String terminated by a line separator, in one of the following classes. Either a data line, a comment line, or a blank line. Discussed more fully in subsection 1.3

line separator: Either carriage return, line feed, or carriage return followed by line feed. The same line separator must be used throughout the file.

示意

1
2


[Bedwen]

-------------本文结束感谢您的阅读-------------