比对软件STAR的使用—高通量测序数据处理学习记录（一）

发布时间：2023-06-21

浏览次数：0

新媒体经理

dnastar序列比对_dnaman序列比对_dnastar序列比对说明

在所有是非的场景中，我最喜欢你。

大写字母

这次给大家带来的是官方的对比软件STAR。该项目是法国国家人类基因组研究所（NHGRI）于2003年9月发起的公共联合研究项目，致力于找出人类基因组中的所有功能。成分[。这是人类基因组计划完成后，国家人类基因组研究所发起的最重要的项目之一。本项目形成的所有数据将及时发布在公共数据库中。

在我之前的RNA-seq数据分析——方法论文章的实践文章中，STAR在对比软件的对比中也表现出了不错的表现。所以在进行比较时，我也考虑过将其与STAR一起使用来检查它们的性能并选择合适的比较工具。

明星安装

cd biosoft && mkdir STAR && cd STAR
wget https://github.com/alexdobin/STAR/archive/2.5.3a.tar.gz
tar -xzf 2.5.3a.tar.gz
cd STAR-2.5.3a
# for easy use, add bin/ to your PATH

下载需要参考基因组和索引创建

# downloading dna index fasta file
nohup wget -r -np -nH -nd -R index.html -L ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/dna_index/ &
# download gft annotation file
nohup wget ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.chr_patch_hapl_scaff.gtf.gz &
mkdir STAR_index && cd STAR_index
STAR --runMode genomeGenerate --genomeDir ~/reference/STAR_index/ --genomeFastaFiles ~/reference/genome/hg38/Homo_sapiens.GRCh38.dna.toplevel.fa --sjdbGTFfile ~/reference/genome/hg38/Homo_sapiens.GRCh38.90.chr_patch_hapl_scaff.gtf --sjdbOverhang 199
# --sjdbOverhang 数值为reads长度-1
# Mode 为generate
# --genomeFastaFiles --sjdbGTFfile 分别对应fasta文件和GTF文件

STAR的使用

# STAR的manual里面给了最基本的比对参数示例
STAR
--runThreadN NumberOfThreads
--genomeDir /path/to/genomeDir
--readFilesIn /path/to/read1 [/path/to/read2 ]
# 基本示例，
针对fastq.gz文件增加--readFilesCommand gunzip -c 参数/--readFilesCommand zcat参数，针对bzip2文件使用--readFilesCommand bunzip2 -c参数
STAR --runThreadN 20 --genomeDir ~/reference/STAR_index/ --readFilesCommand zcat --readFilesIn ~/RNA-seq/LiuPing_data/RNA-seq/SC_w2q20m35_N_1.fq.gz ~/RNA-seq/LiuPing_data/RNA-seq/SC_w2q20m35_N_2.fq.gz
# 输出unsorted or sorted bam file
--outSAMtype BAM Unsorted 实际上就是-name 的sort，下游可以直接接HTSeq
--outSAMtype BAM SortedByCoordinate
--outSAMtype BAM Unsorted SortedByCoordinate 两者都输出

附加参数说明

# 单独指定注释文件，而不用在构建的时候使用
--sjdbGTFfile /path/to/ann.gtf
--sjdbFileChrStartEnd /path/to/sj.tab
# ENCODE参数
# 减少伪junction的几率
--outFilterType BySJout
# 最多允许一个reads被匹配到多少个地方
--outFilterMultimapNmax 20
# 在未有注释的junction区域，最低允许突出多少个bp的单链序列
--alignSJoverhangMin 8
# 在有注释的junction区域，最低允许突出多少个bp的单链序列
--alignSJDBoverhangMin 1
# 过滤掉每个paired read mismatch数目超过N的数据，999代表着忽略这个过滤
--outFilterMismatchNmax 999
# 相对paired read长度可以允许的mismatch数目，如果read长度为100，数值设定为0.04，则会过滤掉100*2*0.04=8个以上的数据
--outFilterMismatchNoverReadLmax 0.04
# 最小的intro长度
--alignIntronMin 20
# 最大的intro长度
--alignIntronMax 1000000
# maximum genomic distance between mates，翻译不出来，自行理解
--alignMatesGapMax 1000000

星输出

STAR可以根据您的参数设置输出多个结果文件，包括各类信息。下面是带有默认参数的输出文件的详细显示。对于一些比较难的翻译dnastar序列比对，我选择使用原文

E00516:168:H37WKCCXY:8:1101:6400:59130    99    1    92836373    255    20M1063N129M    =    92837548    4244    GGCTTGTCTATCCCTCACAGTACCAAACGATTCCCTGGTTATGATTCTGAAAGCAAGGAATTTAATGCAGAAGTACATCGGAAGCACATCATGGGCCAGAATGTTGCAGATTACATGCGCTACTTAATGGAAGAAGATGAAGATGCTTA    AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ    NH:i:1    HI:i:1    AS:i:289    nM:i:0
# 我截取了一条比对信息
我们来看一下最后面的 NH:i:1  HI:i:1  AS:i:289    nM:i:0
NH:i:后面的数值代表着此条read比对到几个loci，1代表着unique map，数值大于1代表着multi-mappers
HI:i:后面的数值attrbiutes enumerates multiple
alignments of a read starting with 1，下游分析接cufflinks or stringtie的时候需要使用参数--outSAMattrIHstart 0
AS:i:的数值代表着local alignment score (paired for paired-edn reads)
nM:i:的数值代表着the number of mismatches per (paired) alignment, not to be confused with NM, which is the number of mismatches in each mate
关于下游处理工具的兼容性还需要使用者自己仔细参考manual

tail Log.out
Joined thread # 12
Completed: thread #13
Joined thread # 13
Joined thread # 14
Joined thread # 15
Joined thread # 16
Joined thread # 17
Joined thread # 18
Joined thread # 19
ALL DONE!

tail Log.progress.out
Sep 08 17:57:52     33.1    23115987      285    94.1%    284.0     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 17:58:53     34.0    24349711      285    94.1%    284.0     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:00:23     33.5    24789186      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:01:51     33.3    25493588      285    94.1%    284.0     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:02:58     33.5    26284824      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:04:23     33.7    27163519      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:05:36     33.1    27428080      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:06:54     33.8    28659661      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:08:00     34.3    29741743      285    94.1%    283.9     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
ALL DONE!

head Log.progress.out 
           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
Sep 08 17:17:47      2.9       88583      288    94.2%    287.4     0.1%     4.0%     0.1%     0.0%     1.7%     0.0%
Sep 08 17:18:53     14.5      711158      282    94.1%    281.9     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:08:00     34.3    29741743      285    94.1%    283.9     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
ALL DONE!

head SJ.out.tab 
1    14830    14969    2    2    0    1    9    69
1    14844    14969    2    2    0    0    2    30
1    15039    15795    2    2    1    2    7    53
1    15948    16606    2    2    1    1    1    41
1    16028    16606    2    2    0    0    1    57
1    16311    16606    2    2    0    2    0    67
1    16766    16853    2    2    0    2    0    43
1    16766    16857    2    2    1    17    108    73
1    16766    16875    2    2    0    0    1    61
1    16789    16875    2    2    0    0    1    53
# 参数释义
column 1: chromosome
column 2: first base of the intron (1-based)
column 3: last base of the intron (1-based)
column 4: strand (0: undened, 1: +, 2: -)
column 5: intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5:AT/AC, 6: GT/AT
column 6: 0: unannotated, 1: annotated (only if splice junctions database is used)
column 7: number of uniquely mapping reads crossing the junction
column 8: number of multi-mapping reads crossing the junction
column 9: maximum spliced alignment overhang

写在最后

也许我探索STAR的最终目的是借助STAR的and来实现的。我自己处理的数据上有一个-，其余的对比软件还没有发现这个功能。

使用-参数时，STAR可以将reads拆分成两部分，分别进行比较

STAR-是一个，可以进行STARdnastar序列比对，点我看代码

其实STAR也可以做2-pass，比较新颖

使用-参数也可以达到HTSeq的疗效。它可以帮助您生成计数，从而节省您 HTSeq 的精力。有时间再回去做一下比较，看看HTSeq和HTSeq的效率如何。

参考：

dnaman序列比对_dnastar序列比对说明_dnastar序列比对

日常鲍勃联排别墅

如有侵权请联系删除！

TAGS：高通量测序

上一篇：Sublime Text3 插件编写教程_第一课

下一篇：我的设计成长笔记—第21篇（插画）

返回列表

诚信、勤奋、创新、卓越

13262879759

行业动态

比对软件STAR的使用—高通量测序数据处理学习记录（一）

联系我们