RNA-seq Pipeline for Known Transcripts: Difference between revisions

Pasted in Instructions from Richard McEachin
 
added section for aligning reads with bowtie
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
RNA-Seq pipeline – known transcripts
==Reference Pages==
1. Run FASTQ to assess quality of reads from sequencer and:
* http://seqanswers.com/wiki/How-to/RNASeq_analysis
a. Filter for quality, if applicable
b. Trim, if applicable
c. http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/


2. Run bowtie-build to generate Burroughs Wheeler transformed reference genome (.ebwt format). 
==Sequence Quality and Trimming==
a. http://bowtie-bio.sourceforge.net/index.shtml (bowtie, tophat, and cufflinks are here).
# Run FASTQC to assess quality of reads from sequencer and:
b. [Optional input and parameter settings are in square brackets.]  
# FASTQC available at http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
c. <Required parameters are in greater than/less than brackets.
## Open run_fastqc on a windows machine. Individually open each sequence file and allow it to analyseSave this report.
d. This BW transformed reference genome can be created once then used repeatedly in the future.  
## Check this report to decide if sequences need to be trimmed or discarded. See http://bioinfo.cipf.es/courses/mda11/lib/exe/fetch.php?media=ngs_qc_tutorial_mda_val_2011.pdf for sample results and an explanation of the quality report
e. $ is the command prompt. 


$ bowtie-build [-f specifies reference genome is in fasta format] <path to input reference genome (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa)> <base name for reference genome output .ebwt files (e.g hg19)>
===Filter Sequences Using FastX-Toolkit===
# If samples are barcoded use Fastx barcode splitter (see http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_barcode_splitter_usage for more details):
<pre>
/usr/local/bin/fastx_barcode_splitter.pl --bcfile FILE --prefix PREFIX [--suffix SUFFIX] [--bol|--eol] [--mismatches N] [--exact] [--partial N] [--help] [--quiet] [--debug]
</pre>
* This requires a barcode file in the format where BC# is the barcode number and the nucleotide names are the barcodes:
<pre>
#This line is a comment (starts with a 'number' sign)
BC1 GATCT
BC2 ATCGT
BC3 GTGAT
BC4 TGTCT
</pre>
* This file is the FILE for the --bcfile option
* Example command where s_2_100.txt is the original file, mybarcodes.txt is the barcode file, 2 mismatches are allowed (default is 1). This will generate files /tmp/bla_BC#.txt:
<pre>
cat s_2_100.txt | /usr/local/bin/fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --mismatches 2 --prefix /tmp/bla_ --suffix ".txt"
</pre>
# Filter for quality, if applicable
# Trim, if applicable using Fast-x.  The following keeps 100% of the reads with a quality of 25 or greater:
<pre>
fastq_quality_filter -v -q 25 -p 100  -i  control-reads.txt -o control-reads-quality25.txt
</pre>


3. Run tophat to align reads to the reference genome. I’ve included a pseudo command line as well as a “real” command line.
==Align Reads==
This can be done with either TopHat or Bowtie, so choose one of the following.  The reference genomes are located in the following locations:
<pre>
/database/davebrid/RNAseq/reference-genomes/hg19
/database/davebrid/RNAseq/reference-genomes/mm9
</pre>
These reference alignments are pre-built UCSC genomes and downloaded from ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/
 
===Align Reads to Reference Genome with Bowtie===
Run bowtie to align reads to reference genomes. The following generates a sam formatted alignment using the best quality flag for reads aligned to hg19
<pre>
bowtie --sam --best /database/davebrid/RNAseq/reference-genomes/hg19/hg19 control-reads-quality25.txt control-aligned-quality25.sam
</pre>
 
===Align Reads to Reference Genome with Tophat===
Run tophat to align reads to the reference genome. I’ve included a pseudo command line as well as a “real” command line.
<pre>
$ tophat [-p #processors -o ./output_directory] <./reference genome in both .ebwt and fasta formats (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19)> <reads file to be aligned (e.g. s_1_1_sequence.fastq)>
$ tophat [-p #processors -o ./output_directory] <./reference genome in both .ebwt and fasta formats (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19)> <reads file to be aligned (e.g. s_1_1_sequence.fastq)>
$ tophat -p 5 -o ./HG19/tophat_out_hg19_001_trimmed /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19 ./HG19/Rich_trim/A_1_16_85.fastq
$ tophat -p 5 -o ./HG19/tophat_out_hg19_001_trimmed /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19 ./HG19/Rich_trim/A_1_16_85.fastq
 
</pre>
4. Run cuffcompare to create .gtf format reference genome from a generic reference genome. Note that cuffcompare adds the tss_id and p_id columns that you will need in cuffdiff. This .gtf reference can be created once then used repeatedly in the future.
==Use Cuffcompare to Generate .gtf Reference==
 
Run cuffcompare to create .gtf format reference genome from a generic reference genome. Note that cuffcompare adds the tss_id and p_id columns that you will need in cuffdiff. This .gtf reference can be created once then used repeatedly in the future.
<pre>
$ cuffcompare [-o ./output_directory] < input file twice (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf )>
$ cuffcompare [-o ./output_directory] < input file twice (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf )>


$ cuffcompare -o ./cuffcompare_out /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf
$ cuffcompare -o ./cuffcompare_out /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf
</pre>


 
==Use Cuffdiff to Identify Differentially Expressed Transcripts==
5. Run cuffdiff to identify differentially abundant transcripts.
Run cuffdiff to identify differentially abundant transcripts.
 
<pre>
$ cuffdiff  [-p #processors -o ./output_directory –L label1,label2,etc. –T (for time series data) –N (use upper quantile normalization –compatible_hits_norm (use reference hits in normalization) –b (use reference transcripts to reduce bias, include path to file e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa) –u (improve multi-read weighting) ] <transcripts.gtf (produced by cuffcompare) sample_A_accepted_hits1.bam, sample_A_accepted_hits2.bam,etc (all produced by tophat) sample_B_accepted_hits1.bam,sample_B_accepted_hits2.bam, etc>  
$ cuffdiff  [-p #processors -o ./output_directory –L label1,label2,etc. –T (for time series data) –N (use upper quantile normalization –compatible_hits_norm (use reference hits in normalization) –b (use reference transcripts to reduce bias, include path to file e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa) –u (improve multi-read weighting) ] <transcripts.gtf (produced by cuffcompare) sample_A_accepted_hits1.bam, sample_A_accepted_hits2.bam,etc (all produced by tophat) sample_B_accepted_hits1.bam,sample_B_accepted_hits2.bam, etc>  


$ cuffdiff -o ./HG19/Cuffdiff_out_options_b_u_N_compatible/ -p 14 -L Control,PUF_kd --no-update-check -b /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa -u -N --compatible-hits-norm /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/cuffcompare_out.combined.gtf /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_001_trimmed/accepted_hits.bam /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_002_trimmed/accepted_hits.bam
$ cuffdiff -o ./HG19/Cuffdiff_out_options_b_u_N_compatible/ -p 14 -L Control,PUF_kd --no-update-check -b /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa -u -N --compatible-hits-norm /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/cuffcompare_out.combined.gtf /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_001_trimmed/accepted_hits.bam /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_002_trimmed/accepted_hits.bam
</pre>
[[Category: Bioinformatics]]
[[Category: Transcription]]
[[Category: RNA-seq]]
[[Category: Next Generation Sequencing]]