Difference between revisions of "RNA-seq Pipeline for Known Transcripts"
Davebridges (Talk | contribs) (Pasted in Instructions from Richard McEachin) |
Davebridges (Talk | contribs) (added section for aligning reads with bowtie) |
||
(7 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | + | ==Reference Pages== | |
− | + | * http://seqanswers.com/wiki/How-to/RNASeq_analysis | |
− | + | ||
− | + | ||
− | + | ||
− | + | ==Sequence Quality and Trimming== | |
− | + | # Run FASTQC to assess quality of reads from sequencer and: | |
− | + | # FASTQC available at http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ | |
− | + | ## Open run_fastqc on a windows machine. Individually open each sequence file and allow it to analyse. Save this report. | |
− | + | ## Check this report to decide if sequences need to be trimmed or discarded. See http://bioinfo.cipf.es/courses/mda11/lib/exe/fetch.php?media=ngs_qc_tutorial_mda_val_2011.pdf for sample results and an explanation of the quality report | |
− | + | ||
− | + | ===Filter Sequences Using FastX-Toolkit=== | |
+ | # If samples are barcoded use Fastx barcode splitter (see http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_barcode_splitter_usage for more details): | ||
+ | <pre> | ||
+ | /usr/local/bin/fastx_barcode_splitter.pl --bcfile FILE --prefix PREFIX [--suffix SUFFIX] [--bol|--eol] [--mismatches N] [--exact] [--partial N] [--help] [--quiet] [--debug] | ||
+ | </pre> | ||
+ | * This requires a barcode file in the format where BC# is the barcode number and the nucleotide names are the barcodes: | ||
+ | <pre> | ||
+ | #This line is a comment (starts with a 'number' sign) | ||
+ | BC1 GATCT | ||
+ | BC2 ATCGT | ||
+ | BC3 GTGAT | ||
+ | BC4 TGTCT | ||
+ | </pre> | ||
+ | * This file is the FILE for the --bcfile option | ||
+ | * Example command where s_2_100.txt is the original file, mybarcodes.txt is the barcode file, 2 mismatches are allowed (default is 1). This will generate files /tmp/bla_BC#.txt: | ||
+ | <pre> | ||
+ | cat s_2_100.txt | /usr/local/bin/fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --mismatches 2 --prefix /tmp/bla_ --suffix ".txt" | ||
+ | </pre> | ||
+ | # Filter for quality, if applicable | ||
+ | # Trim, if applicable using Fast-x. The following keeps 100% of the reads with a quality of 25 or greater: | ||
+ | <pre> | ||
+ | fastq_quality_filter -v -q 25 -p 100 -i control-reads.txt -o control-reads-quality25.txt | ||
+ | </pre> | ||
− | + | ==Align Reads== | |
+ | This can be done with either TopHat or Bowtie, so choose one of the following. The reference genomes are located in the following locations: | ||
+ | <pre> | ||
+ | /database/davebrid/RNAseq/reference-genomes/hg19 | ||
+ | /database/davebrid/RNAseq/reference-genomes/mm9 | ||
+ | </pre> | ||
+ | These reference alignments are pre-built UCSC genomes and downloaded from ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/ | ||
+ | |||
+ | ===Align Reads to Reference Genome with Bowtie=== | ||
+ | Run bowtie to align reads to reference genomes. The following generates a sam formatted alignment using the best quality flag for reads aligned to hg19 | ||
+ | <pre> | ||
+ | bowtie --sam --best /database/davebrid/RNAseq/reference-genomes/hg19/hg19 control-reads-quality25.txt control-aligned-quality25.sam | ||
+ | </pre> | ||
+ | |||
+ | ===Align Reads to Reference Genome with Tophat=== | ||
+ | Run tophat to align reads to the reference genome. I’ve included a pseudo command line as well as a “real” command line. | ||
+ | <pre> | ||
$ tophat [-p #processors -o ./output_directory] <./reference genome in both .ebwt and fasta formats (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19)> <reads file to be aligned (e.g. s_1_1_sequence.fastq)> | $ tophat [-p #processors -o ./output_directory] <./reference genome in both .ebwt and fasta formats (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19)> <reads file to be aligned (e.g. s_1_1_sequence.fastq)> | ||
$ tophat -p 5 -o ./HG19/tophat_out_hg19_001_trimmed /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19 ./HG19/Rich_trim/A_1_16_85.fastq | $ tophat -p 5 -o ./HG19/tophat_out_hg19_001_trimmed /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19 ./HG19/Rich_trim/A_1_16_85.fastq | ||
− | + | </pre> | |
− | + | ==Use Cuffcompare to Generate .gtf Reference== | |
− | + | Run cuffcompare to create .gtf format reference genome from a generic reference genome. Note that cuffcompare adds the tss_id and p_id columns that you will need in cuffdiff. This .gtf reference can be created once then used repeatedly in the future. | |
+ | <pre> | ||
$ cuffcompare [-o ./output_directory] < input file twice (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf )> | $ cuffcompare [-o ./output_directory] < input file twice (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf )> | ||
$ cuffcompare -o ./cuffcompare_out /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf | $ cuffcompare -o ./cuffcompare_out /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf | ||
+ | </pre> | ||
− | + | ==Use Cuffdiff to Identify Differentially Expressed Transcripts== | |
− | + | Run cuffdiff to identify differentially abundant transcripts. | |
− | + | <pre> | |
$ cuffdiff [-p #processors -o ./output_directory –L label1,label2,etc. –T (for time series data) –N (use upper quantile normalization –compatible_hits_norm (use reference hits in normalization) –b (use reference transcripts to reduce bias, include path to file e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa) –u (improve multi-read weighting) ] <transcripts.gtf (produced by cuffcompare) sample_A_accepted_hits1.bam, sample_A_accepted_hits2.bam,etc (all produced by tophat) sample_B_accepted_hits1.bam,sample_B_accepted_hits2.bam, etc> | $ cuffdiff [-p #processors -o ./output_directory –L label1,label2,etc. –T (for time series data) –N (use upper quantile normalization –compatible_hits_norm (use reference hits in normalization) –b (use reference transcripts to reduce bias, include path to file e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa) –u (improve multi-read weighting) ] <transcripts.gtf (produced by cuffcompare) sample_A_accepted_hits1.bam, sample_A_accepted_hits2.bam,etc (all produced by tophat) sample_B_accepted_hits1.bam,sample_B_accepted_hits2.bam, etc> | ||
$ cuffdiff -o ./HG19/Cuffdiff_out_options_b_u_N_compatible/ -p 14 -L Control,PUF_kd --no-update-check -b /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa -u -N --compatible-hits-norm /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/cuffcompare_out.combined.gtf /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_001_trimmed/accepted_hits.bam /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_002_trimmed/accepted_hits.bam | $ cuffdiff -o ./HG19/Cuffdiff_out_options_b_u_N_compatible/ -p 14 -L Control,PUF_kd --no-update-check -b /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa -u -N --compatible-hits-norm /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/cuffcompare_out.combined.gtf /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_001_trimmed/accepted_hits.bam /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_002_trimmed/accepted_hits.bam | ||
+ | </pre> | ||
+ | [[Category: Bioinformatics]] | ||
+ | [[Category: Transcription]] | ||
+ | [[Category: RNA-seq]] | ||
+ | [[Category: Next Generation Sequencing]] |
Latest revision as of 16:08, 8 November 2011
Contents
Reference Pages
Sequence Quality and Trimming
- Run FASTQC to assess quality of reads from sequencer and:
- FASTQC available at http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
- Open run_fastqc on a windows machine. Individually open each sequence file and allow it to analyse. Save this report.
- Check this report to decide if sequences need to be trimmed or discarded. See http://bioinfo.cipf.es/courses/mda11/lib/exe/fetch.php?media=ngs_qc_tutorial_mda_val_2011.pdf for sample results and an explanation of the quality report
Filter Sequences Using FastX-Toolkit
- If samples are barcoded use Fastx barcode splitter (see http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_barcode_splitter_usage for more details):
/usr/local/bin/fastx_barcode_splitter.pl --bcfile FILE --prefix PREFIX [--suffix SUFFIX] [--bol|--eol] [--mismatches N] [--exact] [--partial N] [--help] [--quiet] [--debug]
- This requires a barcode file in the format where BC# is the barcode number and the nucleotide names are the barcodes:
#This line is a comment (starts with a 'number' sign) BC1 GATCT BC2 ATCGT BC3 GTGAT BC4 TGTCT
- This file is the FILE for the --bcfile option
- Example command where s_2_100.txt is the original file, mybarcodes.txt is the barcode file, 2 mismatches are allowed (default is 1). This will generate files /tmp/bla_BC#.txt:
cat s_2_100.txt | /usr/local/bin/fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --mismatches 2 --prefix /tmp/bla_ --suffix ".txt"
- Filter for quality, if applicable
- Trim, if applicable using Fast-x. The following keeps 100% of the reads with a quality of 25 or greater:
fastq_quality_filter -v -q 25 -p 100 -i control-reads.txt -o control-reads-quality25.txt
Align Reads
This can be done with either TopHat or Bowtie, so choose one of the following. The reference genomes are located in the following locations:
/database/davebrid/RNAseq/reference-genomes/hg19 /database/davebrid/RNAseq/reference-genomes/mm9
These reference alignments are pre-built UCSC genomes and downloaded from ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/
Align Reads to Reference Genome with Bowtie
Run bowtie to align reads to reference genomes. The following generates a sam formatted alignment using the best quality flag for reads aligned to hg19
bowtie --sam --best /database/davebrid/RNAseq/reference-genomes/hg19/hg19 control-reads-quality25.txt control-aligned-quality25.sam
Align Reads to Reference Genome with Tophat
Run tophat to align reads to the reference genome. I’ve included a pseudo command line as well as a “real” command line.
$ tophat [-p #processors -o ./output_directory] <./reference genome in both .ebwt and fasta formats (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19)> <reads file to be aligned (e.g. s_1_1_sequence.fastq)> $ tophat -p 5 -o ./HG19/tophat_out_hg19_001_trimmed /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19 ./HG19/Rich_trim/A_1_16_85.fastq
Use Cuffcompare to Generate .gtf Reference
Run cuffcompare to create .gtf format reference genome from a generic reference genome. Note that cuffcompare adds the tss_id and p_id columns that you will need in cuffdiff. This .gtf reference can be created once then used repeatedly in the future.
$ cuffcompare [-o ./output_directory] < input file twice (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf )> $ cuffcompare -o ./cuffcompare_out /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf
Use Cuffdiff to Identify Differentially Expressed Transcripts
Run cuffdiff to identify differentially abundant transcripts.
$ cuffdiff [-p #processors -o ./output_directory –L label1,label2,etc. –T (for time series data) –N (use upper quantile normalization –compatible_hits_norm (use reference hits in normalization) –b (use reference transcripts to reduce bias, include path to file e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa) –u (improve multi-read weighting) ] <transcripts.gtf (produced by cuffcompare) sample_A_accepted_hits1.bam, sample_A_accepted_hits2.bam,etc (all produced by tophat) sample_B_accepted_hits1.bam,sample_B_accepted_hits2.bam, etc> $ cuffdiff -o ./HG19/Cuffdiff_out_options_b_u_N_compatible/ -p 14 -L Control,PUF_kd --no-update-check -b /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa -u -N --compatible-hits-norm /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/cuffcompare_out.combined.gtf /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_001_trimmed/accepted_hits.bam /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_002_trimmed/accepted_hits.bam