ABSTRACT:Powdery Mildew is a huge problem that exists in the wine grape industry, destroying crops and profits.
A wine grape farm that suffers from this issue in gave their information to a lab in Geneva, NY from which the fastq reads for this project were obtained. Using bioinformatics tools and methods, these fastq reads were analyzed through the Tuxedo Suite to get the heatmap, volcano plot, and density plot output. First the fastq reads were put through a Quality Control pipeline, then compressed in Bowtie.
Next, this compressed data was aligned to a reference genome from ENSEMBL in Tophat, and Cufflinks was used to create an annotation file containing information on the prevalence of specific reads. Finally, Cuffdiff was used to compare specific variants generated from the pipeline against the annotation file generated from cufflinks and the data from this was visualized in a heat map using CummeRbund. It was found that the gene variants were generally similar in comparison while the density of read counts was fairly similar resulting in insufficient evidence to promote solutions to suppress powdery mildew.
INTRODUCTION: In the modern era of bioinformatics tools, we see that RNA analyzation is becoming a more useful way of determining relationships between subsets of DNA, RNA, or proteins. This process has evolved over years of computational perfection, and bioinformaticians have been instrumental in that process. Using RNA-Seq data, it is plausible to predict of similar functionality and structures in other molecules. The Tuxedo suite utilizes an important mechanism spanning computer science, mathematics, and bioinformatics alike: dynamic programming. Much like the divide and conquer approach, dynamic programming is the computational methodology for dividing a problem into multiple subsets. By aligning RNA-seq reads to a genome using Tophat, assembling transcripts from read alignments using Cufflinks, performing differential expression analysis using Cuffdiff and final expression analysis using CummeRbund, we can create solutions to minimizing the effect of powdery mildew on grapevines. METHODS:1.
Fastq ReadsFastq is a file format for storing both a biological sequence and its corresponding quality scores. Our fastq reads used thee sourced from Geneva, NY. We needed these reads to stream them through the Quality Control Pipeline, referenced hereinafter as the QC Pipeline. For this, we needed to obtain the following: Reads of mRNA to cDNA in order to measure gene expression, and Fastq reads of fasta with ASCII character quality scores for each nucleotide. Be sure to note that sequences that are either too short or too long may not result in high quality reads.A fastq file may appear in the same format as the following:@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!”*((((***+))%%%++)(%%%%).1***-+*”))**55CCF>>>>>>CCCCCCC65Here is a portion of the reference genome:2.
QC PipelineWe needed to run the fastq reads mentioned in step one through the QC Pipeline. A pipeline is basically a set of data processing elements connected in a series, where the output of one element is the input of the next one, ours is written in python. What we also needed to do was to make a script for this pipeline. We made a script that creates a path for the fastx output. Using fastx will trim the fastq reads and remove the low-quality parts of the file. #!/bin/bash (Path to fastx) In the script, run python with three arguments including the QC pipeline file, the path to the fastq file, and a new path to where we want the output to be placed. Python (QC pipeline file) (arguments for QC pipeline) When the script has completed running, it would be useful to include an email receipt to give we a completion time and notification that we can move on to the next step.
Mail -s “Job Done” (the email here)Below is a complete example of the script we used in computation: #!/bin/bashPATH=$PATH:/usr/local/bin/fastx:/usr/local/binpython qcpipe3.py /Users/Princessrowana/Desktop/RIT/1139_1570_4565_N_20130206Ex5MloErin_R1.fastq /Users/Princessrowana/Desktop/RIT/run2 group_R2 pmr6_barcodes.tsv R2mail -s “Job Done” [email protected]
edu3. BowtieFor the third step, we needed to write another script in order to use Bowtie, a Burrows-Wheeler Transformation tool, to compress our genome into a useable format that will be able to run through Tophat and will take up less space. #!/bin/bashThen we needed to include a path that will point to the directory where Bowtie is located. This takes two program arguments: the path to the location of the genome, and the path to the location where we want the output file to be placed. (path to bowtie program)(path to genome)(path to output location)If we would like the email notification, copy the final line from the last script and place it at the end of this one as well.
mail -s “Job Done” (the email here)Below is a complete example of the script we used in computation:#!/bin/bash/usr/local/bin/bowtie/bowtie2-build /Users/Princessrowana/Desktop/RITmail -s “Job Done” [email protected] TophatOur next step included utilizing Tophat. Tophat is a package that allows the reads from Bowtie to be aligned to a reference genome, in our case, Vitis vinifera. Our genome was obtained from ENSEMBL, a popular and free genome browser.
In Tophat, the gene expression information is generated by the amount of reads aligned to the reference genome. If there are multiple reads aligned in one place, that means that the mRNA at that gene location is more heavily transcribed. With this information, we can run a script with more specific variants that contain more information.To do this, we once again wrote another script. #!/bin/bashThen we created two paths, one for the location of Samtools, a suite of programs used in high-throughput sequencing data, and another of the location to the Tophat program on the system. We used Tophat version 2.
1.1. For our script we wrote ours using: PATH=$PATH:/usr/local/bin/samtools PATH=$PATH:/usr/local/bin/tophat2.1.1Following this, we needed to execute tophat. Tophat takes in a number of arguments which include: the name of the directory that will be created to store the files, the path to the reference genome, the path to the read orientation of the first variant, and the read orientation of the second variant.
The extension “.all” is used to locate all relevant tophat file types within the directory. Tophat -p 4 name of directory to be created/store output /path/to/reference/genome/file-prefix.all /path/to/read/orientation_one.fq /path/to/read/orientation_two.fq mail -s “Job Done” (the email)Note: as per the last scripts, we may include the email notification line at the end.Below is a complete example of the script we used in computation: #!/bin/bashPATH=$PATH:/usr/local/bin/samtools PATH=$PATH:/usr/local/bin/tophat2.1.
1Tophat -p 4 -o tophatoutput /Users/Princessrowana/Desktop/RIT/Vitis_vinifera.IGGP_12x.30.cdn$ /Users/Princessrowana/Desktop/RIT/group_R1_LN381.
cut.fq /Users/Princessrowana/Desktop/RIT/group_R2_LN381.cut.fq mail -s “Job Done” [email protected] CufflinksWe used Cufflinks to generate an annotation file with which we continued our analyses.
Cufflinks assembles transcripts while estimating their abundances and tests for differential expression and regulation in RNA-Seq samples. It takes in aligned RNA-Seq reads and assembles the alignments into a set of transcripts. It then estimates the relative abundances of these transcripts based on how many reads support each one.
Cufflinks is generated from the TopHat output.6. CuffdiffCuffdiff takes the output from Cufflinks and compares the variants generated against the annotation file generated by Cufflinks. After we picked the variants we wanted, the output was ran through Tophat once more. For this, we wrote another script. #!/bin/bashThen, we wrote three paths. One for a path to Samtools on the system, one path to Tophat, and one to Bowtie.
PATH=$PATH:/usr/local/bin/samtools PATH=$PATH:/usr/local/bin/tophat2.0.13/tophat-2.0.13.Linux_86_64 PATH=$PATH:/usr/local/bin/bowtieThen, we wrote a script to run Tophat that created a variant-specific alignment of reads compared to the transcriptome. /usr/local/bin/tophat/tophat -G /path/to/transcriptome/file.gtfNote: ‘-G’ is a label for the transcriptome.
Additional note: This transcriptome file was generated from an assembly of all the reads performed by Tophat. Next, we set a place for a directory to be made to store files. -o name of directory to store filesCuffdiff then used the annotation generated by cufflinks and the specific variant generated from tophat to determine the differences between the variant and the annotation. We needed to include the paths to the read orientations we used in the analyses here as well. /path/to/reference/genome/file-prefix.
all /path/to/read/orientation_one.fq /path/to/read/orientation_two.fqFurthermore, Cufflinks shows areas of high or low expression in the variant as compared to the rest of the reads, for this we wrote another script. To begin, we wrote the paths to Samtools and Cufflinks on the server.
#!/bin/bash PATH=$PATH:/usr/local/bin/samtools PATH=$PATH:/usr/local/bin/cufflinks We needed to use the output directory from Tophat and generate the files within this directory. Cuffdiff -o cuffdiff_out /path/to/transcriptome/file.gtf /path/to/accepted_hits.bam /path/to/unmapped.
bamBelow is a complete example of the script we used in computation: #!/bin/bashPATH=$PATH:/usr/local/bin/samtoolsPATH=$PATH:/usr/local/bin/tophat2.0.13/tophat-2.
0.13.Linux_86_64PATH=$PATH:/usr/local/bin/bowtie/usr/local/bin/tophat/tophat -G /Users/Princessrowana/Desktop/RIT/transcriptome_file.gtf-o tophatoutput /Users/Princessrowana/Desktop/RIT/Vitis_vinifera.IGGP_12x.30.cdn$ /Users/Princessrowana/Desktop/RIT/group_R1_LN381.cut.
fq /Users/Princessrowana/Desktop/RIT/group_R2_LN381.cut.fqBelow is another complete example of the script we used in computation:#!/bin/bash PATH=$PATH:/usr/local/bin/samtools PATH=$PATH:/usr/local/bin/cufflinksCuffdiff -o cuffdiff_out /Users/Princessrowana/Desktop/RIT/transcriptome_file.
gtf /Users/Princessrowana/Desktop/RIT/accepted_hits.bam /Users/Princessrowana/Desktop/RIT/unmapped.bamNote: ‘bam’ files are the binary version of a SAM file and a SAM file (.sam) is a tab-delimited text file that contains sequence alignment data that is human-readable. These paths should be the output of Tophat.7.
CummeRbundFinally, we used the output of Cuffdiff to visualize our data using R. R is a programming language and software environment for statistical computing and graphics. We used a package in R called CummeRbund. After installing R on the system, we ran R.
After running R, we imported the CummeRbund package by prompting the following commands: > source(“http://www.bioconductor.org/biocLite.R”)> biocLite(“cummeRbund”)> library(cummeRbund)We then read in the cuffdiff files using the following command:>cuff_data<-readCufflinks("/Users/Princessrowana/Desktop/RIT/cuffdiff_out")If we read a file that was already generated by cummeRbund we can use this command:> cuff_data<-readCufflinks(dbFile="/Users/Princessrowana/Desktop/RIT/cuffdiff_out.db")To generate different visualizations of the data, cummeRbund has built-in functions such as csScatter(), csHeatmap(), csDensity(), or csVolcanoMatrix().
We used the following commands as a template:> myGeneIds<-("GeneId1", "GeneId2", "GeneId3")> myGenes<-getGenes(cuff_data,myGeneIds) > csHeatmap(myGenes)RESULTS: Note: For all results, q1 and q2 are different read orientations of the same sequence. They were generated by the QC pipeline from Illumina reads in the fastq file in relation to the reference genome. FROM QC PIPELINE: cutadapt for rcbcrcprBC in R1 and rcbcrcprAC in R2 in paired-end reads# – trim for reverse complement of barcode plus reverse complement of prBC if R1# – trim for reverse complement of barcode plus reverse complement of prAC if R2Figure 1: R Graphics Volcano PlotThis figure is the result of the CummeRbund visualization of the data. Our variants q1 and q2 showed variable differences with small percentages of significant plot points. As can also be seen, most of the points on the plot are deemed “not significant”. It can also be seen that the read orientations are the inverse of each other, flipped across the Y axis. This plot was unable to give significant insight into treating the wine grapes for the powdery mildew issue. Figure 2: R Graphics Density PlotThis figure is the result of the CummeRbund visualization of the data.
Our variants q1 and q2 showed similar bell curves where q2 was contained within the bounds of q1. It can be seen that the read orientations, although seemed almost as an inverse of each other in the volcano plot, they are nearly identical in this density plot. This tells us that something in the pipeline must have gone wrong because they should not be nearly identical here.Figure 3: Heat Map This figure is the result of the CummeRbund visualization of the data. Q1 on the right shows differentially expressed genes, whereas Q2 the genes are not very expressed at all. For Q1 the CUFF.
10 is the most highly expressed, followed by CUFF.1 then CUFF.100. This shows that something did definitely go wrong in our pipeline because q2 should not have an expression level of zero. We are unsure of the error that caused this, but from this we are unable make any real conclusions as we cannot compare q1 and q2.
CONCLUSION:After using various bioinformatics tools and common programming/scripting languages, we created a pipeline to visualize differential RNA-seq expression data. In Figure 1 above, an example of the csVolcanoPlot that shows the gene variants to be somewhat different from each other and perhaps insignificant. The density plot in Figure 2 demonstrates the distribution of the RNA-seq read counts. It is noteworthy to see how the density of q1 is slightly greater than q2 with a similar bell curve.
The heat map in Figure 3 demonstrates that there was differential expression in q1 but not q2 meaning that the read orientation of q1 would be better for promoting solutions to suppress powdery mildew. Overall analysis of the data showed that there is insufficient evidence to promote solutions to suppress powdery mildew.REFERENCES:Alberts, B. et al. (1994) Molecular Biology of the Cell, 3rd ed.
, New York: Garland Publishing, Inc.Xiang. C and Brownstein MJ. (2003) Methods in Molecular Biology, Vol 224: Functional Genomics: Methods and Protocols.
Totowa, NJ: Humana Press, Inc.