RB-TnSeq Library Production Analysis

Description

This document describes data processing and plotting steps for analysis of a newly created RB-TnSeq (Deutschbauer 2015) library. It requires barcode pool files produced by the TnSeq shell pipeline TnSeqPipe.sh from fastq reads and an organized species gene list for the species in which the library was created, in this case, Cupriavidus necator H16 and Oligotropha carboxidovorans OM5.

The first part of this document (link) parses the pooled mutants to visualize transposon insertion using the package “BioCircos”, a derivative of Circos. The second part, (link), analyzes systemic transposon insertion bias.

# Load packages

library(rmarkdown)
library(BioCircos)
library(RColorBrewer)
library(tidyverse)
library(ggpubr)
library(tidyr)
library(zoo)
library(data.table)

#Step 1: Loading processed data Read the data from TnSeq Pipeline parsing output. Pool_file contains loci of insertion and number of reads associated with the corresponding barcode. Genes.GC contains a short description and the GC content of all genes in the organism. Pool_file.hit lists genes that received at least one insertion. It will be used later.

#Move to the working directory of the TnSeq pipeline output
setwd("~/Kyle/RalstoniaTnSeq")
poolfile <- read.csv(file="pool_file",header=TRUE,sep="\t",stringsAsFactors=FALSE)
genesfile <- read.csv(file="genes.GC",header=TRUE,sep="\t",stringsAsFactors=FALSE)
hitfile <- read.csv(file="pool_file.hit",header=TRUE,sep="\t",stringsAsFactors=FALSE)

#Step 2: Information about the pool

## [1] "70875 unique barcodes in the pool"

## [1] "6814 genes in this organism"

## [1] "5919 genes with one transposon within central 10-90% of bp"

## [1] "895 genes without central insertions"

#Step 3: Reformat annotations

# Subset pool data to reduce memory usage
poolsubset <- poolfile %>% select(scaffold,pos,nTot)
rm("poolfile")

# Reformat genes.GC annotation to reduce algorithmic replacement load
map2 <- data.frame(scaffold=c("NC_008313","NC_008314","AY305378"),id=c("Ch1","Ch2","pHG1"),stringsAsFactors=FALSE)
genessubset<-genesfile %>% select(scaffoldId,begin,end,strand,desc,GC)
genessubset$length <- abs(genessubset$begin - genessubset$end)
hitssubset<-hitfile %>% select(scaffoldId,locusId,desc,nStrains,nReads)

Next, tables are reformatted and reannotated to allow joining.

# Rename chromosomes in pool and annotation file

genessubset<-
genessubset %>%
  rename(scaffold=scaffoldId)

genessubset <-
genessubset %>%
  left_join(map2,by="scaffold") %>%
  mutate(scaffold = id) %>%
  select(-id)

poolsubset <-
poolsubset %>%
  left_join(map2,by="scaffold") %>%
  mutate(scaffold = id) %>%
  select(-id)

hitssubset <-
hitssubset %>%
  rename(scaffold=scaffoldId) %>%
  left_join(map2,by="scaffold") %>%
  mutate(scaffold = id) %>%
  select(-id)

# Reformat and sort the pool file by position
poolsubset <- poolsubset[order(poolsubset[,1],poolsubset[,2],decreasing=FALSE),]

Pool and gene datasets are joined by placing insertion loci within gene start and end sites. All insertions in unannotated genes and intergenic regions are called “NA” for now.

poolsubset$pos <- as.numeric(as.character(poolsubset$pos))
x = data.table(poolsubset)
y = data.table(genessubset)
#dummy begin/end columns are created in the pool file to allow foverlap function
x$begin <- x$pos
x$end <- x$pos
setkey(y,scaffold,begin,end)
annotatedpool <- data.frame(foverlaps(x,y,by.x=c("scaffold","begin","end"),type="within"))

#Interactive Circos Plot

#Subset chromosomes for circos plotting
Ch1df = annotatedpool[grep("Ch1", annotatedpool[,1]),]
Ch2df = annotatedpool[grep("Ch2", annotatedpool[,1]),]
pHG1df = annotatedpool[grep("pHG1", annotatedpool[,1]),]
Ch1Genes = genessubset[grep("Ch1", genessubset[,1]),]
Ch2Genes = genessubset[grep("Ch2", genessubset[,1]),]
pHG1Genes = genessubset[grep("pHG1", genessubset[,1]),]

#Define the genome
Genome = list("Ch1"=4052032,"Ch2"=2912490,"pHG1"=452156)

#Format circos plot
tracks = BioCircosTracklist()
barcolor = c('#e41a1c','#4daf4a','#377eb8')

dflist <- list(Ch1df,Ch2df,pHG1df)
geneslist <- list(Ch1Genes,Ch2Genes,pHG1Genes)

#instantiate lists and values (bin width, graph range) to be used in the plotting loop
binwid=20000
startlist <- list(seq.int(0, unlist(Genome[[1]]), by = binwid))
endlist <- list(seq.int(binwid, unlist(Genome[[1]])+binwid, by = binwid))
genesbreaks <- rbind(geneslist[[1]]$begin,geneslist[[1]]$end)



ReadsMax = 200000
InsertionBinMax= 750
GeneInsertionMax = 100

tracks = BioCircosTracklist()
barcolor = c('#e41a1c','#4daf4a','#377eb8')

#Loop through chromosomes, producing each plot for each chromosome.
for (i in 1:length(Genome)){
#For preset color palettes
#barColor = colorRampPalette(brewer.pal(3,"Pastel2"))(length(Genome))[i];
#Using predecided color hash
bcolor = barcolor[i]

#create arbitrary bins
startlist[i] <- list(seq.int(0, unlist(Genome[[i]]), by = binwid))
endlist[i] <- list(seq.int(binwid, unlist(Genome[[i]])+binwid, by = binwid))
#create gene-specific bins, accounting for the "0" edge case. intergenic regions are given their own bins, called "intergenic"
genesbreaks <- c(0,rbind(geneslist[[i]]$begin,geneslist[[i]]$end))
#account for "edge cases" where genes overlap by attributing insertions to the first 'genological' gene
for(z in 2:length(genesbreaks)){
  if(genesbreaks[z] < genesbreaks[z-1]){
    genesbreaks[z]=genesbreaks[z-1]
  }
  if(genesbreaks[z] == genesbreaks[z-1]){
    genesbreaks[z]=genesbreaks[z-1]+1
  }
}
dflist[[i]]$genesbins <- cut(dflist[[i]]$pos, breaks = unlist(genesbreaks))
dflist[[i]]$bins <- cut(dflist[[i]]$pos, breaks = unlist(startlist[i]))

dflist[[i]]$length = dflist[[i]]$end - dflist[[i]]$begin

#Circos bar graph of binned insertions
tracks = tracks + BioCircosBarTrack(paste0("bars",i), chromosome = names(Genome)[i], starts =startlist[[i]], ends=endlist[[i]], values=as.data.frame(table(dflist[[i]]$bins))$Freq, labels=rep("Tn5 insertions",length(unique(dflist[[i]]$bins))), range=c(0,InsertionBinMax), color = bcolor)

#Inner circos heat map of binned Illumina reads
tracks = tracks + BioCircosHeatmapTrack("heatmap1", chromosome = names(Genome)[i], starts =startlist[[i]], ends=endlist[[i]], values=as.data.frame(aggregate(nTot ~ bins, data = dflist[[i]],sum))$nTot, labels=rep("Illumina reads",length(unique(dflist[[i]]$bins))), range=c(0,ReadsMax), minRadius = 0.3, maxRadius = 0.45)

#Outer Circos heat map (by read number) binning data by genes - high computational load. Displays reads
#tracks = tracks + BioCircosHeatmapTrack("heatmap2", chromosome = names(Genome)[i], starts = head(genesbreaks,-1), ends = tail(genesbreaks,-1), values = as.data.frame(aggregate(nTot ~ genesbins, data = dflist[[i]],sum))$nTot, range=c(0,5000), labels = c(unlist(rbind("intergenic", geneslist[[i]]$desc)),"intergenic"), minRadius = 1.2, maxRadius = 1.5)

#Circos heat map of insertions per gene/intergenic region, displayed on the outmost ring... requires a higher graph resolution to distinguish bins.
tracks = tracks + BioCircosHeatmapTrack("heatmap2", chromosome = names(Genome)[i], starts = head(genesbreaks,-1), ends = tail(genesbreaks,-1), values=as.data.frame(table(dflist[[i]]$genesbins))$Freq, range=c(0,50), labels = c(unlist(rbind("intergenic", geneslist[[i]]$desc)),"intergenic"), minRadius = 1.2, maxRadius = 1.5)
}

#Color the background of the bar graph
tracks = tracks + BioCircosBackgroundTrack("bars_background",colors="#B3E6FF")

#Cupriavidus necator / Ralstonia eutropha RB-TnSeq Library

#Plot with circos
BioCircos(tracks,genomeFillColor = barcolor, genome = Genome, genomeTicksDisplay=F,genomeLabelDy=0, genomeLabelTextSize="12pt", width=850,height=850)

The Circos graph displays a clear bias for both transposon insertion likelihood and Illumina read density for the termini of the chromosomes, a bias towards insertion in pHG1 and against insertion in Ch2. Sources of bias in RB-TnSeq library production include transposon insertion site basepair preferences, and the copy number of chromosomes and plasmids within the organism. These biases must be described and accounted for in downstream analyses. The next step in the process is statistical treatment of essentiality.

One method to visualize the reliability of the library is to determine the likelihood of transposon insertion into the chromosomes taking into account each of these biases by calculating the insertion density along each chromosome in rolling windows. Next, several randomly sampled pool files can be created to compare against the original. Rolling window averages decrease the chance of overfitting, depending on window size. Some scientists produce an HMM with essential and non-essential states, I think the best method would be a combination of both simulations and the construction of an HMM that calculates a scalar (insertion density) by incorporating sliding windows of various sizes and local sequence content of insertion sites.

insdf = list()
SimWithBiasdf = data.frame()
SimRandomdf = data.frame()
testing2 = data.frame()
statsStated = list()
stats = list()
#loop through the chromosomes
for(i in 1:length(Genome)){
  insertionchr <- as.data.frame(table(factor(dflist[[i]]$pos, levels = 0:as.numeric(Genome[i])))) %>% rename(pos=Var1) %>% mutate(pos=as.numeric(pos))
  insertionchr$chr <- i
  #calculate insertion density in "rolling windows" of 200bp, 1 bp at a time.
  insertionchr$ExpRollInsertions = rollapply(insertionchr$Freq,width=200,FUN=mean,by=1,partial=TRUE)

  samplingdf <- insertionchr
  #Resample the dataset one chromosome at a time by producing simulated insertions based on the rolling insertion density calculation. With 200bp sliding windows, this produces simulated datasets overfit to the experimental data, accounting for local and distant contributors to bias, and random biases. Do so for simulated libraries of 10k - 100k insertions within each chromosome.
  for(k in seq.int(10000, 100000, by = 10000)){
    kk <- paste0(k," samples")
    simnum <- paste0(k," simulated insertions")
    simulatedf <- data.frame()
    simulatedf <- sample(insertionchr$pos, k, replace=TRUE, prob=insertionchr$ExpRollInsertions) %>% table %>% enframe %>% rename(pos=name) %>% mutate(pos=as.numeric(pos)) %>% right_join(insertionchr, by="pos")

    simulatedf <- simulatedf %>% mutate(value=replace_na(simulatedf$value,0))

    #calculate insertion density in gene-size chunks for the simulation
    rollingdensity <- rollapply(simulatedf$value,width=385,FUN=mean,by=192,partial=TRUE)
    #produce an empirical cumulative distribution function of the insertion density across this particular chromosome
    percent0 <- ecdf(rollingdensity)
    #save the percent chance of 0 insertion density in a gene sized chunk with k simulated insertions
    stats <- c(stats,percent0(0))
    #as string
    statsStated <- c(statsStated,paste("Simulation using bias shows", percent0(0)*100, "% essential in", simnum, "in chromosome", i))

    simulatedf$rolling = rollapply(simulatedf$value,width=200,FUN=mean,by=1,partial=TRUE)

    simulatedf <- simulatedf %>% mutate(value=replace_na(simulatedf$value,0)) %>% rename(!!simnum := rolling)

    samplingdf <- merge(samplingdf,as.data.frame(simulatedf))
    }

  #Resample the dataset a second time, with a true random insertion rate, should this be done including overall positional bias by normalizing (as in TraDIS toolkit? Parkhill, Bioinformatics 2016). Do so for simulated libraries of 10k - 100k insertions in each chromosome.
  samplingdf2 <- insertionchr
  for(k in seq.int(10000, 100000, by = 10000)){
    kk <- paste0(k," samples")
    simnum <- paste0(k," simulated insertions")
    simulatedf <- data.frame()
    simulatedf <- sample(insertionchr$pos, k, replace=TRUE) %>% table %>% enframe %>% rename(pos=name) %>% mutate(pos=as.numeric(pos)) %>% right_join(insertionchr, by="pos")

    simulatedf <- simulatedf %>% mutate(value=replace_na(simulatedf$value,0))

    #calculate 0 insertion density tail of rolling insertion density distribution in gene-sized windows of 385bp
    rollingdensity <- rollapply(simulatedf$value,width=385,FUN=mean,by=192,partial=TRUE)
    percent0 <- ecdf(rollingdensity)
    statsStated <- c(statsStated,paste("Random simulation shows", percent0(0)*100, "% essential in", simnum, "in chromosome", i))

    #calculate insertion density in rolling windows for the simulated dataset
    simulatedf$rolling = rollapply(simulatedf$value,width=200,FUN=mean,by=1,partial=TRUE)

    #simulatedf <- simulatedf %>% mutate(value=replace_na(simulatedf$value,0)) %>% rename(!!kk := value, !!simnum := rolling)

    simulatedf <- simulatedf %>% mutate(value=replace_na(simulatedf$value,0)) %>% rename(!!simnum := rolling)

    samplingdf2 <- merge(samplingdf2,as.data.frame(simulatedf))
      }

  samplingdfgathd <- gather(samplingdf, samplenum, ins, -pos, -chr)
  samplingdfgathd2 <- gather(samplingdf2, samplenum, ins, -pos, -chr)

  SimWithBiasdf <- rbind(SimWithBiasdf,samplingdfgathd)
  SimRandomdf <- rbind(SimRandomdf,samplingdfgathd2)
  testing2 <- rbind(testing2, insertionchr)
}

SimWithBiasdf <- SimWithBiasdf %>% filter(samplenum!="Freq")

SimRandomdf <- SimRandomdf %>% filter(samplenum!="Freq")

write.csv(stats,file=paste0("Simulation0TailStats",Sys.time()))
write.csv(SimWithBiasdf, file=paste0("SimWithBias",Sys.time()))
write.csv(SimRandomdf, file=paste0("RandomSimulation",Sys.time()))

Plotting simulations against experimental data. Let’s look specifically at Ch2 and simulations around the experimental dataset and the ideal dataset.

#Simulations using bias across chromosomes, displayed as a line graph of insertion density in 200bp windows.
#Fig <- ggplot(data=SimWithBiasdf,aes(x=pos,y=ins, color=chr))
#Fig <- Fig + geom_line() + facet_grid(chr ~ samplenum, scales="free_x") + scale_y_continuous() + scale_color_manual(values=c("coral", "seagreen", "midnightblue"))

#This doesn't look great, so we will bin the line graph into hex

#Visualizing simulations using bias across chromosomes, displayed as insertion density in 200bp windows, line graph binned into hexagons for easier viewing

Figframe <- SimWithBiasdf %>% filter(chr==2) %>% filter(samplenum==c('20000 simulated insertions','80000 simulated insertions'))
Figframe <- Figframe %>% mutate(graphwidth=max(pos))
Fig <- ggplot(data=Figframe, aes(x=pos, y=ins)) + geom_hex() + geom_blank(aes(x=graphwidth)) + facet_wrap(. ~ samplenum) + ylab("Insertion Density (385bp windows)") + xlab("chromosome position (bp)")
Fig

#Hex graphs are high computational load. We need a better way of visualizing this data. Will try in a StreamLit app.

#Visualizing random simulations across chromosomes, displayed as rolling density in 200bp windows, line graph binned into hexagons for easier viewing. Difficult to see much but noise.
Figframe <- SimRandomdf %>% filter(chr==2) %>% filter(samplenum==c('20000 simulated insertions','80000 simulated insertions'))
Figframe <- Figframe %>% mutate(graphwidth=max(pos))
Fig <- ggplot(data=Figframe, aes(x=pos, y=ins)) + geom_hex() + geom_blank(aes(x=graphwidth)) + facet_wrap(. ~ samplenum) + ylab("Insertion Density (385bp windows)") + xlab("chromosome position (bp)")
Fig

#Whole dataset
#Figframe <- SimRandomdf %>% filter(samplenum!="value")
#Fig <- ggplot(data=Figframe, aes(x=pos, y=ins, color=chr)) + geom_hex() + facet_wrap(chr ~ samplenum, scales="free") + scale_y_continuous(trans='log10') + scale_color_manual(values=c("coral", "seagreen", "midnightblue"))

# Next let's plot the distribution of insertion densities across chromosome 2. Then we can integrate the area under the curve for the areas with 0 insertion density to find the chance of an "essential" call first with bias:
Figframe <- SimWithBiasdf %>% filter(chr==2) %>% filter(samplenum==c('20000 simulated insertions','80000 simulated insertions'))
Fig <- ggplot(data=Figframe, aes(x=ins))
Fig <- Fig + geom_density(aes(y=..scaled..),bw = .003) + facet_grid(. ~ samplenum) + scale_y_continuous() + xlim(c(0,0.2)) + xlab("insertion density (385bp windows)") + ylab("Normalized frequency of densities") + ggtitle("Distribution of Insertions simulated with 200bp rolling bias averaged as rolling windows")
Fig

#Entire dataset. Slow - too much data.
#Fig <- ggplot(data=SimWithBiasdf, aes(x=ins,color=chr))
#Fig <- Fig + geom_density(aes(y=..scaled..)) + facet_grid(chr ~ samplenum) + scale_y_continuous() + scale_color_manual(values=c("coral", "seagreen", "midnightblue")) + xlim(c(0,0.2))

#And with randomly simulated insertions:
Figframe <- SimRandomdf %>% filter(chr==2) %>% filter(samplenum==c('20000 simulated insertions','80000 simulated insertions'))
Fig <- ggplot(data=Figframe, aes(x=ins))
Fig <- Fig + geom_density(aes(y=..scaled..),bw = .003) + facet_grid(. ~ samplenum) + scale_y_continuous() + xlim(c(0,0.2)) + xlab("insertion density (385bp windows)") + ylab("Normalized frequency of densities") + ggtitle("Distribution of Randomly simulated Insertions averaged as rolling windows")
Fig

#Entire dataset. Slow - too much data
#Fig <- ggplot(data=SimRandomdf, aes(x=ins,color=chr))
#Fig <- Fig + geom_density(aes(y=..scaled..)) + facet_grid(chr ~ samplenum) + scale_y_continuous() + scale_color_manual(values=c("coral", "seagreen", "midnightblue")) + xlim(c(0,0.2))

We can see and calculate that completely random transposon insertion creates a distribution with a 5% likelihood of 0 insertions within a gene-sized chunk at ~30k insertions in the large chromosomes 1 and 2, while simulated insertion biased by empirical data levels off at around 70k insertions per chromosome.

As displayed above, our pool in the Hudson lab contains 70875 unique barcodes. Of these,

#Number in chromosome 1
paste(poolsubset %>% filter(scaffold=='Ch1') %>% nrow)

## [1] "41733"

#Number in chromosome 2
paste(poolsubset %>% filter(scaffold=='Ch2') %>% nrow)

## [1] "22811"

#Number in pHG1
paste(poolsubset %>% filter(scaffold=='pHG1') %>% nrow)

## [1] "6331"

## [[1]]
## [1] "Simulation using bias shows 22.0705993840322 % essential in 40000 simulated insertions in chromosome 1"
## 
## [[2]]
## [1] "Simulation using bias shows 19.5972518360578 % essential in 50000 simulated insertions in chromosome 1"
## 
## [[3]]
## [1] "Random simulation shows 2.17484008528785 % essential in 40000 simulated insertions in chromosome 1"
## 
## [[4]]
## [1] "Random simulation shows 0.857616678512201 % essential in 50000 simulated insertions in chromosome 1"
## 
## [[5]]
## [1] "Simulation using bias shows 26.044825313118 % essential in 20000 simulated insertions in chromosome 2"
## 
## [[6]]
## [1] "Random simulation shows 7.37640079103494 % essential in 20000 simulated insertions in chromosome 2"
## 
## [[7]]
## [1] "Simulation using bias shows 12.0594479830149 % essential in 10000 simulated insertions in chromosome 3"
## 
## [[8]]
## [1] "Random simulation shows 0 % essential in 10000 simulated insertions in chromosome 3"

All of this code can easily be rewritten for other RB-TnSeq libraries. For example, see below the creation of a circos visualization graph for

Oligotropha carboxidovorans.

#Information about the pool

## [1] "89643 unique barcodes in the pool"

## [1] "3593 genes in this organism"

## [1] "3048 genes with one transposon within central 10-90% of bp"

## [1] "545 genes without central insertions"

#new version

Oligotropha carboxidovorans RB-TnSeq Library

#Plot with circos
BioCircos(tracks,genomeFillColor = barcolor, genome = Genome, genomeTicksDisplay=F,genomeLabelDy=0, genomeLabelTextSize="12pt", width=850,height=850)

The Oligotropha dataset displays evidence of contamination with Ralstonia. The sequencing data was performed in parallel with the Ralstonia library and the Illumina multiplexing indices were not properly ligated. In fishing further Oligotropha data out of the unindexed file, Ralstonia reads mapped to some regions of Oligotropha’s genome. Its greater presence in the sequencing data meant that wherever Ralstonia regions mapped to Oligotropha genome, an artificial spike is introduced to TnSeq data.

Next, we will create a genome browser to examine this data at the gene level in Python.