Fluent’s novel PIPseq™ technology, provides a simple, flexible and cost effective solution for any researcher that does not require the use of complex instrumentation or expensive consumables. When a PIPseq sample is sequenced, the result is a paired-end dataset in FASTQ format. PIPseeker, Fluent’s data analysis platform, takes these inputs and converts them to a counts table of barcodes and genes. Furthermore, it performs clustering and differential expression analysis that can be used to identify cell types in the sample.
For a detailed description, please see the PIPseeker User Guide.
What’s in a PIP?
Each PIP contains a hydrogel bead with millions of capture moieties. Each captured moiety contains a poly-T sequence that hybridizes with the poly-A tails of mRNA molecules. Each PIP is labeled by a specific barcode sequence, which enables it to be uniquely identified. Within each PIP, different captured mRNA molecules are identified using a molecular index sequence. The barcode sequences and molecular indexes are sequenced from the 5’ end and make up the read 1 (R1) FASTQ file. The cDNA fragment is sequenced from the 3’ end and makes up read 2 (R2).
Extracting PIP Identity
The first step in processing the FASTQ inputs is to identify the PIP each read came from. Sequencing errors can lead to an imperfect match between the sequencing results and the actual barcode on the bead. The barcode space is designed so that barcodes are sufficiently different from each other to tolerate some level of sequencing errors. Reads that could not be matched to a known barcode sequence are excluded from further analysis.
Removing Non-cDNA Sequences
The sequenced cDNA fragment derived from the construct described above is flanked by the template switch oligo (TSO) on the 5’ end and a poly-A sequence on the 3’ end. In some cases, particularly when the captured mRNA sequence is short, the TSO and poly-A sequences may be present in the R2 file. Since those sequences are not genomic in origin, PIPseeker removes them prior to genome alignment.
Aligning to a Reference Genome
Alignment is the process of mapping reads to a reference genome. PIPseeker uses STAR, a popular tool for splice-aware RNA alignment. The alignment process results in a BAM file output, containing all the mapped reads and their corresponding genomic locations. Those locations can be a gene (exon or intron) or a non-gene location in the genome. Only transcriptomic reads, i.e., reads aligned to a gene, are considered when constructing the count matrix.
Constructing the Transcript Count Matrix
The count matrix, also known as the feature-barcode matrix, is a sparse representation of the count of all barcode-gene combinations in the sample. Each element in the matrix represents the number of transcripts for a single gene and barcode combination. PIPseeker derives the count matrix by parsing the alignment results from STAR. The matrix containing all the resolved barcodes and their associated gene counts is referred to as the “raw matrix”.
Due to amplification during library preparation, multiple copies of each molecule are fed into the sequencer, so they need to be merged into a single transcript source. This process is known as deduplication, and results in an accurate count of the transcripts that were originally captured on the bead. For example, the image below shows three molecules on the same bead (same barcode). Capture moieties with molecular indexes 1 and 3 happened to capture transcripts from the same gene. After amplification and sequencing, five and three reads were sequenced from indexes 1 and 2, respectively, whereas only one copy was sequenced from index 3. The reads from indexes 1 and 2 are collapsed into a single transcript count from each, resulting in a count of 2 for Gene 1 and a count of 1 for Gene 2.
After the raw matrix is created, the next step is to determine which barcodes represent cells captured in a PIP. The overall number of PIPs outweighs the number of input cells, leaving some empty PIPs that capture ambient RNA at different steps in the workflow.and will be represented in the raw matrix with lower numbers of transcripts. Variability during cell lysis and library preparation can also lead to barcodes with low RNA content.
The key to cell calling is the barcode rank plot, sometimes called the “knee plot.” It shows barcodes ranked in descending order by (deduplicated) transcript count. Typically, the top mode of the plot includes barcodes from cell-containing PIPs. The rest of the plot is thought of as “background”.
The example above, taken from a mixed sample of human and mouse cells (HEK / 3T3), shows a straightforward separation of cells and background. The cell portion is relatively flat and drops off sharply. In this case an automated algorithm can easily find the inflection point that separates the two barcode populations. However, the shape of the rank plot can vary substantially between different sample types, because some cell types naturally have lower RNA expression levels than others. Such low-RNA content cells will be concentrated in the middle area between the two main modes, and the rank plot for such samples may show less distinction between cells and background. The example below shows a rank plot from a PBMC (human blood isolate) sample. The high-RNA portion of the plot has a much wider range of transcript counts, and may in fact be multimodal, making it difficult to automatically determine an appropriate stopping point for cell calling.
Some algorithms attempt to refine cell calling and separate the barcodes in the ambiguous “knee” area of the plot into cells and background. Our experience suggests that this approach does not always correctly separate the two populations. PIPseeker employs a different cell calling method that allows the user more control over the number of called cells. It selects cells at five different sensitivity levels, corresponding to five different stopping points on the rank plot. This allows the user to select the most appropriate threshold. What is “most appropriate” can vary between experiments. The user may wish to prioritize high-RNA content cells for careful characterization, or include low-RNA content cell types at the expense of increased noise due to background RNA. The example below shows cell calling for the same PBMC sample, with green indicating the cell barcodes. Please see the PIPseeker User Guide for detailed information on the cell calling algorithm.
PIPseeker outputs count matrices including only the cell barcodes for each sensitivity level. Those are known as filtered matrices, and can be used as input to other downstream analysis tools.
After selecting cell barcodes, it is common to conduct a clustering analysis in order to quantify and visualize heterogeneity within the cell population and identify different cell types. Clustering is the basis of many downstream analysis tools for cell type identification, such as Seurat. PIPseeker implements its own clustering algorithm, which is described in detail in the User Guide. The results include a visualization of the clusters in UMAP space, as for the PBMC sample below, and a table of the top expressing genes for each cluster, which can be used to identify cell types.
Pipseeker is an easy to use solution, included in the purchase of your Fluent kit(s) or service, streamlining the process of analyzing your data in an efficient and scalable manner. For further questions on PIPseq or PIPseeker contact us at firstname.lastname@example.org.