Sei sulla pagina 1di 47

SINGuLARTM Analysis Toolset

User Guide
PN 100-5066 B1

Copyright 2013 Fluidigm Corporation. All rights reserved. Limited License for SINGuLAR Analysis Toolset The SINGuLAR Analysis Toolset is a shared-source, proprietary data analysis resource for Fluidigm customers interested in analyzing or developing software for single-cell gene expression data generated on Fluidigm technology . It is comprised of unsupported software development resources, including R-scripts, documentation and reference data. Registered users of the SINGuLAR Analysis Toolset may use the code contained in this file in accordance with the terms set forth in sections 1 through 8 below. You may register to use the toolset at the following address: http://www.fluidigm.com/singular-sc-analysis-toolkitrequest.html. Unregistered users or users whose registration has not been confirmed with a receipt at the aforementioned website have no rights or permission to use this code. 1. 2. Use of the code in source and binary forms, with or without modification is permitted solely in accordance with section 3 below. Redistribution of the code in source and binary forms, with or without modification is permitted only to employees and agents of entities named as registered users of the SINGuLAR Analysis Toolset. Redistribution, whether in source or binary form must include this license statement. Any use must be in conjunction with a Fluidigm product. Any use with a Fluidigm product may also be in conjunction with data from any source, including products from other vendors. In any case, the code may not be used in conjunction with any product similar to the Fluidigm BioMark Real-Time PCR System that is made by another entity. Any redistribution and use shall be in accordance with the laws and export regulations of the United States of America. Under no circumstances shall code be distributed to or used by persons listed on the Denied Persons List maintained by the United States Department of Commerce, or be distributed to or used or executed in a country listed on the Export Control List, List of Extensively Embargoed Countries, or List of Targeted Sanctions Countries and Territories maintained by the United States Department of Commerce; appropriate measures shall be taken to ensure that recipients will also refrain from distribution to such parties. Fluidigm will not provide, and is not responsible for providing any end-user support. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. This license, and all matters relating to use of the SINGuLAR Analysis Toolset shall be governed by and interpreted in accordance with the law of the State of California except for its choice of law rules. For any disputes arising out of this Agreement, the parties consent to the personal and exclusive jurisdiction of, and venue in, the state and federal courts within San Mateo County, California. This license constitutes the entire agreement between you and Fluidigm Corporation. This license may only be amended or supplemented by a writing that refers explicitly to this Agreement and that is signed by duly authorized representatives of both parties.

3.

4.

5. 6.

7.

8.

Information in this manual is subject to change without notice. Fluidigm assumes no responsibility for any errors or omissions. In no event shall Fluidigm be liable for any damages in connection with or arising from the use of this manual. Fluidigm, the Fluidigm logo, BioMark, C1, DELTAgene, Dynamic Array, FC1, and SINGuLAR are trademarks or registered trademarks of Fluidigm Corporation in the U.S. and/or other countries. Contacting Fluidigm By phone: On the Internet:

In the United States: 1.866.FLUIDLINE (1.866.358.4354) Outside the United States: +1.650.266.6100 www.fluidigm.com/support; techsupport@fluidigm.com

Fluidigm Corporation 7000 Shoreline Court, Suite 100 South San Francisco, CA 94080

SINGuLARTM Analysis Toolset


User Guide
PN: 100-5066 B1

Table of Contents
Section 1: Single-Cell Data Analysis Purpose of this Document ................................................................................................... 7 The Nature of Single-Cell Transcription ................................................................................... 7 Transcriptional Bursting in Single Cells ................................................................................ 9 Replicates .................................................................................................................... 10 Identification and Use of Limit of Detection (LoD) and Log2Ex ....................................................... 12 Limit of Detection ........................................................................................................... 14 Detection limit of the qPCR reaction .................................................................................. 15 Qualification of Assays Prior to Single-Cell Experiments ............................................................... 16 Elimination of Cells or Genes from Subsequent Analysis ............................................................... 17 Normalization ................................................................................................................ 18 Secondary Analysis .......................................................................................................... 19 Section 2: The SINGuLAR Workflow Installing R and SINGuLAR .................................................................................................. 22 Installing R ................................................................................................................. 23 Installing SINGuLAR....................................................................................................... 24 Creating the SINGuLAR Directory for Data Analysis .................................................................... 25 Preparing BioMark System Results ........................................................................................ 26 Estimating the Limit of Detection (LoD) Ct Value ....................................................................... 26 Option 1: Experimental Determination of LoD ....................................................................... 26 Option 2: Iterative Determination of LoD ............................................................................. 26 Removing Failed Data Points and Low Expression Cells ............................................................... 27 Loading and Analyzing Data for Single-Cell Experiment Results with SINGuLAR ................................... 27 Single-Cell Data Analysis Performed Using Fluidigm SINGuLAR ....................................................... 27 Violin Plots ................................................................................................................ 28 Hierarchical Clustering .................................................................................................. 29

Principal Component Analysis (PCA) ................................................................................... 30 Loading and Individually Analyzing Data for Single-Cell Experiments ............................................... 31 To Calculate Log2Ex ...................................................................................................... 31 To Generate a Violin Plot ............................................................................................... 31 To Generate a Hierarchical Cluster Heat map ....................................................................... 31 To Perform PCA ........................................................................................................... 32 Analyzing Multiple Chip Runs with SINGuLAR ........................................................................... 32 Section 3: Appendices Appendix 1: Protocol for the Qualification of Assays .................................................................. 35 Appendix 2: Removing Data Failed by Fluidigm Real-Time PCR Analysis Software ................................ 41 Appendix 3: Eliminating Low-Expressing Cells from Subsequent Analysis ........................................... 43 Appendix 4: Normalizing Using Median Log2Ex .......................................................................... 44 Appendix 5: A Note on the Optimal Number of Cycles Needed for Preamplification ............................. 45 References ................................................................................................................... 46

Table of Figures

Figure 1: The single-cell workflow......................................................................................... 7 Figure 2: ActB expression data; Fluidigm study ......................................................................... 8 Figure 3: Data from Fluidigm experiment showing large fold-differences ........................................... 9 Figure 4: Single-cell standard deviations ................................................................................ 10 Figure 5: PCA showing subpopulations ................................................................................... 11 Figure 6: Calculating LoD and Log2Ex .................................................................................... 13 Figure 7: Comparison of IER3 transcripts ................................................................................ 13 Figure 8: Compare Log2Ex levels of 10 genes in 75 single cells ....................................................... 14 Figure 9: Poisson distribution at average of 5 targets/chamber ..................................................... 15 Figure 10: Cutoff Ct three standard deviations below mean .......................................................... 18 Figure 11: Example where normalization does not greatly affect data analysis ................................... 19 Figure 12: The Analysis workflow ......................................................................................... 22 Figure 13: The SINGuLAR workflow ....................................................................................... 23 Figure 14: Spreadsheet exported as .csv file............................................................................ 28 Figure 15: Violin plots generated in R .................................................................................... 29 Figure 16: Sample heat map ............................................................................................... 29 Figure 17: Scree and scatter plots ........................................................................................ 30 Figure 18: Selecting the Tm range ........................................................................................ 39

Section 1

Single-Cell Data Analysis

Purpose of this Document


Single-cell researchers use the Fluidigm BioMark System to measure gene expression levels for up to hundreds of genes in hundreds to thousands of samples. This document is a practical guide on minimum steps in using the BioMark System to obtain single-cell gene expression data. Starting with background material on the nature of single-cell transcription, it takes the reader through a tutorial of data collection, preparation, and analysis. Fundamental steps in the single-cell workflow are:

Figure 1: The single-cell workflow

This document takes users through one particular path of the latter half of the single-cell workflow: qPCR detection, primary data processing, and secondary data analysis. The choices available at each step lie beyond the scope of this document but will provide topics for subsequent documentation.

The Nature of Single-Cell Transcription


Bengtsson et al. (2005) were among the first to use qPCR to quantify transcripts in single cells. They measured gene expression levels of five genes in individual cells from mouse pancreatic islets and found that the transcript levels of the different genes were lognormally distributed. Since a lognormal distribution is characterized by its geometric mean rather than its arithmetic mean, there are profound implications for the comparison of single-cell data to population data. In a lognormal distribution, the average expression level (arithmetic mean) observed for a population of cells gets strongly biased by a few cells with a very high number of transcripts. Therefore, the average expression level does not reflect the expression level in a typical cell. The paper concluded, Accordingly, it may not be valid to extrapolate results of gene expression measurements on cell populations to the single-cell level.

The lognormal distribution means that data from single eukaryotic cells show cell-to-cell variation in mRNA amounts that ranges from 10-fold to 1,000-fold depending on the gene and type of cell. In the study cited, the levels of ActB transcript varied approximately 1,000-fold among the single cells analyzed. A Fluidigm replication of the study is shown in Figure 2 below.

Figure 2: ActB expression data; Fluidigm study

Fluidigm also ran a single-cell experiment on a 96.96 Dynamic Array integrated fluidic circuit (IFC), but analyzed a much larger number of genes. Data for 77 genes in 87 single human K562 cells showing large fold-differences between individual cells are presented in Figure 3 below.

Figure 3: Data from Fluidigm experiment showing large fold-differences

The Fluidigm experiment determined the number of genes exhibiting differential expression between individual cells, depicted here as fold-change (upper X-axis labels) and equivalent Ct values (lower X-axis labels). These results indicate that 10- to >500-fold variation in transcript levels should be expected when comparing individual cells.

Transcriptional Bursting in Single Cells


Data such as these, collected by several researchers, have led to the model that eukaryotic transcripts are produced in short but intense bursts interspersed with intervals of inactivity during which transcript levels decay. Raj et al. (2006) directly observed intrinsically random bursting of mRNA for two genes in CHO cells. Chubb et al. (2006) also observed this burst-and-decay behavior for the dscA gene in living Dictyostelium cells. For this gene, they measured a mean burst duration of 5.2 minutes and a mean interval of inactivity (presumably mRNA decay) of 5.8 minutes, but there was a great deal of stochastic variation in each of these averages. This noise inherent in single-cell gene expression challenges conventional methods for obtaining and analyzing qPCR data. Factors such as replicates, data display, limits of detection, normalization, and univariate versus multivariate analysis need to be re-evaluated. Although one may think that this noise

precludes the ability to get useful information from single cells, the reality is quite the opposite. By acknowledging and addressing the intrinsic noise (using appropriate statistical analysis methods), singlecell gene expression profiling can provide biological insights that are simply not visible when one is averaging expression levels from hundreds or thousands of cells.

Replicates
Another way to assess the variation observed in single cells is to look at the standard deviation of various transcript levels in a population of single cells. Figure 4 uses data from the Fluidigm experiment described earlier (using K562 cells) to depict the standard deviations observed for 77 genes in a population of 87 cells.

Figure 4: Single-cell standard deviations

Only two genes show a standard deviation of less than one cycle between single cells. For experiments run using bulk RNA on the BioMark System, the standard deviation observed for qPCR technical replicates is typically 0.16-0.25 cycle or less. Biological noise is thus greater than technical noise by a large amount. It is therefore better to focus on biological replicates rather than on technical replicates. Experimental bandwidth is thus better utilized by running more single-cell samples and by interrogating more genes than by running technical replicates of the single-cell samples or assays.

10

One way to restate the need for biological replicates is to say that data need to be collected from a statistically significant number of single cells in order to obtain reliable results. What is a statistically significant number of single cells? This is difficult to answer in absolute terms. Statistical significance depends not only on the number of cells, but also on other factors including the degree of variation within the population analyzed, the number of genes assayed, and the ability of those assays to differentiate the population variation. Basic statistics would indicate that for any single gene, a homogenous population can be characterized on the basis of 30 samples. Thus, if every subpopulation within a sample of single cells were represented by at least 30 cells, one would have reasonable confidence that the experiment would robustly identify all subpopulations. This would mean that if one wanted to reliably identify a subpopulation that was 10% of the total population, 300 cells would need to be examined. In practice, subpopulations can be identified with fewer than 30 cells depending on the cells and genes being analyzed. Guo et al. (2010) analyzed 159 single cells from 64-cell stage mouse embryos, assaying 48 genes in each cell. A principal component analysis (PCA) from the study is shown in Figure 5.

Figure 5: PCA showing subpopulations

11

From Guo et al. (2010). Image reprinted with permission from Developmental Cell.

Guo et al. were able to clearly identify the epiblast (EPI) subpopulation, with only 17 cells in that subpopulation. They could do this because of the type of cells analyzed, the use of 48 genes, and the fact that those 48 genes revealed very distinct signatures between EPI, primitive endoderm (PE), and trophectoderm (TE) cells.

Identification and Use of Limit of Detection (LoD) and Log2Ex


When qPCR experiments are run on bulk RNA samples, the results are typically displayed as foldchange differences between samples for each individual gene and known controls. Because of the extensive normal variation in a given gene at the single-cell level, looking at fold changes between individual cells is potentially not very informative. A better approach may be to first assess the population behavior for each gene. By assessing which genes display a lognormal distribution within the cell population under investigation, this type of first-pass analysis can provide the first significant insight to the unique biology of the cell population and dictate further, more directed analyses. This is best done by looking at histograms that bin expression levels and display the number of cells in each bin. To generate such histograms, the expression for each gene must be comparable between different single-cell samples. One starts by calculating the limit of detection (LoD) and then computing Log2Ex values. Because of the lognormal distribution described by Bengtsson et al. (2005) and others, it is useful to view single-cell data as expression level above detection limit on a log scale. For qPCR data, it is convenient and appropriate to do this in log base 2 by defining the term Log2Ex: Log2Ex = LoD Ct Ct [Gene] If the value is negative, Log2Ex = 0 Log2Ex represents transcript level above background expressed in log base 2. Conversion from a log scale to a linear scale can be accomplished by calculating 2^Log2Ex, which gives the fold change. These equations are expressed graphically in Figure 6. The value of each sample is subtracted from the LoD. In this example LoD = 22. Therefore, Ct values higher than 22 are assigned a Log2Ex value of 0.

12

Figure 6: Calculating LoD and Log2Ex

The use of Log2Ex enables plotting the number of cells where the transcript level is at or below the detection limit. Figure 7 compares IER3 transcripts in 87 human K562 cells.

Figure 7: Comparison of IER3 transcripts IER3 transcripts from 87 human K562 cells were plotted on a log (left) and linear (right) scale. No IER3 transcript was detected in 10 cells.

To compare histograms for multiple genes, it is convenient to use violin plots, which are essentially histograms turned on their side and mirrored. Violin plots from Guo et al. (2010), Figure 8 below, compare 10 genes in 75 single cells derived from 16-cell stage mouse embryos:

13

Figure 8: Compare Log2Ex levels of 10 genes in 75 single cells Violin plots from Guo et al. (2010); Image reprinted with permission from Developmental Cell.

The violin plots reveal that seven genes have unimodal distributions and three (Id2, Nanog, Sox2) have bimodal distributions. The unimodal distributions indicate no detectable variation other than intrinsic noise. The bimodal distributions indicate that these three genes are differentially expressed in at least two subpopulations within these 75 cells. The vertical position of each histogram indicates the relative expression level. For example, ActB has the highest expression level among these 10 genes. It is also possible to see that transcripts can have distributions of varying widths, distribution being an indicator of variation. For example, Pou5f1 has a much narrower distribution, or less variation on the Log2Ex axis, than Cdx2. This is because each gene has a characteristic transcriptional burst size, frequency, and decay rate. If the histogram indicates two or more subpopulations, it is now possible to get meaningful average fold change values. For the Id2 gene in the violin diagram, the median Log2Ex value is roughly 7.5 for the higher expressing subpopulation and roughly 1.8 for the lower expressing subpopulation. Thus the Log2Ex between these two subpopulations is about 7.5 1.8 = 5.7 which corresponds to a fold difference of 2^
5.7

, or approximately 50, in expression levels, on average.

Limit of Detection
The Log2Ex calculation requires defining a limit of detection (LoD) Ct value. This raises the issue of defining the detection limit of qPCR. In fact, there are two separate questions: 1. What is the detection limit of the qPCR reaction by itself?

14

2. What is the detection limit of the overall process? (going from single cell RNA cDNA preamplified cDNA qPCR reaction)

Detection Limit of the qPCR Reaction Based on digital PCR results using well-performing assays, it is clear that a single target DNA molecule in a reaction chamber will generate a positive amplification plot. That is why the theoretical limit of PCR is one molecule. A more stringent definition of detection limit, however, would incorporate some indication of the confidence of detecting a target. If a number of identical PCR reactions are performed at an average concentration of one target DNA molecule per reaction chamber, then 37% of the reactions will not contain a single molecule. The chance of detection is therefore 63%. This effect can be calculated according to the Poisson distribution; there is a 37% likelihood that a molecule will not actually land in the chamber, and thus will not show a positive amplification plot. For stringent detection, at what concentration is there at least a 99% chance of generating a positive amplification plot? This occurs at an average concentration of five target molecules per reaction chamber as shown by the Poisson distribution in Figure 9.

Figure 9: Poisson distribution at average of 5 targets/chamber

15

Thus, a stringent definition of LoD would be the value that corresponds to five targets per reaction chamber, which in turn corresponds to a >99% chance of detection with one single-cell replicate. This stringent definition minimizes the number of false negatives; however, it may exclude true positives. In other words, one can be very confident that a positive really is a positive, but some data may be excluded. To explore the effects of sensitivity on results, data can be analyzed using different values for LoD, ranging from stringent to relaxed. For example, the data used in the workflow section of this document indicates that 22 cycles is a stringent LoD Ct value. Thus, Log2Ex values could be calculated using LoD = 22, 23, 24, or 25, and each data set then analyzed to see if altering stringency impacts conclusions. In the single-cell gene expression workflow, qPCR reactions are preceded by preamplification of cDNA. Statistically, 18-20 cycles of preamplification will result in an average of five copies of target per chamber from a single copy of cDNA. Preamplification can have efficiencies close to 100%, as reported by Devonshire et al. (2011). More details on preamplification and its effect on target concentration are discussed in Section 3 (Appendices). The foregoing discussion indicates that the single-cell protocol should be fairly robust even if only a single cDNA molecule is generated in the reverse transcription reaction on the mRNA from a single cell. Of course, the overall limit of detection is critically dependent on the efficiency of the reverse transcriptase. Furthermore, this efficiency probably varies depending on the transcript and the location of the assay amplicon within the transcript. Although reverse transcriptase efficiency deserves closer scrutiny, it will not be explored here. Also, the overall availability of RNA after cell lysis will have an effect on the limit of detection for single-cell gene expression.

Qualification of Assays Prior to Single-Cell Experiments


There are two reasons to test assays on cDNA prepared from bulk RNA before embarking on analyzing single cells. First, when using DNA binding dye assays, such as DELTAgeneTM Assays, the data are used to determine the correct Tm range for the amplicon generated by each assay. For this purpose, it is best to use bulk RNA from the same or similar cells as the single cells to be studied, so that splice variants will be the same as in the single cells. If bulk samples are not

16

available, then appropriate tissue-specific or universal RNA or cDNA can be purchased from various vendors. Second, the data are used to estimate an LoD Ct value for use in data analysis. These two properties, Tm and LoD Ct, are characteristics of the qPCR assay and not of the reverse transcriptase step or preamplification step. Therefore, this qualification test is performed using dilutions of preamplified cDNA in order to focus on the qPCR assays. For the purpose of empirically estimating a LoD Ct value, six replicates of each dilution concentration are run. For each assay, a preliminary LoD Ct is determined by taking the average Ct for the most dilute sample that has positive amplification plots for all six replicates. Because of the approximate nature of this LoD Ct value, it is reasonable to use it for any additional primer pairs that are added to the experiment. The LoD Ct value is most drastically affected by platform. For any particular platform, however, the exact LoD Ct value is somewhat arbitrary and probably will not drastically impact the interpretation of a single-cell experiment. As discussed above, this can be tested by first using the stringent LoD Ct value, then increasing it in one-cycle increments and seeing how this affects the results.

Elimination of Cells or Genes from Subsequent Analysis


It can be difficult to decide which cells can be eliminated from analysis due to abnormally low expression. Using low (or no) expression of a single control gene is not a reliable metric for excluding cells from the data set because the level of expression of any single gene (including housekeeping genes) can vary widely between single cells. Using multiple control genes in single-cell experiments allows greater confidence in eliminating samples, as cells with low expression across several genes are likely to be abnormal. We suggest including three highly-expressed, monophasic control genes in the set of assays used to interrogate the cells. The standard deviation of the control genes can be calculated, as well as a cutoff Ct that is three standard deviations below the mean, as shown in Figure 10. Cells whose expression is below the cutoff Ct for at least two of the three control genes can be eliminated.

17

Figure 10: Cutoff Ct three standard deviations below mean

Normalization
The Ct method (Livak and Schmittgen, 2001) may not be best for identifying differences among the single cells being analyzed. Normalization should be considered a variable that can be tried to see if it has any significant effect on the analysis of the expression data. Normalizing to a single reference gene that is varying 10- to 1,000-fold at the single-cell level is generally not useful. Guo et al. (2010) normalized using the average of ActB and Gapdh Log2Ex values. One way that normalization might be beneficial is by reducing variation due to differing cell size. It is not necessary to normalize Log2Ex data on a per-cell basis. In fact, many single-cell publications have not used any cell-based normalization. Vandesompele et al. (2002) describe the geNorm method, a robust way to use multiple reference genes to determine a normalization factor.

Figure 11 depicts an example where normalization does not seem to have much effect on data analysis. Guo et al. (2010) performed PCA on expression data from 159 single cells derived from 64cell stage mouse embryos. Prior to the analysis, they normalized their data using the average of ActB and Gapdh Log2Ex values. Here, PCAs have been repeated using unnormalized data and median Log2Ex normalized data.

18

Figure 11: Example where normalization does not greatly affect data analysis From Guo et al. (2010)

The distributions of single cells in these three plots do not seem to be significantly different, indicating that normalization would have little effect on data interpretation in this particular case. We suggest normalizing such that each cell has the same median Log2Ex value across all genes detected in that cell. This ensures that the normalization factor includes data from all genes in the study.

Secondary Analysis
Even if normalization issues are addressed by using data from multiple genes, as recommended earlier, the Ct method focuses on genes one at a time. With the expression of each gene varying 10- to 1,000-fold, it may be difficult to discern reliable patterns in data from any single gene. For lower expressed genes, analysis is complicated by the fact that a transcript may not be detected in a particular cell purely due to stochastic noise, not due to lack of expression. Rather, some form of multivariate analysis, such as hierarchical clustering or principal component analysis, will be more fruitful in identifying subpopulations with similar gene expression signatures. The purpose of this section is to focus on the minimum steps required to process single-cell data to make it ready for secondary analysis, rather than to explore all available methods of secondary analysis. In order to provide additional guidance, below is a tabulated list of published research that used the BioMark System to obtain single-cell gene expression data, and the secondary analysis methods that were used in each. They shed additional light on ways to analyze single-cell data for biological insight.

19

Field Buganim et al. 2012 Guo et al. 2010 Flatz et al. 2011 Dalerba et al. 2011 Pang et al. 2011 Vincent et al. 2011 Aguilo et al. 2011 Stem cells Developmental Biology Immunology Cancer Neuroscience Developmental Biology Stem cells

Violin Plots

Plus/ Minus

Pairwise Correlation

HC

PCA

LDA

DTA

JSD

Table 1: Comparison of secondary analysis methods in published research using BioMark for single-cell gene expression (HC = Hierarchical Clustering; PCA = Principal Component Analysis; LDA = Linear Discriminant Analysis; DTA = Decision Tree Analysis; JSD = Jensen-Shannon Divergence)

20

Section 2

The SINGuLAR Workflow

21

Key steps in single-cell gene expression analysis are depicted in Figure 12 below. Two powerful tools, the Fluidigm Real-Time PCR Analysis Software and the SINGuLARTM package, are used in combination, either to process data or to perform the analysis.

Figure 12: The Analysis Workflow

The Fluidigm SINGuLAR Package


SINGuLAR leverages Rs statistical computing capability to streamline data preparation and analysis. Among other things, the data processing ability of SINGuLAR enables users to: 1. Estimate Limit of Detection (LoD) Ct values 2. Generate Log2Ex values For data analysis and representation, SINGuLAR permits users to: 1. Create violin plots 2. Perform multivariate analyses such as hierarchical clustering and principal component analysis (PCA)

Installing R
NOTE: If you have already installed R and SINGuLAR, you can skip this section and proceed directly to creating the SINGuLAR directory for data analysis.

22

Figure 13: The SINGuLAR Workflow

Installing R 1. Download the latest version R for Windows. To do this go to http://www.r-project.org/ and download from the Berkeley CRAN mirror located at http://cran.cnr.Berkeley.edu.

23

2. Run the downloaded .exe file. A setup wizard will walk you through installation. Choose to install the base version only. Installing SINGuLAR 1. Download fluidigmSC_<VersionNumber>.zip by logging in to the Fluidigm single-cell analysis tools web page. 2. Open R. You will be taken to the R-GUI. 3. From the menu bar select Packages > Install package(s) from local zip files and select the file named fluidigmSC_<VersionNumber>.zip.

4. At the R command prompt, type library(fluidigmSC) 5. Hit Enter and type fluidigmSC.firstrun() 6. Select the nearest mirror to install additional packages and hit Enter. You will need to set the CRAN mirror for the session. Select the nearest mirror to reduce network load. 7. To download from Berkeley, please select the USA(CA1) mirror. This is required to continue downloading. It ensures that you receive R updates and have access to online help.

24

8. The R GUI will display a series of messages. You can now proceed to create the SINGuLAR directory for data analysis.

Creating the SINGuLAR Directory for Data Analysis


1. To load SINGuLAR, at the R command line, type library(fluidigmSC) 2. Navigate to File > Change dir to set the working directory for this session. NOTE: Data files calculated by SINGuLAR will automatically get saved to this directory. The working directory could match the location of the single-cell data exported from the Fluidigm Real-Time PCR Analysis software.

25

Preparing BioMark System Results


SINGuLAR supports both 48.48 and 96.96 Dynamic Array IFCs. The examples in this document primarily use the 96.96 IFCs.

1. Process data using the Fluidigm Real-Time PCR Analysis software. 2. Export the data as heat map results (.csv files) as described earlier in this document.

Estimating the Limit of Detection (LoD) Ct Value


Background information on LoD is available in Section 1 of this document. To experimentally determine the LoD for greater accuracy in estimating the LoD Ct value, one can perform a qPCR experiment on cDNA prepared from bulk RNA. NOTE: Appendix 1 provides a detailed protocol for assay qualification. Please follow the setup carefully to ensure that assay data is formatted correctly for subsequent analysis.

Option 1: Experimental Determination of LoD To estimate LoD, type in the following command at the R command line. fluidigmSC.LoD(number of replicates, number of samples, number of assays) For example, as described in Appendix 1, for a run with six replicates of each dilution using 96 samples and 96 assays, your command would look like this: fluidigmSC.LoD(6, 96, 96) A file selection window will open. Select the .csv file that contains your assay qualification experiment. SINGuLAR will return the estimated LoD Ct value.

Option 2: Iterative Determination of LoD If an assay qualification run has not been performed for all assays, we suggest using the conservative LoD Ct value of 22 for the initial run. As the exact LoD Ct value is somewhat arbitrary and probably will not have a drastic impact on the overall interpretation of a single-cell experiment, the user can start with a less stringent LoD Ct value and then go back to decrease the value in one cycle step to see how this affects

26

results. To decrease stringency, the LoD Ct value can be increased to 23, 24, 25, and so on and the singlecell experiments analyzed to see whether changing stringency has any effect on the conclusions.

Removing Failed Data Points and Low Expression Cells


Genes that are not detected in any of the single cells in the study can be eliminated. Optionally, genes expressed in fewer than 5% to 10% of the single cells can be eliminated. Sample and assay numbers and experimental layouts are unique for each experiment and the decision to remove failed data points and low expression cells must be made for each specific experiment. Appendices 2 and 3 cover these procedures in detail.

Loading and Analyzing Data for Single-Cell Experiment Results with SINGuLAR
1. Navigate to the R command line. 2. Enter an R command in the following format: fluidigmSC.analysis(number of assays, number of samples, LoD =22, violin=TRUE, HC=true, PCA=number of principal components) NOTE: Starting with two principal components is highly recommended.

If you are using 96.96 Dynamic Array IFCs, then you will enter: fluidigmSC.analysis(96, 96, LoD =22, violin=TRUE, HC=TRUE, PCA=2) 3. A file selection window will open. Select the heatmap.csv file, exported from the Fluidigm Real-Time PCR Analysis software, containing your single-cell experimental data. NOTE: To analyze data from multiple Dynamic Array runs, please refer to the section on Analyzing Multiple Chip Runs with SINGuLAR.

Single-Cell Data Analysis Performed Using Fluidigm SINGuLAR


Graphics displaying violin plots, a hierarchical clustering map, a scree plot ranking the importance of each principal component axis and a principal component plot will be generated. The resulting single-cell qPCR data will be expressed as log base 2 (Log2Ex) values. Log2Ex values are calculated as Log2Ex = Ct - LoD. If the

27

Log2Ex is negative, then it will be replaced with zero. The calculated Log2Ex values are exported to a .csv file named Log2Ex_data.csv and saved in the working directory that you set for this SINGuLAR session. Gene names will appear in Row 1 and sample names in Column A, in the order they were entered in the Fluidigm Real-Time PCR Analysis software.

Figure 14: Spreadsheet exported as .csv file

Violin Plots Violin plots display the distribution and frequency of Log2Ex values. Genes and assays in the plot are arranged in decreasing order of standard deviation of the Log2Ex values.

28

Figure 15: Violin plots generated in R

To save the violin plot or to copy it to another location, right-click on the plot within the R window.

Hierarchical Clustering SINGuLAR performs unbiased hierarchical clustering (HC) on your data and presents it as a heat map. The reordered data are exported to a .csv file named Hierarchical_clustering_sorted_data.csv and saved in the working directory that you set up for this SINGuLAR session.

Figure 16: Sample heat map

29

To save the HC heat map or to copy it to another location, right-click on it within the R window.

Principal Component Analysis (PCA) The PCA algorithm reduces the dimensionality of a data set by transforming it into a new set of uncorrelated variables with decreasing degrees of variability. The uncorrelated variables are called principal components. The first principal component explains the most variation in the data set, indicating highest amount of variability among the samples. Each succeeding component, in turn, explains the next highest variance for the data under the constraint. SINGuLAR produces two plots about the principal components: a PCA scree plot and a scatter plot. The scree plot displays the first ten PC scores, the height of each bar indicating the PC score. This provides a quick way to determine the number of principal components to use. For example, in the scree plot in Figure 17, you can see that there is a large height difference between the second and third bars, indicating that the first two principal components can be used and they will contain most of the original data variance. Once the number of principal components has been identified from the scree plot, the command can be repeated using that number. The scatter plot graphs each principal component score on a separate axis. To find the label for any axis within the plot, trace that axis outward until the PCA score label is found.

Figure 17: Scree and scatter plots

30

The PC scores for all samples for the first 10 principal components are exported to a file named PCA_rotated_data.txt and saved in the working directory that you set for this SINGuLAR session. The file should subsequently be opened in Microsoft Excel. To save the scatter plot or to copy it to another location, right-click on it within the R window.

Loading and Individually Analyzing Data for Single-Cell Experiments


SINGuLAR enables you to perform several data analyses with a single command but also permits the flexibility to run the same analyses individually.

To Calculate Log2Ex 1. To express your single-cell data in log base 2, type in the following command at the R command line. fluidigmSC.analysis(number of assays, number of samples, LoD=22) 2. A file selection window will open. Select the .csv file that contains your single-cell experiment. SINGuLAR will return your data in log base 2.

To Generate a Violin Plot 1. To plot your gene expression data as a violin plot, type in the following command at the R command line. fluidigmSC.analysis(number of assays, number of samples, LoD=22, violin=TRUE) 2. A file selection window will open. Select the .csv file that contains your single-cell experiment. SINGuLAR will generate a violin plot of your data.

To Generate a Hierarchical Cluster Heat Map 1. To perform hierarchical clustering on your gene expression data, type in the following command at the R command line. fluidigmSC.analysis(number of assays, number of samples, LoD=22, HC=TRUE) 2. A file selection window will open. Select the .csv file that contains your single-cell experiment. SINGuLAR will generate a hierarchical cluster heat map.

31

To Perform PCA 1. To perform PCA on your gene expression data, type in the following command at the R command line. fluidigmSC.analysis(number of assays, number of samples, LoD=22, PCA=number of principal components to plot) 2. A file selection window will open. Select the .csv file that contains your single-cell experiment. SINGuLAR will generate scree and scatter plots for your data.

Analyzing Multiple Chip Runs with SINGuLAR


Samples from different Dynamic Array IFC runs can be analyzed together. SINGuLAR will discard assays that differ between experiments and will analyze only those assays that are common in all the experiments. NOTE: Please ensure that every sample name is unique: no two names should match, even if they are the same sample from different runs. For example, if you have three runs of sample A, label them sampleA-1, sampleA-2, and sampleA-3. It is also helpful to name .csv files so that their filenames indicate the number of samples and the number of assays in the export.

Duplicate sample names will cause an error in the R scripts.

1. To perform the single-cell experiment analysis on combined data from multiple Dynamic Array IFC runs, type the following command at the R command line: fluidigmSC.analysis(number of assays, number of samples, LoD =22, expt=number of data sets, violin=TRUE, HC=TRUE, PCA=number of principal components) 2. A file selection window will open. Select all the .csv files that contain your single-cell experiment data. For example, if you have the following setup: Number of Assays
Run 1 Run 2 Run 3 96 96 96

Number of Samples
96 90 72

32

Then you would type the following R command: fluidigmSC.analysis(c(96,96,96), c(96,90,72), LoD =22, expt=3, violin=TRUE, HC=TRUE, PCA=2)

Identifying Points in the PCA Graph


1. Specify two PCA components that you are interested in and type a locate command in the R console. If, for example, you are interested in components 1 and 2, you will type: locate <- fluidigmSC.locate(1, 2) 2. Once the command has been executed, click anywhere in the R console to enable logging of the coordinates.

3. Now start identifying the points of interest by clicking on them. The sample names for the particular points you selected will be displayed on the screen.

33

Section 3

Appendices

34

Appendix 1: Protocol for Qualification of Assays


A detailed protocol is available in Appendix B of the Fluidigm Real-Time PCR Analysis Software User Guide (PN 68000088).

Determining Limit of Detection Threshold Cycle (LoD Ct) Value Using All Assays
To estimate an LoD Ct value, six replicates are run of each dilution sample. For each assay, a preliminary LoD Ct is determined by taking the average Ct for the most dilute sample concentration that has positive amplification plots for all six replicates. A stringent LoD Ct value would be the Ct corresponding to five target molecules per reaction chamber. At this low concentration, there is considerable stochastic noise due to the Poisson distribution that affects detection and actual Ct value (see Figure 9 and its accompanying explanation). The goal therefore is to estimate a reasonable LoD Ct value using six replicates without precisely determining the Ct corresponding to five target molecules per reaction chamber. Concentration expressed as average target molecules per reaction chamber Probability that all six replicates have a positive amplification plot

0.064

0.418

0.736

0.895

0.960

0.985

0.995

0.998

Table 2: Probability that all six replicates have positive amplification plots

Thus, the preliminary LoD Ct values determined for each assay probably correspond to concentrations ranging from two to 10 target molecules per reaction chamber. The overall LoD Ct is then selected as the median of all the preliminary LoD Ct values rounded up to the next highest whole cycle. Because of the approximate nature of this LoD Ct value, it may be used for any subsequent assays that are used even if they were not run in this experiment.

Finally, the exact LoD Ct value is somewhat arbitrary and probably will not have a drastic effect on the interpretation of a single-cell experiment. As discussed above, this can be tested by first using the stringent

35

LoD Ct value described here and then going back and increasing the value in one-cycle increments and seeing how this affects the Log2Ex results.

Preamplification
1. Prepare the following mixture: 8 L 2.5 ng/L Biochain Human Universal cDNA (PN C4234565-R) or appropriate cDNA standard 2 L 500 nM each PreAmp Primers (pool of all assays) 10 L 2x AB TaqMan PreAmp Master Mix (PN 4391128)

2. Transfer the mix to the thermal cycler and run the following protocol: Cycle
1 (1X) 2 (14X) 3 (1X)

Step
Step 1 Step 1 Step 2 Step 1

Temperature
95 C 95 C 60 C 4 C

Time (minutes:seconds)
10:00 00:15 04:00 hold

3. Prepare the following mixture: 2 L 20 units/L Exonuclease I (New England BioLabs, PN M0293L) 1 L 10X Exonuclease I Reaction Buffer 7 L H2O

4. Add 8 L of this mixture to the preamplified sample. 5. Transfer to the thermal cycler and run the following protocol: Cycle
1 (1X) 2 (1X) 3 (1X)

Step
Step 1 Step 1 Step 1

Temperature
37 C 80 C 4 C

Time (minutes:seconds)
30:00 15:00 hold

6. Add 72 L TE (10 mM Tris, pH 8.0, 1.0 mM EDTA) (TEKnova, PN T0224). 7. Store at -20 C.

Preparation of 1:2 Dilutions


1. Prepare a mixture of 1560 L TE + 40 L 10% Tween-20. 2. Prepare the following dilutions in 1.5 mL tubes, vortexing and centrifuging after each dilution.

36

Table 3: Dilution Table

3. Transfer the samples to 96-well plates for ease of loading into IFCs. 4. Store at -20 C.

qPCR Detection
1. Prime the chip. 2. Prepare the following mixture: 420 L 2X SsoFastTM EvaGreen Supermix with Low ROX 42 L 20X DNA Binding Dye Sample Loading Reagent 7 L H2O 18 L H2O

3. Add 20 L to each well of 16 wells, the first two columns of the 96-well plate. 4. Add 15 L of diluted sample to each well. 5. Vortex gently and centrifuge. 6. Mix 0.3ul assays (100uM each combined F+R primers) with 2.7uL DNA suspension buffer (teknova, PN) and 3ul Assay loading reagent in a 96-well plate. 7. Dispense 5 L of DELTAgene Assays to detector inlets of the 96.96 IFC. 8. Dispense 6 5 L of each dilution sample + SsoFast MM to sample inlets of the 96.96 array. 9. Load the chip. 10. Run GE Fast 96x96 PCR+Melt v2.pcl

37

Segment

Type

Temperature (C)

Duration (seconds) 2400 30 60 5 20

BioMark HD Ramp Rate (C/s) 5.5 5.5 5.5 5.5 5.5

1 Thermal Mix 2 3 Hot Start PCR (30 Cycles) 4 Melting Curve

70 60 95 96 60 60-95

1C / 3 seconds

Fluidigm DELTAgene Assay Qualification


NOTE: DELTAgene assays are DNA binding dye-based detection assays. If you are using TaqMan assays, please follow the assay qualification procedure at http://tinyurl.com/ctuavdx. For probe-based assays, use the Auto (Detector) Ct Threshold Method.

Run the First Chip 1. Use the protocol for assay qualification described earlier. 2. Annotate samples and detectors in the Sample Setup and Detector Setup windows, respectively. 3. Analyze the data using the Linear (Derivative) Baseline Correction Method and the Auto (Global) Ct Threshold Method.

Set the Tm Range for Each Assay The BioMark HD system allows users to identify and eliminate data from non-specific amplification, thereby improving specificity and sensitivity. This is done by adjusting the Tm window in the data analysis software. NOTE: The Fluidigm Real-Time PCR Analysis Software User Guide, downloadable from http://www.fluidigm.com/product-documents.html, provides a detailed procedure for Tm range selection. Be sure to select the Linear Derivative with Auto Global options.

38

Figure 18: Selecting the Tm range

Export Detector.plt 1. From the Detector Setup window, export the Detector.plt file. Use a filename appropriate for the set of assays being analyzed. The Tm range information gets retained as part of the Detector.plt file. 2. Later, when you use the same set of assays to analyze single cells, use the Import button to import the Detector.plt file. This ensures that assay information gets added to the chip run and that Tm ranges selected in the qualification run are automatically applied to the single-cell data. Export Heat Map Results Save the chip run file and navigate to File > Export to export the Ct data. Heat map results are exported to Microsoft Excel in comma-delimited (.csv) format.

Run Second Chip with Single-Cell Samples


1. Flow-sort and process cells, or use the C1 system, and run single cells on a 96.96 Dynamic Array IFC following the guidelines in Appendix A of the Fluidigm Real-Time PCR User Guide, PN 68000088. NOTE: This analysis uses 96 single-cell samples as an example. It is often useful to include some control samples on the chip, but that topic will be discussed in subsequent documentation.

39

2. In Sample Setup, annotate the sample information. 3. In Detector Setup, import the detector.plt file generated in the assay qualification run. This will bring in the Tm range for each assay. 4. Analyze the file once again to incorporate the sample and assay information as part of the chip run file. To do this: Make sure the Baseline Correction Method is still set to Linear (Derivative). Make sure the Ct Threshold Method is still set to Auto (Global). Click Analyze.

5. Save the chip run file. 6. Navigate to File > Export to export the Ct data to Microsoft Excel as Heat Map Results. The file will be in .csv format.

40

Appendix 2: Removing Data Failed by Fluidigm Real-Time PCR Analysis Software


Although the Fluidigm Real-Time PCR Analysis software fails reactions with an improper Tm , it does not change the Ct value determined from the amplification plot. To eliminate Ct values for reactions failed due to Tm: 1. Open the Heat Map Results file in Excel.

2. Copy the sample information in cells A113:B208 to A213:B308. 3. Enter the formula =IF(C113="Pass",C13,999) in cell C213. NOTE: The 999 in the formula is because of the fact that Fluidigm Real-Time PCR Analysis software reports a Ct value of 999 for any reaction in which a positive amplification plot is not detected.

41

4. Copy the formula in cell C213 to fill matrix C213:CT308 5. Save the file in .xls or .xlsx format.

42

Appendix 3: Eliminating Low-Expressing Cells from Subsequent Analysis

One way to eliminate cells or genes from subsequent analysis is to include at least two highly-expressed control genes in the set of assays used to interrogate the cells. To do this: 1. Select at least two control genes that are highly expressed and are not expected to be differentially expressed in the cells being studied. 2. Calculate the Log2Ex values for all genes and observe the expression histograms of the control genes to confirm that their transcript distribution is monophasic. 3. Calculate the median and standard deviation for the control genes across all the single cells. 4. For each control gene, determine a cutoff Ct value by calculating value for the median Ct and subtracting three times the standard deviation of Ct values for that gene. 5. If the measured Cts are lower than the cutoff Cts for at least two control genes, eliminate that cell from further analysis.

6. Replace the heatmap in the .csv export file from the Biomark with this new heatmap (without these samples) as it can then be loaded into the SINGuLAR package. 7. Save the file.

43

Appendix 4: Normalizing Using Median Log2Ex

NOTE: Sample and assay numbers and experimental layouts are unique to each experiment.

1. Find the median of all Log2Ex values for each sample. To do this, use the command: =IFERROR(MEDIAN(All Log2Ex values for a single sample), ) 2. Compute the average of all sample Log2Ex median values. 3. Calculate the difference in the median of each sample and the average of medians and add that value (whether positive or negative) to every Log2Ex value for that sample. This command is: =IFERROR(Individual Log2Ex value for a sample - (Sample Median Log2Ex Value - Avg of Medians),0) These median normalized Log2Ex values can then be copied and pasted into the .csv export file from the BioMark and saved to be loaded into the SINGuLAR package.

44

Appendix 5: A Note on the Optimal Number of Cycles Needed for Preamplification

In the single-cell gene expression workflow, the qPCR reactions are preceded by preamplification of cDNA. Statistically, 18-20 cycles of preamplification will result in an average of five copies of target per chamber from a single copy of cDNA. Preamplification can have efficiencies close to 100% as reported by Devonshire et al. (2011). Preamplification affects the limit of detection. For Dynamic Array IFCs, five target molecules per reaction chamber correspond to 625 molecule/L in the 48.48 IFC and 730 molecule/L in the 96.96 IFC.

45

References
Aguilo, F., S. Avagyan, A. Labar, A. Sevilla, D. F. Lee, P. Kumar, I. R. Lemischka, B. Y. Zhou, and H. W. Snoeck (2011) Prdm16 is a physiologic regulator of hematopoietic stem cells, Blood 117:5057-5066. Bengtsson, M., A. Sthlberg, P. Rorsman, and M. Kubista (2005) Gene expression profiling in single cells from the pancreatic islets of Langerhans reveals lognormal distribution of mRNA levels, Genome Research 15:1388-1392. Chubb, J. R., T Trcek, S. M. Shenoy, and R. H. Singer (2006) Transcriptional pulsing of a developmental gene, Current Biology 16:1018-1025. Dalerba, P. et al. (2011) Single-cell dissection of transcriptional heterogeneity in human colon tumors, Nat Biotechnol 29:1120-1127. Devonshire, A. S., R. Elaswarapu, and C. A. Foy (2011) Applicability of RNA standards for evaluating RT-qPCR assays and platforms, BMC Genomics 12:118127. Diehn, M. et al. (2009) Association of reactive oxygen species levels and radioresistance in cancer stem cells, Nature 458:780-783. Flatz, L. et al. (2011) Single-cell gene-expression profiling reveals qualitatively distinct CD8 T cells elicited by different gene-based vaccines, Proc Natl Acad Sci USA 108:5724-5729. Guo, G., M. Huss, G. Q. Tong, C. Wang, L. L. Sun, N. E. Clarke, and P. Robson (2010) Resolution of cell fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst, Developmental Cell 18:675-685. Livak, K. J. and T. D. Schmittgen (2001) Analysis of relative gene expression data using real-time quantitative PCR and the 2-CT method, Methods 25:402408. Pang, Z. P. et al. (2011) Induction of human neuronal cells by defined transcription factors, Nature 476:220-223. Raj, A., C. S. Peskin, D. Tranchina, D. Y. Vargas, and S. Tyagi (2006) Stochastic mRNA synthesis in mammalian cells, PLoS Biol 4:e309. Vandesompele, J., K. De Preter, F. Pattyn, B. Poppe, N. Van Roy, A. De Paepe, and F. Speleman (2002) Accurate normalization of real-time quantitative RT- PCR data by geometric averaging of multiple internal control genes, Genome Biology 3:research0034.1-research0034.11. Vincent, J. J. et al. (2011) Single cell analysis facilitates staging of Blimp1- dependent primordial germ cells derived from mouse embryonic stem cells, PLoS ONE 6:e28960.

Acknowledgements
Fluidigm gratefully acknowledges the pioneering contributions of Dr. Paul Robson to single cell data analysis; the use of violin plots, principal component analysis, and unsupervised clustering was adopted from his work. We would also like to thank Dr. Robson and the Genome Institute of Singapore for providing the single cell gene expression data used in the SINGuLAR Practice Sets.

46

World Headquarters 7000 Shoreline Court, Suite 100 South San Francisco, CA 94080 USA Tel: 650-266-6000 Fax: 650-871-7152 Fluidigm Europe, BV Parnassustoren Locatellikade 1, 1076 AZ Amsterdam Netherlands Tel: +33 (1) 60 92 42 40 Fax: +31 (0) 20 203 1111 Fluidigm Japan KK Level 5, Ginza TK Building 1-1-7 Shintomi Chuo-ku, Tokyo 104-0041 Japan Office: +81335552351 Fax: +8133552353 Fluidigm Singapore PTE Ltd Block 1026 Tai Seng Avenue #07-3532 Singapore 534413 Office: +6568587316 Fax: +6562825531 Technical Support Email: TechSupport@fluidigm.com Phone in United States: 1.866.FLUIDLINE (1.866.358.4354) Outside the United States: 650.266.6100 On the Internet: www.fluidigm.com/support Visit our website at www.fluidigm.com

PN 100-5066, Rev. B1

47

Potrebbero piacerti anche