Title: | Scaling with Ranked Subsampling |
---|---|
Description: | Analysis of species count data in ecology often requires normalization to an identical sample size. Rarefying (random subsampling without replacement), which is a popular method for normalization, has been widely criticized for its poor reproducibility and potential distortion of the community structure. In the context of microbiome count data, researchers explicitly advised against the use of rarefying. An alternative to rarefying is scaling with ranked subsampling (SRS). SRS consists of two steps. In the first step, the total counts for all OTUs (operational taxonomic units) or species in each sample are divided by a scaling factor chosen in such a way that the sum of the scaled counts Cscaled equals Cmin. In the second step, the non-integer Cscaled values are converted into integers by an algorithm that we dub ranked subsampling. The Cscaled value for each OTU or species is split into the integer part Cint (Cint = floor(Cscaled)) and the fractional part Cfrac (Cfrac = Cscaled - Cints). Since the sum of Cint is smaller or equal to Cmin, additional delta C = Cmin - the sum of Cint counts have to be added to the library to reach the total count of Cmin. This is achieved as follows. OTUs are ranked in the descending order of their Cfrac values. Beginning with the OTU of the highest rank, single count per OTU is added to the normalized library until the total number of added counts reaches delta C and the sum of all counts in the normalized library equals Cmin. When the lowest Cfrag involved in picking delta C counts is shared by several OTUs, the OTUs used for adding a single count to the library are selected in the order of their Cint values. This selection minimizes the effect of normalization on the relative frequencies of OTUs. OTUs with identical Cfrag as well as Cint are sampled randomly without replacement. See Beule & Karlovsky (2020) <doi:10.7717/peerj.9593> for details. |
Authors: | Lukas Beule [aut, cre], Vitor Heidrich [aut], Petr Karlovsky [aut] |
Maintainer: | Lukas Beule <[email protected]> |
License: | CC BY-SA 4.0 |
Version: | 0.2.3 |
Built: | 2024-11-02 03:57:04 UTC |
Source: | https://github.com/cran/SRS |
Scaling with ranked subsampling (SRS) for the normalization of ecological count data. It is recommended to use SRS.shiny.app for the determination of Cmin.
SRS(data, Cmin, set_seed = TRUE, seed = 1)
SRS(data, Cmin, set_seed = TRUE, seed = 1)
data |
Data frame (species count or OTU table) in which columns are samples and rows are the counts of species or OTUs. Only integers are accepted as data. |
Cmin |
The number of counts to which all samples will be normalized. Typically, the total OTU count of the sample with the lowest sequencing depth is chosen as Cmin. Samples with sequencing depth lower than the chosen Cmin will be discarded. |
set_seed |
Logical, if TRUE, a seed is set to enable reproducibility of SRS if OTUs with identical Cfrag as well as Cint are sampled randomly without replacement. See set.seed for details. Default is TRUE. |
seed |
Integer, specifying the seed. See set.seed for details. Default is 1. |
It is recommended to use SRS.shiny.app for the determination of Cmin.
SRS consists of two steps. In the first step, the total counts for all OTUs (operational taxonomic units) or species in each sample are divided by a scaling factor chosen in such a way that the sum of the scaled counts Cscaled equals Cmin. In the second step, the non-integer Cscaled values are converted into integers by an algorithm that we dub ranked subsampling. The Cscaled value for each OTU or species is split into the integer part Cint () and the fractional part Cfrac (
). Since
, additional
counts have to be added to the library to reach the total count of Cmin. This is achieved as follows. OTUs are ranked in the descending order of their Cfrac values. Beginning with the OTU of the highest rank, single count per OTU is added to the normalized library until the total number of added counts reaches
and the sum of all counts in the normalized library equals Cmin. When the lowest Cfrag involved in picking
counts is shared by several OTUs, the OTUs used for adding a single count to the library are selected in the order of their Cint values. This selection minimizes the effect of normalization on the relative frequencies of OTUs. OTUs with identical Cfrag as well as Cint are sampled randomly without replacement.
Data frame normalized to Cmin.
Lukas Beule, Vitor Heidrich, Devon O'rourke, Petr Karlovsky
Beule L, Karlovsky P. 2020. Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities. PeerJ 8:e9593
<https://doi.org/10.7717/peerj.9593>
##Samples should be arranged columnwise. ##Input data should not contain any categorial ##data such as taxonomic assignment or barcode sequences. ##An example of the input data can be found below: example_input_data <- matrix(c(sample(1:20, 100, replace = TRUE), sample(1:30, 100, replace = TRUE),sample(1:40, 100, replace = TRUE)), nrow = 100) colnames(example_input_data) <- c("sample_1","sample_2","sample_3") example_input_data <- as.data.frame(example_input_data) example_input_data ##Selection of the desired number of counts ##(e.g., total OTU counts of the sample with the lowest sequencing depth): Cmin <- min(colSums(example_input_data)) Cmin ##Running the SRS function SRS_output <- SRS(example_input_data, Cmin) SRS_output ##Samples that have a total number of counts < Cmin will be discarded: SRS_output <- SRS(example_input_data, Cmin+1) SRS_output
##Samples should be arranged columnwise. ##Input data should not contain any categorial ##data such as taxonomic assignment or barcode sequences. ##An example of the input data can be found below: example_input_data <- matrix(c(sample(1:20, 100, replace = TRUE), sample(1:30, 100, replace = TRUE),sample(1:40, 100, replace = TRUE)), nrow = 100) colnames(example_input_data) <- c("sample_1","sample_2","sample_3") example_input_data <- as.data.frame(example_input_data) example_input_data ##Selection of the desired number of counts ##(e.g., total OTU counts of the sample with the lowest sequencing depth): Cmin <- min(colSums(example_input_data)) Cmin ##Running the SRS function SRS_output <- SRS(example_input_data, Cmin) SRS_output ##Samples that have a total number of counts < Cmin will be discarded: SRS_output <- SRS(example_input_data, Cmin+1) SRS_output
Shiny app for the determination of Cmin for scaling with ranked subsampling (SRS).
SRS.shiny.app(data)
SRS.shiny.app(data)
data |
Data frame (species count or OTU table) in which columns are samples and rows are the counts of species or OTUs. Only integers are accepted as data. |
Shiny app that generates a visualization of retained samples, summary statistics, SRS curves, and an interactive table in response to varying minimum sample size (Cmin).
Launches Shiny app for SRS in the default web browser.
Vitor Heidrich, Devon O'rourke, Petr Karlovsky, Lukas Beule
Beule L, Karlovsky P. 2020. Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities. PeerJ 8:e9593
<https://doi.org/10.7717/peerj.9593>
##Samples should be arranged columnwise. ##Input data should not contain any categorial ##data such as taxonomic assignment or barcode sequences. ##An example of the input data can be found below: example_input_data <- matrix(c(sample(1:20, 100, replace = TRUE), sample(1:30, 100, replace = TRUE),sample(1:40, 100, replace = TRUE)), nrow = 100) colnames(example_input_data) <- c("sample_1","sample_2","sample_3") example_input_data <- as.data.frame(example_input_data) example_input_data ##Launching the SRS shiny app with example_input_data as input if (interactive()) {SRS.shiny.app(example_input_data)}
##Samples should be arranged columnwise. ##Input data should not contain any categorial ##data such as taxonomic assignment or barcode sequences. ##An example of the input data can be found below: example_input_data <- matrix(c(sample(1:20, 100, replace = TRUE), sample(1:30, 100, replace = TRUE),sample(1:40, 100, replace = TRUE)), nrow = 100) colnames(example_input_data) <- c("sample_1","sample_2","sample_3") example_input_data <- as.data.frame(example_input_data) example_input_data ##Launching the SRS shiny app with example_input_data as input if (interactive()) {SRS.shiny.app(example_input_data)}
For each column of the input data, draws a line plot of alpha diversity indices (see metric) at different sample sizes (specified by step) normalized by scaling with ranked subsampling (using SRS). Minimum sample size (cutoff-level) can be evaluated by specifying sample. The function further allows to visualize trade-offs between cutoff-level and alpha diversity and enables direct comparison of SRS and repeated rarefying.
See Beule & Karlovsky (2020) <doi:10.7717/peerj.9593> for details regarding SRS.
SRScurve(data, metric = "richness", step = 50, sample = 0, max.sample.size = 0, rarefy.comparison = FALSE, rarefy.repeats = 10, rarefy.comparison.legend = FALSE, xlab = "sample size", ylab = "richness", label = FALSE, col, lty, ...)
SRScurve(data, metric = "richness", step = 50, sample = 0, max.sample.size = 0, rarefy.comparison = FALSE, rarefy.repeats = 10, rarefy.comparison.legend = FALSE, xlab = "sample size", ylab = "richness", label = FALSE, col, lty, ...)
data |
Data frame (species count or OTU table) in which columns are samples and rows are the counts of species or OTUs. Only integers are accepted as data. |
metric |
Character, "richness" (using specnumber) for species richness or "shannon", "simpson" or "invsimpson" (using diversity) for common diversity indices. Default is "richness". |
step |
Numeric, specifying the step used to vary the sample size. Default is 50. |
sample |
Numeric, specifying the cutoff-level to visualize trade-offs between cutoff-level and alpha diversity. |
max.sample.size |
Numeric, specifying the maximum sample size to which SRS curves are drawn. Default is 0 which does not limit the maximum sample size. |
rarefy.comparison |
Logical, if TRUE, median values of rarefy with n repeats (specified by rarefy.repeats) will be drawn for comparison. Default is FALSE. |
rarefy.repeats |
Numeric, specifying the number of repeats used to obtain median values for rarefying. Default is 10. |
rarefy.comparison.legend |
Logical, if TRUE, a legend for the comparison between SRS and rarefy is plotted. Default is FALSE. |
xlab , ylab , label , col , lty , ...
|
Graphical parameters. |
See Beule & Karlovsky (2020) <doi:10.7717/peerj.9593> for details regarding scaling with ranked subsampling.
Returns a line plot visualizing the change in alpha diversity indices with changing sample size.
Vitor Heidrich, Petr Karlovsky, Lukas Beule
Beule L, Karlovsky P. 2020. Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): application to microbial communities. PeerJ 8:e9593
<https://doi.org/10.7717/peerj.9593>
##Samples should be arranged columnwise. ##Input data should not contain any categorial ##data such as taxonomic assignment or barcode sequences. ##An example of the input data can be found below: example_input_data <- matrix(c(sample(1:20, 100, replace = TRUE), sample(1:30, 100, replace = TRUE),sample(1:40, 100, replace = TRUE)), nrow = 100) colnames(example_input_data) <- c("sample_1","sample_2","sample_3") example_input_data <- as.data.frame(example_input_data) example_input_data ##Default settings of SRScurve. SRScurve(example_input_data, metric = "richness", step = 50, ylab = "richness", col = c("#000000", "#E69F00", "#56B4E9")) ##Limit the compution of SRS curves to a sample size of 200. SRScurve(example_input_data, metric = "richness", step = 50, max.sample.size = 200, ylab = "richness", col = c("#000000", "#E69F00", "#56B4E9")) ##SRScurve with comparison of SRS (solid lines) and repeated rarefying (dashed lines). ##Different colors correspond to indiviual samples. Cuttoff-level set to 200. SRScurve(example_input_data, metric = "richness", step = 50, sample = 200, max.sample.size = 200, rarefy.comparison = TRUE, rarefy.repeats = 10, rarefy.comparison.legend = TRUE, ylab = "richness", col = c(rep(c("#000000", "#E69F00", "#56B4E9"),2)), lty = c(1,2))
##Samples should be arranged columnwise. ##Input data should not contain any categorial ##data such as taxonomic assignment or barcode sequences. ##An example of the input data can be found below: example_input_data <- matrix(c(sample(1:20, 100, replace = TRUE), sample(1:30, 100, replace = TRUE),sample(1:40, 100, replace = TRUE)), nrow = 100) colnames(example_input_data) <- c("sample_1","sample_2","sample_3") example_input_data <- as.data.frame(example_input_data) example_input_data ##Default settings of SRScurve. SRScurve(example_input_data, metric = "richness", step = 50, ylab = "richness", col = c("#000000", "#E69F00", "#56B4E9")) ##Limit the compution of SRS curves to a sample size of 200. SRScurve(example_input_data, metric = "richness", step = 50, max.sample.size = 200, ylab = "richness", col = c("#000000", "#E69F00", "#56B4E9")) ##SRScurve with comparison of SRS (solid lines) and repeated rarefying (dashed lines). ##Different colors correspond to indiviual samples. Cuttoff-level set to 200. SRScurve(example_input_data, metric = "richness", step = 50, sample = 200, max.sample.size = 200, rarefy.comparison = TRUE, rarefy.repeats = 10, rarefy.comparison.legend = TRUE, ylab = "richness", col = c(rep(c("#000000", "#E69F00", "#56B4E9"),2)), lty = c(1,2))