Title: | Compute Upper Prediction Bounds on the FDP in Competition-Based Setups |
---|---|
Description: | Implements functions that calculate upper prediction bounds on the false discovery proportion (FDP) in the list of discoveries returned by competition-based setups, implementing Ebadi et al. (2022) <arXiv:2302.11837>. Such setups include target-decoy competition (TDC) in computational mass spectrometry and the knockoff construction in linear regression (note this package typically uses the terminology of TDC). Included is the standardized (TDC-SB) and uniform (TDC-UB) bound on TDC's FDP, and the simultaneous standardized and uniform bands. Requires pre-computed Monte Carlo statistics available at <https://github.com/uni-Arya/fdpbandsdata>. This data can be downloaded by running the command 'devtools::install_github("uni-Arya/fdpbandsdata")' in R and restarting R after installation. The size of this data is roughly 81Mb. |
Authors: | Arya Ebadi [aut, cre],
Dong Luo [aut],
Jack Freestone [aut],
William Stafford Noble [aut],
Uri Keich [aut] |
Maintainer: | Arya Ebadi <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.0 |
Built: | 2025-02-07 05:42:10 UTC |
Source: | https://github.com/uni-arya/bandsfdp |
This function computes an upper prediction bound on the FDP among target wins
in any set of hypotheses of TDC. See details for more information.
gen_bound( labels, indices, gamma, type, d_max = NULL, max_fdp = 0.5, c = 0.5, lambda = 0.5 ) genband( labels, indices, gamma, type, d_max = NULL, max_fdp = 0.5, c = 0.5, lambda = 0.5 )
gen_bound( labels, indices, gamma, type, d_max = NULL, max_fdp = 0.5, c = 0.5, lambda = 0.5 ) genband( labels, indices, gamma, type, d_max = NULL, max_fdp = 0.5, c = 0.5, lambda = 0.5 )
labels |
A vector of (ordered) labels. See details below. |
indices |
A vector specifying the indices of hypotheses for which an upper prediction bound on the FDP is computed. |
gamma |
The confidence parameter of the band. Typical values include
|
type |
A character string specifying which band to use. Must be one of
|
d_max |
An optional positive integer specifying the maximum number of decoy wins considered in calculating the bands. |
max_fdp |
A number specifying the maximum FDP considered by the user in
calculating the bands. Used to compute |
c |
Determines the ranks of the target score that are considered
winning. Defaults to |
lambda |
Determines the ranks of the target score that are
considered losing. Defaults to |
In (single-decoy) TDC, each hypothesis is associated to a
winning score and a label (1 for a target win, -1 for a decoy win). This
function assumes that the hypotheses are ordered in decreasing order of
winning scores (with ties broken at random). The argument labels
,
therefore, must be ordered according to this rule.
This function also supports the extension of TDC that uses multiple
decoys. In that setup, the target score is competed with multiple decoy
scores and the rank of the target score after competition is used to determine whether the
hypothesis is a target win (label = 1), decoy win (-1) or uncounted (0).
The top c
proportion of ranks are considered winning, the bottom
1-lambda
losing, and all the rest uncounted.
The threshold of TDC is given by the formula (assuming hypotheses are ordered):
where is the number of target wins among the top
hypotheses, and
is the number of decoy wins similarly.
The argument gamma
sets a confidence level of 1-gamma
. Both
the uniform and standardized bands require pre-computed Monte Carlo
statistics, so only certain values of gamma
are available to use.
Commonly used confidence levels, like 0.95 and 0.99, are available.
We refer the reader to the README of this package for more details.
The argument d_max
controls the rate at which the returned bounds
increase: a larger d_max
results in a more conservative bound.
If, however, exceeds
d_max
for some index , each target
win thereafter is considered a false discovery when computing the bound.
Thus it is important that
d_max
, chosen a priori, is large enough. Given
it is sufficiently large, the precise value of d_max
does not have a
significant effect on the resulting bounds (see https://arxiv.org/abs/2302.11837 for more details).
We recommend setting d_max = NULL
so that it is computed automatically
using max_fdp
. This argument ensures that never
exceeds
d_max
when the (non-interpolated) FDP bound on the top
hypotheses is less than
max_fdp
.
An upper prediction bound on the FDP among target wins in the set of
hypotheses whose indices
are given as input.
Ebadi et al. (2022), Bounding the FDP in competition-based control of the FDR https://arxiv.org/abs/2302.11837.
if (requireNamespace("fdpbandsdata", quietly = TRUE)) { set.seed(123) labels <- c( rep(1, 250), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.9, 0.1)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.5, 0.5)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.1, 0.9)) ) indices <- c(1:100, 300:400, 600:650) gamma <- 0.05 gen_bound(labels, indices, gamma, "stband") }
if (requireNamespace("fdpbandsdata", quietly = TRUE)) { set.seed(123) labels <- c( rep(1, 250), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.9, 0.1)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.5, 0.5)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.1, 0.9)) ) indices <- c(1:100, 300:400, 600:650) gamma <- 0.05 gen_bound(labels, indices, gamma, "stband") }
This function computes upper prediction bounds on the target wins among the
top hypotheses of TDC, for each
where
is the total number of hypotheses.
sim_bound( labels, gamma, type, d_max = NULL, max_fdp = 0.5, c = 0.5, lambda = 0.5 ) simband( labels, gamma, type, d_max = NULL, max_fdp = 0.5, c = 0.5, lambda = 0.5 )
sim_bound( labels, gamma, type, d_max = NULL, max_fdp = 0.5, c = 0.5, lambda = 0.5 ) simband( labels, gamma, type, d_max = NULL, max_fdp = 0.5, c = 0.5, lambda = 0.5 )
labels |
A vector of (ordered) labels. See details below. |
gamma |
The confidence parameter of the band. Typical values include
|
type |
A character string specifying which band to use. Must be one of
|
d_max |
An optional positive integer specifying the maximum number of decoy wins considered in calculating the bands. |
max_fdp |
A number specifying the maximum FDP considered by the user in
calculating the bands. Used to compute |
c |
Determines the ranks of the target score that are considered
winning. Defaults to |
lambda |
Determines the ranks of the target score that are
considered losing. Defaults to |
In (single-decoy) TDC, each hypothesis is associated to a
winning score and a label (1 for a target win, -1 for a decoy win). This
function assumes that the hypotheses are ordered in decreasing order of
winning scores (with ties broken at random). The argument labels
,
therefore, must be ordered according to this rule.
This function also supports the extension of TDC that uses multiple
decoys. In that setup, the target score is competed with multiple decoy
scores and the rank of the target score after competition is used to determine whether the
hypothesis is a target win (label = 1), decoy win (-1) or uncounted (0).
The top c
proportion of ranks are considered winning, the bottom
1-lambda
losing, and all the rest uncounted.
The threshold of TDC is given by the formula (assuming hypotheses are ordered):
where is the number of target wins among the top
hypotheses, and
is the number of decoy wins similarly.
The argument gamma
sets a confidence level of 1-gamma
. Both
the uniform and standardized bands require pre-computed Monte Carlo
statistics, so only certain values of gamma
are available to use.
Commonly used confidence levels, like 0.95 and 0.99, are available.
We refer the reader to the README of this package for more details.
The argument d_max
controls the rate at which the returned bounds
increase: a larger d_max
results in a more conservative bound.
If, however, exceeds
d_max
for some index , each target
win thereafter is considered a false discovery when computing the bound.
Thus it is important that
d_max
, chosen a priori, is large enough. Given
it is sufficiently large, the precise value of d_max
does not have a
significant effect on the resulting bounds (see https://arxiv.org/abs/2302.11837 for more details).
We recommend setting d_max = NULL
so that it is computed automatically
using max_fdp
. This argument ensures that never
exceeds
d_max
when the (non-interpolated) FDP bound on the top
hypotheses is less than
max_fdp
.
A vector of upper prediction bounds on the FDP of target wins among
the top hypotheses for each
where
is the total number of hypotheses.
Ebadi et al. (2022), Bounding the FDP in competition-based control of the FDR https://arxiv.org/abs/2302.11837.
if (requireNamespace("fdpbandsdata", quietly = TRUE)) { set.seed(123) labels <- c( rep(1, 250), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.9, 0.1)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.5, 0.5)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.1, 0.9)) ) gamma <- 0.05 head(sim_bound(labels, gamma, "stband")) }
if (requireNamespace("fdpbandsdata", quietly = TRUE)) { set.seed(123) labels <- c( rep(1, 250), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.9, 0.1)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.5, 0.5)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.1, 0.9)) ) gamma <- 0.05 head(sim_bound(labels, gamma, "stband")) }
This function computes an upper prediction bound, derived from the standardized band, on the FDP in TDC's list of discoveries.
tdc_sb( thresholds, labels, alpha, gamma, c = 0.5, lambda = 0.5, n = length(labels), interpolate = TRUE ) stband( thresholds, labels, alpha, gamma, c = 0.5, lambda = 0.5, n = length(labels), interpolate = TRUE )
tdc_sb( thresholds, labels, alpha, gamma, c = 0.5, lambda = 0.5, n = length(labels), interpolate = TRUE ) stband( thresholds, labels, alpha, gamma, c = 0.5, lambda = 0.5, n = length(labels), interpolate = TRUE )
thresholds |
The rejection threshold of TDC. If given as a vector, an upper prediction bound is returned for each element. |
labels |
A vector of (ordered) labels. See details below. |
alpha |
The FDR threshold. |
gamma |
The confidence parameter of the bound. Typical values include
|
c |
Determines the ranks of the target score that are considered
winning. Defaults to |
lambda |
Determines the ranks of the target score that are
considered losing. Defaults to |
n |
The number of hypotheses. Defaults to the length of |
interpolate |
A boolean indicating whether the bands should be
interpolated. Offers a slight boost in performance at the cost of computing
power. Defaults to |
In (single-decoy) TDC, each hypothesis is associated to a
winning score and a label (1 for a target win, -1 for a decoy win). This
function assumes that the hypotheses are ordered in decreasing order of
winning scores (with ties broken at random). The argument labels
,
therefore, must be ordered according to this rule.
This function also supports the extension of TDC that uses multiple
decoys. In that setup, the target score is competed with multiple decoy
scores and the rank of the target score after competition is used to determine whether the
hypothesis is a target win (label = 1), decoy win (-1) or uncounted (0).
The top c
proportion of ranks are considered winning, the bottom
1-lambda
losing, and all the rest uncounted.
The threshold of TDC is given by the formula:
where is the number of target wins among the top
hypotheses, and
is the number of decoy wins similarly.
The argument gamma
sets a confidence level of 1-gamma
. Since
the standardized band requires pre-computed Monte Carlo quantiles, only
certain values of gamma
are available to use. Commonly used
confidence levels, like 0.95 and 0.99, are available. We refer the reader
to the README of this package for more details.
The argument alpha
, used to compute the threshold of TDC, is also
used in this function. It serves to compute an appropriate d_max
for a non-trivial bound. In particular, if the user inputs a vector of
thresholds
, a bound is returned for each element of
thresholds
using the same d_max. For more details, see:
https://arxiv.org/abs/2302.11837.
We recommend the use of interpolate = TRUE
(default), as it generally
results in a tighter bound. This comes at the cost of performance: the bound
for each threshold is computed in O(n) time with interpolation and O(1)
without.
An upper prediction bound on the FDP in TDC's list of discoveries.
If thresholds
is a vector, returns an upper prediction bound for each
element of thresholds
.
Ebadi et al. (2022), Bounding the FDP in competition-based control of the FDR https://arxiv.org/abs/2302.11837.
if (requireNamespace("fdpbandsdata", quietly = TRUE)) { set.seed(123) thresholds <- c(250, 500, 750, 1000) labels <- c( rep(1, 250), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.9, 0.1)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.5, 0.5)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.1, 0.9)) ) alpha <- 0.05 gamma <- 0.05 tdc_sb(thresholds, labels, alpha, gamma) }
if (requireNamespace("fdpbandsdata", quietly = TRUE)) { set.seed(123) thresholds <- c(250, 500, 750, 1000) labels <- c( rep(1, 250), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.9, 0.1)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.5, 0.5)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.1, 0.9)) ) alpha <- 0.05 gamma <- 0.05 tdc_sb(thresholds, labels, alpha, gamma) }
This function computes an upper prediction bound, derived from the uniform band, on the FDP in TDC's list of discoveries.
tdc_ub( thresholds, labels, alpha, gamma, c = 0.5, lambda = 0.5, n = length(labels), interpolate = TRUE ) uniband( thresholds, labels, alpha, gamma, c = 0.5, lambda = 0.5, n = length(labels), interpolate = TRUE )
tdc_ub( thresholds, labels, alpha, gamma, c = 0.5, lambda = 0.5, n = length(labels), interpolate = TRUE ) uniband( thresholds, labels, alpha, gamma, c = 0.5, lambda = 0.5, n = length(labels), interpolate = TRUE )
thresholds |
The rejection threshold of TDC. If given as a vector, an upper prediction bound is returned for each element. |
labels |
A vector of (ordered) labels. See details below. |
alpha |
The FDR threshold. |
gamma |
The confidence parameter of the bound. Typical values include
|
c |
Determines the ranks of the target score that are considered
winning. Defaults to |
lambda |
Determines the ranks of the target score that are
considered losing. Defaults to |
n |
The number of hypotheses. Defaults to the length of |
interpolate |
A boolean indicating whether the bands should be
interpolated. Offers a slight boost in performance at the cost of computing
power. Defaults to |
In (single-decoy) TDC, each hypothesis is associated to a
winning score and a label (1 for a target win, -1 for a decoy win). This
function assumes that the hypotheses are ordered in decreasing order of
winning scores (with ties broken at random). The argument labels
,
therefore, must be ordered according to this rule.
This function also supports the extension of TDC that uses multiple
decoys. In that setup, the target score is competed with multiple decoy
scores and the rank of the target score after competition is used to determine whether the
hypothesis is a target win (label = 1), decoy win (-1) or uncounted (0).
The top c
proportion of ranks are considered winning, the bottom
1-lambda
losing, and all the rest uncounted.
The threshold of TDC is given by the formula:
where is the number of target wins among the top
hypotheses, and
is the number of decoy wins similarly.
The argument gamma
sets a confidence level of 1-gamma
. Since
the uniform band requires pre-computed Monte Carlo statistics, only
certain values of gamma
are available to use. Commonly used
confidence levels, like 0.95 and 0.99, are available. We refer the reader
to the README of this package for more details.
The argument alpha
, used to compute the threshold of TDC, is also
used in this function. It serves to compute an appropriate d_max
for a non-trivial bound. In particular, if the user inputs a vector of
thresholds
, a bound is returned for each element of
thresholds
using the same d_max. For more details, see:
https://arxiv.org/abs/2302.11837.
We recommend the use of interpolate = TRUE
(default), as it generally
results in a tighter bound. This comes at the cost of performance: the bound
for each threshold is computed in O(n) time with interpolation and O(1)
without.
An upper prediction bound on the FDP in TDC's list of discoveries.
If thresholds
is a vector, returns an upper prediction bound for each
element of thresholds
.
Ebadi et al. (2022), Bounding the FDP in competition-based control of the FDR https://arxiv.org/abs/2302.11837.
if (requireNamespace("fdpbandsdata", quietly = TRUE)) { set.seed(123) thresholds <- c(250, 500, 750, 1000) labels <- c( rep(1, 250), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.9, 0.1)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.5, 0.5)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.1, 0.9)) ) alpha <- 0.05 gamma <- 0.05 tdc_ub(thresholds, labels, alpha, gamma) }
if (requireNamespace("fdpbandsdata", quietly = TRUE)) { set.seed(123) thresholds <- c(250, 500, 750, 1000) labels <- c( rep(1, 250), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.9, 0.1)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.5, 0.5)), sample(c(1, -1), size = 250, replace = TRUE, prob = c(0.1, 0.9)) ) alpha <- 0.05 gamma <- 0.05 tdc_ub(thresholds, labels, alpha, gamma) }