1) Discovery of a new drug involves screening large chemical libraries to identify active com- pounds. These libraries are usually quite large, with thousands or perhaps millions of com- pounds, and the proportion p of active compounds is often quite small. We will assume the probability that each individual compound is active is p and that each compound is indepen- dent.
(a) Discuss how the Bernoulli trial assumptions apply here; talk about individual drug com- pounds (not generic âtrialsâ). Also, identify a practical situation where the Bernoulli trial assumptions would not hold in this context.
For the remainder of this problem, we will assume that the Bernoulli trial assumptions hold. From a discovery point of view, the obvious question is,
âHow should we search an entire library of compounds to find those compounds that are active?â
Suppose that there are N = mk compounds in a chemical library and that we would like to determine the active/nonactive status of each compound. Consider the following two ways to do this:
(i) Each compound can be tested separately. This will require N tests.â¨(ii) Form m pools of k compounds by assigning each compound to exactly one pool. Test
the pools. If a pool tests negative, all k compounds in it are negative (and only 1 test is needed). If a pool tests positive, each of the k compounds will subsequently be tested separately (therefore, k + 1 tests will be required for the k compounds).
Important: We will assume that the test that classifies compounds and compound pools is perfect; there are no mistakes in classification.
(b) What is the probability that a single pool of k compounds will test positive?
â¨(c) Let Y denote the number of tests needed to screen the entire library under (ii). Find an expression for E(Y ), the expected number of tests. Your answer here should depend on m, k, and p.
â¨(d) In terms of minimizing the expected number of tests to be performed on the N compounds, which plan, (i) or (ii), would be preferred if p is close to 0? Justify your answer using the expression derived in part (c).â¨
(e) For concreteness (only for this part), suppose that N = 100000, and consider the following choices of (m, k) under plan (ii):
m = 20000, k = 5 m = 10000, k = 10 m = 5000, k = 20 m = 2000, k = 50
For each choice, graph E(Y ) as a function of p over the range 0 < p < 0.20. Try to identify regions of p where each (m,k) combination above would be preferred; i.e., where E(Y) is smallest.
.â¨(f) We made an assumption that there are no mistakes in classification (i.e., positive pools test positively; negative pools test negatively). Provide two realistic scenarios where such an assumption may be violated.