{"title": "Learning Bounds for Importance Weighting", "book": "Advances in Neural Information Processing Systems", "page_first": 442, "page_last": 450, "abstract": "This paper presents an analysis of importance weighting for learning from finite samples and gives a series of theoretical and algorithmic results. We point out simple cases where importance weighting can fail, which suggests the need for an analysis of the properties of this technique. We then give both upper and lower bounds for generalization with bounded importance weights and, more significantly, give learning guarantees for the more common case of unbounded importance weights under the weak assumption that the second moment is bounded, a condition related to the Renyi divergence of the training and test distributions. These results are based on a series of novel and general bounds we derive for unbounded loss functions, which are of independent interest. We use these bounds to guide the definition of an alternative reweighting algorithm and report the results of experiments demonstrating its benefits. Finally, we analyze the properties of normalized importance weights which are also commonly used.", "full_text": "Learning Bounds for Importance Weighting\n\nCorinna Cortes\nGoogle Research\n\nNew York, NY 10011\ncorinna@google.com\n\nYishay Mansour\nTel-Aviv University\nTel-Aviv 69978, Israel\nmansour@tau.ac.il\n\nMehryar Mohri\n\nCourant Institute and Google\n\nNew York, NY 10012\nmohri@cims.nyu.edu\n\nAbstract\n\nThis paper presents an analysis of importance weighting for learning from \ufb01nite\nsamples and gives a series of theoretical and algorithmic results. We point out\nsimple cases where importance weighting can fail, which suggests the need for an\nanalysis of the properties of this technique. We then give both upper and lower\nbounds for generalization with bounded importance weights and, more signi\ufb01-\ncantly, give learning guarantees for the more common case of unbounded impor-\ntance weights under the weak assumption that the second moment is bounded,\na condition related to the R\u00b4enyi divergence of the training and test distributions.\nThese results are based on a series of novel and general bounds we derive for un-\nbounded loss functions, which are of independent interest. We use these bounds to\nguide the de\ufb01nition of an alternative reweighting algorithm and report the results\nof experiments demonstrating its bene\ufb01ts. Finally, we analyze the properties of\nnormalized importance weights which are also commonly used.\n\n1 Introduction\nIn real-world applications of machine learning, often the sampling of the training and test instances\nmay differ, which results in a mismatch between the two distributions. For example, in web search\napplications, there may be data regarding users who clicked on some advertisement link but little\nor no information about other users. Similarly, in credit default analyses, there is typically some\ninformation available about the credit defaults of customers who were granted credit, but no such\ninformation is at hand about rejected costumers. In other problems such as adaptation, the training\ndata available is drawn from a source domain different from the target domain. These issues of\nbiased sampling or adaptation have been long recognized and studied in the statistics literature.\nThere is also a large body of literature dealing with different techniques for sample bias correction\n[11, 29, 16, 8, 25, 6] or domain adaptation [3, 7, 19, 10, 17] in the recent machine learning and\nnatural language processing literature.\nA common technique used in several of these publications for correcting the bias or discrepancy is\nbased on the so-called importance weighting technique. This consists of weighting the cost of errors\non training instances to emphasize the error on some or de-emphasize it on others, with the objective\nof correcting the mismatch between the distributions of training and test points, as in sample bias\ncorrection, adaptation, and other related contexts such as active learning [24, 14, 8, 19, 5]. Different\nde\ufb01nitions have been adopted for these weights. A common de\ufb01nition of the weight for point x is\nw(x) = P (x)/Q(x) where P is the target or test distribution and Q is the distribution according to\nwhich training points are drawn. A favorable property of this de\ufb01nition, which is not hard to verify,\nis that it leads to unbiased estimates of the generalization error [8].\nThis paper presents an analysis of importance weighting for learning from \ufb01nite samples. Our study\nwas originally motivated by the observation that, while this corrective technique seems natural, in\nsome cases in practice it does not succeed. An example in dimension two is illustrated by Figure 1.\nThe target distribution P is the even mixture of two Gaussians centered at (0, 0) and (0, 2) both with\n\n1\n\n\f4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n+\n+\n\u2212\n\u2212\n\u2212\n\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\n\u2212\n\u2212\n\u2212\n\u03c3 P\n\u2212\nx\n\u2212\n\u2212\n\u03c3 P\n\u03c3Q\nx\n+\n+\n\u2212\n\u2212\n\n+\n+\n\u2212\n\u2212\n\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n+\n+\n\u2212\n\u2212\n\u2212\n\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u03c3Q\nx\n\u2212\n\u2212\n\u2212\n\u2212\n\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\nr\no\nr\nr\n\nE\n\n1.0\n0.8\n0.6\n0.4\n0.2\n0.0\n\nRatio = 0.3\n\n\u03c3 \u03c3Q P\n\nRatio = 0.75\n\n\u03c3 \u03c3Q P\n\nr\no\nr\nr\n\nE\n\n1.0\n0.8\n0.6\n0.4\n0.2\n0.0\n\n20\n\n100\n\n500\n\n5000\n\n20\n\n100\n\n500\n\n5000\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\nTraining set size\n\nTraining set size\n\nFigure 1: Example of importance weighting. Left \ufb01gure: P (in blue) and Q (in red) are even mixtures\nof Gaussians. The labels are positive within the unit sphere centered at the origin (in grey), negative\nelsewhere. The hypothesis class is that of hyperplanestangent to the unit sphere. Right \ufb01gures: plots\nof test error vs training sample size using importance weighting for two different values of the ratio\n\u03c3Q/\u03c3P. The results indicate mean values of the error over 40 runs \u00b1 one standard deviation.\nstandard deviation \u03c3P, while the source distribution Q is the even mixture of two Gaussians centered\nat (0, 0) and (2, 0) but with standard deviation \u03c3Q. The hypothesis class is that of hyperplanes\ntangent to the unit sphere. The best classi\ufb01er is selected by empirical risk minimization. As shown\nin Figure 1, for \u03c3P /\u03c3Q = .3, the error of the hypothesis learned using importance weighting is close\nto 50% even for a training sample of 5,000 points and the standard deviation of the error is quite\nhigh. In contrast, for \u03c3P /\u03c3Q = .75, convergence occurs relatively rapidly and learning is successful.\nIn Section 4, we discuss other examples where importance weighting does not succeed.\nThe problem just described is not limited to isolated examples. Similar observations have been made\nin the past in both the statistics and learning literature, more recently in the context of the analysis\nof boosting by [9] who suggest that importance weighting must be used with care and highlight the\nneed for convergence bounds and learning guarantees for this technique.\nWe study the theoretical properties of importance weighting. We show using standard generaliza-\ntion bounds that importance weighting can succeed when the weights are bounded. However, this\ncondition often does not hold in practice. We also show that, remarkably, convergence guarantees\ncan be given even for unbounded weights under the weak assumption that the second moment of the\nweights is bounded, a condition that relates to the R\u00b4enyi divergence of P and Q. We further extend\nthese bounds to guarantees for other possible reweightings. These results suggest minimizing a bias-\nvariance tradeoff that we discuss and that leads to several algorithmic ideas. We explore in detail an\nalgorithm based on these ideas and report the results of experiments demonstrating its bene\ufb01ts.\nThroughout this paper, we consider the case where the weight function w is known. When it is\nnot, it is typically estimated from \ufb01nite samples. The effect of this estimation error is speci\ufb01cally\nanalyzed by [8]. This setting is closely related to the problem of importance sampling in statistics\nwhich is that of estimating the expectation of a random variable according to P while using a sample\ndrawn according to Q, with w given [18]. Here, we are concerned with the effect of the weights on\nlearning from \ufb01nite samples. A different setting is when further full access to Q is assumed, von\nNeumann\u2019s rejection sampling technique [28] can then be used. We note however that it requires w\nto be bounded by some constant M, which is often not guaranteed and is the simplest case of our\nbounds. Even then, the method is wasteful as it requires on average M samples to obtain one point.\nThe remainder of this paper is structured as follows. Section 2 introduces the de\ufb01nition of the R\u00b4enyi\ndivergences and gives some basic properties of the importance weights. In Section 3, we give gen-\neralization bounds for importance weighting in the bounded case. We also present a general lower\nbound indicating the key role played by the R\u00b4enyi divergence of P and Q in this context. Section 4\ndeals with the more frequent case of unbounded w. Standard generalization bounds do not apply\nhere since the loss function is unbounded. We give novel generalization bounds for unbounded loss\nfunctions under the assumption that the second moment is bounded (see Appendix) and use them to\nderive learning guarantees for importance weighting in this more general setting. In Section 5, we\ndiscuss an algorithm inspired by these guarantees for which we report preliminary experimental re-\nsults. We also discuss why the commonly used remedy of truncating or capping importance weights\nmay not always provide the desired effect of improved performance. Finally, in Section 6, we study\n\n2\n\n\fthe properties of an alternative reweighting also commonly used which is based on normalized im-\nportance weights, and discuss its relationship with the (unnormalized) weights w.\n2 Preliminaries\nLet X denote the input space, Y the label set, and let L : Y\u00d7Y \u2192 [0, 1] be a loss function. We denote\nby P the target distribution and by Q the source distribution according to which training points are\ndrawn. We also denote by H the hypothesis set used by the learning algorithm and by f : X \u2192 Y\nthe target labeling function.\n2.1 R\u00b4enyi divergences\nOur analysis makes use of the notion of R\u00b4enyi divergence, an information theoretical measure of\nthe difference between two distributions directly relevant to the study of importance weighting. For\n\u03b1\u2265 0, the R\u00b4enyi divergence D\u03b1(P$Q) between distributions P and Q is de\ufb01ned by [23]\n\nD\u03b1(P$Q) =\n\n(1)\nThe R\u00b4enyi divergence is a non-negative quantity and for any \u03b1> 0, D\u03b1(P$Q) = 0 iff P = Q. For\n\u03b1 = 1, it coincides with the relative entropy. We denote by d\u03b1(P$Q) the exponential in base 2 of\nthe R\u00b4enyi divergence D\u03b1(P$Q):\n\n\u03b1 \u2212 1\n\nlog2!x\n\nP (x)\" P (x)\n\nQ(x)#\u03b1\u22121\n\n1\n\n.\n\nd\u03b1(P$Q) = 2D\u03b1(P\"Q) =$!x\n\nP \u03b1(x)\n\nQ\u03b1\u22121(x)% 1\n\n\u03b1\u22121\n\n.\n\n(2)\n\n2.2 Importance weights\nThe importance weight for distributions P and Q is de\ufb01ned by w(x) = P (x)/Q(x). In the follow-\ning, the expectations are taken with respect to Q.\nLemma 1. The following identities hold for the expectation, second moment, and variance of w:\n\n(3)\nProof. The \ufb01rst equality is immediate. The second moment of w can be expressed as follows in\nterms of the R\u00b4enyi divergence:\n\n\u03c32(w) = d2(P$Q) \u2212 1.\n\nE[w2] = d2(P$Q)\n\nE[w] = 1\n\nE\nQ\n\n[w2] = !x\u2208X\n\nw2(x) Q(x) = !x\u2208X\" P (x)\nQ(x)#2\n\nP (x)\" P (x)\nThus, the variance of w is given by \u03c32(w) = EQ[w2] \u2212 EQ[w]2 = d2(P$Q) \u2212 1.\nFor any hypothesis h\u2208H, we denote by R(h) its loss and by &Rw(h) its weighted empirical loss:\n\nQ(x)# = d2(P$Q).\n\nQ(x) = !x\u2208X\n\nw(xi) L(h(xi), f (xi)).\n\n[L(h(x), f (x))]\n\nR(h) = E\nx\u223cP\n\nWe shall use the abbreviated notation Lh(x) for L(h(x), f (x)), in the absence of any ambiguity\nabout the target function f. Note that the unnormalizedimportance weighting of the loss is unbiased:\n\nm!i=1\n\n1\nm\n\n&Rw(h) =\nLh(x) Q(x) =!x\n\nE\nQ\n\n[w(x)Lh(x)] =!x\n\nP (x)\nQ(x)\n\nP (x)Lh(x) = R(h).\n\nThe following lemma gives a bound on the second moment.\nLemma 2. For all \u03b1> 0 and x \u2208 X, the second moment of the importance weighted loss can be\nbounded as follows:\n(4)\n\nE\nx\u223cQ\n\n[w2(x) L2\n\nh(x)] \u2264 d\u03b1+1(P$Q) R(h)1\u2212 1\n\u03b1 .\nFor \u03b1 = 1, this becomes R(h)2 \u2264 Ex\u223cQ[w2(x) L2\nh(x)] \u2264 d2(P$Q).\n3\n\n\fProof. The second moment can be bounded as follows:\n\n[w2(x) L2\n\nE\nx\u223cQ\n\nL2\n\nQ(x)$ P (x)\nQ(x)%2\nh(x)] =!x\nQ(x)%\u03b1% 1\n\u2264$!x\nP (x)$ P (x)\n= d\u03b1+1(P$Q)$!x\n\nh(x) =!x\n\u03b1$!x\n\nP (x) Lh(x)L\n\nP (x)\n\nP (x) L\n\n1\n\n\u03b1$ P (x)\nQ(x)% P (x)\n(x)% \u03b1\u22121\n\n\u03b1\n\n2\u03b1\n\u03b1\u22121\nh\n\n\u03b1+1\n\u03b1\u22121\nh\n\n\u03b1\n\n(x)% \u03b1\u22121\n\n\u03b1\u22121\n\u03b1 L2\n\nh(x)\n\n(H\u00a8older\u2019s inequality)\n\n\u2264 d\u03b1+1(P$Q) R(h)1\u2212 1\n\n\u03b1 = d\u03b1+1(P$Q) R(h)1\u2212 1\n\u03b1 .\n\n\u03b1 B1+ 1\n3 Learning Guarantees - Bounded Case\nNote that supx w(x)= supx\nQ(x) = d\u221e(P$Q). We \ufb01rst examine the case d\u221e(P$Q)< +\u221e and use\nthe notation M = d\u221e(P$Q). The following propositionfollows then directly Hoeffding\u2019sinequality.\nProposition 1 (single hypothesis). Fix h \u2208 H. For any \u03b4> 0, with probability at least 1 \u2212 \u03b4,\n\nP (x)\n\n|R(h) \u2212 &Rw(h)|\u2264 M\u2019 log(2/\u03b4)\n\n2m\n\n.\n\n\u2200\u03b1> 0,\n\nd\u03b1+1(P$Q) \u2264 d\u221e(P$Q).\n\nThe upper bound M, though \ufb01nite, can be quite large. The following theorem provides a more\nfavorable bound as a function of the ratio M/m when any of the moments of w, d\u03b1+1(P$Q),\nis \ufb01nite, which is the case when d\u221e(P$Q) < \u221e since the R\u00b4enyi divergence is a non-decreasing\nfunction of \u03b1 [23, 2], in particular:\n(5)\nTheorem 1 (single hypothesis). Fix h \u2208 H. Then, for any \u03b1\u2265 1, for any \u03b4> 0, with probability at\nleast 1\u2212\u03b4, the following bound holds for the importance weighting method:\n\n+( 2)d\u03b1+1(P$Q) R(h)1\u2212 1\n\u03b1 \u2212 R(h)2* log 1\n3m ++ 2d2(P\"Q) log 1\nFor \u03b1 = 1 after further simpli\ufb01cation, this gives R(h) \u2264 &Rw(h) + 2M log 1\nProof. Let Z denote the random variable w(x) Lh(x)\u2212 R(h). Then, |Z|\u2264 M. By lemma 2, the\nvariance of the random variable Z can be bounded in terms of the R\u00b4enyi divergence d\u03b1+1(P$Q):\n\nR(h) \u2264 &Rw(h) +\n\n2M log 1\n\u03b4\n\n(6)\n\n3m\n\nm\n\nm\n\n.\n\n.\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\n\u03b1 \u2212 R(h)2.\n\n\u03c32(Z) = E\nQ\n\nThus, by Bernstein\u2019s inequality [4], it follows that:\n\n[w2(x) Lh(x)2] \u2212 R(h)2 \u2264 d\u03b1+1(P$Q) R(h)1\u2212 1\n\u03c32(Z) + \u0001M/3#.\n\nPr[R(h) \u2212 &Rw(h) >\u0001 ] \u2264 exp\" \u2212m\u00012/2\nR(h) \u2264 &Rw(h) +\n\n+( M 2 log2 1\nUsing the sub-additivity of \u221a\u00b7 leads to the simpler expression\n\nM log 1\n\u03b4\n\n9m2\n\n3m\n\nm\n\n+\n\n\u03b4\n\n2\u03c32(Z) log 1\n\u03b4\n\n.\n\nSetting \u03b4 to match this upper bound shows that with probability at least 1\u2212\u03b4, the following bound\nholds for the importance weighting method:\n\n2M log 1\n\u03b4\n\n3m\n\n+( 2\u03c32(Z) log 1\n\nm\n\n\u03b4\n\n.\n\nR(h) \u2264 &Rw(h) +\n\nThese results can be straightforwardly extended to general hypothesis sets. In particular, for a \ufb01nite\nhypothesis set and for \u03b1 = 1, the application of the union bound yields the following result.\n\n4\n\n\fR(h) \u2264 &Rw(h) +\n\nTheorem 2 (\ufb01nite hypothesis set). Let H be a \ufb01nite hypothesis set. Then, for any \u03b4> 0, with\nprobability at least 1\u2212\u03b4, the following bound holds for the importance weighting method:\n\n2M (log|H| + log 1\n\u03b4 )\n\n3m\n\n+( 2d2(P$Q)(log|H| + log 1\n\nm\n\n\u03b4 )\n\n.\n\n(7)\n\nFor in\ufb01nite hypothesis sets, a similar result can be shown straightforwardly using covering numbers\ninstead of |H| or a related measure based on samples of size m [20].\nIn the following proposition, we give a lower bound that further emphasizes the role of the R\u00b4enyi\ndivergence of the second order in the convergence of importance weighting in the bounded case.\nProposition 2 (Lower bound). Assume that M < \u221e and \u03c32(w)/M 2 \u2265 1/m. Assume that H\ncontains a hypothesis h0 such that Lh0(x) = 1 for all x. Then, there exists an absolute constant c,\nc = 2/412, such that\n\nh\u2208H,,R(h) \u2212 &Rw(h),, \u2265\u2019 d2(P$Q) \u2212 1\nPr$ sup\n\n4m\n\n% \u2265 c > 0.\n\n(8)\n\nProof. Let \u03c3H = suph\u2208H \u03c3(wLh). If for all x\u2208 X, Lh0(x) = 1, then \u03c32(wLh0) = d2(P$Q) \u2212 1 =\n\nH. The result then follows a general theorem, Theorem 9 proven in the Appendix.\n\n\u03c32(w)= \u03c32\n4 Learning Guarantees - Unbounded Case\nThe condition d\u221e(P$Q) <\u221e assumed in the previous section does not always hold, even in some\nnatural cases, as illustrated by the following examples.\n4.1 Examples\nAssume that P and Q both follow a Gaussian distribution with the standard deviations \u03c3P and \u03c3Q\nand with means \u00b5 and \u00b5&:\n1\n\n1\n\n% Q(x) =\n\nexp$ \u2212\n\nQ %.\n(x \u2212 \u00b5&)2\n\n2\u03c32\n\n\u221a2\u03c0\u03c3Q\n., thus, even for \u03c3P = \u03c3Q and \u00b5 += \u00b5& the\nQ(x) = +\u221e, and the bound of Theorem 1\n\nP (x)\n\nP (x) =\n\n\u221a2\u03c0\u03c3P\nQ(x) = \u03c3Q\n\n(x \u2212 \u00b5)2\n2\u03c32\nP\n\nexp$ \u2212\nexp- \u2212 \u03c32\nexp$ \u2212\n\u03c3P / +\u221e\nP\u221a2\u03c0/ +\u221e\n\n\u2212\u221e\n\n\u2212\u221e\n\n\u03c3Q\n\n\u03c32\n\n=\n\n\u03c3P\n\nQ(x\u2212\u00b5)2\u2212\u03c32\nP \u03c32\nQ\n\nIn that case, P (x)\nP (x\u2212\u00b5\")2\nimportance weights are unbounded, d\u221e(P$Q) = supx\nis not informative. The R\u00b4enyi divergence of the second order is given by:\nP (x \u2212 \u00b5&)2\n\n\u03c3Q\n\n2\u03c32\n\nQ(x \u2212 \u00b5)2 \u2212 \u03c32\n\u03c32\nP \u03c32\nQ\n\n2\u03c32\n\nd2(P$Q) =\n\nexp$ \u2212\n\n2\u03c32\n\nQ(x \u2212 \u00b5)2 \u2212 \u03c32\nP \u03c32\nQ\n\n2\u03c32\n\n%P (x)dx\n%dx.\nP (x \u2212 \u00b5&)2\n\n\u221a2\nThat is, for \u03c3Q >\n2 \u03c3P the variance of the importance weights is bounded. By the additivity\nproperty of the R\u00b4enyi divergence, a similar situation holds for the product and sums of such Gaussian\ndistributions. Hence, in the rightmost example of Figure 1, the importance weights are unbounded,\nbut their second moment is bounded. In the next section we provide learning guarantees even for\nthis setting in agreement with the results observed. For \u03c3Q = 0.3\u03c3P, the same favorable guarantees\ndo not hold, and, as illustrated in Figure 1, learning is signi\ufb01cantly more dif\ufb01cult.\nThis example of Gaussians can further illustrate what can go wrong in importance weighting. As-\nsume that \u00b5 = \u00b5& = 0, \u03c3Q = 1 and \u03c3P = 10. One could have expected this to be an easy case for\nimportance weighting since sampling from Q provides useful information about P. The problem\nis, however, that a sample from Q will contain a very small number of points far from the mean\n(of either negative or positive label) and that these points will be assigned very large weights. For\na sample of size m and \u03c3Q = 1, the expected value of an extreme point is \u221a2 log m \u2212 o(1) and its\n\n5\n\n\fP +1/\u03c32\n\nweight will be in the order of m\u22121/\u03c32\nQ = m0.99. Therefore, a few extreme points will domi-\nnate all other weights and necessarily have a huge in\ufb02uence on the selection of a hypothesis by the\nlearning algorithm.\nAnother related example is when \u03c3Q = \u03c3P = 1 and \u00b5& = 0. Let \u00b5 , 0 depend on the sample size\nm. If \u00b5 is large enough compared to log(m), then, with high probability, all the weights will be\nnegligible. This is especially problematic, since the estimate of the probability of any event would\nbe negligible (in fact both an event and its complement). If we normalize the weights, the issue\nis overcome, but then, with high probability, the maximum weight dominates the sum of all other\nweights, reverting the situation back to that of the previous example.\n4.2 Importance weighting learning bounds - unbounded case\nAs in these examples, in practice, the importance weights are typically not bounded. However, we\nshall show that, remarkably, under the weak assumption that the second moment of the weights\nw, d2(P$Q), is bounded, generalization bounds can be given for this case as well. The follow-\ning result relies on a general learning bound for unbounded loss functions proven in the Appendix\n(Corollary 1). We denote by Pdim(U ) the pseudo-dimension of a real-valued function class U [21].\nTheorem 3. Let H be a hypothesis set such that Pdim({Lh(x): h \u2208 H}) = p < \u221e. Assume that\nd2(P$Q) < +\u221e and w(x) += 0 for all x. Then, for any \u03b4> 0, with probability at least 1 \u2212 \u03b4, the\nfollowing holds:\n\n3\n\n8( p log 2me\n\n.\n\nh\u2208H\n\nR(h) \u2212 &Rw(h)\n0d2(P$Q)\n\nR(h) \u2264 &Rw(h) + 25/40d2(P$Q)\nProof. Since d2(P$Q) < +\u221e, the second moment of w(x)Lh(x) is \ufb01nite and upper bounded by\nd2(P$Q) (Lemma 2). Thus, by Corollary 1, we can write\nPr$ sup\n\n>\u0001% \u2264 4 exp\"p log\n\n45/3 #,\n\n2em\np \u2212\n\np + log 4\n\u03b4\nm\n\nm\u00018/3\n\n\u2200xi \u2208 B, w(xi)Lh(xi) \u2265 ri\n\n\u2200xi \u2208 B, Lh(xi) \u2265 ri/w(xi)\n\n\u2200xi \u2208 A \u2212 B, w(xi)Lh(xi) < ri.\n\n\u2200xi \u2208 A \u2212 B, Lh(xi) < ri/w(xi).\n\nwhere p is the pseudo-dimension of the function class H&& = {w(x)Lh(x): h \u2208 H}. We now show\nthat p = Pdim({Lh(x): h \u2208 H}). Let H& denote {Lh(x): h \u2208 H}. Let A = {x1, . . . , xk} be a\nset shattered by H&&. Then, there exist real numbers r1, . . . , rk such that for any subset B \u2286 A there\nexists h \u2208 H such that\n(9)\nSince by assumption w(xi)> 0 for all i \u2208 [1, k], this implies that\n(10)\nThus, H& shatters A with the witnesses si = ri/w(xi), i \u2208 [1, k]. Using the same observations, it is\nstraightforward to see that conversely, any set shattered by H& is shattered by H&&.\nThe convergence rate of the bound is slightly weaker (O(m\u22123/8)) than in the bounded case\n(O(m\u22121/2)). A faster convergence can be obtained however using the more precise bound of Theo-\nrem 8 at the expense of readability. The R\u00b4enyi divergence d2(P$Q) seems to play a critical role in\nthe bound and thus in the convergence of importance weighting in the unbounded case.\n5 Alternative reweighting algorithms\nThe previous analysis can be generalized to the case of an arbitrary positive function u : X \u2192 R,\nm1m\nu > 0. Let &Ru(h)= 1\ni=1 u(xi)Lh(xi) and let &Q denote the empirical distribution.\nTheorem 4. Let H be a hypothesis set such that Pdim({Lh(x): h \u2208 H}) = p < \u221e. Assume that\n0 < EQ[u2(x)] < +\u221e and u(x) += 0 for all x. Then, for any \u03b4> 0, with probability at least 1 \u2212 \u03b4,\nthe following holds:\n|R(h) \u2212 &Ru(h)|\u2264 ,,, E\nQ)[w(x) \u2212 u(x)]Lh(x)*,,,+\n25/4 max20EQ[u2(x)L2\n\nh(x)],\u221aE bQ[u2(x)L2\n\n8( p log 2me\n\nh(x)]3 3\n\np + log 4\nm\n\n\u03b4\n\n.\n\n6\n\n\fUnweighted, Ratio = 0.75\n1.0\n\n\u03c3 \u03c3P Q\n\nImportance, Ratio = 0.75\n1.0\n\n\u03c3 \u03c3P Q\n\nr\no\nr\nr\n\nE\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\nr\no\nr\nr\n\nE\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\nr\no\nr\nr\n\nE\n\n1.0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\nQuantile, Ratio = 0.75\n\n\u03c3 \u03c3P Q\n\nCapped 1%, Ratio = 0.75\n1.0\n\n\u03c3 \u03c3P Q\n\nr\no\nr\nr\n\nE\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n20\n\n50\n\n200 500\nTraining set size\n\n2000\n\n20\n\n50\n\n200 500\nTraining set size\n\n2000\n\n20\n\n50\n\n200 500\nTraining set size\n\n2000\n\n20\n\n50\n\n200 500\nTraining set size\n\n2000\n\nFigure 2: Comparison of the convergence of 4 different algorithms for the learning task of Figure 1:\nlearning with equal weights for all examples (Unweighted), Importance weighting, using Quantiles\nto parameterize the function u, and Capping the largest weights.\nProof. Since R(h) = E[w(x)Lh(x)], we can write\n\nand thus\n\nR(h) \u2212 &Ru(h) = E\n|R(h) \u2212 &Ru(h)|\u2264 ,, E\n\nQ)[w(x) \u2212 u(x)]Lh(x)* + E[u(x)Lh(x)] \u2212 &Ru(h),\nQ)[w(x) \u2212 u(x)]Lh(x)*,, + | E[u(x)Lh(x)] \u2212 &Ru(h)|.\n\n3\n\nQ\n\nbQ\n\n[u2]3 \u22640E\n\n[u2],0E\nQ)|w(x) \u2212 u(x)|* + \u03b30E\n\nQ\n\nh(x)])\n\np +log 4\n\u03b4\nm\n\nh(x)],\u221aE bQ[u2(x)L2\n\nBy Corollary 2 applied to the function u Lh,\n\n| E[u(x)Lh(x)] \u2212 &Ru(h)| can be bounded by\n8+ p log 2me\nwith probability 1 \u2212 \u03b4, with\n\n25/4 max(0EQ[u2(x)L2\np = Pdim({Lh(x): h \u2208 H}) by a proof similar to that of Theorem 3.\nThe theorem suggests that other functions u than w can be used to reweight the cost of an error\non each training point by minimizing the upper bound, which is a trade-off between the bias term\nwhere the coef\ufb01cients are explicitly given. Function u can be selected from different families. Using\nan upper bound on these quantities that is independent of h and a multiplicative bound of the form\n\n| EQ[(w(x)\u2212u(x))Lh(x)]| and the second moment max40EQ[u2(x)L2\n[u2]41 + O(1/\u221am)5 ,\n\nleads to the following optimization problem:\n\nh(x)],\u221aE bQ[u2(x)L2\n\nh(x)]5,\n\nmax20E\n\nQ\n\nE\n\n[u2],\n\nmin\nu\u2208U\n\n(11)\nwhere \u03b3> 0 is a parameter controlling the trade-off between bias and variance minimization and\nwhere U is a family of possible weight functions out of which u is selected.\nHere, we consider a family of functions U parameterized by the quantiles q of the weight function\nw. A function uq \u2208 U is then de\ufb01ned as follows: within each quantile, the value taken by uq is the\naverage of w over that quantile. For small values of \u03b3, the bias term dominates, and very \ufb01ne-grained\nquantiles minimize the bound of equation (11). For large values of \u03b3 the variance term dominates\nand the bound is minimized by using just one quantile, corresponding to an even weighting of\nthe training examples. Hence by varying \u03b3 from small to large values, the algorithm interpolates\nbetween standard importanceweighting with just one example per quantile, and unweighted learning\nwhere all examples are given the same weight. Figure 2 also shows the results of experiments for\nthe learning task of Figure 1 using the algorithm de\ufb01ned by (11) with this family of functions. The\noptimal q is determined by 10-fold cross-validation. We see that a more rapid convergence can be\nobtained by using these weights compared to the standard importance weights w.\nAnother natural family of functions is that of thresholded versions of the importance weights\n{u\u03b8 : \u03b8> 0,\u2200x\u2208X, u\u03b8(x)= min(w(x),\u03b8 )}. In fact, in practice, users often cap importance weights\nby choosing an arbitrary value \u03b8. The advantage of this family is that, by de\ufb01nition, the weights are\n\n7\n\n\fbounded. However, in some cases, larger weights could be critical to achieve a better performance.\nFigure 2 illustrates the performance of this approach. Compared to importance weighting, no change\nin performance is observed until the largest 1% of the weights are capped, in which case we only\nobserve a performance degradation. We expect the thresholding to be less bene\ufb01cial when the large\nweights re\ufb02ect the true w and are not an artifact of estimation uncertainties.\n6 Relationship between normalized and unnormalized weights\nAn alternative approach based on the weight function w = P (x)/Q(x) consists of normalizing the\nweights. Thus, while in the unnormalized case the unweighted empirical error is replaced by\n\n1\nm\n\nm!i=1\n\nw(xi) Lh(xi) =\n\nin the normalized case it is replaced by\n\nw(xi)\n\nm\n\nm!i=1\n\nLh(xi),\n\nw(xi)\n\nW\n\nLh(xi),\n\nm!i=1\n\nwith W =1m\n\ni=1 w(xi). We refer to &w(x) = w(x)/W as the normalized importance weight. An\n\nadvantage of the normalized weights is that they are by de\ufb01nition bounded by one. However, the\nprice to pay for this bene\ufb01t is the fact that the weights are no more unbiased. In fact, several issues\nsimilar to those we pointed out in the Section 4 affect the normalized weights as well.\nHere, we maintain the assumption that the second moment of the importance weights is bounded\nand analyze the relationship between normalized and unnormalized weights. We show that, under\nthis assumption, normalized and unnormalized weights are in fact very close, with high probability.\nObserve that for any i \u2208 [1, m],\n&w(xi) \u2212\n\nm% .\n\nW \u2212\n\nw(xi)\n\nW\n\nm\n\n1\n\nm\n\nW\n\nw(xi)\n\nThus, since w(xi)\nES[W ] = 1\nthe following inequality holds\n\nm% =\n= w(xi)$ 1\nW \u2264 1, we can write,,,&w(xi) \u2212 w(xi)\nm ,,, \u2264,,1 \u2212 W\nm1m\nm,,,, \u2264 25/4 max6+d2(P$Q),+d2(P$&Q)7 3\n,,,,1 \u2212\n\nW $1 \u2212\nm,, . Since E[w(x)] = 1, we also have\nk=1 E[w(xk)] = 1. Thus, by Corollary 2, for any \u03b4> 0, with probability at least 1\u2212 \u03b4,\n8( log 2me + log 4\nm ,,,, simultaneously for all i \u2208 [1, m].\n\nwhich implies the same upper bound on,,,&w(xi) \u2212 w(xi)\n\n7 Conclusion\nWe presented a series of theoretical results for importance weighting both in the bounded weights\ncase and in the more general unbounded case under the assumption that the second moment of the\nweights is bounded. We also initiated a preliminary exploration of alternative weights and showed its\nbene\ufb01ts. A more systematic study of new algorithms based on these learning guarantees could lead\nto even more bene\ufb01cial and practically useful results. Several of the learning guarantees we gave\ndepend on the R\u00b4enyi divergence of the distributions P and Q. Accurately estimating that quantity\nis thus critical and should motivate further studies of the convergence of its estimates from \ufb01nite\nsamples. Finally, our novel unbounded loss learning bounds are of independent interest and could\nbe useful in a variety of other contexts.\nReferences\n[1] M. Anthony and J. Shawe-Taylor. A result of Vapnik with applications. Discrete Applied\n\n\u03b4\n\n,\n\nMathematics, 47:207 \u2013 217, 1993.\n\n8\n\n\f[2] C. Arndt. Information Measures: Information and its Description in Science and Engineering.\n\nSignals and Communication Technology. Springer Verlag, 2004.\n\n[3] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain\n\nadaptation. NIPS, 2007.\n\n[4] S. N. Bernstein. Sur l\u2019extension du th\u00b4eor`eme limite du calcul des probabilit\u00b4es aux sommes de\n\nquantit\u00b4es d\u00b4ependantes. Mathematische Annalen, 97:1\u201359, 1927.\n\n[5] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML,\n\npages 49\u201356, New York, NY, USA, 2009.\n\n[6] S. Bickel, M. Br\u00a8uckner, and T. Scheffer. Discriminative learning for differing training and test\n\ndistributions. In ICML, pages 81\u201388, 2007.\n\n[7] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain\n\n[8] C. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh. Sample selection bias correction theory.\n\nadaptation. NIPS 2007, 2008.\n\nIn ALT, 2008.\n\n[9] S. Dasgupta and P. M. Long. Boosting with diverse base classi\ufb01ers. In COLT, 2003.\n[10] H. Daum\u00b4e III and D. Marcu. Domain adaptation for statistical classi\ufb01ers. Journal of Arti\ufb01cial\n\nIntelligence Research, 26:101\u2013126, 2006.\n\n[11] M. Dud\u00b4\u0131k, R. E. Schapire, and S. J. Phillips. Correcting sample selection bias in maximum\n\nentropy density estimation. In NIPS, 2006.\n\n[12] R. M. Dudley. A course on empirical processes. Lecture Notes in Math., 1097:2 \u2013 142, 1984.\n[13] R. M. Dudley. UniversalDonsker classes and metric entropy. Annals of Probability, 14(4):1306\n\n\u2013 1326, 1987.\n\n[14] C. Elkan. The foundations of cost-sensitive learning. In IJCAI, pages 973\u2013978, 2001.\n[15] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other\n\nlearning applications. Inf. Comput., 100(1):78\u2013150, 1992.\n\n[16] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Sch\u00a8olkopf. Correcting sample\n\nselection bias by unlabeled data. In NIPS, volume 19, pages 601\u2013608, 2006.\n\n[17] J. Jiang and C. Zhai. Instance Weighting for Domain Adaptation in NLP. In ACL, 2007.\n[18] J. S. Liu. Monte Carlo strategies in scienti\ufb01c computing. Springer, 2001.\n[19] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algo-\n\nrithms. In COLT, 2009.\n\n[20] A. Maurer and M. Pontil. Empirical bernstein bounds and sample-variance penalization. In\n\nCOLT, Montr\u00b4eal, Canada, June 2009. Omnipress.\n\n[21] D. Pollard. Convergence of Stochastic Processess. Springer, New York, 1984.\n[22] D. Pollard. Asymptotics via empirical processes. Statistical Science, 4(4):341 \u2013 366, 1989.\n[23] A. R\u00b4enyi. On measures of information and entropy. In Proceedings of the 4th Berkeley Sym-\n\nposium on Mathematics, Statistics and Probability, page 547561, 1960.\n\n[24] H. Shimodaira.\n\nlikelihood function. Journal of Statistical Planning and Inference, 90(2):227\u2013244, 2000.\n\nImproving predictive inference under covariate shift by weighting the log-\n\n[25] M. Sugiyama, S. Nakajima, H. Kashima, P. von B\u00a8unau, and M. Kawanabe. Direct importance\nestimation with model selection and its application to covariate shift adaptation. In NIPS, 2008.\n\n[26] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.\n[27] V. N. Vapnik. Estimation of Dependences Based on Empirical Data, 2nd ed. Springer, 2006.\n[28] J. von Neumann. Various techniques used in connection with random digits. Monte Carlo\n\nmethods. Nat. Bureau Standards, 12:36\u201338, 1951.\n\n[29] B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example\n\nweighting. In ICDM, 2003.\n\n9\n\n\f", "award": [], "sourceid": 731, "authors": [{"given_name": "Corinna", "family_name": "Cortes", "institution": null}, {"given_name": "Yishay", "family_name": "Mansour", "institution": null}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": null}]}