Problem #1 (Semi-Supervised Learning): Is the learning rate in binary classification problems substantially lower when one knows one has available (potentially infinite amount of) unlabeled data. The conjecture is that no! More precisely, let H be any class of {0,1}-valued functions from some domain X. A learning algorithm A has access to a labeled sample {(x1, y1), (x2, y2), ..., (xm,ym)} where xi's are i.i.d. generated according to some probability distribution D over the domain and yi = h(xi) for some unknown ``target'' function h from the class H. Algorithm's output is a {0,1}-valued function g over the domain. Learning rate of an algorithm is rate = Expectation[|g(x)-h(x)|] where the expectation is over x, x1, x2,..., xm, generated i.i.d. according to D. Consider two learning rates: a) the D-specific which is defined as min max rate where the maximum is taken with respect to choice of h and the minimum is taken with respect to the choice of the learning algorithm A. (The D-specific rate is what is the rate that is theoretically achievable if we have infinite amounts of unlabeled data available and we use the ``best'' algorithm for that exploits the unlabeled data.) b) the (D,A)-specific rate is defined as max rate where the maximum is taken over the choice of h. (This is the rate that is achievable with a given algorithm if the distribution is fixed.) Our conjecture can be phrased as follow: Does there exist an algorithm A such that for any D the (D,A)-specific rate is within constant factor of the D-specific rate?
Problem #2 (Domain Adaptation): This an open ended problem and it's somewhat similar to Problem #1. Suppose we have infinite amounts of (unlabeled) data available from two distributions P (the source) and Q (the target) and a labeled sample {(x1, y1), (x2, y2), ..., (xm,ym)} where xi's are i.i.d. generated from the source distribution and yi = h(xi) for some unknown ``target'' function h from the class H. The learning rate is defined as in Problem #1 with the proviso that expectation is taken with respect to the target distribution. Define the (P,Q)-specific learning rate as min max rate where the maximum is taken with respect to choice of h and the minimum is taken with respect to the choice of the learning algorithm A. What is the limit of the (P,Q)-specific learning as sample size goes to infinity? Define the excess rate as the usual rate minus this limit. In the same spirit as in problem #1, define (P,Q)-specific excess rate and (P,Q,A)-specific excess rate. Does there exists an algorithm A such that for any pair P,Q the (P,Q,A)-specific excess rate is within constant factor of the (P,Q)-specific excess rate?
Problem #3 (Stochastic Optimization): Let X be a domain set (any non-empty set, potentially infinite), and let Y be the set of solutions (any non-empty set, potentially infinite). Consider a {0,1}-valued loss function L over X×Y and any distribution over D over the domain. Given an i.i.d. sample from the distribution, we want to find a solution y with (approximate) minimum cost where the cost is defined as cost(y) = E[L(x,y)]. We say that a procedure is consistent if, as sample size goes to infinity, the cost of the solution produced approaches the minimum cost. Under what conditions on L does there exist a procedure that is universally consistent (i.e. consistent for any distribution D). Think of L as an infinite binary matrix. Give a combinatorial characterization L when a universal constistent procedure exists! (In the spirit of VC-dimension.) Finite VC-dimension of L is sufficient but not necessary; (e.g. if for some y* and all x, L(x,y*)=0, then for the remaining y's L(x,y) can be arbitrarily crazy and the procedure that outputs y* regardless of the input is universally consistent.) What about distribution-free rates of the difference between the cost of the output solution and the minimum cost?
Problem #4 (Active Budgeted Learning) Let P be a probability distribution over Rd and a target function h from some class of functions H from Rd to a class of labels {0,1,2,...,K}. (H can be e.g. the class of hyperplanes.) We have a large data set available sampled from Rd however the coordinates of the data set are hidden. For each point x in the data set we only know its label f(x). However, for any point x in the data set we can ask for its features. How many features in total do we need to see in order to learn f(x) within precision epsilon.
Problem #5 (Motion planning) Consider a subset C of the unit d-dimensional cube in the Euclidean space. We have an oracle access to C, that is, for any point in the cube we can ask whether it belongs to C or not. Imagine a robot in ``lower-left'' corner that needs to get to the ``upper-right'' corner of the cube along a continuous path, as short as possible, that avoids C. We want to find such a path (i.e. a plan) for the robot that is as short as possible in as few oracle queries as possible. (Our strategy can be, of course, randomized.) Say, we define the regret as the difference between the length of the path the we output after n oracle queries and the length of the shortest path. Give an upper bound on the regret as a function of n. I think, in order to get meaningful results, we need to assume that C is ``nice''. For start, we can assume that C is a square. What about more complicated sets?
Problem #5 (Online K-means) Let x1,x2,...,xT be a sequence of points in the unit ball (in Rd or a Hilbert space). The sequence is revealed in online fashion to a clustering algorithm. In round t, first the algorithm chooses a k-tuple of points (i.e. a k-means clustering), then it is revealed xt and suffers loss k-means loss. Design an algorithm with sublinear regret!