New York University Tandon School of Engineering Computer Science and Engineering CS-GY 6923: Written Homework 1. Due Thursday, February 10th, 2022, 11:59pm. Discussion with other students is allowed for this problem set, but solutions must be written-up individually. For just this ﬁrst problem set, 10% extra credit will be given if solutions are typewritten (using LaTeX, Markdown, or some other mathematical formatting program). Problem 1: Practice Minimizing a Loss Function (10pts) Consider a linear model of the form: fβ(x) = βx, which is the same as the linear model we saw in class, but with the intercept forced to zero. Such models are used when we want to force the predicted value fβ(x) = 0 when x = 0. For example, if we are modeling y = output power of a motor vs. x = the input power, we would expect x = 0 ⇒ y = 0. (a) Given data (x1, y1), . . . , (xn, yn), write the equation for a loss function which measures prediction accu- racy using the sum-of-squared distances between the predicted values and target values. Make sure they remember to properly subscript yi, xi. n L(β) = (yi − βxi)2 i=1 (b) Derive an expression for the β that minimizes this loss function. Do you get the same expression that we got for β1 in the full linear model? • Set derivative to zero: d L(β) = 0 dβ • Calculate derivative d L(β) = n −2xi · (yi − βxi) dβ i=1 • Solve for β: β = ( n xiyi )/( n xi2) i=1 i=1 This expression is diﬀerent than what we got for the full linear model – its close, but you don’t subtract means oﬀ of x’s and y’s ﬁrst. Problem 2: Machine Learning Does Averages (15pts) Suppose we have data y1, . . . , yn ∈ R and we want to choose a single value m ∈ R which is “most repre- sentative” of our dataset. This is sometimes called the “central tendency” problem in statistics. A machine learning approach to this problem would measure how representative m is of the data using a loss function. (a) Consider the loss function L(m) = n (yi − m)2. Show that L(m) is minimized by setting m = y¯, mean of i=1 1 n where y¯ = n i=1 yi is the our data. • Set derivative to zero: d L(m) = n −2 · (yi − m) = 0 dm i=1 • Solve for m: m= 1 n yi n i=1 (b) Consider the loss function L(m) = maxi |yi − m|. What value of m minimizes this loss? Hint: Using derivatives will not help here – try just thinking about the minimization problem directly. Let ymin = mini yi and let ymax = maxi yi. We should set m = (ymin + ymax)/2. I.e. place m exactly between the largest and smallest value in our data set. No other point can achieve a better value on the loss because it will be further than (ymin + ymax)/2 from either ymin or ymax.

(c) Consider the loss function L(m) = n |yi − m|. Prove that L(m) is minimized by setting m to the i=1 median of the data. Hint: This question is harder than the previous two and takes some creativity! Again derivatives might not be helpful. There are a number of ways to prove this. One is as follows: • Sort our numbers from smallest to largest so that y1 ≤ y2 ≤ . . . , ≤ ym. • Split n |yi − m| into n |yi − m| = (|y1 − m| + |yn − m|) + (|y2 − m| + |yn−1 − m|) + .. . + i=1 i=1 (|yn/2 − m| + |yn/2+1 − m|). • Then observe that (|y1 − m| + |yn − m|) is minimized at cost |yn − y1| by any m that lies between y1 and yn. Similarly (|y2 − m| + |yn−1 − m|) is minimized by any m that lies between y2 and yn−1, so on and so forth. • So we can choose the best m by just making sure it lies simultaneously in the intervals [y1, yn], [y2, yn−1], . . . , [yn/2, yn/2 One such choice is clearly the median. (d) In a few short sentences, discuss when you might prefer each of the three losses above. Is the median typically considered a more “robust” measure of central tendency than the mean? Why? No real wrong or right answers here. Although they should be able to identify that the median is more robust. If you add a large outlier to a data set, the median will barely change, but the mean could move arbitrarily far. Problem 3: Practice with Non-linear Transformations. (8 pts) A medical researcher wants to model, f (t), the concentration of some chemical in the blood over time. She believes the concentration should decay exponentially in that f (t) = z0e−αt, (1) for some parameters z0 and α. To conﬁrm this model, and to estimate the parameters z0, α, she collects a large number of time-stamped samples (ti, ci), i = 1, . . . , n, where ci is the measured concentration at time ti. Unfortunately, the model (1) is non-linear, so she can’t directly apply the linear regression formula to estimate z0 and α. (a) Taking logarithms, show that we can transform our training data so that the conjectured relationship between predictor and target variables is in fact linear. log(f (t)) = log(z0) − αt. which is linear in t. (b) Write pseudocode (or actual Python) for how you might estimate z0 and α using this transformation. • Let c˜i = log(ci) for all i = 1, . . . , n. • Let c¯ = 1 n c˜i and t¯ = 1 n ti . n i=1 n i=1 • Compute σxy = 1 n (c˜i − c¯)(ti − t¯) and σx2 = 1 in=1(ti − t¯)2. n i=1 n • Compute β1 = σxy/σx2 and β0c¯ − β1t¯. • Set parameters z0 = eβ0 and α = −β1 It’s okay if they just say something like “ﬁnd β0 and β1 for inputs t1, . . . , tn and c˜1, . . . , c˜n using the equations from class”. Then they would only need to get the ﬁrst and last step of what I wrote above.

Problem 4: Practice With Gradients (8pts) For X ∈ Rn×d and target vector y ∈ Rn, consider ﬁtting a linear model of the form: fβ(x) = Xβ under the so-called p loss: Lp(β) = y − fβ(x) pp. Here · p denotes the p norm raised to the p power. p I.e. for any even integer p = 2, 4, 6, . . . n z p = zip p i=1 (a) Derive an expression for ∇g(z) where g(z) = z pp. We have that ∂g/∂zi = pzip−1. So if we stack all of the partial derivatives into a column vector we have: ∇g(z) = pzp−1 where zp−1 denotes raising every entry in the vector z to the p − 1 power. (b) Derive an expression for ∇Lp(β). Let h(z) = y−z p for some ﬁxed vector y. Using the same argument as in part (a), we have that p ∇h(z) = −p (y − z)p−1. Then we can apply the multivariate chain rule from class, and in particular the special case where the input of a function is multiplied by a matrix. Speciﬁcally, we have that Lp(β) = h(Xβ). So ∇Lp(β) = XT ∇h(Xβ) = −pXT (y − Xβ)p−1 where again the power denotes an entrywise operation. Problem 5: Piecewise Linear Regression via Feature Transformations (15pts) Your goal is to ﬁt a piecewise linear model to a single variate dataset of the form (x1, y1), . . . , (xn, yn) where all values are scalars. We will only use two pieces. In other words, for some known value λ, f (xi) = a1 + s1xi for xi < λ a2 + s2xi for xi ≥ λ with the additional constraint that a1 + s1λ = a2 + s2λ. This constraint ensures that our two linear models actually “meet” at x = λ, which means we get a continuous prediction function. For example, when λ = 100, a piecewise linear ﬁt for our MPG data might look like: (a) Show that this model is equivalent to the following unconstrained model: f (xi) = a1 + s1xi for xi < λ a1 + s1λ − s2λ + s2xi for xi ≥ λ From our constraint, we always have that a2 = a1 +s1λ−s2λ. So we can eliminate the a2 parameter: it is completely determined by the other model parameters. Simply replacing a2 in f (xi) with a1 + s1λ − s2λ gives the alternative formulation.

(b) Show how to ﬁt an optimal f under the squared loss using an algorithm for multiple linear regression. In particular, your approach should: • Transform the input data to form a data matrix X with multiple columns. • Use a multiple regression algorithm to ﬁnd the β which minimizes y − Xβ 22. • Extract from the optimal β optimal values for a1, s1, s2. You need to describe 1) a correct data transformation and 2) a correct mapping from β to a1, s1, s2. Note that in our model λ is known. It is not a model parameter which needs to be optimized. Construct X as follows: • First column equal to 1. • Second column equal to xi for any position i such that xi < λ. Otherwise, equal to λ for any position i such that xi ≥ λ. • Third column equal to 0 for any position i such that xi < λ. Otherwise, equal to xi − λ for any position i such that xi ≥ λ. Now consider Xβ. With some thought we see that the ith entry of this vector is equal to f (xi) with parameters a1 = β1, s1 = β2, and s2 = β3. (c) Implement your algorithm in Python and apply it to the dataset from demo auto mpg.ipynb. Produce a piecewise linear ﬁt for MPG as a function of Horsepower using the value λ = 100. Plot the result. You can attach a Jupyter notebook to your submission, or simply include the printed code and plot. See my solution in part 4 soln.ipynb. (d) (3pts bonus) Modify your approach to handle the case when λ is unknown. Again obtain a ﬁt for MPG vs. horsepower. What value of λ gives the optimal ﬁt? Include any modiﬁed code and a plot of your result. Easiest way to do this is by brute forcing over a grid of possible values of λ. For students who are struggling with the mathematical part but more comfortable with coding, you can give a hint to suggest this – e.g. ask if any tools from the ﬁrst lab might come in handy. It’s a good way for them to make up some points.

Search

### Read the Text Version

- 1 - 4

Pages: