15.1. Methods and Mathematics – EN – Deep Learning Bible – 2. Classification

en.wikipedia.org/wiki/Multi-task_learning

$\equiv$ CONTENTS

  1. Methods
    1.1. Task grouping and overlap
    1.2. Exploiting unrelated tasks
    1.3. Transfer of knowledge
    1.4. Group online adaptive learning
  2. Mathematics / Reproducing Hilbert space of vector valued functions (RKHSvv)
    2.1. RKHSvv concepts
    2.2. Separable kernels
    2.3. Known task structure
    $\quad$ 2.3.1. Task structure representations
    $\quad$ 2.3.2. Task structure examples
    2.4. Learning tasks together with their structure
    $\quad$ 2.4.1. Optimization of Q
    $\quad$ 2.4.2. Special cases
    $\quad$ 2.4.3. Generalizations

1.1. Task grouping and overlap

Within the MTL paradigm, information can be shared across some or all of the tasks. Depending on the structure of task relatedness, one may want to share information selectively across the tasks. For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric. Suppose, as developed more formally below, that the parameter vector modeling each task is a linear combination of some underlying basis. Similarity in terms of this basis can indicate the relatedness of the tasks. For example, with sparsity, overlap of nonzero coefficients across tasks indicates commonality. A task grouping then corresponds to those tasks lying in a subspace generated by some subset of basis elements, where tasks in different groups may be disjoint or overlap arbitrarily in terms of their bases. Task relatedness can be imposed a priori or learned from the data. Hierarchical task relatedness can also be exploited implicitly without assuming a priori knowledge or learning relations explicitly. For example, the explicit learning of sample relevance across tasks can be done to guarantee the effectiveness of joint learning across multiple domains.

One can attempt learning a group of principal tasks using a group of auxiliary tasks, unrelated to the principal ones. In many applications, joint learning of unrelated tasks which use the same input data can be beneficial. The reason is that prior knowledge about task relatedness can lead to sparser and more informative representations for each task grouping, essentially by screening out idiosyncrasies of the data distribution. Novel methods which builds on a prior multitask methodology by favoring a shared low-dimensional representation within each task grouping have been proposed. The programmer can impose a penalty on tasks from different groups which encourages the two representations to be orthogonal. Experiments on synthetic and real data have indicated that incorporating unrelated tasks can result in significant improvements over standard multi-task learning methods.

1.3. Transfer of knowledge

Related to multi-task learning is the concept of knowledge transfer. Whereas traditional multi-task learning implies that a shared representation is developed concurrently across tasks, transfer of knowledge implies a sequentially shared representation. Large scale machine learning projects such as the deep convolutional neural network GoogLeNet, an image-based object classifier, can develop robust representations which may be useful to further algorithms learning related tasks. For example, the pre-trained model can be used as a feature extractor to perform pre-processing for another learning algorithm. Or the pre-trained model can be used to initialize a model with similar architecture which is then fine-tuned to learn a different classification task.

1.4. Group online adaptive learning

Traditionally Multi-task learning and transfer of knowledge are applied to stationary learning settings. Their extension to non-stationary environments is termed Group online adaptive learning (GOAL). Sharing information could be particularly useful if learners operate in continuously changing environments, because a learner could benefit from previous experience of another learner to quickly adapt to their new environment. Such group-adaptive learning has numerous applications, from predicting financial time-series, through content recommendation systems, to visual understanding for adaptive autonomous agents.

The MTL problem can be cast within the context of RKHSvv (a complete inner product space of vector-valued functions equipped with a reproducing kernel). In particular, recent focus has been on cases where task structure can be identified via a separable kernel, described below. The presentation here derives from Ciliberto et al., 2015.

2.1. RKHSvv concepts

Suppose the training data set is ${\displaystyle {\mathcal {S}}_{t}={(x_{i}^{t},y_{i}^{t})}_{i=1}^{n_{t}}}$, with ${\displaystyle x_{i}^{t}\in {\mathcal {X}}}$, ${\displaystyle y_{i}^{t}\in {\mathcal {Y}}}$, where $t$ indexes task, and ${\displaystyle t\in 1,…,T}$. Let ${\displaystyle n=\sum_{t=1}^{T}n_{t}}$. In this setting there is a consistent input and output space and the same loss function ${\displaystyle {\mathcal {L}}:\mathbb {R} \times \mathbb {R} \rightarrow \mathbb {R}_{+}}$ for each task: . This results in the regularized machine learning problem:

$${\displaystyle \min_{f\in {\mathcal {H}}}\sum_{t=1}^{T}{\frac {1}{n_{t}}}\sum_{i=1}^{n_{t}}{\mathcal {L}}(y_{i}^{t},f_{t}(x_{i}^{t}))+\lambda ||f||_{\mathcal {H}}^{2}} \qquad \qquad (1)$$

where ${\mathcal {H}}$ is a vector valued reproducing kernel Hilbert space with functions ${\displaystyle f:{\mathcal {X}}\rightarrow {\mathcal {Y}}^{T}}$ having components ${\displaystyle f_{t}:{\mathcal {X}}\rightarrow {\mathcal {Y}}}$.

The reproducing kernel for the space ${\mathcal {H}}$ of functions ${\displaystyle f:{\mathcal {X}}\rightarrow \mathbb {R} ^{T}}$ is a symmetric matrix-valued function ${\displaystyle \Gamma :{\mathcal {X}}\times {\mathcal {X}}\rightarrow \mathbb {R} ^{T\times T}}$ , such that ${\displaystyle \Gamma (\cdot ,x)c\in {\mathcal {H}}}$ and the following reproducing property holds:

$${\displaystyle \langle f(x),c\rangle_{\mathbb {R}^{T}}=\langle f,\Gamma (x,\cdot )c\rangle_{\mathcal {H}}} \qquad \qquad (2)$$

The reproducing kernel gives rise to a representer theorem showing that any solution to equation 1 has the form:

$${\displaystyle f(x)=\sum_{t=1}^{T}\sum_{i=1}^{n_{t}}\Gamma (x,x_{i}^{t})c_{i}^{t}} \qquad \qquad \qquad (3)$$

2.2. Separable kernels

The form of the kernel $Γ$ induces both the representation of the feature space and structures the output across tasks. A natural simplification is to choose a separable kernel, which factors into separate kernels on the input space $X$ and on the tasks ${\displaystyle \{1,…,T\}}$. In this case the kernel relating scalar components ${\displaystyle f_{t}}$ and ${\displaystyle f_{s}}$ is given by ${\textstyle \gamma ((x_{i},t),(x_{j},s))=k(x_{i},x_{j})k_{T}(s,t)=k(x_{i},x_{j})A_{s,t}}$. For vector valued functions ${\displaystyle f\in {\mathcal {H}}}$ we can write ${\displaystyle \Gamma (x_{i},x_{j})=k(x_{i},x_{j})A}$, where $k$ is a scalar reproducing kernel, and $A$ is a symmetric positive semi-definite ${\displaystyle T\times T}$ matrix. Henceforth denote ${\displaystyle S_{+}^{T}=\{{ \text{PSD matrices}} \}\subset \mathbb {R} ^{T\times T}}$ .

This factorization property, separability, implies the input feature space representation does not vary by task. That is, there is no interaction between the input kernel and the task kernel. The structure on tasks is represented solely by $A$. Methods for non-separable kernels $Γ$ is an current field of research.

For the separable case, the representation theorem is reduced to ${\textstyle f(x)=\sum_{i=1}^{N}k(x,x_{i})Ac_{i}}$. The model output on the training data is then KCA , where $K$ is the $n\times n$ empirical kernel matrix with entries ${\textstyle K_{i,j}=k(x_{i},x_{j})}$, and $C$ is the ${\displaystyle n\times T}$ matrix of rows $c_{i}$.

With the separable kernel, equation 1 can be rewritten as

$${\displaystyle \min_{C\in \mathbb {R}^{n\times T}}V(Y,KCA)+\lambda tr(KCAC^{\top })} \qquad \qquad (P)$$

where $V$ is a (weighted) average of $L$ applied entry-wise to $Y$ and $KCA$. (The weight is zero if ${\displaystyle Y_{i}^{t}}$ is a missing observation).

Note the second term in $P$ can be derived as follows:

$${\displaystyle {\begin{aligned}|f|_{\mathcal {H}}^{2}
&=\left\langle \sum_{i=1}^{n}k(\cdot ,x_{i})Ac_{i},\sum_{j=1}^{n}k(\cdot ,x_{j})Ac_{j}\right\rangle_{\mathcal {H}} \\
&=\sum_{i,j=1}^{n}\langle k(\cdot ,x_{i})Ac_{i},k(\cdot ,x_{j})Ac_{j}\rangle_{\mathcal {H}}&{\text{(bilinearity)}} \\
&=\sum_{i,j=1}^{n}\langle k(x_{i},x_{j})Ac_{i},c_{j}\rangle_{\mathbb {R} ^{T}}&{\text{(reproducing property)}} \\
&=\sum_{i,j=1}^{n}k(x_{i},x_{j})c_{i}^{\top }Ac_{j}=tr(KCAC^{\top })\end{aligned}}}$$

2.3. Known task structure

2.3.1. Task structure representations

There are three largely equivalent ways to represent task structure: through a regularizer; through an output metric, and through an output mapping.

Regularizer — With the separable kernel, it can be shown (below) that ${\textstyle ||f||_{\mathcal {H}}^{2}=\sum_{s,t=1}^{T}A_{t,s}^{\dagger }\langle f_{s},f_{t}\rangle_{{\mathcal {H}}_{k}}}$, where ${\displaystyle A_{t,s}^{\dagger }}$ is the ${\displaystyle t,s}$ element of the pseudoinverse of $A$, and ${\displaystyle {\mathcal {H}}_{k}}$ is the RKHS based on the scalar kernel $k$, and ${\textstyle f_{t}(x)=\sum_{i=1}^{n}k(x,x_{i})A_{t}^{\top }c_{i}}$. This formulation shows that ${\displaystyle A_{t,s}^{\dagger }}$ controls the weight of the penalty associated with ${\textstyle \langle f_{s},f_{t}\rangle_{{\mathcal {H}}_{k}}}$. (Note that ${\textstyle \langle f_{s},f_{t}\rangle_{{\mathcal {H}}_{k}}}$ arises from ${\textstyle ||f_{t}||_{{\mathcal {H}}_{k}}=\langle f_{t},f_{t}\rangle_{{\mathcal {H}}_{k}}}$.)

Proof
\begin{aligned}|f|_{\mathcal {H}}^{2}
&=\left\langle \sum_{i=1}^{n}\gamma ((x_{i},t_{i}),\cdot )c_{i}^{t_{i}},\sum_{j=1}^{n}\gamma ((x_{j},t_{j}),\cdot )c_{j}^{t_{j}}\right\rangle_{\mathcal {H}} \\
&=\sum_{i,j=1}^{n}c_{i}^{t_{i}}c_{j}^{t_{j}}\gamma ((x_{i},t_{i}),(x_{j},t_{j})) \\
&=\sum_{i,j=1}^{n}\sum_{s,t=1}^{T}c_{i}^{t}c_{j}^{s}k(x_{i},x_{j})A_{s,t} \\
&=\sum_{i,j=1}^{n}k(x_{i},x_{j})\langle c_{i},Ac_{j}\rangle_{\mathbb {R} ^{T}} \\
&=\sum_{i,j=1}^{n}k(x_{i},x_{j})\langle c_{i},AA^{\dagger }Ac_{j}\rangle_{\mathbb {R} ^{T}} \\
&=\sum_{i,j=1}^{n}k(x_{i},x_{j})\langle Ac_{i},A^{\dagger }Ac_{j}\rangle_{\mathbb {R} ^{T}} \\
&=\sum_{i,j=1}^{n}\sum_{s,t=1}^{T}(Ac_{i})^{t}(Ac_{j})^{s}k(x_{i},x_{j})A_{s,t}^{\dagger } \\
&=\sum_{s,t=1}^{T}A_{s,t}^{\dagger }\langle \sum_{i=1}^{n}k(x_{i},\cdot )(Ac_{i})^{t},\sum_{j=1}^{n}k(x_{j},\cdot )(Ac_{j})^{s}\rangle_{{\mathcal {H}}_{k}} \\
&=\sum_{s,t=1}^{T}A_{s,t}^{\dagger }\langle f_{t},f_{s}\rangle_{{\mathcal {H}}_{k}}\end{aligned}

Output metric — an alternative output metric on ${\displaystyle {\mathcal {Y}}^{T}}$ can be induced by the inner product ${\displaystyle \langle y_{1},y_{2}\rangle_{\Theta }=\langle y_{1},\Theta y_{2}\rangle_{\mathbb {R} ^{T}}}$. With the squared loss there is an equivalence between the separable kernels ${\displaystyle k(\cdot ,\cdot )I_{T}}$ under the alternative metric, and ${\displaystyle k(\cdot ,\cdot )\Theta }$, under the canonical metric.

Output mapping — Outputs can be mapped as ${\displaystyle L:{\mathcal {Y}}^{T}\rightarrow {\mathcal {\tilde {Y}}}}$ to a higher dimensional space to encode complex structures such as trees, graphs and strings. For linear maps $L$, with appropriate choice of separable kernel, it can be shown that ${\displaystyle A=L^{\top }L}$.

2.3.2. Task structure examples

Via the regularizer formulation, one can represent a variety of task structures easily.

  • Letting ${\textstyle A^{\dagger }=\gamma I_{T}+(\gamma -\lambda ){\frac {1}{T}}\mathbf {1} \mathbf {1} ^{\top }}$ (where ${\displaystyle I_{T}}$ is the $T \times T$ identity matrix, and ${\textstyle \mathbf {1} \mathbf {1} ^{\top }}$ is the $T \times T$ matrix of ones) is equivalent to letting $Γ$ control the variance ${\textstyle \sum_{t}||f_{t}-{\bar {f}}||_{{\mathcal {H}}_{k}}}$ of tasks from their mean ${\textstyle {\frac {1}{T}}\sum_{t}f_{t}}$. For example, blood levels of some biomarker may be taken on $T$ patients at $n_t$ time points during the course of a day and interest may lie in regularizing the variance of the predictions across patients.

  • Letting ${\displaystyle A^{\dagger }=\alpha I_{T}+(\alpha -\lambda )M}$ , where ${\displaystyle M_{t,s}={\frac {1}{|G_{r}|}}\mathbb {I} (t,s\in G_{r})}$ is equivalent to letting $\alpha$ control the variance measured with respect to a group mean: ${\displaystyle \sum_{r} \sum_{t\in G_{r}}||f_{t}-{\frac {1}{|G_{r}|}}\sum_{s\in G_{r})}f_{s}||}$. (Here ${\displaystyle |G_{r}|}$ the cardinality of group $r$, and ${\displaystyle \mathbb {I} }$ is the indicator function). For example, people in different political parties (groups) might be regularized together with respect to predicting the favorability rating of a politician. Note that this penalty reduces to the first when all tasks are in the same group.

  • Letting ${\displaystyle A^{\dagger }=\delta I_{T}+(\delta -\lambda )L}$, where ${\displaystyle L=D-M}$ is the Laplacian for the graph with adjacency matrix $M$ giving pairwise similarities of tasks. This is equivalent to giving a larger penalty to the distance separating tasks $t$ and $s$ when they are more similar (according to the weight ${\displaystyle M_{t,s}}$,) i.e. $\delta$ regularizes ${\displaystyle \sum_{t,s}||f_{t}-f_{s}||_{{\mathcal {H}}_{k}}^{2}M_{t,s}}$.

  • All of the above choices of $A$ also induce the additional regularization term ${\textstyle \lambda \sum_{t}||f||_{{\mathcal {H}}_{k}}^{2}}$ which penalizes complexity in $f$ more broadly.

2.4. Learning tasks together with their structure

Learning problem $P$ can be generalized to admit learning task matrix $A$ as follows:

$${\displaystyle \min_{C\in \mathbb {R}^{n\times T}, A \in S_{+}^{T}}V(Y,KCA)+\lambda tr(KCAC^{\top })+F(A)} \qquad \qquad (Q)$$

Choice of ${\displaystyle F:S_{+}^{T}\rightarrow \mathbb {R}_{+}}$ must be designed to learn matrices $A$ of a given type. See “Special cases” below.

2.4.1. Optimization of Q

Restricting to the case of convex losses and coercive penalties Ciliberto et al. have shown that although $Q$ is not convex jointly in $C$ and $A$, a related problem is jointly convex.

Specifically on the convex set ${\displaystyle {\mathcal {C}}={(C,A)\in \mathbb {R} ^{n\times T}\times S_{+}^{T}|Range(C^{\top }KC)\subseteq Range(A)}}$, the equivalent problem

$${\displaystyle \min_{C,A\in {\mathcal {C}}}V(Y,KC)+\lambda tr(A^{\dagger }C^{\top }KC)+F(A)}
\qquad \qquad (R)$$

is convex with the same minimum value. And if ${\displaystyle (C_{R},A_{R})}$ is a minimizer for $R$ then ${\displaystyle (C_{R}A_{R}^{\dagger },A_{R})}$ is a minimizer for $Q$.

$R$ may be solved by a barrier method on a closed set by introducing the following perturbation:

$${\displaystyle \min_{C\in \mathbb {R} ^{n\times T},A\in S_{+}^{T}}V(Y,KC)+\lambda tr(A^{\dagger }(C^{\top }KC+\delta ^{2}I_{T}))+F(A)} \qquad \qquad (S)$$

The perturbation via the barrier ${\displaystyle \delta ^{2}tr(A^{\dagger })}$ forces the objective functions to be equal to $+\infty$ on the boundary of ${\displaystyle R^{n\times T}\times S_{+}^{T}}$ .

$S$ can be solved with a block coordinate descent method, alternating in $C$ and $A$. This results in a sequence of minimizers ${\displaystyle (C_{m},A_{m})}$ in $S$ that converges to the solution in $R$ as ${\displaystyle \delta_{m}\rightarrow 0}$, and hence gives the solution to $Q$.

2.4.2. Special cases

Spectral penalties – Dinnuzo et al suggested setting $F$ as the Frobenius norm ${\displaystyle {\sqrt {tr(A^{\top }A)}}}$. They optimized $Q$ directly using block coordinate descent, not accounting for difficulties at the boundary of ${\displaystyle \mathbb {R} ^{n\times T}\times S_{+}^{T}}$.

Clustered tasks learning – Jacob et al suggested to learn $A$ in the setting where $T$ tasks are organized in $R$ disjoint clusters. In this case let ${\displaystyle E\in {0,1}^{T\times R}}$ be the matrix with ${\displaystyle E_{t,r}=\mathbb {I} ({\text{task }}t\in {\text{group }}r)}$. Setting ${\displaystyle M=I-E^{\dagger }E^{T}}$, and ${\displaystyle U={\frac {1}{T}}\mathbf {11} ^{\top }}$, the task matrix ${\displaystyle A^{\dagger }}$ can be parameterized as a function of ${\displaystyle M}$: ${\displaystyle A^{\dagger }(M)=\epsilon_{M}U+\epsilon_{B}(M-U)+\epsilon (I-M)}$ , with terms that penalize the average, between clusters variance and within clusters variance respectively of the task predictions. $M$ is not convex, but there is a convex relaxation ${\displaystyle {\mathcal {S}}_{c}={M\in S_{+}^{T}:I-M\in S_{+}^{T}\land tr(M)=r}}$. In this formulation, ${\displaystyle F(A)=\mathbb {I} (A(M)\in {A:M\in {\mathcal {S}}_{C}})}$.

2.4.3. Generalizations

Non-convex penalties – Penalties can be constructed such that $A$ is constrained to be a graph Laplacian, or that $A$ has low rank factorization. However these penalties are not convex, and the analysis of the barrier method proposed by Ciliberto et al. does not go through in these cases.

Non-separable kernels – Separable kernels are limited, in particular they do not account for structures in the interaction space between the input and output domains jointly. Future work is needed to develop models for these kernels.


Read more here: Source link