Reference
Consistant Rank Ligist for Ordinal Regression
Network design
After the last fully-connected layer with num of classes as one, a 1D linear bias layer is introduced.
Copy self . fc = nn . Linear ( 4096 , 1 , bias = False )
self . linear_1_bias = nn . Parameter (torch. zeros (num_classes - 1 ). float ())
Loss function
Let W W W denote the weight parameters of the neural network excluding the bias units of the final layer. The penultimate layer, whose output is denoted as g ( x i , W ) g(x_i,W) g ( x i , W ) , shares a single weight with all nodes in the final output layer. K − 1 K-1 K − 1 independent bias units are then added to g ( x i , W ) g(x_i, W) g ( x i , W ) such that g ( x i , W ) + b k k = 1 K − 1 {g(x_i, W)+b_k}_{k=1}^{K-1} g ( x i , W ) + b k k = 1 K − 1 are the inputs to the crresponding binary classifiers in the final layer. Let s ( z ) = 1 / ( 1 + e x p ( − z ) ) s(z)=1/(1+exp(-z)) s ( z ) = 1/ ( 1 + e x p ( − z )) be the logistic sigmoid function. The predicted empirical probability for task k is defined as:
P ^ ( y i k = 1 ) = s ( g ( x i , W ) + b k ) \hat{P}(y_i^k=1) = s(g(x_i, W) +b_k) P ^ ( y i k = 1 ) = s ( g ( x i , W ) + b k ) For model training, we minimize the loss function:
L ( W , b ) = − ∑ i = 1 N ∑ k = 1 K − 1 λ k [ l o g ( s ( g ( x i , W ) + b k ) ) y i k + l o g ( 1 − s ( g ( x i , W ) + b k ) ) ( 1 − y i k ) ] L(W,b) = - \sum_{i=1}^N \sum_{k=1}^{K-1} \lambda ^k [log(s(g(x_i,W) +b_k))y_i^k + log(1-s(g(x_i, W) + b_k))(1-y_i^k)] L ( W , b ) = − i = 1 ∑ N k = 1 ∑ K − 1 λ k [ l o g ( s ( g ( x i , W ) + b k )) y i k + l o g ( 1 − s ( g ( x i , W ) + b k )) ( 1 − y i k )] which is the weighted cross-entropy of K-1 binary classifiers. For rank prediction, the binary labels are obtained via:
f k ( x i ) = 1 P ^ ( y i k = 1 ) > 0.5 f_k(x_i) = 1{ \hat{P}(y_i^k=1) > 0.5 } f k ( x i ) = 1 P ^ ( y i k = 1 ) > 0.5 Example
Let's take a look at the labels, for 7 ranks:
Cross-Entropy, the one hot encoded label for class 3 is denoted as [ 0 , 0 , 1 , 0 , 0 , 0 , 0 ] T [0,0,1,0,0,0,0]^T [ 0 , 0 , 1 , 0 , 0 , 0 , 0 ] T ,
CROAL-Loss, it's [ 1 , 1 , 1 , 0 , 0 , 0 ] T [1,1,1,0,0,0 ]^T [ 1 , 1 , 1 , 0 , 0 , 0 ] T
Copy levels = [[ 1 ] * label + [ 0 ] * (self . num_classes - 1 - label) for label in batch_y]
The logits for CORAL-loss looks like this [ 0.9 , 0.8 , 0.6 , 0.4 , 0.2 , 0.1 ] T [0.9, 0.8, 0.6, 0.4, 0.2, 0.1]^T [ 0.9 , 0.8 , 0.6 , 0.4 , 0.2 , 0.1 ] T , we find the last num >= 0.5, it's index 3 is our prediction.
During training, the loss for the current sample is calculated as
L = − ∑ k = 1 K − 1 [ 1 , 1 , 1 , 0 , 0 , 0 ] ∗ l o g ( [ 0.9 , 0.8 , 0.6 , 0.4 , 0.2 , 0.1 ] T ) + ( 1 − [ 1 , 1 , 1 , 0 , 0 , 0 ] ) ∗ l o g ( 1 − [ 0.9 , 0.8 , 0.6 , 0.4 , 0.2 , 0.1 ] T ) = − ∑ k = 1 K − 1 [ 1 , 1 , 1 , 0 , 0 , 0 ] ∗ l o g ( [ 0.9 , 0.8 , 0.6 , 0.4 , 0.2 , 0.1 ] T ) + [ 0 , 0 , 0 , 1 , 1 , 1 ] ∗ l o g ( 1 − [ 0.9 , 0.8 , 0.6 , 0.4 , 0.2 , 0.1 ] T ) \begin{aligned} L = -\sum_{k=1}^{K-1} [1,1,1,0,0,0] * log( [0.9,0.8,0.6,0.4,0.2,0.1]^T ) \\ + (1 - [1,1,1,0,0,0]) * log (1- [0.9, 0.8, 0.6, 0.4, 0.2, 0.1]^T ) \\ = - \sum_{k=1}^{K-1} [1,1,1,0,0,0] * log( [0.9,0.8,0.6,0.4,0.2,0.1]^T ) \\ + [0,0,0,1,1,1] * log (1- [0.9, 0.8, 0.6, 0.4, 0.2, 0.1]^T ) \\ \end{aligned} L = − k = 1 ∑ K − 1 [ 1 , 1 , 1 , 0 , 0 , 0 ] ∗ l o g ([ 0.9 , 0.8 , 0.6 , 0.4 , 0.2 , 0.1 ] T ) + ( 1 − [ 1 , 1 , 1 , 0 , 0 , 0 ]) ∗ l o g ( 1 − [ 0.9 , 0.8 , 0.6 , 0.4 , 0.2 , 0.1 ] T ) = − k = 1 ∑ K − 1 [ 1 , 1 , 1 , 0 , 0 , 0 ] ∗ l o g ([ 0.9 , 0.8 , 0.6 , 0.4 , 0.2 , 0.1 ] T ) + [ 0 , 0 , 0 , 1 , 1 , 1 ] ∗ l o g ( 1 − [ 0.9 , 0.8 , 0.6 , 0.4 , 0.2 , 0.1 ] T ) Ordinal Regression
Network design
Last fc layer outputs (num_classes-1)*2
logits.
Copy self . fc = nn . Linear ( 2048 * block.expansion, (self.num_classes - 1 ) * 2 )
Final output is similar to CORAL-loss:
Copy probas = F . softmax (logits, dim = 2 ) [ :, :, 1 ]
predict_levels = probas > 0.5
predicted_labels = torch . sum (predict_levels, dim = 1 )
Loss function
Copy def cost_fn ( logits , levels , imp ):
val = ( - torch . sum ((F. log_softmax (logits, dim = 2 )[:, :, 1 ] * levels
+ F. log_softmax (logits, dim = 2 )[:, :, 0 ] * ( 1 - levels)) * imp, dim = 1 ) )
return torch . mean (val)