Skip to the content.

Adaptive Dual-Teacher Distillation for Lightweight Video Models

TL;DR

Table of Contents

A dual-teacher knowledge distillation framework uses a heterogeneous Vision Transformer (ViT) teacher and a structurally similar CNN teacher to train a lightweight CNN student for video action recognition. Two key components drive gains:

  • Discrepancy‑Aware Teacher Weighting (DATW): adaptively mixes teacher logits per sample using teacher confidence and student–teacher discrepancy.
  • Structure Discrepancy‑Aware Distillation (SDD): the student learns the residual between ViT and CNN teacher features via a lightweight auxiliary branch during training (removed at inference).

On HMDB51, EPIC‑KITCHENS‑100, and Kinetics‑400, the student CNN surpasses strong ViT teachers while keeping CNN efficiency. FLOPs reduce dramatically (e.g., ~96% vs. a ViT teacher) with large parameter savings (e.g., ~89%), making deployment on mobile/edge devices practical.

Reference: Revisiting Cross‑Architecture Distillation: Adaptive Dual‑Teacher Transfer for Lightweight Video Models


Motivation

  • ViTs excel at global context but are compute‑heavy.
  • Lightweight CNNs are efficient but often trail in accuracy due to local‑bias feature extraction.
  • Prior single‑teacher ViT→CNN distillation struggles with architectural mismatch.
  • Insight: a structurally similar CNN teacher (homogeneous w.r.t. the student) provides aligned, stable guidance, while a ViT teacher contributes complementary global cues. Combining both yields better transfer than either alone.

Core Ideas and Contributions

  • Dual‑Teacher Distillation: one ViT teacher (heterogeneous) + one CNN teacher (homogeneous) supervise a lightweight CNN student simultaneously.
  • Discrepancy‑Aware Teacher Weighting (DATW): per‑sample adaptive weights favor teachers that are (i) confident and (ii) meaningfully different from the student’s current prediction.
  • Structure Discrepancy‑Aware Distillation (SDD): the student learns residual features that capture what the ViT adds beyond the CNN teacher’s features via a temporary training‑only auxiliary branch.
  • Relational Knowledge Distillation (RKD): transfers architecture‑agnostic relational structure (pairwise relations between samples) from ViT to student.

Discrepancy‑Aware Teacher Weighting (DATW)

For a given input, each teacher produces logits. DATW computes a per‑teacher efficacy score that increases when:

  • the teacher is confident (low‑entropy softmax), and
  • the teacher’s prediction meaningfully disagrees with the student (e.g., cosine distance in logit space).

These scores are normalized to obtain weights that form a weighted combination of teacher logits as the soft target for the student. This balances complementary strengths dynamically, favoring the most informative teacher signal for each sample.

Intuition:

  • High‑confidence + high‑discrepancy teacher ⇒ higher weight.
  • Low‑confidence or redundant teacher ⇒ lower weight.

Benefits:

  • Improves stability and efficacy vs. uniform mixing.
  • Avoids over‑relying on a single teacher when it’s uncertain or uninformative for the current sample.

Structure Discrepancy‑Aware Distillation (SDD) and Residual Features

Directly forcing a CNN student to mimic ViT intermediate features is brittle due to architectural mismatch (patch‑token attention vs. convolutional feature maps). SDD reframes the target as residual transfer:

  • Let $f_{\mathrm{ViT}}$ and $f_{\mathrm{CNN}}$ denote ViT and CNN teacher features at aligned stages (with optional projections (\phi) to match shapes).
  • The residual is $r = f_{\mathrm{ViT}} - f_{\mathrm{CNN}}$.
  • A lightweight Structure Discrepancy Branch (SDB) attached to the student predicts $r$ during training (often using a small non‑local block + squeeze‑and‑excitation).
  • The SDB is discarded at inference, so no runtime overhead is added.

Why it works:

  • Transfers the “what’s missing” global context from ViT relative to CNN, instead of copying incompatible features.
  • Focuses learning on complementary information while preserving the aligned CNN inductive biases.

Architectures: Heterogeneous vs. Structurally Similar Teachers

  • Structurally similar CNN teacher: same family/backbone style as the student (e.g., X3D‑M for an X3D‑S student, SlowFast‑R50 variant for SlowFast‑R50 student). This alignment makes feature‑level guidance coherent (similar receptive fields and spatial semantics).
  • Heterogeneous ViT teacher: models global dependencies with self‑attention over patches, providing powerful but architecturally different representations; transferred via residuals (SDD) and relational signals (RKD).

Result: CNN teacher stabilizes feature transfer; ViT teacher injects complementary global context.


Datasets and Results (Highlights)

  • Benchmarks: HMDB51, EPIC‑KITCHENS‑100, Kinetics‑400.
  • Students trained with the dual‑teacher framework outperform single‑teacher distillation and, in cases, surpass the ViT teacher’s accuracy (e.g., distilled X3D‑S > MViTv2‑S on HMDB51).
  • Ablations confirm both DATW and SDD materially contribute to gains; best transfer occurs when the CNN teacher is structurally aligned with the student.

Efficiency and Deployment

  • FLOPs reduction: reported up to ~96% vs. a ViT teacher for certain students (e.g., X3D‑S).
  • Parameter reduction: ~89% (e.g., 3.76M vs. 34.5M), improving memory footprint and load times.
  • No inference overhead from SDD (auxiliary branch is dropped after training).
  • Practical impact on devices:
    • Lower latency and power draw (subject to hardware/runtime specifics).
    • CNN‑friendly accelerators (GPU/NPU) often deliver better throughput than for heavy self‑attention workloads.

FAQs

What is a “structurally similar CNN” teacher?
A CNN teacher whose architecture closely matches the student’s (same backbone family and block types). This alignment yields comparable feature spaces and receptive fields, enabling more effective feature‑level guidance.

What is a “heterogeneous teacher”?
A teacher with a fundamentally different architecture than the student (e.g., a ViT teacher for a CNN student). Its features are transferred via residuals (SDD) and relational signals rather than direct imitation.

What does “low entropy” mean here?
Low entropy in a teacher’s softmax output implies high confidence (the probability mass concentrates on a few classes). DATW prioritizes confident and informative teachers—but also accounts for student–teacher discrepancy to avoid over‑trusting confident yet unhelpful signals.


References

Keep Reading