🚙 BECoTTA: Input-dependent Online Blending of Experts for Continual Test-time Adaptation [ICML2024]

1 Korea University, 2 UNC Chapel Hill, 3 KAIST

*Indicates Equal Contribution
Teaser
Figure 1: Our BECoTTA achieves superior performance and parameter/memory efficiency against strong CTTA baselines on the CDS-hard scenario.

Continual Test Time Adaptation (CTTA) is required to adapt efficiently to continuous unseen domains while retaining previously learned knowledge. However, despite the progress of CTTA, forgetting-adaptation trade-offs and efficiency are still unexplored. Moreover, current CTTA scenarios assume only the disjoint situation, even though real-world domains are seamlessly changed. To tackle these challenges, this paper proposes BECoTTA, an input-dependent yet efficient framework for CTTA. We propose Mixture-of-Domain Low-rank Experts (MoDE) that contains two core components: (i) Domain- Adaptive Routing, which aids in selectively capturing the domain-adaptive knowledge with multiple domain routers, and (ii) Domain-Expert Synergy Loss to maximize the dependency between each domain and expert. We validate our method outperforms multiple CTTA scenarios including disjoint and gradual domain shits, while only requiring ∼98% fewer trainable parameters. We also provide analyses of our method, including the construction of experts, the effect of domain-adaptive experts, and visualizations.

Motivation

Teaser
Figure 2: Comparison of TTA process with other SoTA models. We compare the existing models and denote activated modules as yellow during CTTA process. In particular, CoTTA adopts the mean-teacher architecture and updates the entire model. TENT and EcoTTA update a few parts of the model, however, they achieve suboptimal performance with forgetting. Meanwhile, our BECoTTA updates only MoDE layers for efficient and rapid adaptation while preserving previous knowledge.

Method

Teaser
Figure 3: : The overview of BECoTTA. We propose a novel CTTA framework for dynamic real-world scenarios, including disjoint and gradual shifts of domains. When the model receives a target domain input, the Domain Discriminator (DD) first estimates a pseudo-domain label. Based on the estimated pseudo labels, the domain router processes the input to specific experts containing domain-specific information by minimizing our proposed Domain-Expert Synergy Loss. Finally, we obtain domain-adaptive representation, addressing downstream tasks in test-time.

Main Results

Continual Disjoint Shifts (CDS) - Hard : Imbalanced Weather & Area Shifts

Teaser
We devise a novel scenario encompassing imbalanced weather and area shifts. We present performance results for both w/o WAD and w/ WAD across the overall baselines. We report S, M, and L versions for our BECoTTA based on the number of parameters.

Continual Disjoint Shifts (CDS) - Easy : Balanced Weather Shifts

Teaser
We use the Cityscapes-to-ACDC benchmark, containing balanced weather shifts for target domains. For a fair comparison, we report both w/o WAD and w/ WAD performance of our method. The number of the parameters for DePT and VDP are not available as they do not provide the official codes.

Continual Gradual Shifts (CGS) Scenario

Teaser
We construct the novel gradual shifts scenario using CDS-Easy target domains. In this case, the input-dependent process of BECoTTA performs well in these blurry scenarios and ultimately shows +13.4%p improvement over the source model.

Zero-shot Domain Generalization Benchmark

Teaser
We compare the zero-shot performance of our method with strong TTA methods on four unseen domains. Our proposed method constantly outperforms strong baselines, demonstrating the competitive potential of generalization ability over unseen domains.

Analyses and Ablations

Expert Analysis

Teaser
Left: We visualize the frequency of ten expert selections for each domain during CTTA. Our frequency map shows co-selected and isolated experts in different domains. Right: We interpret the similarity between target domains by visualizing the assignment weights from each domain-adaptive router

Detailed Analysis

Teaser
We conduct the ablation study for the number of experts and k. Moreover, we analyze the relation between pre-defined WAD and target domains.