NAACL

SharpSeq: Empowering Continual Event Detection through Sharpness-Aware Sequential-task Learning

June 27, 2024

Motivations

State-of-the-art methodologies for continual event detection (CED) [1, 2, 3] are built upon memory-based techniques [4, 5, 6]. However, when employing such techniques in continual learning, the management of multiple objectives associated with previous and current tasks becomes crucial. Naively aggregating these objectives by simple summation overlooks the inherent, complicated trade-offs involved.

To tackle this problem, gradient-based frameworks for MOO, seeking a solution on the Pareto front [7, 8, 9], have emerged as promising approaches. Despite their achievements, the effective application of such methods to continual NLP remains largely unexplored. Specifically, there are two main concerns: the skewed distribution of training data in later tasks, where current-task classes are significantly more prevalent; furthermore, existing MOO methods lack a clear criterion to determine whether a solution on the Pareto front would be ideal for mitigating catastrophic forgetting, as well as a systematic approach to reach such a solution.

How we handle

Overview

Figure 1: Overview of SharpSeq’s workflow.

We propose SharpSeq – a novel approach that enables effective utilizations of multi-task learning frameworks with Sharpness-Aware Minimization (SAM) [10] for CED, tailored for the sequential task emergence in continual learning.

Apply SAM exclusively to the objectives affiliated with the current task, while excluding its application to the objectives of previously encountered tasks.
To address data imbalance, we propose using a generative model to learn the underlying distribution of event triggers representations from each event type, and thereby synthesize data to alleviate the imbalance between past-task and current-task data during replay.

Algorithm

To reduce the chance of disturbing the model on old tasks with from SAM, we propose to apply the sharpness-aware multi-task learning approach exclusively to the event detection loss ( $L_{ed}$ ) and knowledge transfer loss ( $L_{kt}$ ). Specifically, we modify each objective $L_i$ with the worst-case loss perturbation in a neighborhood of the model parameter. It is assumed that the function $L_i$ is differentiable up to the first order with respect to the model parameter $\bm{\theta}$ .

Algorithm 1: Sequential Sharpness Minimization for Continual Event Detection.

How we evaluate

We conduct experiments on two English datasets: ACE 2005 [11] and MAVEN [12]; both are preprocessed as by Yu et al. [2]. We train models for 30 epochs, with early stopping enabled after 5 epochs. The reported results are averaged from 5 different random seeds. We evaluate three versions of our method: SharpSeq, SharpSeq-G, and SharpSeq-G-A. SharpSeq-G is SharpSeq without Representation Generation (RG). SharpSeq-G-A is the version of SharpSeq-G when we use both losses of the current task and the old tasks for sharpness-aware minimization as per Phan et al. [9].

Table 1: Classification F1-scores (%) on 2 datasets MAVEN and ACE.

We can observe from Table 1 that SharpSeq-G-A achieved significant improvements in F1 scores across most tasks on both datasets, outperforming other baselines. Notably, compared to EMP, the final F1 score of SharpSeq-G-A increased by $4.15\%$ in MAVEN and $4.56\%$ in ACE. This observation demonstrates the effectiveness of finding a flat minimum in continual learning. Moreover, the consistent performance superiority of SharpSeq-G over SharpSeq-G-A highlights the efficiency and necessity of our sharpness-aware continual learning paradigm. Furthermore, Representation Generation (RG) enhanced the F1 scores of SharpSeq-G from $59.11\%$ to $60.27\%$ at the fifth task of MAVEN, and from $56.85\%$ to $62.60\%$ at the fifth task of ACE. These findings are concrete evidence of the effectiveness of our methods. RG synthesizes old-label data to improve balance in the training set, benefiting multi-objective optimization algorithms. Our optimization framework, specifically tailored for continual learning, achieves a minimizer at flat regions while effectively alleviating noise due to SAM’s adversarial nature. These are the foundations that enable our methods to outperform current state-of-the-art approaches in continual event detection.

We further explore the impact of the generative model and multi-objective optimization method choices. The results are presented in Table 2 and 3.

Table 2: Ablation results of generation methods. “SS” is the abbreviation of SharpSeq.

From Table 2, we can see that all generation methods result in better performance compared to the baseline model KT. When combined with SharpSeq, all of them gained significant additional enhancements across most tasks. Considering the generation method in isolation, GMMs achieved the best performance, substantially improving KT by $3.83\%$ and $3.19\%$ on MAVEN and ACE, respectively, after the fifth task. From these findings, we can see that the choice of the generation method for SharpSeq is consequential and needs to be selected carefully:

GMMs’ learning process can be perceived as a soft-clustering process, which makes them excel in preserving the inherent separability within the latent trigger representations of different labels. Conversely, VAEs are trained such that their encoders can map the data into a continuous, latent probabilistic space, which allows smooth interpolations during reconstruction. This is a strength of VAEs in generating continuous-in-nature data types such as images and sound. However, in the context of our work, additional mapping of the latent trigger representations to a different latent space can result in unnecessary information loss. As such, the expected benefit of VAEs, which is to achieve smooth interpolation between two latent spaces, is not significant for the replay process of our Continual Event Detection model.

Table 3: Ablation result on MOO methods. “SS” is the abbreviation of SharpSeq.

We can observe from Table 3 that directly applying MOO methods to KT without any adjustments can even cause a downgrade in performance. For instance, Nash-MTL worsened KT’s performance by $2.47\%$ and $4.13\%$ on MAVEN and ACE, respectively. The main reasons for these decreases are the oversight of training data imbalance and the inherent differences between multi-task learning and continual learning. When the MOO methods were combined with SharpSeq, their performances improved clearly.

Why it matters

We introduce SharpSeq, a novel framework that enables the seamless integration of state-of-the-art gradient-based multi-objective optimization methods into continual event detection systems. By addressing the challenges of imbalanced training data and the unique nature of continual learning, our method significantly enhances the performance of continual event detection. Through rigorous empirical benchmarks, we demonstrate the effectiveness and versatility of our contributions, extending beyond the realm of continual event detection and showcasing the potential for leveraging multi-objective optimization in solving various continual learning problems across various domains. This work sets a solid foundation and paves the way for future research in this exciting and rapidly evolving field.

Read our SharpSeq paper (Proceedings of NAACL 2024) at:

https://aclanthology.org/2024.naacl-long.200

References

[1] Pengfei Cao, Yubo Chen, Jun Zhao, and Taifeng Wang. 2020. Incremental event detection via knowledge consolidation networks. In Proceedings of EMNLP 2020.

[2] Pengfei Yu, Heng Ji, and Prem Natarajan. 2021. Lifelong event detection with knowledge transfer. In Proceedings of EMNLP 2021.

[3] Minqian Liu, Shiyu Chang, and Lifu Huang. 2022. Incremental prompting: Episodic memory prompt for lifelong event detection. In Proceedings of COLING 2022.

[4] Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. 2018. End-to-end incremental learning. In Proceedings of ECCV 2018.

[5] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. 2018. Memory aware synapses: Learning what (not) to forget. In Proceedings of ECCV 2018.

[6] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. 2018. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420.

[7] Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. In NeurIPS 2018.

[8] Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. 2022. Multi-task learning as a bargaining game. arXiv preprint arXiv:2202.01017.

[9] Hoang Phan, Lam Tran, Ngoc N Tran, Nhat Ho, Dinh Phung, and Trung Le. 2022a. Improving multi-task learning via seeking task-based flat regions. arXiv preprint arXiv:2211.13723.

[10] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. 2021. Sharpness-aware minimization for efficiently improving generalization. In ICLR 2021.

[11] Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. 2006. ACE 2005 multilingual training corpus LDC2006T06. Web Download. Philadelphia: Linguistic Data Consortium.

[12] Xiaozhi Wang, Ziqi Wang, Xu Han, Wangyi Jiang, Rong Han, Zhiyuan Liu, Juanzi Li, Peng Li, Yankai Lin, and Jie Zhou. 2020. Maven: A massive general domain event detection dataset. arXiv preprint arXiv:2004.13590.

On Inference Stability for Diffusion Models Read Now

JPIS: A joint model for profile-based intent detection and slot filling with slot-to-intent attention Read Now

Back to Technical Blog

Overall

7 minutes

Thanh-Thien Le, Viet Dao, Linh Van Nguyen, Thi-Nhung Nguyen, Linh Ngo Van, Thien Huu Nguyen