NLP

Selecting Optimal Context Sentences for Event-Event Relation Extraction

January 7, 2022

Introduction

Understanding events entails recognizing the structural and temporal orders between event mentions building event structures/ graphs for input documents. To achieve this goal, our work addresses the problems of subevent relation extraction (SRE) and temporal event relation extraction (TRE) that aim to predict subevent and temporal relations between two given event mentions/triggers in texts (event-event relation extraction problems – EERE).

Recent state-of-the-art methods for such problems have employed transformer-based language models (e.g., BERT (Devlin et al. 2019)) to induce effective contextual representations for input event mention pairs. However, a major limitation of existing transformer-based models for SRE and TRE is that they can only encode input texts of limited length (i.e., up to 512 sub-tokens in BERT), thus unable to effectively capture important context sentences that are farther away in the documents.

Therefore, in this work, we introduce a novel method to better model document-level context with important context sentences for event-event relation extraction. Our method seeks to identify the most important context sentences for a given entity mention pair in a document and pack them into shorter documents to be consumed entirely by transformer-based language models for representation learning.

Previous works which address the length limit of the transformer-based model

Self-Attention Architecture Modification (Zaheer et al. 2020; Beltagy, Peters, and Cohan 2020; Kitaev, Kaiser, and Levskaya 2020): Replace the vanilla self-attention of transformer networks with some variant architectures, e.g., sparse self-attention (Zaheer et al. 2020), that allows the modeling of larger document context while maintaining the same complexity as the original transformer

Hierarchical Designs (Adhikari et al. 2019; J¨orke et al. 2020): the standard transformer-based language models are still leveraged to encode input texts with certain length limits. For larger input documents, another network architecture will be introduced to facilitate representation induction.

Drawbacks:

With Self-Attention Architecture Modification methods, they still suffer from a constraint of certain input length and generally lead to poorer performance for NLP tasks compared to vanilla transformers (Beltagy, Peters, and Cohan 2020)
With Hierarchical Designs methods, the self-attention mechanism in the transformer-based language models cannot consume the entire input document to fully exploit its ability to capture long-range context dependencies in the whole document.
In both of these methods, the transformer-based models are often used to encode a consecutive sequence of sentences in an input document without considering the potential contribution of each sentence for the prediction tasks of interest.

Proposed approach

To address these issues, we propose to design models that can learn to select important context sentences for EERE to improve representation learning with BERT.

Sentence context selection helps compress an input document into a shorter one (with only important context) that can fit entirely into the length limit of transformer-based models to better leverage their representation learning capacity.
Important sentences with arbitrary distances in the document can also be reached in this selection process to provide effective information for the predictions in EERE.
Context selection can avoid irrelevant sentences in the inputs for transformer-based models to reduce noise in the induced representations for EERE.

In particular, starting with the host sentences of the two event mentions of interest, we will perform the sentence selection sequentially. The total length of the selected sentences will be constrained to not exceed the input limit in transformer-based models, thus allowing the entire consumption and encoding of the models for the selected context. To train this model, the policy-gradient method REINFORCE (Williams 1992) is leveraged which is guided by three rewards:

The performance of BERT-based models for EERE tasks
Contexture similarity and background knowledge-based similarity between sentences.

Notation

D is the input document with N sentences
S_i and S_j be the host sentences of e₁ and e₂ in D (respectively) with S_i as the earlier sentence, i.e., i ≤ j.
C is the set of selected sentences.

Important Sentence Selection Model (Selector model)

The goal of this section is to select the most important context sentences C for the event relation prediction between e₁ and e₂ in D. A sentence S_k∈ D is considered to involve important context information for EERE if including Sk into the compressed document D’ can lead to improved performance for the prediction model over e₁ and e₂.

Our sentence selection model follows an iterative process where a sentence in S_context is chosen at each time step to be included in the sentence set C. In particular, C is empty at the beginning (step 0). At step t + 1 (t ≥ 0), given t sentences selected in previous steps, i.e., C we aim to choose a next sentence S_{k t+1} over the set of non-selected sentences

S^t_context= S_context C

To summarize the selected sentences in prior steps, we run a Long Sort-Term Memory Network (LSTM) over the representation vectors x_{k i} of the selected sentences. The hidden vector h_t of LSTM at step t will serve as the summarization vector of the previously selected sentences. Afterward, the selection of S_{k t+1} at step t+1 will be conditioned on the selected sentences in prior steps via their summarization vector ht. In particular, for each non-selected sentence S_u ∈ S^t_context, a selection score sc_u^t+1 is computed as a function of the representation vector xu of S_u in X and the summarization vector h_t:

sc_u^t+1 = sigmoid (G[x_u, h_t])

where G is a two-layer feed-forward network.

To this end, the sentence S_u* with highest selection score, i.e., S_u* = argmax sc_u^t+1, will be considered for selection at this step.

Training Sentence Selection Model

We utilize the REINFORCE algorithm (Williams 1992) that can treat the prediction performance as the reward function R(C) for the selected sentence sequence C to train the selection processes for input documents. In addition, another benefit of REINFORCE involves its flexibility that facilitates the incorporation of different information sources from C to enrich the reward function R(C) and provide more training signals for the selection model. As such, for EERE problems, we propose the following information sources to compute the reward function R(C) for REINFORCE training:

Performance-based Reward R^per(C): We compute this reward via the relation prediction performance of the prediction model for the event mentions e₁ and e₂in D. To condition on the selected sentence sequence C, prediction model is applied on the compressed short document D’ = {S_i, S_j} ∪ C. As such, R^per(C) is set to 1 if prediction model correctly predicts the relation between e₁ and e₂; and 0 otherwise.
Context-based Reward R^context(C): The motivation for this reward is that a sentence should be preferred to be included in C in the selection process if its contextual semantics is more similar to those for the event mentions e₁ and e₂ in the host sentences S_iand S_j(i.e., our target sentences).
Knowledge-based Reward R^know(C): This reward has the same motivation as the context-based reward R^context(C) where similar sentences to e₁ and e₂ should be promoted for selection in C due to the potential to involve related events with helpful information for event-event relation prediction. However, instead of relying on contextual semantics (i.e., via representation vectors) to obtain similarity measures as in R^context(C), R^know(C) seeks to exploit external knowledge resources to retrieve semantic word representations for the similarity-based reward (i.e., knowledge-based semantics).

The overall reward function R(C) to train our context selection module with REINFORCE for EERE is:

R(C) = α_perR^per(C) + α_contextR^context(C) + α_knowR^know(C)

Evaluation

We evaluate our model on four datasets: HiEve (Glavaˇs et al. 2014), MATRES (Ning, Wu, and Roth 2018), TDDMan and TDDAuto datasets in the TDDiscourse corpus (Naik, Breitfeller, and Rose 2019).

We found that our proposed method achieves significantly better performance in all four datasets.

Conclusion

We present a novel model for event-event relation extraction that learns to select the most important context sentences in a document and directly use them to induce representation vectors with transformer-based language models. Relevant context sentences are selected sequentially in our model that is conditioned on the summarization vector for the previously selected sentences in the sequence. We propose three novel reward functions to train our model with REINFORCE. Our extensive experiments show that the proposed model can select important context sentences that are far away from the given event mentions and achieve state-of-the-art performance for subevent and temporal event relation extraction. In the future, we plan to extend our proposed method to other related tasks in event structure understanding (e.g., for the joint event and event-event relation extraction).

POODLE: Improving Few-shot Learning via Penalizing Out-of-Distribution Samples Read Now

Improving Neural Cross-Lingual Summarization via Employing Optimal Transport Distance for Knowledge Distillation Read Now

Back to Technical Blog

Overall

7 minutes

Hieu Man Duc Trong (Research Resident)