Relation extraction using Joint Bootstrapping Machines

Paper: Joint Bootstrapping Machines for High Confidence Relation Extraction

Introduction:

This paper proposes a method for effective semi-supervised bootstrapping in the context of relation extraction. Deep learning techniques have shown promise in solving entity extraction and relation extraction tasks. However, these techniques require a large amount of hand-labeled data, thus making it an expensive and time-consuming proposition. Hence, semi-supervised bootstrapping techniques are used to simplify and automate extraction from the corpus.

Bootstrapping methods start off with an initial set of positive and negative seed instances(say, examples of entity pairs for the target relation). Occurrences of positive seed instances in the corpus are converted into extraction patterns that are used to extract new relationship entity pairs that augment the initial seed set. This process is continued iteratively.

Since this is an iterative process that builds upon previous extractors, false positives can have a highly detrimental effect. False positives added during a particular iteration can effect extractions from all subsequent iterations, thus leading to semantic drift.

This paper tries to solve this problem by proposing a new bootstrapping method that aims to reduce false positives by using rigorous confidence measures and using a combination of two types of seed sets.

Methodology:

For a target relation like ‘X acquired Y’, the goal is to extract all entity pairs in the corpus for which this relationship holds true. The algorithm requires an initial set of positive and negative seeds. Seeds can be entity pairs, templates, or both. A template consists of three vectors, with the first vector representing the text before the relation occurrence, the second vector representing the text between the entity pair participating in the relation, and the third vector representing the text after the relation occurrence. An instance refers to the combination of the entity pair and its template(context).

The first step is to extract a set of instances from the corpus. The extracted instances are compared to the seeds according to a similarity measure. Instances that pass the similarity threshold are collected and clustered. This process is repeated for three hops (i.e for the next iteration, all instances similar to a hop-1 instance are added).

Each such collected instance is added to the extracted set only if they pass a confidence assessment. For an instance, the confidence of a single extractor is defined as the product of the reliability of the cluster and the similarity of the instance to the cluster. The similarity of an instance to a cluster is defined as the maximum similarity between the instance and any member of the cluster. The reliability of the cluster is an instance independent score.

One of the factors that affect the reliability score of a cluster is the ratio of the number of instances in the cluster that matches negative seeds to the number of instances in the cluster that matches positive seeds. If the ratio is close to zero, it means false positive extractions are less likely compared to true positive extractions. Thus, a large likelihood of false positive extractions reduces the reliability score of the cluster.

All instances that pass the confidence assessment are added to the extracted set and the algorithm is repeated for k iterations.

The number of iterations and the similarity threshold hyperparameters have a tradeoff between them.

Why use both entity and template seeds?

In this algorithm, instance matching is disjunctive, i.e, an instance need only match either the entity pair seed or the template seed, thus having a higher hit rate in terms of matched instances. Using both entity pair seeds and template seeds can act as a confidence booster, with instances matching both entity pair seeds and template seeds getting a higher confidence score.