View on GitHub

NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Temporal Processing

Document Dating (Time-stamping)

Document Dating is the problem of automatically predicting the date of a document based on its content. Date of a document, also referred to as the Document Creation Time (DCT), is at the core of many important tasks, such as, information retrieval, temporal reasoning, text summarization, event detection, and analysis of historical text, among others.

For example, in the following document, the correct creation year is 1999. This can be inferred by the presence of terms 1995 and Four years after.

Swiss adopted that form of taxation in 1995. The concession was approved by the govt last September. Four years after, the IOC….

Datasets

Datasets # Docs Start Year End Year
APW 675k 1995 2010
NYT 647k 1987 1996

Comparison on year level granularity:

  APW Dataset NYT Dataset Paper/Source
NeuralDater (Vashishth et. al, 2018) 64.1 58.9 Document Dating using Graph Convolution Networks
Chambers (2012) 52.5 42.3 Labeling Documents with Timestamps: Learning from their Time Expressions
BurstySimDater (Kotsakos et. al, 2014) 45.9 38.5 A Burstiness-aware Approach for Document Dating

Temporal Information Extraction

Temporal information extraction is the identification of chunks/tokens corresponding to temporal intervals, and the extraction and determination of the temporal relations between those. The entities extracted may be temporal expressions (timexes), eventualities (events), or auxiliary signals that support the interpretation of an entity or relation. Relations may be temporal links (tlinks), describing the order of events and times, or subordinate links (slinks) describing modality and other subordinative activity, or aspectual links (alinks) around the various influences aspectuality has on event structure.

The markup scheme used for temporal information extraction is well-described in the ISO-TimeML standard, and also on www.timeml.org.

<?xml version="1.0" ?>

<TimeML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://timeml.org/timeMLdocs/TimeML_1.2.1.xsd">
<TEXT>


 PRI20001020.2000.0127 
 NEWS STORY 
 <TIMEX3 tid="t0" type="TIME" value="2000-10-20T20:02:07.85">10/20/2000 20:02:07.85</TIMEX3> 


 The Navy has changed its account of the attack on the USS Cole in Yemen.
 Officials <TIMEX3 tid="t1" type="DATE" value="PRESENT_REF" temporalFunction="true" anchorTimeID="t0">now</TIMEX3> say the ship was hit <TIMEX3 tid="t2" type="DURATION" value="PT2H">nearly two hours </TIMEX3>after it had docked.
 Initially the Navy said the explosion occurred while several boats were helping
 the ship to tie up. The change raises new questions about how the attackers
 were able to get past the Navy security.


 <TIMEX3 tid="t3" type="TIME" value="2000-10-20T20:02:28.05">10/20/2000 20:02:28.05</TIMEX3> 



<TLINK timeID="t2" relatedToTime="t0" relType="BEFORE"/>
</TEXT>
</TimeML>

To avoid leaking knowledge about temporal structure, train, dev and test splits must be made at document level for temporal information extraction.

TimeBank

TimeBank, based on the TIMEX3 standard embedded in ISO-TimeML, is a benchmark corpus containing 64K tokens of English newswire, and annotated for all asepcts of ISO-TimeML - including temporal expressions. TimeBank is freely distributed by the LDC: TimeBank 1.2

Evaluation is for both entity chunking and attribute annotation, as well as temporal relation accuracy, typically measured with F1 – although this metric is not sensitive to inconsistencies or free wins from interval logic induction over the whole set.

Model F1 score Paper / Source
Catena 0.511 CATENA: CAusal and TEmporal relation extraction from NAtural language texts
CAEVO 0.507 Dense Event Ordering with a Multi-Pass Architecture

TempEval-3

The TempEval-3 corpus accompanied the shared TempEval-3 SemEval task in 2013. This uses a timelines-based metric to assess temporal relation structure. The corpus is fresh and somewhat more varied than TimeBank, though markedly smaller. TempEval-3 data

Model Temporal awareness Paper / Source
Ning et al. 67.2 A Structured Learning Approach to Temporal Relation Extraction
ClearTK 30.98 Cleartk-timeml: A minimalist approach to tempeval 2013

Timex normalisation

Temporal expression normalisation is the grounding of a lexicalisation of a time to a calendar date or other formal temporal representation.

Example:

10/18/2000 21:01:00.65

Dozens of Palestinians were wounded in scattered clashes in the West Bank and Gaza Strip, Wednesday, despite the Sharm el-Sheikh truce accord.

Chuck Rich reports on entertainment every Saturday

TimeBank

TimeBank, based on the TIMEX3 standard embedded in ISO-TimeML, is a benchmark corpus containing 64K tokens of English newswire, and annotated for all asepcts of ISO-TimeML - including temporal expressions. TimeBank is freely distributed by the LDC: TimeBank 1.2

Model F1 score Paper / Source
TIMEN 0.89 TIMEN: An Open Temporal Expression Normalisation Resource
HeidelTime 0.876 A baseline temporal tagger for all languages

PNT

The Parsing Time Normalizations corpus in SCATE format allows the representation of a wider variety of time expressions than previous approaches. This corpus was release with SemEval 2018 Task 6.

Model F1 score Paper / Source
Laparra et al. 2018 0.764 From Characters to Time Intervals: New Paradigms for Evaluation and Neural Parsing of Time Normalizations
HeidelTime 0.74 A baseline temporal tagger for all languages
Chrono 0.70 Chrono at SemEval-2018 task 6: A system for normalizing temporal expressions

Go back to the README