1 Introduction

Live blogs are dynamic news articles providing a rolling textual coverage of an ongoing event. One or multiple journalists continually post micro-updates about the event, which are displayed in chronological order. The updates contain a wide variety of modalities and genres, including text, video, audio, images, social media excerpts, and external links. During the last 5 years, live-blogging emerged as a very popular way to disseminate news offered by many major news organizations, such as the BBC, The Guardian, or The New York Times.

Several different kinds of events are regularly covered by live blogs, including sport games, elections, ceremonies, protests, conflicts, and natural disasters. Thurman and Schapals (2017, p. 1) report a journalist’s view that “live blogs have transformed the way we think about news, our sourcing, and everything”. Besides their timeliness, live blogs differ from common news articles by utilizing more original sources and providing information as smaller chunks, often written in a different tone than in traditional news writing (Thurman and Walters 2013).

Figure 1 shows an example of a live blog on the constitution of a new Brexit committee provided by The Guardian.Footnote 1 Live blogs typically consist of metadata, such as date, title, and authors and a list of postings with the updated information. For larger events, journalists provide intermediate summaries shown at the top of the article. At the end of the broadcasting, a journalist usually aggregates the postings and, if available, intermediate summaries to present the most important information about the event as timelines, short texts, or bullet point lists to the users. Figure 2 shows an excerpt of a completed live blog by the BBC which consists of 360 postings (distributed over 19 pages) and a summary shown as four bullet point items.Footnote 2

Fig. 1
figure 1

Live blog example from The Guardian (two newest postings visible)

Fig. 2
figure 2

Archived live blog example from the BBC (three newest postings visible)

In this work, we propose to leverage these human-written summaries to investigate the novel task of automatic live blog summarization. To this end, we provide a new corpus construction approach for producing a dataset of live blogs, and we evaluate state-of-the-art summarization systems for this new summarization task. Our work has multiple direct applications in digital journalism and news research, since automatic summarization tools for live blogs help journalists to save time during live-blogging and enable instant updates of the intermediate summaries on a live event. However, the automatic live blog summarization task also comes with new challenges:

  1. 1.

    Unlike a news article, the postings of a live blog do not form one coherent piece of text. Instead, each posting introduces facts or opinions from a single source which might be highly or only marginally related to the overarching topic. For example, the live blog in Fig. 2 contains a posting commenting the relationship between Theresa May and Angela Merkel, which is related to the overall Brexit topic, but not to the Supreme Court case. In similar lines, the live blog contains multiple topic shifts (e.g., focusing on the MP’s opinions or the government appeal). This lets us assume that single-document summarizers cannot be used out of the box.

  2. 2.

    A particular challenge is that positional features cannot be used to estimate information importance, because live blogs are chronologically ordered and, unlike news articles, do not necessarily report the most important information first. Thus, baselines that extract the first few sentences or single-document summarization approaches building extensively on the position of a sentence are not suitable for live blog summarization.

  3. 3.

    The postings of live blogs are very heterogeneous, covering multiple genres, modalities, and styles. They also differ in their length and, unlike most multi-document summarization datasets, they are hardly redundant. Automatic live blog summarization approaches therefore have to deal with heterogeneous data and identify novel ways of judging importance that are not solely based on the frequency signal.

In summary, live blog summarization is a special kind of multi-document summarization, but faces highly heterogeneous, temporally ordered input. It is similar to update summarization, but has to deal with low redundancy and occasional topic shifts. Moreover, it is related to real-time summarization, where summaries are to be created without having full information about the topic yet.

The remaining article is structured as follows: Sect. 2 discusses related work on live blog summarization and summarization corpora. In Sect. 3, we introduce our first contribution by suggesting a novel pipeline to collect and extract the human-written summaries and postings from online live blogs, which we make available as open-source software from our GitHub repository.Footnote 3 Section 4 provides a detailed analysis of the corpus we created from live blogs of two major news publishers, the BBC and The Guardian, using our pipeline. Sections 5 and 6 describe our second contribution, as we propose the new task of live blog summarization and benchmark our corpus with multiple commonly used summarization methods. While we find that live blog summarization is a challenging task, our work aims at stipulating further research in this area, for which we provide both reference data and benchmark results. Our results show that off-the-shelf summarization systems are not effective for live blog summarization, as they do not properly take into account the large number of heterogeneous postings of a live blog. Section 7 concludes our work and points to multiple directions for future research.

2 Related work

In this section, we discuss previous work on summarization corpora and automatic summarization methods as well as journalistic NLP applications related to the task of live blog summarization.

2.1 Summarization corpora

The most widely used summarization corpora have been published in the Document Understanding ConferenceFootnote 4 (DUC) series. In total, there are 139 document clusters with 376 human-written reference summaries across DUC 2001, 2002, and 2004. Although the research community has often used these corpora, their limited size prevents training advanced methods, such as encoder–decoder architectures, and it is time-consuming and labor-intensive to extend such corpora with large numbers of manually written summaries.

Large datasets exist particularly for single-document summarization tasks, including the ACL Anthology Reference Corpus (Bird et al. 2008) and the CNN/Daily Mail dataset (Hermann et al. 2015). The latter contains large pairs of 312k online news articles and multi-sentence summaries used for neural summarization approaches (Nallapati et al. 2016; See et al. 2017). However, this dataset contains only one source document, whereas live blogs have a larger number of postings (typically more than 100) that act like individual small documents.

Another recent work uses social media posts on Twitter to create large-scale multi-document summaries for news: Cao et al. (2016) use hashtags to cluster the tweets on the same topic, and they assume the tweet’s content to be a reference summary for the document linked by the tweet. Their corpus consists of 204 document clusters with 1114 documents and 4658 reference tweets. Lloret and Palomar (2013) create a similar corpus of English and Spanish news documents and corresponding tweets linking to them.

Other multi-document summarization datasets focus on heterogeneous sources: Zopf et al. (2016) and Zopf (2018) use Wikipedia articles as reference summaries and automatically search for potential source documents on the web. Benikova et al. (2016) propose an expert-based annotation setup for creating a summarization corpus for highly heterogeneous text genres from the educational domain. In similar lines of research, Tauchmann et al. (2018) use a combination of crowdsourcing and expert annotation to create a hierarchical summaries for a heterogeneous web crawl. Giannakopoulos et al. (2015) discuss multilingual summarization corpora and Li et al. (2017) introduce a corpus of reader-aware multi-document summaries, which jointly aggregate news documents and reader comments.

2.2 Automatic summarization

2.2.1 Extractive summarization

Until recently (Yao et al. 2017), the vast majority of research focused on extractive summarization, which outputs a selection of important sentences or phrases available in the input sources (Ko and Seo 2008; Nenkova and McKeown 2012). By selecting already grammatical elements, extractive summarization reduces to a combinatorial optimization problem (McDonald 2007). To solve such combinatorial problems, summarization systems have leveraged powerful techniques like Integer Linear Programming (ILP) or submodular maximization.

In order to score sentences and phrases, Luhn (1958) initially introduced the simple, but influential idea that sentences containing the most important words are most likely to embody the original document. This hypothesis was experimentally supported by Nenkova et al. (2006), who showed that humans tend to use words appearing frequently in the sources to produce their summaries. Many subsequent works exploited and refined this strategy. For instance, by computing TF\(\cdot \)IDF (Sparck Jones 1972) or likelihood ratio (Dunning 1993).

Words serve as a proxy to represent the topics discussed in the sources. However, different words with a similar meaning may refer to the same topic and should not be counted separately. This observation gave rise to a set of important techniques based on topic models (Allahyari et al. 2017). These approaches can be divided into sentence clustering (Radev et al. 2000), Latent Semantic Analysis (Deerwester et al. 1990; Gong and Liu 2001), and Bayesian topic models (Blei et al. 2003).

Graph-based methods form another powerful class of approaches which combine repetitions at the word and at the sentence level. They were developed to estimate sentence importance based on word and sentence similarities (Mani and Bloedorn 1997, 1999; Mihalcea and Tarau 2004). One of the most prominent examples is LexRank (Erkan and Radev 2004), which we run on our dataset in Sect. 6.

More generally, many indicators for sentence importance were proposed and therefore the idea of combining them to develop stronger indicators emerged (Aone et al. 1995). Kupiec et al. (1995) suggested that statistical analysis of summarization corpora would reveal the best combination of features. For example, the frequency computation of words or n-grams can be replaced with learned weights (Hong and Nenkova 2014; Li et al. 2013). Additionally, structured output learning permits to score smaller units while providing supervision at the summary level (Li et al. 2009; Peyrard and Eckle-Kohler 2017).

A variety of works proposed to learn importance scores for sentences (Yin and Pei 2015; Cao et al. 2015). This started a huge body of research comparing different learning algorithms, features and training data (Hakkani-Tur and Tur 2007; Hovy and Lin 1999; Wong et al. 2008). Nowadays, sequence-to-sequence methods are usually employed (Nallapati et al. 2017; Kedzie et al. 2018). These approaches are presented in Sect. 5 and tested on live blog summarization in Sect. 6.

2.2.2 Abstractive summarization

In contrast to extractive summarization, abstractive summarization aims to produce new and original texts (Khan et al. 2016) either from scratch (Rush et al. 2015; Chopra et al. 2016), by fusion of extracted parts (Barzilay and McKeown 2005; Filippova 2010), or by combining and compressing sentences from the input documents (Knight and Marcu 2000; Radev et al. 2002). Intuitively, abstractive systems have more degrees of freedom. Indeed, careful word choices, reformulation and generalization should allow condensing more information in the final summary.

Recently, end-to-end training based on the encoder-decoder framework with long short-term memory (LSTM) has achieved huge success in sequence transduction tasks like machine translation (Sutskever et al. 2014). For abstractive summarization, large single-document summarization datasets rendered possible the application of such techniques. For instance, (Rush et al. 2015) introduced a sequence-to-sequence model for sentence simplification. Later, Chopra et al. (2016) and Nallapati et al. (2016) extended this work with attention mechanisms. Since words from the summary are often retained from the original source, copy mechanisms (Gu et al. 2016; Gulcehre et al. 2016) have been thoroughly investigated (Nallapati et al. 2016; See et al. 2017).

2.2.3 Update summarization

After the DUC series, the Text Analysis ConferenceFootnote 5 (TAC) series introduced the update summarization task (Dang and Owczarzak 2008). In this task, two summaries are provided for two sets of documents and the summary of the second set of documents is an update of the first set. Although the importance of text to be included in the summary solely depends on the novelty of the information, the task usually observes only a single topic shift. In live blogs, however, there are multiple sub-topics and the importance of the sub-topics changes over time.

2.2.4 Real-time summarization

Real-time summarization began at the Text REtrieval ConferenceFootnote 6 (TREC) 2016 and represents an amalgam of the microblog track and the temporal summarization track (Lin et al. 2016). In real-time summarization, the goal is to automatically monitor the stream of documents to keep a user up to date on topics of interest and create email digests that summarize the events of that day for their interest profile. The drawback of this task is that they have a predefined time frame for evaluation due to the real-time constraint, which makes the development of systems and replicating results arduous. Note that live blog summarization is very similar to real-time summarization, as the real-time constraint also holds true for live blogs if the summarization system is applied to the stream of postings. Moreover, the Guardian live blogs do consist of updated and real-time summaries, but this requires different real-time crawling strategies which are out of the scope of this work.

2.2.5 Multi-tweet summarization

Tweets are 140-character short messages shared on Twitter, a micro-bloging website with a large number of users contributing and sharing content. Multi-tweet summarization allows the users to quickly grasp the gist of the large number of tweets. For multi-tweet summarization, previous work employed graph-based approaches (Liu et al. 2012) similar to LexRank, Hybrid TF-IDF (Sharifi et al. 2010) which ranks tweets based on TF-IDF, and ILP (Cao et al. 2016; Liu et al. 2011) optimizing the coverage of information in the summary. Summarizing tweets is similar to live blog summarization, since the postings of live blogs are similarly structured as tweets, but typically use more formal language than on Twitter. The postings of live blogs are also more heterogeneous, as tweets can be part of a live blog along with many other types of postings, such as images, interviews, or reporting. In our work, we benchmark our live blog summarization corpus with similar approaches, including graph-based, TF-IDF, and ILP-based methods.

2.3 NLP and journalism

Leveraging natural language processing methods for journalism is an emerging research topic. The SciCAR conferencesFootnote 7 and the recent “Natural Language Processing meets Journalism” workshops (Birnbaum et al. 2016; Popescu and Strapparava 2017, 2018) are predominant examples for this development. Previous research focuses on news headline generation and click-bait analysis (Blom and Hansen 2015; Gatti et al. 2016; Szymanski et al. 2016), abusive language and comment moderation (Clarke and Grieve 2017; Kolhatkar and Taboada 2017; Pavlopoulos et al. 2017; Schmidt and Wiegand 2017), news bias and filter bubble analyses (Baumer et al. 2015; Bozdag and van den Hoven 2015; Fu et al. 2016; Kuang and Davison 2016; Potash et al. 2017), as well as news verification and fake news detection (Brandtzaeg et al. 2015; Thorne et al. 2017; Bourgonje et al. 2017; Hanselowski et al. 2018; Thorne et al. 2018). We are not aware of any work on live blog summarization or computational approaches closely related to journalistic live blogging.

Live blogs as such have been previously discussed in the domain of digital journalism. Thorsen (2013) gives a general introduction about challenges and opportunities of live blogging. Thurman and Walters (2013) and Thurman and Newman (2014) study the production processes and the readers’ consumption behavior, Thurman and Schapals (2017) evaluate aspects of transparency and objectivity, and Thorsen and Jackson (2018) analyze sourcing practices in live blogs. Further works discuss certain types of live blogs, such as live blogs on sport events (McEnnis 2016) or terrorist attacks (Wilczek and Blangetti 2018). None of these works focuses on intermediate or final summaries in live blogs or computational approaches to assist the journalists.

3 Corpus construction pipeline

In this section, we describe the three steps to construct our live blogs summarization corpus: (1) live blog crawling yielding a list of URLs, (2) content parsing and processing, where the documents and corresponding summaries with the metadata are extracted from the URLs and stored in a JSON format, and (3) live blog pruning as a final step for creating a high-quality gold standard live blog summarization corpus.

3.1 Live blog crawling

A frequently updated index webpageFootnote 8 references all archived live blogs of the Guardian. We take a snapshot of this page yielding 16,246 unique live blog URLs. In contrast, the BBC website has no such live blog archive. Thus, we use an iterative approach similar to BootCaT (Baroni and Bernardini 2004) to bootstrap our corpus.

Algorithm 1 shows pseudo code for our iterative crawling approach, which is based on a small set of live blog URLs \(L_0\) shown in Table 1. From these live blogs, we extract a set of seed terms \(K_0\) using the 500 terms with the highest TF\(\cdot \)IDF scores. Table 2 shows \(K_0\) for our corpus. The iterative procedure uses the seed terms \(K_0\) to gather new live blog URLs by issuing automated Bing queriesFootnote 9 created using recurring URL patterns P for live blogs (line 7). We collect all valid links returned by the Bing search (line 8) and extract new key terms \(K_t\) from each crawled live blog (line 12). Similar to the seed terms, we define \(K_t\) as the top 500 terms sorted by TF\(\cdot \)IDF. The new key terms are then used to generate the Bing queries in the subsequent iterations (line 7). The process is repeated until no new live blogs are discovered anymore (line 9). For our corpus, we use the pattern

$$ \texttt {site:http://www.bbc.com/news/live/}<\textit{key term}>$$

where <key term> is one of the extracted key terms \(K_{t-1}\) from the previous iteration (or the seed terms if \(t = 1\)).

Table 1 Initial BBC live blogs links used to extract seed terms
Table 2 Sample seed terms extracted from the initial ten BBC live blogs

Using the proposed algorithm, we run 4000 search queries returning each around 1000 results on average, from which we collected 9931 unique URLs. Although our method collects a majority of the live blogs in the 4000 search queries, a more sophisticated key terms selection could minimize the search queries and maximize the unique URLs. An important point to note is that we find the collected BBC live blog URLs predominantly cover more recent years. This usage could be due to the Bing Search API preferring recent articles for the first 100 results.

By choosing a different set of seed URLs \(L_0\) or seed terms \(K_0\) and different URL patterns P, our methodology can be applied to other news websites featuring live blogs, such as The New York Times, the Washington Post or the German Spiegel.

3.2 Content parsing and processing

Once the URLs are retrieved, we fetch the HTML content, remove the boiler-plate using the BeautifulSoupFootnote 10 parser and store the cleaned data in a JSON file. During this step, unreachable URLs were filtered out. We discard live blogs for which we could not retrieve the summary or correctly parse the postings.

We parse metadata, such as URL, author, date, genre, summaries, and all postings for each live blog using site-specific regular expressions on the HTML source files. The automatic extraction is generally difficult, as the markup structure may change over time. For BBC live blogs, both the postings and the bullet-point summaries follow a consistent pattern, we can easily extract automatically. For the Guardian, we identify several recurring patterns which cover most of the live blogs. The Guardian provides live blogs since 2001, but they were in an experimental phase until 2008. Due to the lack of a specific structure or a summary during this experimental phase,

figure a

we had to remove about 10k of the crawled live blogs, for which we could not automatically identify the postings or the summary. However, after 2008, the live blogs showed a consistent structure, as they received a prominent place in the web site. After this step, 7307 live blogs remain for the BBC and 6450 for the Guardian.

3.3 Live blog pruning

To further clean the data, we remove live blogs covering multiple topics, as they can be quite noisy. For example, BBC provides some live blogs discussing all events happening in a certain region within a given time frame (e.g., Essex: Latest updates). We also prune live blogs about sport games and live chats, because their summaries are based on simple, easy-to-replicate templates.

We further prune live blogs based on their summaries. We first remove a sentence of a summary if it has less than three words. Then, we discard live blogs whose summaries have less than three sentences. This is to ensure the quality of the corpus, since overly short summaries would yield a different summarization goal similar to headline generation and they are typically an indicator for a non-standard live blog layout in which the summary has been separated to multiple parts of the website.

After the whole pruning step, 762 live blogs remained for BBC and 1683 for the Guardian. Overall, 10% of the initial set of live blogs, both for BBC and the Guardian remain after our selective pruning. This is to ensure high-quality summaries for the live blogs. Although the pruning rejects 90% of the live blogs, the size of the live blog corpus is still 20–30 times larger than the classical corpora released during DUC, TREC, and TAC tasks (Table 3).

Table 3 Number of live blogs for BBC and the Guardian after each step of our pipeline

3.4 Code repository

We publish our tools for reconstructing the live blog corpus as open-source software under the Apache License 2.0 on GitHub.Footnote 11 This repository helps to replicate our results and advance research in live blog summarization.

The repository consists of (a) raw and pruned URL lists, (b) tools for crawling live blogs, (c) tools for parsing the content of the URLs and transforming the results into JSON, and (d) code for computing benchmark results and corpus statistics.

4 Corpus analysis

Our final corpus yields a multi-document summarization corpus, in which the individual topics correspond to the crawled live blogs and the set of documents per topic corresponds to the postings of the live blog. We compute several statistics about our corpus and report them in Table 4. The number of postings per live blog is around 95 for BBC and 56 for the Guardian. In comparison, standard multi-document summarization datasets like DUC 2004Footnote 12 and TAC 2008AFootnote 13 have only 10 documents per topic. Furthermore, we observe that the postings are quite short as there is an average of 62 words per posting for BBC and 108 for the Guardian. The summaries are also shorter than the summaries of standard datasets: The summaries of DUC 2004 and TAC 2008A are expected to contain 100 words. However, our final corpus is larger overall, because it contains 2655 live blogs (i.e., topics) and 186,999 postings (i.e., documents). With that many data points, machine learning approaches become readily applicable.

Table 4 Corpus statistics for BBC and the Guardian live blogs

4.1 Domain distribution

The live blogs in our corpus cover a wide range of subjects from multiple domains. In Table 5, we report the distribution across all domains in the final corpus (BBC and Guardian combined). While we observe that politics, business, and news are the most prominent domains, there is also a number of well-represented domains, such as local and international events or culture.

Table 5 Domain distribution of our final corpus

4.2 Heterogeneity

The resulting corpus is expected of exhibiting various levels of heterogeneity. Indeed, it contains live blogs with mixed writing styles (short and to the point vs. longer descriptive postings, informal language, quotations, encyclopedic background information, opinionated discussions, etc.). Furthermore, live blogs are subject to topic shifts which can be observed by changes in words usage.

To measure this textual heterogeneity, we use information theoretic metrics on word probability distributions like it was done before in analyzing the heterogeneity of summarization corpora (Zopf et al. 2016). Based on the Jensen-Shannon (JS) divergence, they defined a measure of textual heterogeneity \(\textit{TH}\) for a topic T composed of documents \(d_1, \ldots , d_n\) as

$$\begin{aligned} \textit{TH}_{JS}(T) = \frac{1}{n} \sum _{d_i \in T} JS(P_{d_i},P_{T\setminus d_i}) \end{aligned}$$
(1)

Here \(P_{d_i}\) is the frequency distribution of words in document \(d_i\) and \(P_{T \setminus d_i}\) is the frequency distribution of words in all other documents of the topic except \(d_i\). The final quantity \(\textit{TH}_{JS}\) is the average divergence of documents with all the others and provides, therefore, a measure of diversity among documents of a given topic.

We report the results in Table 6. To put the numbers in perspective, we also report the textual heterogeneity of the two standard multi-document summarization corpora DUC 2004 and TAC 2008A. The heterogeneity in BBC and Guardian are similar. Thus, heterogeneity of our corpus is much higher than in DUC 2004 and TAC 2008A, indicating that our corpus contains more lexical variation inside its topics.

Table 6 Average textual heterogeneity of our corpora compared to standard datasets

4.3 Compression ratio

Additional factors which determine the difficulty of the summarization task are the length of the source documents and the summary (Nenkova and Louis 2008). The input document sizes of the BBC and the Guardian are on an average 5890 and 6048 words, whereas the summary sizes are only around 59 and 42 words respectively. In contrast, typical multi-document DUC datasets have a much lower compression ratio, since their input documents have on average only 700 words, while the summaries have 100 words. Thus, we expect that the high compression ratio makes live blog summarization even more challenging.

5 Automatic summarization methods

To automatically summarize live blogs, we employ methods that have been successfully used for both single and multi-document summarization. Some variants of them have also been applied to update summarization tasks.

5.1 Unsupervised methods

5.1.1 TF·IDF

Luhn (1958) scores sentences with the term frequency and the inverse document frequency (TF\(\cdot \)IDF ) of the words they contain. The best sentences are then greedily extracted.

5.1.2 LexRank

Erkan and Radev (2004) constructs a similarity graph G(VE) with the set of sentences V and edges \(e_{ij} \in E\) between two sentences \(v_i\) and \(v_j\) if and only if the cosine similarity between them is above a given threshold. Sentences are then scored according to their PageRank in G.

5.1.3 LSA

Steinberger and Jezek (2004) computes a dimensionality reduction of the term-document matrix via singular value decomposition (SVD). The sentences extracted should cover the most important latent topics.

5.1.4 KL-Greedy

Haghighi and Vanderwende (2009) minimizes the Kullback-Leibler (KL) divergence between the word distributions of the summary and the documents.

5.1.5 ICSI

Gillick and Favre (2009) propose using global linear optimization to extract a summary by solving a maximum coverage problem considering the most frequent bigrams in the source documents. ICSI has been among the state-of-the-art MDS systems when evaluated with ROUGE (Hong et al. 2014).

ICSI’s concept-based summarization can be formalized using an Integer Linear Programming (ILP) framework. Let C be the set of concepts in a given set of source documents D, \(c_i\) the presence of the concept i in the resulting summary, \(w_i\) a concept’s weight, \(\ell _j\) the length of sentence j, \(s_j\) the presence of sentence j in the summary, and \( Occ _{ij}\) the occurrence of concept i in sentence j. Based on these definitions, the following ILP has to be solved:

$$ \text {Maximize}\quad {\textstyle \sum _i} \, w_i c_i$$
(2)
$$ \text {subject to}\quad \forall j. \quad {\textstyle \sum _j} \, \ell _j s_j \le L$$
(3)
$$\forall i,j. \quad s_j \, Occ _{ij} \le c_i $$
(4)
$$\forall i. \quad {\textstyle \sum _j} \, s_j \, Occ _{ij} \ge c_i $$
(5)
$$ \forall i. \quad c_i \in \{0,1\} $$
(6)
$$ \forall j. \quad s_j \in \{0,1\} $$
(7)

The objective function (2) maximizes the occurrence of concepts \(c_i\) (typically bi-grams) in the summary based on their weights \(w_i\) (e.g., document frequency). The constraint formalized in (3) ensures that the summary length is restricted to a maximum length L, (4) ensures the selection of all concepts in a sentence \(s_j\) if \(s_j\) has been selected for the summary. Constraint (5) ensures that a concept is only selected if it is present in at least one of the selected sentences.

5.2 Supervised methods

The supervised extractive summarization task as a sequence labeling problem using the formulation by Conroy and O’Leary (2001): Given a document set containing n sentences (\(s_{1},\ldots, s_{i},\ldots, s_{n}\)), the goal is to generate a summary by predicting a label sequence (\(y_{1},\ldots, y_{i},\ldots, y_{n}) \in \{0, 1\}^{n}\) corresponding to the n sentences, where \(y_{i} = 1\) indicates that the i-th sentence is included in the summary. The summaries are constructed with a word budget L, which enforces a constraint on the summary length \(\sum _{i=1}^{n} y_{i} \,\cdot \mid s_{i} \mid \le L\). Figure 3 shows the neural network architecture of the four state-of-the-art sentence extractors we describe below.

Fig. 3
figure 3

Architectures of the sentence extractors RNN, Seq2Seq, Cheng and Lapata, and SummaRuNNer

5.2.1 RNN

Kedzie et al. (2018) propose a simple bidirectional RNN-based tagging model. In the sentence encoder, the forward and backward outputs of each sentence are passed through a multi-layer perceptron with sigmoid function as the output layer to predict the probability of extracting each sentence.

5.2.2 Seq2Seq

In the same paper, Kedzie et al. (2018) also propose a sequence-to-sequence (Seq2Seq) extractor which tackles the shortcoming of the RNN extractor i.e. the inability to capture long range dependencies between the sentences. The Seq2Seq extractor thus uses an attention mechanism (Bahdanau et al. 2015; See et al. 2017; Rush et al. 2015) popularly used in machine translation and abstractive summarization. The Seq2Seq extractor is divided into encoder and decoder, where the sentence embeddings are first encoded by a bidirectional GRU and a separate decoder GRU transforms each sentence into a query vector. The query vector attends to the encoder output and is concatenated with the decoder GRU’s output. These concatenated outputs are then fed into a multi-layer perceptron to compute the probabilities for extraction.

5.2.3 Cheng and Lapata

Cheng and Lapata (2016) propose a Seq2Seq model where the encoder RNN is fed with the sentence embedding and the final encoder state is passed on to the first step of the decoder RNN. The decoder takes the same sentence embeddings as input and the outputs are used to predict the \(y_{i}\) labels defining the summary. To induce dependencies of \(y_{i}\) on \(y_{<i}\), the decoder input is weighted by the previous extraction probabilities \(y_{<i}\).

5.2.4 SummaRuNNer

Nallapati et al. (2017) propose a sentence extractor where the sentence embeddings are passed into a bidirectional RNN and the output is concatenated. Then, they average the RNN output to construct a document representation, and they sum up the previous RNN outputs weighted by extraction probabilities to construct a summary representation for each time step. Finally, the extraction probabilities are calculated using the document representation, the sentence position, the RNN outputs, and the summary representation at the i-th step. The iterative summary representation process intuitively considers dependencies of \(y_{i}\) on all \(y_{<i}\).

We test each sentence extractor with two input encoders that compute sentence representations based on the sequence of word embeddings.

5.2.5 Averaging encoder (Avg)

The averaging encoder creates sentence representations

$$ h_{i} = \frac{1}{|s_{i}|} \sum _{j=1}^{|s_{i}|} w_{j}$$

by averaging the word embeddings (\(w_{1},\ldots w_{j}\ldots w_{|s_i|}\)) of a sentence \(s_i\).

5.2.6 CNN encoder

The CNN sentence encoder employs a series of one-dimensional convolutions over word embeddings, which is similar to the architecture proposed by Kim (2014) used for text classification. The final sentence representation \(h_{i}\) is the concatenation of the max-pooling overtime of all the convolutional filter outputs.

6 Benchmark results and discussion

In this section, we describe our live blog summarization experiments and provide benchmark results for future researchers using our data and setup.

6.1 Experimental setup

In our experiments, we measure performance using the ROUGE metrics identified by Owczarzak et al. (2012) as strongly correlating with human evaluation methods: ROUGE-1 (R1), ROUGE-2 (R2) and ROUGE-L (RL) recall with stemming and stop words not removed. We explore two different summary lengths: 50 words, which corresponds to the average length of the human-written summary, and 100 words, which is twice the average length of the human-written summaries in order to give leeway for compensating the excessive compression ratio of the human-written live blog summaries.

For the supervised setup, we split the dataset into training, validation and testing consisting of 80%, 10%, and 10% of the data respectively. Table 7 illustrates the training, validation, and test split sizes used for our experiments.

Table 7 Training, validation and test split sizes for BBC and Guardian datasets

We train the models to minimize the weighted negative log-likelihood over the training data D:

$$ \mathcal {L} = - \sum _{s,y \in D} \; \sum _{i=1}^{n} \omega (y_{i}) \; \mathrm {log} p(y_{i} \mid y_{\le i}, h), \; \mathrm { where } \; h = enc(s) $$

We use the stochastic gradient descent with the Adam optimizer for optimizing the objective function. \(\omega (y)\) represents the weights of the labels i.e. \(\omega (0) = 1\) and \(\omega (1) = \frac{N_{0}}{N_{1}}\) where \(N_{y}\) is the number of training samples with label y. The word embeddings are initialized using the pretrained GloVe embeddings (Pennington et al. 2014) and not updated during training. The training is carried out for a maximum of 50 epochs and the best model is selected using an early stopping criterion for ROUGE-2 on the validation set. We use a learning rate of 0.0001, a dropout rate of 0.25, and bias terms of 0. The batch size is set to 32 for both BBC and the Guardian. Additionally, due to the GPU memory limitation, the number of input sentences used by the extractors is set to 250 for BBC and 200 for the Guardian.

6.2 Upper bound

For comparison, we compute two upper bounds UB-1 and UB-2. The upper bound for extractive summarization is retrieved by solving the maximum coverage of n-grams from the reference summary (Takamura and Okumura 2010; Peyrard and Eckle-Kohler 2016; Avinesh and Meyer 2017). Upper bound summary extraction is cast as an ILP problem as described in Eqs. (27), which is the core of the ICSI system. However, the only difference is that the concept weights are set to 1 if the concepts occur in the human-written reference summary. The concept extraction depends on N, which represents the n-gram concept type. In our work, we set \(N=1\) and \(N=2\) and compute the upper bound for ROUGE-1 (UB-1) and ROUGE-2 (UB-2) respectively.

6.3 Analysis

Table 8 shows the benchmark results of the five unsupervised summarization methods introduced in Sect. 5.1 on our live blog corpus in comparison to the standard DUC 2004 dataset. TF\(\cdot \)IDF and LSA consistently lag behind the other methods. The results of KL are in a mid-range for the DUC datasets, but low on our data. LexRank yields stable results, but ICSI as a state-of-the-art method for unsupervised extractive summarization consistently outperforms all other methods by a large margin. The automatic methods reach higher ROUGE scores on BBC than on Guardian data, which we attribute to the different level of abstractiveness used for these live blogs: In BBC and DUC 2004, the summaries tend to reuse verbatim phrases from the input documents, whereas the Guardian summaries often contain newly formulated sentences in the summary. This can also be observed in the upper bound, as both UB-1 and UB-2 for the Guardian data are lower than the corresponding values for BBC. The best unsupervised method ICSI is 0.15 ROUGE-1 and 0.2 ROUGE-2 lower than the upper bounds for BBC and 0.1 ROUGE-1 and 0.1 ROUGE-2 lower for the Guardian’s upper bounds.

Table 8 ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) scores of multiple unsupervised systems compared to the extractive upper bounds for ROUGE-1 (UB-1) and ROUGE-2 (UB-2) for summary lengths of 50 and 100 words

The results of the supervised approaches comparing different extractors and encoders are shown in Table 9. While ICSI is the only unsupervised approach which is able to reach one-third of the upper bound, supervised approaches can reach up to 50% of the upper bound scores for BBC. This confirms that the supervised models are able to learn importance properties of the BBC dataset. However, the supervised models perform worse than ICSI on the Guardian dataset. We presume this is caused by the constraint on the number of input sentences due to the GPU memory constraint.

Table 9 ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (L) scores across supervised neural methods with all extractor and encoder (enc.) pairs compared to the extractive upper bounds for ROUGE-1 (UB-1) and ROUGE-2 (UB-2)

Overall, there are improvements of about 0.03 ROUGE-1 and 0.02 ROUGE-2 when a CNN encoder is used for sentence representation as compared to the averaging encoder across all the supervised approaches, which differs from the observation by Kedzie et al. (2018). When analyzing different extractors, the Seq2Seq extractor performs best in the majority of the settings, closely followed by Cheng and Lapata and RNN. SummRuNNer consistently yields lower scores across all settings. Although RNN yields slightly better results on the 100 words condition of the Guardian data, Seq2Seq and Cheng and Lapata with CNN encoder yield consistently good results across both datasets.

Figure 4 shows the output of the best unsupervised system ICSI and the three best supervised systems (i.e. Chang and Lapata, RNN, and Seq2Seq with a CNN encoder). The outputs are compared to the extractive upper bound UB-2 and the reference summary for the BBC live blog on “Junior doctors’ strike updates”.Footnote 14 ICSI extracts sentences with the most frequent concepts (e.g., junior doctor, strike, England), but misses to identify topic shifts in the live blog’s postings, such as the discussion of emergency cover. The best supervised approach Seq2Seq captures more diverse concepts (e.g, junior doctors, emergency cover, 24-h walkout, dispute with the government) covering a greater variety of information about the strike event and its agents and reasons.

Fig. 4
figure 4

System outputs on the BBC.com live blog on Junior doctors’ strike updates

However, the example also shows the challenges of live blog summarization, since most methods incorporate general statements to capture the reader’s attention (e.g., “stay with us as we bring you the latest updates”), which contain little factual information, but are frequently found in the postings. Many of our methods failed to detect this raising the need for methods that better take semantic aspects into account. Furthermore, none of the summaries provides information about the greater context and future outlook (i.e., the fact that three strikes are planned). Such information is very important for summarizing live blogs, since readers are typically interested into the implications of certain events or decisions. The same applies to quotes by major protagonists of an event, as they are often included in a live blog summary, but not yet particularly treated by the automatic summarization methods. The increasing use of multimedia also raises a need for multimodal approaches that are able to extract important content from images or videos and include them into a summary. For multimodal summarization, there are yet only few case studies for a few domains, such as financial reports (Ahmad et al. 2004). Among the biggest challenges is, however, the heterogeneity of the individual postings, which makes the task of live blog summarization much different to multi-document summarization of multiple news articles covering very similar information or microblog summarization of a large number of highly redundant posts. In live blogs, the same fact is typically covered only once.

7 Conclusion and future work

Automatic live blog summarization is a new task with direct applications for journalists and news readers, as journalists can easily summarize the major facts about an event and even provide instant updates as intermediate summaries while the event is ongoing. In this paper, we suggest a pipeline to collect live blogs with human-written bullet-point summaries from two major online newspapers, the BBC and the Guardian. Out pipeline can be extended to collect live blogs from other news agencies as well, including the New York Times, the Washington Post or Der Spiegel.

Based on this live blog reference corpus, we analyze the domain distribution and the heterogeneity of the corpus, and we provide benchmark results using state-of-the-art summarization methods. Our results show that simple off-the-shelf unsupervised summarization systems are not very effective for live blog summarization. Supervised systems, however, yield better results, particularly on our BBC data. We find the Seq2Seq extractor with a CNN encoder for sentence representations to perform best in the majority of settings. Furthermore, sentence representations based on a CNN encoder show improvements of 0.03 ROUGE-1 0.02 ROUGE-2 compared to the averaging encoder. For the Guardian data, the supervised systems showed worse results than the unsupervised ICSI system. Our results enable future research on novel approaches to live blog summarization that are able to successfully handle the large number of heterogeneous postings of a live blog.

Besides our benchmark results which allow for comparison, we provide the source code for constructing and reproducing the live blog corpus as well as the automatic summarization experiments under the permissive Apache License 2.0 from our GitHub repository https://github.com/AIPHES/live-blog-summarization.