CNN/DailyMail Dataset as SQLite#
Creates a SQLite database if the CNN and DailyMail summarization dataset.
Documentation#
See the full documentation. The API reference is also available.
Obtaining#
The easiest way to install the command line program is via the pip
installer:
pip3 install zensols.cnndmdb
Binaries are also available on pypi.
Usage#
First create the SQLite database file: cnndmdb load
and check to make sure
the file data/cnn.sqlite3
was created. This takes a while since the entire
corpus is first downloaded and then inserted into the SQLite file.
Command Line#
The SQLite database keys can be given:
cnndmdb keys
Then the command line can also be used to print articles:
cnndmdb show -t org 3b07f5102c69e3e609d73b2ccb0dc5549d4fbaf6
The -t org
tells it to use the original corpus keys. This option also allows
for selected SQLite rowid
keys or a Kth smallest article.
API#
The corpus objects are accessible as mapped Python objects. For example:
corpus: Corpus = ApplicationFactory.get_corpus()
art: Article = next(iter(corpus.stash.values()))
print(art.text)
Data Source#
The data is sourced from a Tensorflow dataset, which in turn uses the Abigail See GitHub repository.
@article{DBLP:journals/corr/SeeLM17,
author = {Abigail See and
Peter J. Liu and
Christopher D. Manning},
title = {Get To The Point: Summarization with Pointer-Generator Networks},
journal = {CoRR},
volume = {abs/1704.04368},
year = {2017},
url = {http://arxiv.org/abs/1704.04368},
archivePrefix = {arXiv},
eprint = {1704.04368},
timestamp = {Mon, 13 Aug 2018 16:46:08 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/SeeLM17},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{hermann2015teaching,
title={Teaching machines to read and comprehend},
author={Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil},
booktitle={Advances in neural information processing systems},
pages={1693--1701},
year={2015}
}