zensols.cnndmdb package#

Submodules#

zensols.cnndmdb.app#

Creates a SQLite database if the CNN and DailyMail summarization dataset.

class zensols.cnndmdb.app.Application(config_factory, corpus)[source]#

Bases: object

Creates a SQLite database if the CNN and DailyMail summarization dataset.

__init__(config_factory, corpus)#

config_factory: ConfigFactory#: Used to create objects for load().

corpus: Corpus#

.Article.

Type: The corpus which contains a stash that creates instances of

load()[source]#: Load the SQLite database with the CNN/DailyMail corpus.

write_article(key, key_type=_KeyType.org, format=_Format.text, output_file=PosixPath('-'))[source]#

Write an article.

Parameters

key (str) – the key to the article
key_type (_KeyType) – db for numeric, database, org for corpus original, short for Kth shortest article
format (_Format) – the output format

Return type

Article

write_keys(limit=1)[source]#

Print the keys of the corpus.

Parameters: limit (int) – the max number of keys to write

class zensols.cnndmdb.app.PrototypeApplication(app)[source]#

Bases: object

CLI_META = {'is_usage_visible': False}#

__init__(app)#

app: Application#

proto()[source]#: Prototype test.

zensols.cnndmdb.cli#

Command line entry point to the application.

class zensols.cnndmdb.cli.ApplicationFactory(*args, **kwargs)[source]#

Bases: ApplicationFactory

__init__(*args, **kwargs)[source]#

classmethod get_corpus()[source]#

Return the section predictor using the app context.

Return type: Corpus

zensols.cnndmdb.cli.main(args=['/Users/landes/opt/lib/python/bin/sphinx-build', '-M', 'html', '/Users/landes/view/nlp/cnndmdb/target/doc/src', '/Users/landes/view/nlp/cnndmdb/target/doc/build'], **kwargs)[source]#

Return type: ActionResult

zensols.cnndmdb.corpus#

Data access objects (DAO) for the CNN/DailyMail news summarization corpus, which is sourced from a Tensorflow dataset instance, which in turn uses the Abi See `GitHub`_ repo.

link: Tensorflow
link: GitHub <https://github.com/abisee/cnn-dailymail>

class zensols.cnndmdb.corpus.Article(id, corp_id, split, publisher, text, highlights=None)[source]#

Bases: Dictable

Represents an article from the CNN/DailyMail corpus.

__init__(id, corp_id, split, publisher, text, highlights=None)#

asflatdict(*args, **kwargs)[source]#

Like asdict() but flatten in to a data structure suitable for writing to JSON or YAML.

Return type: Dict[str, Any]

corp_id: str#: The original corpus unique identifier.

highlights: Tuple[str, ...] = None#: The highlights, or summarization, of the article.

id: int#: The database unique identifier.

publisher: Publisher#: The source of the article.

split: Split#: The split .

text: str#: The article’s (story) text.

write(depth=0, writer=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]#

Write this instance as either a Writable or as a Dictable. If class attribute _DICTABLE_WRITABLE_DESCENDANTS is set as True, then use the write() method on children instead of writing the generated dictionary. Otherwise, write this instance by first creating a dict recursively using asdict(), then formatting the output.

If the attribute _DICTABLE_WRITE_EXCLUDES is set, those attributes are removed from what is written in the write() method.

Note that this attribute will need to be set in all descendants in the instance hierarchy since writing the object instance graph is done recursively.

Parameters

depth (int) – the starting indentation depth
writer (TextIOBase) – the writer to dump the content of this writable

class zensols.cnndmdb.corpus.Corpus(persister, stash)[source]#

Bases: object

Contains access to the CNN/DailyMail corpus.

__init__(persister, stash)#

get_by_corp_id(name)[source]#

Get an article using the original corpus ID.

Parameters: name (str) – the long 40 character unique identifier from the original corpus.
Return type: Article

get_kth_shortest(k)[source]#

Return type: Article

persister: BeanDbPersister#: The DB access object.

stash: Stash#: A stash for accessing and mapping the corpus as Article instances.

class zensols.cnndmdb.corpus.Publisher(value)[source]#

Bases: Enum

The source of the article.

cnn = 'c'#

daily_mail = 'd'#

class zensols.cnndmdb.corpus.Split(value)[source]#

Bases: Enum

The split of the news article

test = 't'#

train = 'r'#

validation = 'v'#

zensols.cnndmdb.load#

Classes to populate the database. For data sources see zensols.stash.

class zensols.cnndmdb.load.DatabaseLoader(persister, chunk_size, dataset_name='cnn_dailymail', split_spec=None)[source]#

Bases: object

Loads the CNN/DailyMail into a new SQLite database file. If the file already exists, it is deleted. This takes about 2 to load.

__init__(persister, chunk_size, dataset_name='cnn_dailymail', split_spec=None)#

chunk_size: int#: Number of rows to insert into SQLite at a time.

dataset_name: str = 'cnn_dailymail'#: The name of the dataset to load from.

property db_file: Path#

The SQLite file.

Return type: Path

load()[source]#: Load the SQLite database with the CNN/DailyMail corpus.

persister: BeanDbPersister#: The DB access object.

split_spec: Dict[str, str] = None#: Used to create the split format for loading the dataset.

zensols.cnndmdb package#

Submodules#

zensols.cnndmdb.app#

zensols.cnndmdb.cli#

zensols.cnndmdb.corpus#

zensols.cnndmdb.load#

Module contents#