Welcome to zeugma’s documentation!

📝 Natural language processing (NLP) utils: word embeddings (Word2Vec, GloVe, FastText, …) and preprocessing transformers, compatible with scikit-learn Pipelines. 🛠 Check the documentation for more information.

Installation

Install package with pip install zeugma.

Examples

Embedding transformers can be either be used with downloaded embeddings (they all come with a default embedding URL) or trained.

Pretrained embeddings

As an illustrative example the cosine similarity of the sentences what is zeugma and a figure of speech is computed using the GloVe pretrained embeddings.:

>>> from zeugma.embeddings import EmbeddingTransformer
>>> glove = EmbeddingTransformer('glove')
>>> embeddings = glove.transform(['what is zeugma', 'a figure of speech'])
>>> from sklearn.metrics.pairwise import cosine_similarity
>>> cosine_similarity(embeddings)[0, 1]
0.8721696

Training embeddings

To train your own Word2Vec embeddings use the Gensim sklearn API.

Fine-tuning embeddings

Embeddings fine tuning (training embeddings with preloaded values) will be implemented in the future.

Other examples

Usage examples are present in the examples folder.

Additional examples using Zeugma can be found in some posts of my blog.

Contribute

Feel free to fork this repo and submit a Pull Request.

Development

The development workflow for this repo is the following:

  1. create a virtual environment: python -m venv venv && source venv/bin/activate
  2. install required packages: pip install -r requirements.txt
  3. install the pre-commit hooks: pre-commit install
  4. run the test suite with: pytest from the root folder

Distribution via PyPI

To upload a new version to PyPI, simply:

  1. tag your new version on git: git tag -a x.x -m "my tag message"
  2. update the download_url field in the setup.py file
  3. commit, push the code and the tag (git push origin x.x), and make a PR
  4. Make sure you have a .pypirc file structured like this in your home folder (you can use https://upload.pypi.org/legacy/ for the URL field)
  5. once the updated code is present in master run python setup.py sdist && twine upload dist/* from the root of the package to distribute it.

Building documentation

To build the documentation locally simply run make html from the docs folder.

Bonus: what’s a zeugma?

It’s a figure of speech: “The act of using a word, particularly an adjective or verb, to apply to more than one noun when its sense is appropriate to only one.” (from Wiktionary).

For example, “He lost his wallet and his mind.” is a zeugma.

zeugma

zeugma package

Submodules

zeugma.conf module

Created on the 05/01/18 @author: Nicolas Thiebaut @email: nkthiebaut@gmail.com

zeugma.embeddings module

class zeugma.embeddings.EmbeddingTransformer(model: str = 'glove', aggregation: str = 'average')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Text vectorizer class: load pre-trained embeddings and transform texts into vectors.

fit(x: Iterable[Iterable[T_co]], y: Iterable[T_co] = None) → sklearn.base.BaseEstimator[source]

Has to define fit method to conform scikit-learn Transformer definition and integrate a sklearn.Pipeline object

transform(texts: Iterable[str]) → Iterable[Iterable[T_co]][source]

Transform corpus from single text transformation method

transform_sentence(text: Union[Iterable[T_co], str]) → numpy.array[source]

Compute an aggregate embedding vector for an input str or iterable of str.

zeugma.keras_transformers module

Created on the 02/05/2018 @author: Nicolas Thiebaut @email: nicolas@visage.jobs

class zeugma.keras_transformers.Padder(max_length=500)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Pad and crop uneven lists to the same length. Only the end of lists longer than the max_length attribute are kept, and lists shorter than max_length are left-padded with zeros

Variables:
  • max_length (int) – sizes of sequences after padding
  • max_index (int) – maximum index known by the Padder, if a higher index is met during transform it is transformed to a 0
fit(X, y=None)[source]
transform(X, y=None)[source]
class zeugma.keras_transformers.TextsToSequences(**kwargs)[source]

Bases: sphinx.ext.autodoc.importer._MockObject, sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Sklearn transformer to convert texts to indices list

Example

>>> from zeugma import TextsToSequences
>>> sequencer = TextsToSequences()
>>> sequencer.fit_transform(["the cute cat", "the dog"])
[[1, 2, 3], [1, 4]]
fit(texts, y=None)[source]
transform(texts, y=None)[source]

zeugma.logger module

zeugma.texttransformers module

class zeugma.texttransformers.ItemSelector(key)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

For data grouped by feature, select subset of data at a provided key.

The data is expected to be stored in a 2D data structure, where the first index is over features and the second is over samples.

Parameters:key (hashable, required) – The key corresponding to the desired value in a mappable.
fit(x, y=None)[source]

Necessary fit method to include transformer in a sklearn.Pipeline

transform(data_dict)[source]

Return selected items

class zeugma.texttransformers.Namer(key)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Return a single-entry dictionary with key given by the attribute ‘key’ and value is the input data

Parameters:key (hashable, required) – The key corresponding to the output name.
fit(x, y=None)[source]

Necessary fit method to include transformer in a sklearn.Pipeline

transform(X)[source]

Return data in a dictionary with key provided at instantiation

class zeugma.texttransformers.RareWordsTagger(min_count, oov_tag='<oov>')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Replace rare words with a token in a corpus (list of strings)

fit(texts, y=None)[source]
transform(texts)[source]
class zeugma.texttransformers.TextStats[source]

Bases: sklearn.preprocessing._function_transformer.FunctionTransformer

Extract features from each document for DictVectorizer

Module contents

Created on the 05/01/18 @author: Nicolas Thiebaut @email: nkthiebaut@gmail.com

License

The project is licensed under the MIT license.

Indices and tables