Skip to content

vectorization

  • Name: cognitivefactory.interactive_clustering.utils.vectorization
  • Description: Utilities methods to apply NLP vectorization.
  • Author: Erwan SCHILD
  • Created: 17/03/2021
  • Licence: CeCILL (https://cecill.info/licences.fr.html)

vectorize(dict_of_texts, vectorizer_type='tfidf', spacy_language_model='fr_core_news_md')

A method used to vectorize texts. Severals vectorizer are available : TFIDF, spaCy language model.

References
  • Scikit-learn: Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, et E. Duchesnay (2011). Scikit-learn : Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830.
  • Scikit-learn 'TfidfVectorizer': https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
  • spaCy: Honnibal, M. et I. Montani (2017). spaCy 2 : Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.
  • spaCy language models: https://spacy.io/usage/models

Parameters:

Name Type Description Default
dict_of_texts Dict[str, str]

A dictionary that contains the texts to vectorize.

required
vectorizer_type str

The vectorizer type to use. The type can be "tfidf" or "spacy". Defaults to "tfidf".

'tfidf'
spacy_language_model str

The spaCy language model to use if vectorizer is spacy. Defaults to "fr_core_news_md".

'fr_core_news_md'

Raises:

Type Description
ValueError

Raises error if vectorizer_type is not implemented or if the spacy_language_model is not installed.

Returns:

Type Description
Dict[str, csr_matrix]

Dict[str, csr_matrix]: A dictionary that contains the computed vectors.

Example
# Import.
from cognitivefactory.interactive_clustering.utils.vectorization import vectorize

# Define data.
dict_of_texts={
    "0": "comment signaler une perte de carte de paiement",
    "1": "quelle est la procedure pour chercher une carte de credit avalee",
    "2": "ma carte visa a un plafond de paiment trop bas puis je l augmenter",
}

# Apply vectorization.
dict_of_vectors = vectorize(
    dict_of_texts=dict_of_texts,
    vectorizer_type="spacy",
    spacy_language_model="fr_core_news_md",
)

# Print results.
print("Computed results", ":", dict_of_vectors)
Source code in src\cognitivefactory\interactive_clustering\utils\vectorization.py
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
def vectorize(
    dict_of_texts: Dict[str, str],
    vectorizer_type: str = "tfidf",
    spacy_language_model: str = "fr_core_news_md",
) -> Dict[str, csr_matrix]:
    """
    A method used to vectorize texts.
    Severals vectorizer are available : TFIDF, spaCy language model.

    References:
        - _Scikit-learn_: `Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, et E. Duchesnay (2011). Scikit-learn : Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830.`
        - _Scikit-learn_ _'TfidfVectorizer'_: `https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html`
        - _spaCy_: `Honnibal, M. et I. Montani (2017). spaCy 2 : Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.`
        - _spaCy_ language models: `https://spacy.io/usage/models`

    Args:
        dict_of_texts (Dict[str,str]): A dictionary that contains the texts to vectorize.
        vectorizer_type (str, optional): The vectorizer type to use. The type can be `"tfidf"` or `"spacy"`. Defaults to `"tfidf"`.
        spacy_language_model (str, optional): The spaCy language model to use if vectorizer is spacy. Defaults to `"fr_core_news_md"`.

    Raises:
        ValueError: Raises error if `vectorizer_type` is not implemented or if the `spacy_language_model` is not installed.

    Returns:
        Dict[str, csr_matrix]: A dictionary that contains the computed vectors.

    Example:
        ```python
        # Import.
        from cognitivefactory.interactive_clustering.utils.vectorization import vectorize

        # Define data.
        dict_of_texts={
            "0": "comment signaler une perte de carte de paiement",
            "1": "quelle est la procedure pour chercher une carte de credit avalee",
            "2": "ma carte visa a un plafond de paiment trop bas puis je l augmenter",
        }

        # Apply vectorization.
        dict_of_vectors = vectorize(
            dict_of_texts=dict_of_texts,
            vectorizer_type="spacy",
            spacy_language_model="fr_core_news_md",
        )

        # Print results.
        print("Computed results", ":", dict_of_vectors)
        ```
    """

    # Initialize dictionary of vectors.
    dict_of_vectors: Dict[str, csr_matrix] = {}

    ###
    ### Case of TFIDF vectorization.
    ###
    if vectorizer_type == "tfidf":
        # Initialize vectorizer.
        vectorizer = TfidfVectorizer(
            analyzer="word",
            ngram_range=(1, 3),
            min_df=2,
            ####min_df=0.0, max_df=0.95, max_features=20000,
            ####ngram_range=(1,5), analyzer="char_wb", sublinear_tf=True,
        )

        # Apply vectorization.
        tfidf_vectorization: csr_matrix = vectorizer.fit_transform(
            [str(dict_of_texts[data_ID]) for data_ID in dict_of_texts.keys()]
        )

        # Format dictionary of vectors to return.
        dict_of_vectors = {data_ID: tfidf_vectorization[i] for i, data_ID in enumerate(dict_of_texts.keys())}

        # Return the dictionary of vectors.
        return dict_of_vectors

    ###
    ### Case of SPACY vectorization.
    ###
    if vectorizer_type == "spacy":
        # Load vectorizer (spaCy language model).
        try:
            spacy_nlp = spacy.load(
                name=spacy_language_model,
                disable=[
                    "morphologizer",  # Not needed
                    "parser",  # Not needed
                    "attribute_ruler",  # Not needed
                    "lemmatizer",  # Not needed
                    "ner",  # Not needed
                ],
            )
        except OSError as err:  # `spacy_language_model` is not installed.
            raise ValueError(
                "The `spacy_language_model` '" + str(spacy_language_model) + "' is not installed."
            ) from err

        # Apply vectorization.
        dict_of_vectors = {data_ID: csr_matrix(spacy_nlp(str(text)).vector) for data_ID, text in dict_of_texts.items()}

        # Return the dictionary of vectors.
        return dict_of_vectors

    ###
    ### Other case : Raise a `ValueError`.
    ###
    raise ValueError("The `vectorizer_type` '" + str(vectorizer_type) + "' is not implemented.")