Skip to content

preprocessing

  • Name: cognitivefactory.interactive_clustering.utils.preprocessing
  • Description: Utilities methods to apply NLP preprocessing.
  • Author: Erwan SCHILD
  • Created: 17/03/2021
  • Licence: CeCILL (https://cecill.info/licences.fr.html)

preprocess(dict_of_texts, apply_stopwords_deletion=False, apply_parsing_filter=False, apply_lemmatization=False, spacy_language_model='fr_core_news_md')

A method used to preprocess texts. It applies simple preprocessing (lowercasing, punctuations deletion, accents replacement, whitespace deletion). Some options are available to delete stopwords, apply lemmatization, and delete tokens according to their depth in the denpendency tree.

References
  • spaCy: Honnibal, M. et I. Montani (2017). spaCy 2 : Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.
  • spaCy language models: https://spacy.io/usage/models
  • NLTK: Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
  • NLTK 'SnowballStemmer': https://www.nltk.org/api/nltk.stem.html#module-nltk.stem.snowball

Parameters:

Name Type Description Default
dict_of_texts Dict[str, str]

A dictionary that contains the texts to preprocess.

required
apply_stopwords_deletion bool

The option to delete stopwords. Defaults to False.

False
apply_parsing_filter bool

The option to filter tokens based on dependency parsing results. If set, it only keeps "ROOT" tokens and their direct children. Defaults to False.

False
apply_lemmatization bool

The option to lemmatize tokens. Defaults to False.

False
spacy_language_model str

The spaCy language model to use if vectorizer is spacy. The model has to be installed. Defaults to "fr_core_news_md".

'fr_core_news_md'

Raises:

Type Description
ValueError

Raises error if the spacy_language_model is not installed.

Returns:

Type Description
Dict[str, str]

Dict[str,str]: A dictionary that contains the preprocessed texts.

Example
# Import.
from cognitivefactory.interactive_clustering.utils.preprocessing import preprocess

# Define data.
dict_of_texts={
    "0": "Comment signaler une perte de carte de paiement ?",
    "1": "Quelle est la procédure pour chercher une carte de crédit avalée ?",
    "2": "Ma carte Visa a un plafond de paiment trop bas, puis-je l'augmenter ?",
}

# Apply preprocessing.
dict_of_preprocessed_texts = preprocess(
    dict_of_texts=dict_of_texts,
    apply_stopwords_deletion=True,
    apply_parsing_filter=False,
    apply_lemmatization=False,
    spacy_language_model="fr_core_news_md",
)

# Print results.
print("Expected results", ";", {
    "0": "signaler perte carte paiement",
    "1": "procedure chercher carte credit avalee",
    "2": "carte visa plafond paiment l augmenter",
})
print("Computed results", ":", dict_of_preprocessed_texts)
Source code in src\cognitivefactory\interactive_clustering\utils\preprocessing.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
def preprocess(
    dict_of_texts: Dict[str, str],
    apply_stopwords_deletion: bool = False,
    apply_parsing_filter: bool = False,
    apply_lemmatization: bool = False,
    spacy_language_model: str = "fr_core_news_md",
) -> Dict[str, str]:
    """
    A method used to preprocess texts.
    It applies simple preprocessing (lowercasing, punctuations deletion, accents replacement, whitespace deletion).
    Some options are available to delete stopwords, apply lemmatization, and delete tokens according to their depth in the denpendency tree.

    References:
        - _spaCy_: `Honnibal, M. et I. Montani (2017). spaCy 2 : Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.`
        - _spaCy_ language models: `https://spacy.io/usage/models`
        - _NLTK_: `Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.`
        - _NLTK_ _'SnowballStemmer'_: `https://www.nltk.org/api/nltk.stem.html#module-nltk.stem.snowball`

    Args:
        dict_of_texts (Dict[str,str]): A dictionary that contains the texts to preprocess.
        apply_stopwords_deletion (bool, optional): The option to delete stopwords. Defaults to `False`.
        apply_parsing_filter (bool, optional): The option to filter tokens based on dependency parsing results. If set, it only keeps `"ROOT"` tokens and their direct children. Defaults to `False`.
        apply_lemmatization (bool, optional): The option to lemmatize tokens. Defaults to `False`.
        spacy_language_model (str, optional): The spaCy language model to use if vectorizer is spacy. The model has to be installed. Defaults to `"fr_core_news_md"`.

    Raises:
        ValueError: Raises error if the `spacy_language_model` is not installed.

    Returns:
        Dict[str,str]: A dictionary that contains the preprocessed texts.

    Example:
        ```python
        # Import.
        from cognitivefactory.interactive_clustering.utils.preprocessing import preprocess

        # Define data.
        dict_of_texts={
            "0": "Comment signaler une perte de carte de paiement ?",
            "1": "Quelle est la procédure pour chercher une carte de crédit avalée ?",
            "2": "Ma carte Visa a un plafond de paiment trop bas, puis-je l'augmenter ?",
        }

        # Apply preprocessing.
        dict_of_preprocessed_texts = preprocess(
            dict_of_texts=dict_of_texts,
            apply_stopwords_deletion=True,
            apply_parsing_filter=False,
            apply_lemmatization=False,
            spacy_language_model="fr_core_news_md",
        )

        # Print results.
        print("Expected results", ";", {
            "0": "signaler perte carte paiement",
            "1": "procedure chercher carte credit avalee",
            "2": "carte visa plafond paiment l augmenter",
        })
        print("Computed results", ":", dict_of_preprocessed_texts)
        ```
    """

    # Initialize dictionary of preprocessed texts.
    dict_of_preprocessed_texts: Dict[str, str] = {}

    # Initialize punctuation translator.
    punctuation_translator = str.maketrans(
        {
            punct: " "
            for punct in (
                ".",
                ",",
                ";",
                ":",
                "!",
                "¡",
                "?",
                "¿",
                "…",
                "•",
                "(",
                ")",
                "{",
                "}",
                "[",
                "]",
                "«",
                "»",
                "^",
                "`",
                "'",
                '"',
                "\\",
                "/",
                "|",
                "-",
                "_",
                "#",
                "&",
                "~",
                "@",
            )
        }
    )

    # Load vectorizer (spacy language model).
    try:
        spacy_nlp = spacy.load(
            name=spacy_language_model,
            disable=[
                # "morphologizer", # Needed for lemmatization.
                # "parser", # Needed for filtering on dependency parsing.
                # "attribute_ruler",  # Need for pos tagging.
                # "lemmatizer", # Needed for lemmatization.
                "ner",  # Not needed
            ],
        )
    except OSError as err:  # `spacy_language_model` is not installed.
        raise ValueError("The `spacy_language_model` '" + str(spacy_language_model) + "' is not installed.") from err

    # Initialize stemmer.
    ####stemmer = SnowballStemmer(language="french")

    # For each text...
    for key, text in dict_of_texts.items():
        # Force string type.
        preprocessed_text: str = str(text)

        # Apply lowercasing.
        preprocessed_text = text.lower()

        # Apply punctuation deletion (before tokenization).
        preprocessed_text = preprocessed_text.translate(punctuation_translator)

        # Apply tokenization and spaCy pipeline.
        tokens = [
            token
            for token in spacy_nlp(preprocessed_text)
            if (
                # Spaces are not allowed.
                not token.is_space
            )
            and (
                # Punctuation are not allowed.
                not token.is_punct
                and not token.is_quote
            )
            and (
                # If set, stopwords are not allowed.
                (not apply_stopwords_deletion)
                or (not token.is_stop)
            )
            and (
                # If set, stopwords are not allowed.
                (not apply_parsing_filter)
                or (len(list(token.ancestors)) <= 1)
            )
        ]

        # Apply retokenization with lemmatization.
        if apply_lemmatization:
            preprocessed_text = " ".join([token.lemma_.strip() for token in tokens])

        # Apply retokenization without lemmatization.
        else:
            preprocessed_text = " ".join([token.text.strip() for token in tokens])

        # Apply accents deletion (after lemmatization).
        preprocessed_text = "".join(
            [char for char in unicodedata.normalize("NFKD", preprocessed_text) if not unicodedata.combining(char)]
        )

        # Store preprocessed text.
        dict_of_preprocessed_texts[key] = preprocessed_text

    return dict_of_preprocessed_texts