vectorization
- Name: cognitivefactory.interactive_clustering.utils.vectorization
- Description: Utilities methods to apply NLP vectorization.
- Author: Erwan SCHILD
- Created: 17/03/2021
- Licence: CeCILL (https://cecill.info/licences.fr.html)
vectorize(dict_of_texts, vectorizer_type='tfidf', spacy_language_model='fr_core_news_md')
¶
A method used to vectorize texts. Severals vectorizer are available : TFIDF, spaCy language model.
References
- Scikit-learn:
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, et E. Duchesnay (2011). Scikit-learn : Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830.
- Scikit-learn 'TfidfVectorizer':
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- spaCy:
Honnibal, M. et I. Montani (2017). spaCy 2 : Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.
- spaCy language models:
https://spacy.io/usage/models
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dict_of_texts |
Dict[str, str]
|
A dictionary that contains the texts to vectorize. |
required |
vectorizer_type |
str
|
The vectorizer type to use. The type can be |
'tfidf'
|
spacy_language_model |
str
|
The spaCy language model to use if vectorizer is spacy. Defaults to |
'fr_core_news_md'
|
Raises:
Type | Description |
---|---|
ValueError
|
Raises error if |
Returns:
Type | Description |
---|---|
Dict[str, csr_matrix]
|
Dict[str, csr_matrix]: A dictionary that contains the computed vectors. |
Example
# Import.
from cognitivefactory.interactive_clustering.utils.vectorization import vectorize
# Define data.
dict_of_texts={
"0": "comment signaler une perte de carte de paiement",
"1": "quelle est la procedure pour chercher une carte de credit avalee",
"2": "ma carte visa a un plafond de paiment trop bas puis je l augmenter",
}
# Apply vectorization.
dict_of_vectors = vectorize(
dict_of_texts=dict_of_texts,
vectorizer_type="spacy",
spacy_language_model="fr_core_news_md",
)
# Print results.
print("Computed results", ":", dict_of_vectors)
Source code in src\cognitivefactory\interactive_clustering\utils\vectorization.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|