Replies: 3 comments 5 replies
-
Hi, thanks for your suggestion! As long as it uses permissive licenses (for its entire supply chain), a contribution would be great. The simplest approach imho is to create a custom EntityRecognizer rather than an NlpEngine, but both options are possible. In any case, it should be an optional dependency. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the helpful feedback! I'll ensure Natasha and its dependencies use permissive licenses and will start working on the PR. Regarding the EntityRecognizer vs. NlpEngine approach: I'll take a closer look at Natasha's other capabilities (like tokenization, lemmatization). If they align well with the broader functionality of Presidio's NlpArtifacts, it might be beneficial to implement the full NlpEngine. If not, I'll stick with the simpler, self-contained EntityRecognizer as you suggested. I'll keep you updated on the approach I choose. |
Beta Was this translation helpful? Give feedback.
-
@omri374 Natasha is a full-featured NLP pipeline, like SpaCy or Stanza. It provides not only NER but also tokenization, lemmatization, and morphological analysis. Implementing natasha as a self-contained EntityRecognizer would require running a separate NLP engine (defaulting to SpaCy) alongside it, leading to redundant processing (e.g., tokenizing the same text twice). Alternatively, it would have to rely on the preprocessing results from the main NlpEngine, which may not guarantee the best outcome compared to using Natasha's own native preprocessing. Therefore, I believe the NlpEngine approach is better suited here. This would:
I will proceed with the NlpEngine + Recognizer implementation, ensuring that natasha is handled as an optional dependency, as you recommended. Please let me know if you have any objections. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello Presidio team,
I would like to propose adding support for the Natasha NLP engine to Presidio.
Natasha (https://github.com/natasha/natasha) is a powerful library for Russian NLP optimized for CPU performance. I have started implementing this in my own fork for personal use because I'm thrilled with the functionality Presidio provides, but I found it was missing an engine like this for the Russian language.
Would the team be open to a pull request for this feature? I'm happy to contribute and can prepare a PR with tests and documentation.
Beta Was this translation helpful? Give feedback.
All reactions