nlpmed_engine.components package¶
Submodules¶
nlpmed_engine.components.duplicate_checker module¶
Duplicate checking module for NLPMed-Engine.
This module provides functionality to detect and handle duplicate sentences within medical notes using MinHash and Locality-Sensitive Hashing (LSH). The DuplicateChecker class offers methods to process notes, identify duplicate sentences, and manage LSH states.
- Classes:
DuplicateChecker: Class for checking and handling duplicate sentences within notes.
- class nlpmed_engine.components.duplicate_checker.DuplicateChecker(num_perm: int = 256, sim_threshold: float = 0.9, length_threshold: int = 50)¶
Bases:
object
Class for checking and managing duplicate sentences in medical notes.
This class uses MinHash and MinHashLSH to identify similar or duplicate sentences within notes based on a defined similarity threshold. It allows for adding sentences to the LSH structure, querying for duplicates, and clearing/resetting the LSH state.
- Attributes:
num_perm (int): Number of permutations used in MinHash. sim_threshold (float): Similarity threshold for considering sentences as duplicates. length_threshold (int): Minimum length of sentences to be considered for duplication checking. lsh (MinHashLSH): The Locality-Sensitive Hashing structure used to store and query MinHash values.
- add_sentence(sentence: Sentence) None ¶
Adds a sentence to the LSH structure for future duplicate detection.
- Args:
sentence (Sentence): The sentence to be added to the LSH structure.
- clear_lsh(num_perm: int | None = None, sim_threshold: float | None = None, **_: Any) None ¶
Clears and reinitializes the LSH structure with optional new parameters.
- Args:
num_perm (int | None): Optional new number of permutations for MinHash. sim_threshold (float | None): Optional new similarity threshold for LSH.
- get_minhash(sentence: Sentence) MinHash ¶
Generates a MinHash object from the words in a sentence.
- Args:
sentence (Sentence): The sentence from which to generate the MinHash.
- Returns:
MinHash: The MinHash object representing the sentence.
- is_duplicate(sentence: Sentence) bool ¶
Checks if a given sentence is a duplicate by querying the LSH structure.
- Args:
sentence (Sentence): The sentence to check for duplication.
- Returns:
bool: True if the sentence is considered a duplicate, False otherwise.
- process(note: Note, length_threshold: int | None = None, **_: Any) Note ¶
Processes a note to check for duplicate sentences based on the defined thresholds.
- Args:
note (Note): The note object containing sections and sentences to be processed. length_threshold (int | None): Optional length threshold to override the default.
- Returns:
Note: The processed note with sentences marked as duplicates if applicable.
nlpmed_engine.components.encoding_fixer module¶
Encoding fixer module for NLPMed-Engine.
This module provides functionality to fix encoding issues in medical notes using the ftfy library. The EncodingFixer class processes notes to correct common encoding problems, ensuring text is properly readable and standardized.
- Classes:
EncodingFixer: Class for fixing encoding issues in notes.
- class nlpmed_engine.components.encoding_fixer.EncodingFixer¶
Bases:
object
Class for fixing encoding issues in medical notes.
This class uses the ftfy library to automatically correct encoding errors in the text of medical notes, making the text more readable and consistent.
- Methods:
process: Fixes encoding issues in the text of a note.
nlpmed_engine.components.joiner module¶
Joiner module for NLPMed-Engine.
This module provides functionality to join important sentences and sections of medical notes into a cohesive preprocessed text. The Joiner class allows customization of sentence and section delimiters to structure the joined text appropriately.
- Classes:
Joiner: Class for joining sentences and sections within a note.
- class nlpmed_engine.components.joiner.Joiner(sentence_delimiter: str = '\n', section_delimiter: str = '\n\n')¶
Bases:
object
Class for joining important sentences and sections within medical notes.
This class processes a note by joining important sentences within each section into a single string, then combines these joined sections into a final preprocessed text using specified delimiters.
- Attributes:
sentence_delimiter (str): Delimiter used to join sentences within a section. section_delimiter (str): Delimiter used to join sections within the note.
- process(note: Note, sentence_delimiter: str | None = None, section_delimiter: str | None = None) Note ¶
Processes a note to join important sentences and sections into preprocessed text.
- Args:
note (Note): The note object containing sections and sentences to be joined. sentence_delimiter (str | None): Optional custom delimiter for joining sentences. section_delimiter (str | None): Optional custom delimiter for joining sections.
- Returns:
Note: The processed note with the preprocessed text formed by joining sentences and sections.
nlpmed_engine.components.ml_inference module¶
Machine Learning Inference module for NLPMed-Engine.
This module provides functionality for performing machine learning-based inference on medical notes using a pre-trained text classification model. The MLInference class uses the Hugging Face Transformers library to predict labels and scores for text data.
- Classes:
MLInference: Class for performing ML inference on notes and patients.
- class nlpmed_engine.components.ml_inference.MLInference(device: str = 'cpu', ml_model_path: Path | str = '', ml_tokenizer_path: Path | str = '', max_length: int = 512, *, use_preped_text: bool = True)¶
Bases:
object
Class for performing machine learning inference on medical notes.
This class initializes a text classification pipeline using a specified model and tokenizer. It provides methods for predicting labels and scores for text data within notes and patients, allowing for batch processing and customization of input parameters.
- Attributes:
use_preped_text (bool): Whether to use preprocessed text for inference. pipe (pipeline): The Hugging Face Transformers pipeline for text classification.
- process(note: Note, *, use_preped_text: bool | None = None, **_: Any) Note ¶
Performs inference on a single note, predicting a label and score.
- Args:
note (Note): The note object containing text to be classified. use_preped_text (bool | None): Optional override for using preprocessed text.
- Returns:
Note: The note object with predicted label and score updated.
- process_batch_patients(patients: list[Patient], *, use_preped_text: bool | None = None) list[Patient] ¶
Performs batch inference on a list of patients, predicting labels and scores for their notes.
- Args:
patients (list[Patient]): A list of patient objects containing notes to be classified. use_preped_text (bool | None): Optional override for using preprocessed text.
- Returns:
list[Patient]: The list of patients with their notes updated with predicted labels and scores.
nlpmed_engine.components.note_filter module¶
Note filtering module for NLPMed-Engine.
This module provides functionality to filter medical notes based on specified keywords. The NoteFilter class checks whether a note contains any of the specified words and returns the note if it matches the criteria.
- Classes:
NoteFilter: Class for filtering notes based on keyword presence.
- class nlpmed_engine.components.note_filter.NoteFilter(words_to_search: list[str] | None = None)¶
Bases:
object
Class for filtering medical notes based on the presence of specified keywords.
This class uses regular expressions to identify whether a note contains any of the specified words. If a match is found, the note is returned; otherwise, it is filtered out.
- Attributes:
words_to_search (list[str] | None): List of keywords to search for in notes.
- process(note: Note, words_to_search: list[str] | None = None) Note | None ¶
Processes a note to check if it contains specified keywords.
- Args:
note (Note): The note object to be filtered. words_to_search (list[str] | None): Optional list of keywords to search for, overriding the default list set during initialization.
- Returns:
Note | None: The note if it contains the keywords, otherwise None.
nlpmed_engine.components.pattern_replacer module¶
Pattern replacer module for NLPMed-Engine.
This module provides functionality to replace specified patterns within the text of medical notes. The PatternReplacer class allows for defining patterns and target replacements to standardize or clean up the text data.
- Classes:
PatternReplacer: Class for replacing patterns in notes with specified target strings.
- class nlpmed_engine.components.pattern_replacer.PatternReplacer(pattern: str | None = None, target: str | None = None)¶
Bases:
object
Class for replacing patterns in the text of medical notes.
This class uses regular expressions to find and replace specified patterns within the text of notes. It allows for customizable pattern lists and target replacements to manage the standardization of note content.
- Attributes:
pattern (str | None): Regex pattern to search for in the text. target (str | None): The replacement string for matched pattern.
- process(note: Note, pattern: str | None = None, target: str | None = None) Note ¶
Processes a note by replacing matched pattern with the specified target string.
- Args:
note (Note): The note object containing text to be modified. pattern (str | None): Optional pattern to override the default pattern. target (str | None): Optional target string to override the default replacement.
- Returns:
Note: The processed note with pattern replaced by the target string.
nlpmed_engine.components.section_filter module¶
Section filtering module for NLPMed-Engine.
This module provides functionality to filter sections of medical notes based on specified inclusion and exclusion keywords. The SectionFilter class uses regular expressions to identify and retain important sections while optionally allowing fallback behavior.
- Classes:
SectionFilter: Class for filtering sections within notes based on inclusion and exclusion rules.
- class nlpmed_engine.components.section_filter.SectionFilter(section_inc_list: list[str] | None = None, section_exc_list: list[str] | None = None, *, fallback: bool = False)¶
Bases:
object
Class for filtering sections of medical notes based on inclusion and exclusion criteria.
This class allows for defining inclusion and exclusion lists of keywords to filter sections of a note. Sections that match inclusion criteria are retained, while sections matching exclusion criteria are filtered out, unless fallback behavior is enabled.
- Attributes:
section_inc_list (list[str] | None): List of keywords for including sections. section_exc_list (list[str] | None): List of keywords for excluding sections. fallback (bool): Whether to enable fallback behavior if no sections match the criteria.
- process(note: Note, section_inc_list: list[str] | None = None, section_exc_list: list[str] | None = None, *, fallback: bool | None = None) Note ¶
Processes a note by filtering its sections based on inclusion and exclusion keywords.
- Args:
note (Note): The note object containing sections to be filtered. section_inc_list (list[str] | None): Optional list of inclusion keywords to override the default list. section_exc_list (list[str] | None): Optional list of exclusion keywords to override the default list. fallback (bool | None): Optional fallback behavior to override the default setting.
- Returns:
Note: The processed note with filtered sections based on the defined rules.
nlpmed_engine.components.section_splitter module¶
Section splitter module for NLPMed-Engine.
This module provides functionality to split the text of medical notes into sections based on a specified delimiter. The SectionSplitter class facilitates the division of note text into manageable sections for further processing.
- Classes:
SectionSplitter: Class for splitting note text into sections using a specified delimiter.
- class nlpmed_engine.components.section_splitter.SectionSplitter(delimiter: str = '\n\n')¶
Bases:
object
Class for splitting the text of medical notes into sections.
This class uses a specified delimiter to divide the text of a note into sections, creating Section objects for each part. Sections are marked with their respective start and end indices relative to the original note text.
- Attributes:
delimiter (str): The delimiter used to split the note text into sections.
- process(note: Note, delimiter: str | None = None) Note ¶
Processes a note by splitting its text into sections based on the specified delimiter.
- Args:
note (Note): The note object containing text to be split into sections. delimiter (str | None): Optional delimiter to override the default delimiter.
- Returns:
Note: The processed note with its text split into Section objects.
nlpmed_engine.components.sentence_expander module¶
Sentence expander module for NLPMed-Engine.
This module provides functionality to expand short sentences within medical notes by combining them with neighboring sentences until a specified length threshold is met. The SentenceExpander class helps ensure that important short sentences are contextually enriched with adjacent content.
- Classes:
SentenceExpander: Class for expanding short sentences in notes by merging them with surrounding sentences.
- class nlpmed_engine.components.sentence_expander.SentenceExpander(length_threshold: int = 50)¶
Bases:
object
Class for expanding short sentences within sections of medical notes.
This class processes sentences within sections of a note, expanding short sentences by merging them with adjacent sentences until a specified length threshold is met. This approach enhances the contextual richness of important sentences.
- Attributes:
length_threshold (int): The minimum length a sentence should have before it is considered for expansion.
- expand_section_sentences(sentences: list[Sentence], important_indices: list[int], length_threshold: int) list[int] ¶
Expands short sentences in a list by merging them with adjacent sentences.
- Args:
sentences (list[Sentence]): A list of sentences within a section to be expanded. length_threshold (int): The minimum length a sentence should have before it is considered sufficiently long.
- Returns:
list[Sentence]: A list of expanded sentences that meet the length threshold.
- process(note: Note, length_threshold: int | None = None) Note ¶
Processes a note by expanding short sentences within its sections.
- Args:
note (Note): The note object containing sections and sentences to be expanded. length_threshold (int | None): Optional length threshold to override the default.
- Returns:
Note: The processed note with expanded sentences in important sections.
nlpmed_engine.components.sentence_filter module¶
Sentence filter module for NLPMed-Engine.
This module provides functionality to filter and flag important sentences within medical notes based on specified keywords. The SentenceFilter class uses regular expressions to identify sentences that contain the target words, marking them as important.
- Classes:
SentenceFilter: Class for filtering sentences in notes based on keyword presence.
- class nlpmed_engine.components.sentence_filter.SentenceFilter(words_to_search: list[str] | None = None)¶
Bases:
object
Class for filtering and marking important sentences within sections of medical notes.
This class identifies sentences that contain specified keywords, flagging them as important for further processing. The filtering is achieved using compiled regular expressions to match target words in the sentences.
- Attributes:
words_to_search (list[str] | None): List of keywords to search for in sentences.
- process(note: Note, words_to_search: list[str] | None = None) Note ¶
Processes a note by filtering its sentences based on the specified keywords.
- Args:
note (Note): The note object containing sections and sentences to be filtered. words_to_search (list[str] | None): Optional list of keywords to override the default list.
- Returns:
Note: The processed note with sentences marked as important if they contain the keywords.
nlpmed_engine.components.sentence_segmenter module¶
Sentence segmenter module for NLPMed-Engine.
This module provides functionality to segment text into sentences using spaCy. The SentenceSegmenter class processes text from medical notes, splitting sections into sentences with accurate start and end indices, ensuring precise sentence-level segmentation.
- Classes:
SentenceSegmenter: Class for segmenting text into sentences using a spaCy NLP model.
- class nlpmed_engine.components.sentence_segmenter.SentenceSegmenter(model_name: str = 'en_core_sci_lg', batch_size: int = 10)¶
Bases:
object
Class for segmenting text from medical notes into sentences using a spaCy NLP model.
This class utilizes a spaCy pipeline to split sections of notes into sentences, capturing their start and end positions within the text. It supports processing individual notes and batches of patients with configurable NLP models and batch sizes.
- Attributes:
nlp (spacy.Language): The spaCy NLP pipeline used for sentence segmentation. batch_size (int): The batch size used when processing text in parallel.
- process(note: Note, **_: Any) Note ¶
Processes a single note, segmenting its sections into sentences.
- Args:
note (Note): The note object containing sections to be segmented into sentences.
- Returns:
Note: The processed note with sections segmented into sentences.
- process_batch_patients(patients: list[Patient]) list[Patient] ¶
Processes a batch of patients, segmenting the sections of their notes into sentences.
- Args:
patients (list[Patient]): A list of patient objects containing notes and sections to be segmented.
- Returns:
list[Patient]: The list of patients with their notes’ sections segmented into sentences.
nlpmed_engine.components.word_masker module¶
Word masker module for NLPMed-Engine.
This module provides functionality to mask specified words within the text of medical notes. The WordMasker class allows for defining words to be masked and the character used for masking, ensuring sensitive or unwanted terms are obscured in the text.
- Classes:
WordMasker: Class for masking specified words in the text of medical notes.
- class nlpmed_engine.components.word_masker.WordMasker(words_to_mask: list[str] | None = None, mask_char: str = '*')¶
Bases:
object
Class for masking specified words within the text of medical notes.
This class uses regular expressions to identify and replace specified words with a masking character, enhancing privacy or readability by obscuring sensitive or unwanted terms.
- Attributes:
words_to_mask (list[str]): List of words to be masked in the text. mask_char (str): The character used to replace each character of the masked words.
- process(note: Note, words_to_mask: list[str] | None = None, mask_char: str | None = None) Note ¶
Processes a note by masking specified words in its text.
- Args:
note (Note): The note object containing text to be masked. words_to_mask (list[str] | None): Optional list of words to override the default words to mask. mask_char (str | None): Optional masking character to override the default.
- Returns:
Note: The processed note with specified words masked.