E2E API¶
Main Summarizer Class¶
Transcribe¶
transcribe_main¶
Converts a .srt, .vtt, or .sbv file saved at
transcript_path
to a python string. Optionally removes speaker entries by removing everything before “: ” in each subtitle cell.
- lecture2notes.end_to_end.transcribe.transcribe_main.check_transcript(generated_transcript, ground_truth_transcript)[source]¶
Compares
generated_transcript
toground_truth_transcript
to check for accuracy using spacy similarity measurement. Requires the “en_vectors_web_lg” model to use “real” word vectors.
- lecture2notes.end_to_end.transcribe.transcribe_main.chunk_by_silence(audio_path, output_path, silence_thresh_offset=5, min_silence_len=2000)[source]¶
Split an audio file into chunks on areas of silence
- Parameters
audio_path (str) – path to a wave file
output_path (str) – path to a folder where wave file chunks will be saved
silence_thresh_offset (int, optional) – a value subtracted from the mean dB volume of the file. Default is 5.
min_silence_len (int, optional) – the length in milliseconds in which there must be no sound in order to be marked as a splitting point. Default is 2000.
- lecture2notes.end_to_end.transcribe.transcribe_main.chunk_by_speech(audio_path, output_path=None, aggressiveness=1, desired_sample_rate=None)[source]¶
Uses the python interface to the WebRTC Voice Activity Detector (VAD) API to create chunks of audio that contain voice. The VAD that Google developed for the WebRTC project is reportedly one of the best available, being fast, modern and free.
- Parameters
audio_path (str) – path to the audio file to process
output_path (str, optional) – path to save the chunk files. if not specified then no wave files will be written to disk and the raw pcm data will be returned. Defaults to None.
aggressiveness (int, optional) – determines how aggressive filtering out non-speech is. must be an interger between 0 and 3. Defaults to 1.
desired_sample_rate (int, optional) – the sample rate of the returned segments. the default is the same rate of the input audio file. Defaults to None.
- Returns
(segments, sample_rate, audio_length). See
vad_segment_generator()
.- Return type
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.convert_deepspeech_json(transcript_json)[source]¶
Convert a deepspeech json transcript from a letter-by-letter format to word-by-word.
- Parameters
transcript_json (dict or str) – The json format transcript as a dictionary or a json string, which will be loaded using
json.loads()
.- Returns
The word-by-word transcript json.
- Return type
dict
- lecture2notes.end_to_end.transcribe.transcribe_main.convert_samplerate(audio_path, desired_sample_rate)[source]¶
Use SoX to resample wave files to 16 bits, 1 channel, and
desired_sample_rate
sample rate.
- Parameters
audio_path (str) – path to wave file to process
desired_sample_rate (int) – sample rate in hertz to convert the wave file to
- Returns
- (desired_sample_rate, output) where
desired_sample_rate
is the newsample rate and
output
is the newly resampled pcm data- Return type
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.extract_audio(video_path, output_path)[source]¶
Extracts audio from video at
video_path
and saves it tooutput_path
- lecture2notes.end_to_end.transcribe.transcribe_main.get_youtube_transcript(video_id, output_path, use_youtube_dl=True)[source]¶
Downloads the transcript for
video_id
and saves it tooutput_path
- lecture2notes.end_to_end.transcribe.transcribe_main.load_deepspeech_model(model_dir, beam_width=500, lm_alpha=None, lm_beta=None)[source]¶
Load the deepspeech model from
model_dir
- Parameters
model_dir (str) – path to folder containing the “.pbmm” and optionally “.scorer” files
beam_width (int, optional) – beam width for decoding. Default is 500.
(float (lm_alpha) – alpha parameter of language model. Default is None.
optional} – alpha parameter of language model. Default is None.
lm_beta (float, optional) – beta parameter of langage model. Default is None.
- Returns
the loaded deepspeech model
- Return type
deepspeech.Model
- lecture2notes.end_to_end.transcribe.transcribe_main.load_wav2vec_model(model='facebook/wav2vec2-base-960h', tokenizer='facebook/wav2vec2-base-960h', **kwargs)[source]¶
- lecture2notes.end_to_end.transcribe.transcribe_main.metadata_to_json(candidate_transcript)[source]¶
Helper function to convert metadata tokens from deepspeech to a dictionary.
- lecture2notes.end_to_end.transcribe.transcribe_main.metadata_to_string(metadata)[source]¶
Helper function to convert metadata tokens from deepspeech to a string.
- lecture2notes.end_to_end.transcribe.transcribe_main.process_chunks(chunk_dir, method='sphinx', model_dir=None)[source]¶
Performs transcription on every noise activity chunk (audio file) created by
chunk_by_silence()
in a directory.
- lecture2notes.end_to_end.transcribe.transcribe_main.process_segments(segments, model, audio_length='unknown', method='deepspeech', do_segment_sentences=True)[source]¶
Transcribe a list of byte strings containing pcm data
- Parameters
segments (list) – list of byte strings containing pcm data (generated by
chunk_by_speech()
)model (deepspeech model) – a deepspeech model object or a path to a folder containing the model files (see
load_deepspeech_model()
).audio_length (str, optional) – the length of the audio file if known (used for logging statements) Default is “unknown”.
method (str, optional) – The model to use to perform speech-to-text. Supports ‘deepspeech’ and ‘vosk’. Defaults to “deepspeech”.
do_segment_sentences (bool, optional) – Find sentence boundaries using
segment_sentences()
. Defaults to True.- Returns
(full_transcript, full_transcript_json) The combined transcript of all the items in
segments
as a string and as dictionary/json.- Return type
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.read_wave(path, desired_sample_rate=None, force=False)[source]¶
Reads a “.wav” file and converts to
desired_sample_rate
with one channel.
- Parameters
path (str) – path to wave file to load
desired_sample_rate (int, optional) – resample the loaded pcm data from the wave file to this sample rate. Default is None, no resampling.
force (bool, optional) – Force the audio to be converted even if it is detected to meet the necessary criteria.
- Returns
(PCM audio data, sample rate, duration)
- Return type
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.resolve_deepspeech_models(dir_name)[source]¶
Resolve directory path for deepspeech models and fetch each of them.
- Parameters
dir_name (str) – Path to the directory containing pre-trained models
- Returns
a tuple containing each of the model files (pb, scorer)
- Return type
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.segment_sentences(text, text_json=None, do_capitalization=True)[source]¶
Detect sentence boundaries without punctuation or capitalization.
- Parameters
text (str) – The string to segment by sentence.
text_json (str or dict, optional) – If the detected sentence boundaries should be applied to the JSON format of a transcript. Defaults to None.
do_capitalization (bool, optiona) – If the first letter of each detected sentence should be capitalized. Defaults to True.
- Returns
The punctuated (and optionally capitalized) string
- Return type
str
- lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio(audio_path, method='sphinx', **kwargs)[source]¶
Transcribe audio using DeepSpeech, Vosk, or a method offered by
transcribe_audio_generic()
.
- Parameters
audio_path (str) – Path to the audio file to transcribe.
method (str, optional) – The method to use for transcription. Defaults to “sphinx”.
**kwargs – Passed to the transcription function.
- Returns
(transcript_text, transcript_json)
- Return type
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_deepspeech(audio_path_or_data, model, raw_audio_data=False, json_num_transcripts=None, **kwargs)[source]¶
Transcribe an audio file or pcm data with the deepspeech model
- Parameters
audio_path_or_data (str or byte string) – a path to a wave file or a byte string containing pcm data from a wave file. set
raw_audio_data
to True if pcm data is used.model (deepspeech model or str) – a deepspeech model object or a path to a folder containing the model files (see
load_deepspeech_model()
)raw_audio_data (bool, optional) – must be True if
audio_path_or_data
is raw pcm data. Defaults to False.json_num_transcripts (str, optional) – Specify this value to generate multiple transcipts in json format.
- Returns
(transcript_text, transcript_json) the transcribed audio file in string format and the transcript in json
- Return type
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_generic(audio_path, method='sphinx', **kwargs)[source]¶
Transcribe an audio file using CMU Sphinx or Google through the speech_recognition library
- Parameters
audio_path (str) – audio file path
method (str, optional) – which service to use for transcription (“google” or “sphinx”). Default is “sphinx”.
- Returns
the transcript of the audio file
- Return type
str
- lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_vosk(audio_path_or_chunks, model='../vosk_models', chunks=False, desired_sample_rate=16000, chunk_size=2000, **kwargs)[source]¶
Transcribe audio using a
vosk
model.
- Parameters
audio_path_or_chunks (str or generator) – Path to an audio file or a generator of chunks created by
chunk_by_speech()
model (str or vosk.Model, optional) – Path to the directory containing the
vosk
models or loadedvosk.Model
. Defaults to “../vosk_models”.chunks (bool, optional) – If the audio_path_or_chunks is chunks. Defaults to False.
desired_sample_rate (int, optional) – The sample rate that the model requires to convert audio to. Defaults to 16000.
chunk_size (int, optional) – The number of wave frames per loop. Amount of audio data transcribed at a time. Defaults to 2000.
- Returns
(text_transcript, results_json) The transcript as a string and as JSON.
- Return type
tuple
- lecture2notes.end_to_end.transcribe.transcribe_main.transcribe_audio_wav2vec(audio_path_or_chunks, model=None, chunks=False, desired_sample_rate=16000)[source]¶
webrtcvad_utils¶
- class lecture2notes.end_to_end.transcribe.webrtcvad_utils.Frame(bytes, timestamp, duration)[source]¶
Represents a “frame” of audio data.
- lecture2notes.end_to_end.transcribe.webrtcvad_utils.frame_generator(frame_duration_ms, audio, sample_rate)[source]¶
Generates audio frames from PCM audio data.
Takes the desired frame duration in milliseconds, the PCM data, and the sample rate.
Yields Frames of the requested duration.
- lecture2notes.end_to_end.transcribe.webrtcvad_utils.vad_collector(sample_rate, frame_duration_ms, padding_duration_ms, vad, frames)[source]¶
Filters out non-voiced audio frames.
Given a webrtcvad.Vad and a source of audio frames, yields only the voiced audio.
Uses a padded, sliding window algorithm over the audio frames. When more than 90% of the frames in the window are voiced (as reported by the VAD), the collector triggers and begins yielding audio frames. Then the collector waits until 90% of the frames in the window are unvoiced to detrigger.
The window is padded at the front and back to provide a small amount of silence or the beginnings/endings of speech around the voiced frames.
- Parameters
sample_rate – The audio sample rate, in Hz.
frame_duration_ms – The frame duration in milliseconds.
padding_duration_ms – The amount to pad the window, in milliseconds.
vad – An instance of webrtcvad.Vad.
frames – a source of audio frames (sequence or generator).
- Returns
A generator that yields PCM audio data.
- Return type
[generator]
- lecture2notes.end_to_end.transcribe.webrtcvad_utils.vad_segment_generator(wavFile, aggressiveness, desired_sample_rate=None)[source]¶
Generate VAD segments. Filters out non-voiced audio frames.
- Parameters
waveFile (str) – Path to input wav file to run VAD on.
- Returns
segments
: a bytearray of multiple smaller audio frames (The longer audio split into multiple smaller one’s)
sample_rate
: Sample rate of the input audio file
audio_length
: Duration of the input audio file- Return type
[tuple]
mic_vad_streaming¶
- class lecture2notes.end_to_end.transcribe.mic_vad_streaming.Audio(callback=None, device=None, input_rate=16000, file=None)[source]¶
Streams raw audio from microphone. Data is received in a separate thread, and stored in a buffer, to be read from.
- BLOCKS_PER_SECOND = 50¶
- CHANNELS = 1¶
- FORMAT = 8¶
- RATE_PROCESS = 16000¶
- property frame_duration_ms¶
- class lecture2notes.end_to_end.transcribe.mic_vad_streaming.VADAudio(aggressiveness=3, device=None, input_rate=None, file=None)[source]¶
Filter & segment audio with voice activity detection.
- vad_collector(padding_ms=300, ratio=0.75, frames=None)[source]¶
Generator that yields series of consecutive audio frames comprising each utterance, separated by yielding a single None. Determines voice activity by ratio of frames in padding_ms. Uses a buffer to include padding_ms prior to being triggered.
Example: (frame, ..., frame, None, frame, ..., frame, None, ...) |---utterance---| |---utterance---|
Cluster¶
- class lecture2notes.end_to_end.cluster.ClusterFilesystem(slides_dir, algorithm_name='kmeans', num_centroids=20, preference=None, damping=0.5, max_iter=200, model_path='model_best.ckpt')[source]¶
Clusters images from a directory and saves them to disk in folders corresponding to each centroid.
Corner Crop Transform¶
- lecture2notes.end_to_end.corner_crop_transform.all_in_folder(path, remove_original=False, **kwargs)[source]¶
Perform perspective cropping on every file in folder and return new paths.
**kwargs
is passed tocrop()
.
- lecture2notes.end_to_end.corner_crop_transform.cluster_points(points, nclusters)[source]¶
Perform KMeans clustering (using
cv2.kmeans
) onpoints
, creatingnclusters
clusters. Returns the centroids of the clusters.
- lecture2notes.end_to_end.corner_crop_transform.contour_offset(cnt, offset)[source]¶
Offset contour because of 5px border
- lecture2notes.end_to_end.corner_crop_transform.crop(img_path, output_path=None, mode='automatic', debug_output_imgs=False, save_debug_imgs=False, create_debug_gif=False, debug_gif_optimize=True, debug_path='debug_imgs')[source]¶
Main method to perspective crop an image to the slide.
- Parameters
img_path (str) – path to the image to load
output_path (str, optional) – path to save the image. Defaults to
[filename]_cropped.[ext]
.mode (str, optional) –
There are three modes available. Defaults to “automatic”.
contours
: usesfind_page_contours()
to extract contours from an edge map of the image. is ineffective if there are any gaps or obstructions in the outline around the slide.
hough_lines
: useshough_lines_corners()
to get corners by looking for horizontal and vertical lines, finding the intersection points, and clustering the intersection points.
automatic
: tries to usecontours
and falls back tohough_lines
ifcontours
reports a failure.debug_output_imgs (bool or dict, optional) – if dictionary, modifies the dictionary by adding
(image file name, image data)
pairs. if boolean and True, creates a dictionary in the same way as if a dictionary was passed. Defaults to False.save_debug_imgs (bool, optional) – uses
write_debug_imgs()
to save the debug_output_imgs to disk. Requiresdebug_output_imgs
to not be False. Defaults to False.create_debug_gif (bool, optional) – create a gif of the debug images. Requires
debug_output_imgs
to not be False. Defaults to False.debug_gif_optimize (bool, optional) – optimize the gif produced by enabling the
create_debug_gif
option usingpygifsicle
. Defaults to True.debug_path (str, optional) – location to save the debug images and debug gif. Defaults to “debug_imgs”.
- Returns
path to cropped image and failed (True if no slide bounding box found, false otherwise)
- Return type
[tuple]
- lecture2notes.end_to_end.corner_crop_transform.edges_det(img, min_val, max_val, debug_output_imgs=None)[source]¶
Preprocessing (gray, thresh, filter, border) & Canny edge detection
- Parameters
img (image) – the image loaded using
cv2.imread
.min_val (int) – minimum value for
cv2.Canny
.max_val (int) – maximum value for
cv2.Canny
.debug_output_imgs (dict, optional) – modifies this dictionary by adding
(image file name, image data)
pairs. Defaults to None.- Returns
(dilated, total_border), dialted edges and total border width added
- Return type
[tuple]
- lecture2notes.end_to_end.corner_crop_transform.find_intersection(line1, line2)[source]¶
Find the intersection between
line1
andline2
.
- lecture2notes.end_to_end.corner_crop_transform.find_page_contours(edges, img, border_size=11, min_area_mult=0.3, debug_output_imgs=None)[source]¶
Find corner points of page contour
- Parameters
edges (image) – edges extracted from
img
byedges_det()
.img (image) – the image loaded by
cv2.imread
.border_size (int, optional) – the size of the borders added by
edges_det()
. Defaults to 11.min_area_mult (float, optional) – the minimum percentage of the image area that a contour’s area must be greater than to be considered as the slide. Defaults to 0.5.
- Returns
contour
is the set of coordinates of the corners sortedby
four_corners_sort()
or returns None when no contour meets the criteria.- Return type
[contour or NoneType]
- lecture2notes.end_to_end.corner_crop_transform.four_corners_sort(pts)[source]¶
Sort corners: top-left, bot-left, bot-right, top-right
- lecture2notes.end_to_end.corner_crop_transform.horizontal_vertical_edges_det(img, thresh_blurred, debug_output_imgs=None)[source]¶
Detects horizontal and vertical edges and merges them together.
- Parameters
img (image) – the image as provided by
cv2.imread
thresh_blurred (image) – the image processed by thresholding. see
edges_det()
.debug_output_imgs (dict, optional) – modifies this dictionary by adding
(image file name, image data)
pairs. Defaults to None.- Returns
result image with a black background and white edges
- Return type
[image]
- lecture2notes.end_to_end.corner_crop_transform.hough_lines_corners(img, edges_img, min_line_length, border_size=11, debug_output_imgs=None)[source]¶
- Uses
cv2.HoughLinesP
to find horizontal and vertical lines, finds the intersectionpoints, and finally clusters those points using KMeans.
- Parameters
img (image) – the image as loaded by
cv2.imread
.edges_img (image) – edges extracted from
img
byedges_det()
.min_line_length (int) – the shortest line length to consider as a valid line
border_size (int, optional) – the size of the borders added by
edges_det()
. Defaults to 11.debug_output_imgs (dict, optional) – modifies this dictionary by adding
(image file name, image data)
pairs. Defaults to None.- Returns
The corner coordinates as sorted by
four_corners_sort()
.- Return type
[list]
- lecture2notes.end_to_end.corner_crop_transform.persp_transform(img, s_points)[source]¶
Transform perspective of
img
from start points to target points.
- lecture2notes.end_to_end.corner_crop_transform.remove_contours(edges, contour_removal_threshold)[source]¶
Remove contours from an edge map by deleting contours shorter than
contour_removal_threshold
.
- lecture2notes.end_to_end.corner_crop_transform.resize(img, height=800, allways=False)[source]¶
Resize image to given height.
- lecture2notes.end_to_end.corner_crop_transform.segment_lines(lines, delta)[source]¶
Groups lines from
cv2.HoughLinesP
into vertical and horizontal bins.
- Parameters
lines (list) – the data returned from
cv2.HoughLinesP
delta (int) – how far away the x and y coordinates can differ before they’re marked as different lines
- Returns
(h_lines, v_lines) the horizontal and vertical lines, respectively. each line in each list is formatted as (x1, y1, x2, y2).
- Return type
[tuple]
- lecture2notes.end_to_end.corner_crop_transform.straight_lines_in_contour(contour, delta=100)[source]¶
Returns True if
contour
contains lines that are horizontal or vertical.delta
allows the lines to tilt by a certain number of pixels. For instance, if a line is vertical, its y values can change bydelta
pixels before it is considered not vertical.
- lecture2notes.end_to_end.corner_crop_transform.write_debug_imgs(debug_output_imgs, base_path='debug_imgs')[source]¶
Saves images from
debug_output_imgs
to disk inbase_path
.
- Parameters
debug_output_imgs (dict) – dictionary in format {image file name: image data}
base_path (str, optional) – the directory to store the debug images. Defaults to “debug_imgs”.
Text Detection¶
- lecture2notes.end_to_end.text_detection.get_text_bounding_boxes(image, net, min_confidence=0.5, resized_width=320, resized_height=320)[source]¶
Determine the locations of text in an image.
- Parameters
image (np.array) – The image to be processed.
net (cv2.dnn_Net) – The EAST model loaded with
load_east()
.min_confidence (float, optional) – Minimum probability required to inspect a region. Defaults to 0.5.
resized_width (int, optional) – Resized image width (should be multiple of 32). Defaults to 320.
resized_height (int, optional) – Resized image height (should be multiple of 32). Defaults to 320.
- Returns
The coordinates of bounding boxes containing text.
- Return type
list
Figure Detection¶
- lecture2notes.end_to_end.figure_detection.all_in_folder(path, remove_original=False, east='frozen_east_text_detection.pb', do_text_check=True, **kwargs)[source]¶
Perform figure detection on every file in folder and return new paths.
**kwargs
is passed todetect_figures()
.
- lecture2notes.end_to_end.figure_detection.area_of_overlapping_rectangles(a, b)[source]¶
Find the overlapping area of two rectangles
a
andb
. Inspired by https://stackoverflow.com/a/27162334.
- lecture2notes.end_to_end.figure_detection.detect_color_image(image, thumb_size=40, MSE_cutoff=22, adjust_color_bias=True)[source]¶
Detect if an image contains color, is black and white, or is grayscale. Based on this StackOverflow answer.
- Parameters
image (np.array) – Input image
thumb_size (int, optional) – Resize image to this size to speed up calculation. Defaults to 40.
MSE_cutoff (int, optional) – A larger value requires more color for an image to be labeled as “color”. Defaults to 22.
adjust_color_bias (bool, optional) – Mean color bias adjustment, which improves the prediction. Defaults to True.
- Returns
Either “grayscale”, “color”, “b&w” (black and white), or “unknown”.
- Return type
str
- lecture2notes.end_to_end.figure_detection.detect_figures(image_path, output_path=None, east='frozen_east_text_detection.pb', text_area_overlap_threshold=0.32, figure_max_area_percentage=0.6, text_max_area_percentage=0.3, large_box_detection=True, do_color_check=True, do_text_check=True, entropy_check=2.5, do_remove_subfigures=True, do_rlsa=False)[source]¶
Detect figures located in a slide.
- Parameters
image_path (str) – Path to the image to process.
output_path (str, optional) – Path to save the figures. Defaults to
[filename]_figure_[index].[ext]
.east (str or cv2.dnn_Net, optional) – Path to the EAST model file or the pre-trained EAST model loaded with
load_east()
.do_text_check
must be true for this option to take effect. Defaults to “frozen_east_text_detection.pb”.text_area_overlap_threshold (float, optional) – The percentage of the figure that can contain text. If the area of the text in the figure is greater than this value, the figure is discarded.
do_text_check
must be true for this option to take effect. Defaults to 0.10.figure_max_area_percentage (float, optional) – The maximum percentage of the area of the original image that a figure can take up. If the figure uses more area than
original_image_area*figure_max_area_percentage
then the figure will be discarded. Defaults to 0.70.text_max_area_percentage (float, optional) – The maximum percentage of the area of the original image that a block of text (as identified by the EAST model) can take up. If the text block uses more area than
original_image_area*text_max_area_percentage
then that text block will be ignored.do_text_check
must be true for this option to take effect. Defaults to 0.30.large_box_detection (bool, optional) – Detect edges and classify large rectangles as figures. This will ignore do_color_check and do_text_check. This is useful for finding tables for example. Defaults to True.
do_color_check (bool, optional) – Check that potential figures contain color. This helps to remove large quantities of black and white text form the potential figure list. Defaults to True.
do_text_check (bool, optional) – Check that only text_area_overlap_threshold of potential figures contains text. This is useful to remove blocks of text that are mistakenly classified as figures. Checking for text increases processing time so be careful if processing a large number of files. Defaults to True.
entropy_check (float, optional) – Check that the entropy of all potential figures is above this value. Figures with a
shannon_entropy
lower than this value will be removed. Set toFalse
to disable this check. Theshannon_entropy
implementation is fromskimage.measure.entropy
. IMPORTANT: This check applies to both the regular tests andlarge_box_detection
, which most check do not apply to. Defaults to 3.5.do_remove_subfigures (bool, optional) – Check that there are no overlapping figures. If an overlapping figure is detected, the smaller figure will be deleted. This is useful to have enabled when using large_box_detection since large_box_detection will commonly mistakenly detect subfigures. Defaults to True.
do_rlsa (bool, optional) – Use RLSA (Run Length Smoothing Algorithm) instead of dilation. Does not apply to large_box_detection. Defaults to False.
- Returns
(figures, output_paths) A list of figures extracted from the input slide image and a list of paths to those figures on disk.
- Return type
tuple
Frames Extractor¶
Helpers¶
- lecture2notes.end_to_end.helpers.copy_all(list_path_files, output_dir, move=False)[source]¶
Copy (or move) every path in list_path_files if list or all files in a path if path to output_dir
Image Hash¶
- lecture2notes.end_to_end.imghash.get_hash_func(hashmethod='phash')[source]¶
Returns a hash function from the
imagehash
library.
- Hash Methods:
ahash: Average hash
phash: Perceptual hash
dhash: Difference hash
whash-haar: Haar wavelet hash
whash-db4: Daubechies wavelet hash
- lecture2notes.end_to_end.imghash.remove_duplicates(img_dir, images)[source]¶
Remove duplicate frames/slides from disk.
- Parameters
img_dir (str) – path to directory containing image files
images (dict) – dictionary in format {image hash: image filenames} provided by
sort_by_duplicates()
.
- lecture2notes.end_to_end.imghash.sort_by_duplicates(img_dir, hash_func='phash')[source]¶
Find duplicate images in a directory.
- Parameters
img_dir (str) – path to folder containing images to scan for duplicates
hash_func (str, optional) – the hash function to use as given by
get_hash_func()
. Defaults to “phash”.- Returns
dictionary in format {image hash: image filenames}
- Return type
[dict]
OCR¶
Segment Cluster¶
- class lecture2notes.end_to_end.segment_cluster.SegmentCluster(slides_dir, model_path='model_best.ckpt')[source]¶
Iterates through frames in order and splits based on large visual differences (measured by the cosine difference between the feature vectors from the slide classifier)
SIFT Matcher¶
- lecture2notes.end_to_end.sift_matcher.does_camera_move(old_frame, frame, gamma=10, border_ratios=(10, 19), bottom=False)[source]¶
Detects camera movement between two frames by tracking features in the borders of the image. Only the borders are used because the center of the image probably contains a slide. Thus, tracking features of the slide is not robust since those features will disappear when the slide changes.
- Parameters
old_frame (np.array) – First frame/image as loaded with
cv2.imread()
frame (np.array) – Second frame/image as loaded with
cv2.imread()
gamma (int, optional) – The threshold pixel movement value. If the camera moves more than this value, then there is assumed to be camera movement between the two frames. Defaults to 10.
border_ratios (tuple, optional) – The ratios of the height and width respectively of the first frame to be searched for features. Only the borders are searched for features. these values specify how much of the image should be counted as a border. Defaults to (10, 19).
bottom (bool, optional) – Whether to find features in the bottom border. This is not recommended because ‘presenter_slide’ images may have the peoples’ heads at the bottom, which will move and do not represent camera motion. Defaults to False.
- Returns
(total_movement > gamma, total_movement) If there is camera movement between the two frames and the total movement between the frames.
- Return type
tuple
- lecture2notes.end_to_end.sift_matcher.does_camera_move_all_in_folder(folder_path)[source]¶
Runs
does_camera_move()
on all the files in a folder and calculates statistics about camera movement within those files.
- Parameters
folder_path (str) – Directory containing the files to be processed.
- Returns
(movement_detection_percentage, average_move_value, max_move_value) A float representing the precentage of frames where movement was detected from the previous frame. The average of the
total_movement
values returned fromdoes_camera_move()
. The maximum of the thetotal_movement
values returned fromdoes_camera_move()
.- Return type
tuple
- lecture2notes.end_to_end.sift_matcher.is_content_added(first, second, first_area_modifier=0.7, second_area_modifier=0.4, gamma=0.09, dilation_amount=22)[source]¶
Detect if
second
contains more content thanfirst
and how much more content it adds. This algorithm dilates both images and finds contours. It then computes the total area of those contours. Ifgamma
% more than the area of the first image’s contours is greater than the area of the second image’s contours then it is assumed more content is added.
- Parameters
first (np.array) – Image loaded using
cv2.imread()
belonging to the ‘slide’ classsecond (np.array) – Image loaded using
cv2.imread()
belonging to the ‘presenter_slide’ classfirst_area_modifier (float, optional) – The maximum percent area of the
first
image that a contour can take up before it is excluded. Defaults to 0.70.second_area_modifier (float, optional) – The maximum percent area of the
second
image that a contour can take up before it is excluded. Images belonging to the ‘presenter_slide’ class are more likely to have mistaken large contours. Defaults to 0.40.gamma (float, optional) – The percentage increase in content area necessary for second` to be classified as having more content than
first
. Defaults to 0.09.dilation_amount (int, optional) – How much the canny edge maps of each both images
first
andsecond
should be dilated. This helps to combine multiple components of one object into a single contour. Defaults to 22.- Returns
(content_is_added, amount_of_added_content) Boolean if
second
contains more content thanfirst
and float describing the difference in content fromfirst
tosecond
.amount_of_added_content
can be negative.- Return type
tuple
- lecture2notes.end_to_end.sift_matcher.match_features(slide_path, presenter_slide_path, min_match_count=33, min_area_percent=0.37, do_motion_detection=True)[source]¶
Match features between images in slide_path and presenter_slide_path. The images in slide_path are the queries to the matching algorithm and the images in presenter_slide_path are the train/searched images.
- Parameters
slide_path (str) – Path to the images classified as “slide” or any directory containing query images.
presenter_slide_path (str) – Path to the images classified as “presenter_slide” or any directory containing train images.
min_match_count (int, optional) – The minimum number of matches returned by
sift_flann_match()
required for the image pair to be considered as containing the same slide. Defaults to 33.min_area_percent (float, optional) – Percentage of the area of the train image (images belonging to the ‘presenter_slide’ category) that a matched slide must take up to be counted as a legitimate duplicate slide. This removes incorrect matches that can result in crops to small portions of the train image. Defaults to 0.37.
do_motion_detection (bool, optional) – Whether motion detection using
does_camera_move_all_in_folder()
should be performed. If set to False then it is assumed that there is movement since assuming no movement leaves room for a lot of false positives. If no camera motion is detected and this option is enabled then all slides that are unique to the “presenter_slide” category (they have no matches in the “slide” category) will automatically be cropped to contain just the slide. They will be saved to the originating folder but with the string defined by the variableOUTPUT_PATH_MODIFIER
in their filename. Even ifdoes_camera_move_all_in_folder()
detects no movement it is still possible that movement is detected while running this function since a check is performed to make sure all slide bounding boxes found contain 80% overlapping area with all previously found bounding boxes. Defaults to True.- Returns
(non_unique_presenter_slides, transformed_image_paths)
non_unique_presenter_slides
: The images in the “presenter_slide” category that are not unique and should be deletedtransformed_image_paths
: The paths to the cropped images if do_motion_detection was enabled and no motion was detected.- Return type
tuple
- lecture2notes.end_to_end.sift_matcher.ransac_transform(sift_matches, kp1, kp2, img1, img2, draw_matches=False)[source]¶
Use data from
sift_flann_match()
to find the coordinates ofimg1
inimg2
.sift_matches
,kp1
,kp2
,img1
, andimg2
are all the outputs of meth:~sift_matcher.sift_flann_match. Ifdraw_matches
is enabled then the features matches will be drawn and shown on the screen.
- Returns
The corner coordinates of the quadrilateral representing
img1
withinimg2
.- Return type
np.array
- lecture2notes.end_to_end.sift_matcher.sift_flann_match(query_image, train_image, algorithm='orb', num_features=1000)[source]¶
Locate
query_image
withintrain_image
usingalgorithm
for feature detection/description and FLANN (Fast Library for Approximate Nearest Neighbors) for matching. You can read more about matching in the OpenCV “Feature Matching” documentation or about homography on the OpenCV Python Tutorial “Feature Matching + Homography to find Objects”
- Parameters
query_image (np.array) – Image to find. Loading using
cv2.imread()
.train_image (np.array) – Image to search. Loading using
cv2.imread()
.algorithm (str, optional) – The feature detection/description algorithm. Can be one of ORB, (ORB Class Reference) SIFT, (SIFT Class Reference) or FAST. (FAST Class Reference) Defaults to “orb”.
num_features (int, optional) – The maximum number of features to retain when using ORB and SIFT. Does not take effect when using the FAST detection algorithm. Setting to 0 for SIFT is a good starting point. The default for ORB is 500, but it was increased to 1000 to improve accuracy. Defaults to 1000.
- Returns
(good, kp1, kp2, img1, img2) The good matches as per Lowe’s ratio test, the key points from image 1, the key points from image 2, modified image 1, and modified image 2.
- Return type
tuple
Slide Classifier¶
- lecture2notes.end_to_end.slide_classifier.classify_frames(frames_dir, do_move=True, incorrect_threshold=0.6, model_path='model_best.ckpt')[source]¶
Classifies images in a directory using the slide classifier model.
- Parameters
frames_dir (str) – path to directory containing images to classify
do_move (bool, optional) – move the images to their sorted folders instead of copying them. Defaults to True.
incorrect_threshold (float, optional) – the certainty value that the model must be below for a prediction to be marked “probably incorrect”. Defaults to 0.60.
- Returns
(frames_sorted_dir, certainties, percent_wrong)
- Return type
[tuple]
Slide Structure Analysis¶
- lecture2notes.end_to_end.slide_structure_analysis.all_in_folder(path, do_rename=True, **kwargs)[source]¶
Perform structure analysis and OCR on every file in folder using
analyze_structure()
.
- Parameters
path (str) – Directory containing images to process.
do_rename (str, optional) – Rename files to just their frame number. Defaults to True.
**kwargs (dict, optional) –
lecture2notes.end_to_end.slide_structure_analysis.analyze_structure()
.- Returns
(raw_texts, json_texts) A list of the raw text for each slide and a list of the json structure analysis data for each slide.
- Return type
tuple
- lecture2notes.end_to_end.slide_structure_analysis.analyze_structure(image, to_json=None, return_unstructured_text=True, gamma=0.1, beta=0.2, orient='index', extra_json=None)[source]¶
Perform slide structure analysis.
- Parameters
image (np.array) – Image to be processed as loaded with
cv2.imread()
.to_json (str or bool, optional) – Path to write json output or a boolean to return json data as a string. The default return value is a pd.DataFrame. Defaults to None.
return_unstructured_text (bool, optiona) – If the raw recognized text should be returned in addition to the other return values.
gamma (float, optional) – The percentage greater than or less than the average stroke width that a text line must meet to be classified as bold/subtitle or small text repsectively. Defaults to 0.1.
beta (float, optional) – The percentage greater than or less than the average height that a text line must meet to be classified as bold/subtitle or small text repsectively. This is greater than
gamma
because height is on a larger scale than gamma. Defaults to 0.2.orient (str, optional) – The format of the output json data if
to_json
is set. The acceptable values can be found on the pandas.DataFrame.to_json documentation. Defaults to “index”.extra_json (dict, optional) – Additional keys and values to add to the json output if
to_json
is enabled. Defaults to None.- Returns
The default is to return a pd.DataFrame. However, setting
to_json
to a string will instead write json data toto_json
and return the path to the data. Settingto_json
toTrue
will return the json data as a string. Settingreturn_unstructured_text
returns the previously described data and the raw recognized text as a tuple. Will returnNone
is no text is detected.- Return type
pd.DataFrame or str or tuple or
None
- lecture2notes.end_to_end.slide_structure_analysis.identify_title(tesseract_df, image, left_start_maximum=0.77, character_limit=3, enabled_checks=None)[source]¶
- lecture2notes.end_to_end.slide_structure_analysis.stroke_width(image)[source]¶
Determine the average stroke length in an image. Inspired by: https://stackoverflow.com/a/61914060.
Other Links:
- lecture2notes.end_to_end.slide_structure_analysis.write_to_file(raw_texts, json_texts, raw_save_file, json_save_file)[source]¶
Write the raw text in
raw_texts
toraw_save_file
and the json data injson_texts
tojson_save_file
. Used to write results fromall_in_folder()
to disk.
- Parameters
raw_texts (list) – List of raw text outputs from
analyze_structure()
.json_texts (list) – List of json ssa outputs from
analyze_structure()
.raw_save_file (str) – The path to save the raw text. A “.txt” file.
json_save_file (str) – The path to save the json output. A “.json” file.
Spell Check¶
- class lecture2notes.end_to_end.spell_check.SpellChecker(max_edit_distance_dictionary=2, max_edit_distance_lookup=2, prefix_length=7)[source]¶
A spell checker.
Summarization Approaches¶
- lecture2notes.end_to_end.summarization_approaches.cluster(text, coverage_percentage=0.7, final_sort_by=None, cluster_summarizer='extractive', title_generation=False, num_topics=10, minibatch=False, hf_inference_api=False, feature_extraction='neural_sbert', **kwargs)[source]¶
Summarize
text
tocoverage_percentage
length of the original document by extracting features from the text, clustering based on those features, and finally summarizing each cluster. See the scikit-learn documentation on clustering text for more information since several sections of this function were borrowed from that example.Notes
**kwargs
is passed to the feature extraction function, which is eitherextract_features_bow()
orextract_features_neural()
depending on thefeature_extraction
argument.
- Parameters
text (str) – a string of text to summarize
coverage_percentage (float, optional) – The length of the summary as a percentage of the original document. Defaults to 0.70.
final_sort_by (str, optional) – If cluster_summarizer is extractive and title_generation is False then this argument is available. If specified, it will sort the final cluster summaries by the specified string. Options are
["order", "rating"]
. Defaults to None.cluster_summarizer (str, optional) – Which summarization method to use to summarize each individual cluster. “Extractive” uses the same approach as
keyword_based_ext()
but instead of using keywords from another document, the keywords are calculated in theTfidfVectorizer
orHashingVectorizer
. Each keyword is a feature in the document-term matrix, thus the number of words to use is specified by the n_features parameter. Options are["extractive", "abstractive"].
Defaults to “extractive”.title_generation (bool, optional) – Option to generate titles for each cluster. Can not be used if
final_sort_by
is set. Generates titles by summarizing the text using BART finetuned on XSum (a dataset of news articles and one sentence summaries aka headline generation) and forcing results to be from 1 to 10 words long. Defaults to False.num_topics (int, optional) – The number of clusters to create. This should be set to the number of topics discussed in the lecture if generating good titles is desired. If separating into groups is not very important and a final summary is desired then this parameter is not incredibly important, it just should not be set super low (3) or super high (50) unless your document in super short or long. Defaults to 10.
minibatch (bool, optional) – Two clustering algorithms are used: ordinary k-means and its more scalable cousin minibatch k-means. Setting this to True will use minibatch k-means with a batch size set to the number of clusters set in
num_topics
. Defaults to False.hf_inference_api (bool, optional) – Use the huggingface inference API for abstractive summarization. Defaults to False.
feature_extraction (str, optional) –
Specify how features should be extracted from the text.
neural_hf
: uses a huggingface/transformers pipeline with the roberta model by default
neural_sbert
: special bert and roberta models fine-tuned to extract sentence embeddings
spacy
: uses spacy model. All other options use the small spacy model to splitthe text into sentences since sentence detection does not improve with larger models. However, if spacy is specified for feature_selection than the en_core_web_lg model will be used to extract high-quality embeddings
bow
: bow = “bag of words”. this method is extremely fast since it is based onword frequencies throughout the input text. The
extract_features_bow()
function contains more details on recommended parameters that you can pass to this function because of**kwargs
.Options are
["neural_hf", "neural_sbert", "spacy", "bow"]
Default is “neural_sbert”.- Raises
Exception – If incorrect parameters are passed.
- Returns
The summarized text as a normal string. Line breaks will be included if
title_generation
is true.- Return type
[str]
- lecture2notes.end_to_end.summarization_approaches.create_sumy_summarizer(algorithm, language='english')[source]¶
- lecture2notes.end_to_end.summarization_approaches.extract_features_bow(data, return_lsa_svd=False, use_hashing=False, use_idf=True, n_features=10000, lsa_num_components=False)[source]¶
Extract features using a bag of words statistical word-frequency approach.
- Parameters
data (list) – List of sentences to extract features from
return_lsa_svd (bool, optional) – Return the features and
lsa_svd
. See “Returns” section below. Defaults to False.use_hashing (bool, optional) – Use a HashingVectorizer instead of a CountVectorizer. Defaults to False. A HashingVectorizer should only be used with large datasets. Large to the degree that you’ll probably never pass enough data through this function to warrent the usage of a HashingVectorizer. HashingVectorizers use very little memory and are thus scalable to large datasets because there is no need to store a vocabulary dictionary in memory. More information can be found in the HashingVectorizer scikit-learn documentation.
use_idf (bool, optional) – Option to use inverse document-frequency. Defaults to True. In the case of
use_hasing
a TfidfTransformer will be appended in a pipeline after the HashingVectorizer. If notuse_hashing
then theuse_idf
parameter of the TfidfVectorizer will be set to use_idf. This step is important because, as explained by the scikit-learn documentation: “In a large text corpus, some words will be very present (e.g. ‘the’, ‘a’, ‘is’ in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms. In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.”n_features (int, optional) – Specifies the number of features/words to use in the vocabulary (which are the rows of the document-term matrix). In the case of the TfidfVectorizer the
n_features
acts as a maximum since the max_df and min_df parameters choose words to add to the vocabulary (to use as features) that occur within the bounds specified by these parameters. This value should probably be lowered ifuse_hasing
is set to True. Defaults to 10000.lsa_num_components (int, optional) – If set then preprocess the data using latent semantic analysis to reduce the dimensionality to
lsa_num_components
components. Defaults to False.- Returns
list of features extracted and optionally the u, sigma, and v of the svd calculation on the document-term matrix. only returns if
return_lsa_svd
set to True.- Return type
[list or tuple]
- lecture2notes.end_to_end.summarization_approaches.extract_features_neural_hf(sentences, model='roberta-base', tokenizer='roberta-base', n_hidden=768, squeeze=True, **kwargs)[source]¶
Extract features using a transformer model from the huggingface/transformers library
- lecture2notes.end_to_end.summarization_approaches.extract_features_neural_sbert(sentences, model='roberta-base-nli-mean-tokens')[source]¶
Extract features using Sentence-BERT (SBERT) or SRoBERTa from the sentence-transformers library
- lecture2notes.end_to_end.summarization_approaches.full_sents(ocr_text, transcript_text, remove_newlines=True, cut_off=0.7)[source]¶
- lecture2notes.end_to_end.summarization_approaches.generic_abstractive(to_summarize, summarizer=None, min_length=None, max_length=None, hf_inference_api=False, *args, **kwargs)[source]¶
- lecture2notes.end_to_end.summarization_approaches.generic_abstractive_hf_api(to_summarize, summarizer='facebook/bart-large-cnn', *args, **kwargs)[source]¶
- lecture2notes.end_to_end.summarization_approaches.generic_extractive_sumy(text, coverage_percentage=0.7, algorithm='text_rank', language='english')[source]¶
- lecture2notes.end_to_end.summarization_approaches.get_best_sentences(sentences, count, rating, *args, **kwargs)[source]¶
- lecture2notes.end_to_end.summarization_approaches.get_complete_sentences(text, return_string=False)[source]¶
- lecture2notes.end_to_end.summarization_approaches.get_sentences(text, model='en_core_web_sm')[source]¶
- lecture2notes.end_to_end.summarization_approaches.initialize_abstractive_model(sum_model, use_hf_pipeline=True, *args, **kwargs)[source]¶
- lecture2notes.end_to_end.summarization_approaches.keyword_based_ext(ocr_text, transcript_text, coverage_percentage=0.7)[source]¶
- lecture2notes.end_to_end.summarization_approaches.structured_joined_sum(ssa_path, transcript_json_path, frame_every_x=1, ending_char='.', first_slide_frame_num=0, to_json=False, summarization_method='abstractive', max_summarize_len=50, abs_summarizer='sshleifer/distilbart-cnn-12-6', ext_summarizer='text_rank', hf_inference_api=False, *args, **kwargs)[source]¶
Summarize slides by combining the Slide Structure Analysis (SSA) and transcript json to create a per slide summary of the transcript. The content from the beginning of one slide to the start of the next to the nearest
ending_char
is considered the transcript that belongs to that slide. The summarized transcript content is organized in a dictionary where the slide titles are keys. This dictionary can be returned as json or written to a json file.
- Parameters
ssa_path (str) – Path to the SSA JSON file.
transcript_json_path (str) – Path to the transcript JSON file.
frame_every_x (int, optional) – How often frames were extracted from the video that the SSA was conducted on. This is used to convert frame numbers to time (seconds). Defaults to 1.
ending_char (str, optional) – The character that the transcript belonging to each slide will be extended to. For instance, if the next slide appears in the middle of a word, the transcript content will continue to be added to the previous slide until the
ending_char
is reached. It is recommended to use periods or a special end of sentence token if present. These can be generated withlecture2notes.end_to_end.transcribe.transcribe_main.segment_sentences()
Defaults to" "
(nearest complete word).first_slide_frame_num (int, optional) – The frame number of the first slide. Used to create a ‘preface’ (aka an introduction) if the first slide is not immediately shown. Defaults to 0.
to_json (bool or str, optional) – If the output dictionary should be returned as a JSON string. This can also be set to a path as a string and the JSON data will be dumped to the file at that path. Defaults to False.
summarization_method (str, optional) – The method to use to summarize each slide’s transcript content. Options include “abstractive”, “extractive”, or “none”. Defaults to “abstractive”.
max_summarize_len (int, optional) – Text longer than this many tokens will be summarized. Defaults to 50.
abs_summarizer (str, optional) – The abstractive summarization model to use if summarization_method is “abstractive”. Defaults to “sshleifer/distilbart-cnn-12-6”.
hf_inference_api (bool, optional) – Use the huggingface inference API for abstractive summarization. Defaults to False.
function (*args and **kwargs are passed to the summarization) –
generic_abstractive()
orgeneric_extractive_sumy()
depending onsummarization_method
.either (which is) –
generic_abstractive()
orgeneric_extractive_sumy()
depending onsummarization_method
.- Returns
A dictionary containing the slide titles as keys and the summarized transcript content for each slide as values. A string will be returned when
to_json
is set. Ifto_json
isTrue
(boolean) the JSON data formatted as a string will be returned. Ifto_json
is a path (string), then the JSON data will be dumped to the file specified and the path to the file will be returned.- Return type
dict or str
Transcript Downloader¶
- class lecture2notes.end_to_end.transcript_downloader.TranscriptDownloader(youtube=None, ytdl=True)[source]¶
Download transcripts from YouTube using the YouTube API or
youtube-dl
.
- static check_suffix(output_path)[source]¶
Gets the file extension from
output_path
and verifies it is either “.srt”, “.vtt”, or it is not present inoutput_path
. The default is “.vtt”.
- download(video_id, output_path)[source]¶
Convenience function to download transcript with one call. If
self.ytdl
is False, callsget_caption_id()
and passes result toget_transcript()
. Ifself.ytdl
is True, callsget_transcript_ytdl()
.
Gets the caption id with language
land
for a video on YouTube with idvideo_id
.
- get_transcript_api(caption_id, output_path)[source]¶
Downloads a caption track by id directly from the YouTube API.
- Parameters
caption_id (str) – the id of the caption track to download
output_path (str) – path to save the captions. file extensions are parsed by
check_suffix()
- Returns
the path where the transcript was saved (may not be the same as the
output_path
parameter)- Return type
[str]