Published 10/2024
MP4 | Video: h264, 1280×720 | Audio: AAC, 44.1 KHz
Language: English | Size: 1.15 GB | Duration: 2h 0m
How to extract information WITHOUT building custom Machine Learning models
What you’ll learn
Understand the spaCy document object
How spaCy pipelines work
How to use Rule based Matching for Information Extraction
A system for practical, iterative Text Analytics using the itables library
Requirements
Intermediate Knowledge of Python programming
Basic knowledge of the pandas dataframe library
Description
What is text analytics?I like this definition: “Text analytics is the process of transforming unstructured text documents into usable, structured data. Text analysis works by breaking apart sentences and phrases into their components, and then evaluating each part’s role and meaning using complex software rules and machine learning algorithms.”[Source: Lexalytics website]In spaCy, you can use machine learning algorithms in two ways1) pretrained models provided by spaCy and other organizations – for example the en_core_web_md, which I use in this course, is a pretrained model provided by Explosion, the company which created spaCy2) custom machine learning models that you train on your data – which is often referred to in the documentation as “statistical models”Why not statistical models?This is what the makers of spaCy say in their documentation:”For complex tasks, it’s usually better to train a statistical entity recognition model. However, statistical models require training data, so for many situations, rule-based approaches are more practical. This is especially true at the start of a project: you can use a rule-based approach as part of a data collection process, to help you “bootstrap” a statistical model.Training a model is useful if you have some examples and you want your system to be able to generalize based on those examples. It works especially well if there are clues in the local context. For instance, if you’re trying to detect person or company names, your application may benefit from a statistical named entity recognition model.Rule-based systems are a good choice if there’s a more or less finite number of examples that you want to find in the data, or if there’s a very clear, structured pattern you can express with token rules or regular expressions. For instance, country names, IP addresses or URLs are things you might be able to handle well with a purely rule-based approach.”Just to clarify, I am not against developing statistical models – but as the documentation states quite clearly, it is often more practical to start with rules based systems. One of my main aims in this course is to provide a solid understanding of what you can and cannot do using just a rules based system – in fact I use only one dataset in this entire course so it is a lot easier for the students to make this distinction.When you combine a rules based system with the data visualization technique I describe in this course, you will also gain a very good understanding of your dataset. You can then use this understanding to improve your statistical model if you choose to build one. In my view, most people barely scratch the surface when it comes to using spaCy rules for text analytics. I hope this course will provide them a lot of new insight into how to approach this task.
Overview
Section 1: About this course
Lecture 1 How this course is different from other spaCy courses
Lecture 2 The best dataset for learning text analytics
Section 2: Exploring spaCy document objects
Lecture 3 Import libraries
Lecture 4 Splitting text into sentences
Lecture 5 Splitting text into words
Lecture 6 Part-of-speech tagging
Lecture 7 Stop words and punctuation
Lecture 8 Text spans
Lecture 9 Dependency Parse Tree
Lecture 10 Named Entity Recognition
Lecture 11 Token is_ attributes
Lecture 12 Token like_ attributes
Lecture 13 More token attributes
Lecture 14 Remaining token attributes
Lecture 15 Visualizing the Subtree
Lecture 16 Visualizing the token head
Section 3: spaCy pipelines
Lecture 17 Display pipeline
Lecture 18 Tokenizer is unique
Lecture 19 tagger
Lecture 20 parser
Lecture 21 attribute_ruler
Lecture 22 lemmatizer
Lecture 23 ner
Section 4: Rule based matching
Lecture 24 Token matcher
Lecture 25 Dependency Matcher based on position
Lecture 26 Dependency Matcher based on the parse tree
Lecture 27 Phrase matcher
Section 5: Download the Jupyter notebook
Lecture 28 Download the Jupyter notebook used in this course
Data Science practitioners who want to use spaCy and Natural Language Processing,Anyone who has a spreadsheet where one of the columns is a paragraph of text and wants to know how to extract useful information from that text to use with the filters you can apply on the OTHER columns (sort, less than, greater than etc) in spreadsheet tools like Excel and Airtable
Homepage