Hello everybody,
Michael here, and today I thought I’d get back into some Python lessons-particularly, I wanted to start a new series of Python lessons on a topic I’ve been wanting to cover, NLP (or natural language processing).
See, Python isn’t just good for mathematical operations-there’s so much more you can do with it (computer vision, natural language processing, graphic design, etc.). Heck, if I really wanted to, I could post only Python content on this blog and still have enough content to keep this blog running for another 6-10 years.
In the context of Python, what is natural language processing? It’s basically a concept that encompasses how computers process natural language. Natural language is basic, conversational language (whether English, Spanish, or any other language on the planet), much like what you’d use when talking to your buddies or writing a job resume.
See, when you’re writing a Python program (or any program really) you’re not feeding natual language to the computer for processing. Rather, what you feed to the computer are programming instructions (loops, conditions, print statements-they all count as programming instruction). Humans don’t speak in code, and computers don’t process instructions in “people-talk”, if you will. This is where natural language processing comes in, as developers (depending on the program being created) sometimes want to process natural language in their programs for a variety of purposes, such as data anaylsis, finding certain parts-of-speech, etc.
Now that I’ve given you a basic NLP intro, let’s dive into some coding! To start exploring natural language processing with Python, let’s first pip install the NLTK package by running this line on our command prompts (the regular command prompt, not the Anaconda prompt-if you happen to have that)-pip install nltk.
- Remember to run the
pip listcommand to see if you already have thenltkpackage installed on your device.
Once you get the NLTK package installed on your device, let’s start coding!
Take a look at this code below-I’ll dicuss it after I show you the example:
import nltk
nltk.download('punkt')
test = input('Please type in a string: ')
nltk.word_tokenize(test)
Please type in a string: Don't worry I won't be going anywhere.
['Do', "n't", 'worry', 'I', 'wo', "n't", 'be', 'going', 'anywhere', '.']
In this example, I was demonstrating one of the most basic concepts of NLP-tokenization. Tokenization is simply the process of splitting up text strings-either by word or by sentence (and more on the sentence thing later).
For this example to work, I imported the nltk package and downloaded NLTK’s punkt module-reasons for doing this are that punkt is a good pre-trained tokenizer model and in order for the tokenization process to work, you’ll need to install a pre-trained model (which doesn’t come with the NLTK package’s pip installation, sadly).
After importing the NLTK package and installing the pre-trained model, I then typed in a sentence that I wanted to toeknize and then ran the NLTK package’s word_tokenize method on the sentence. The last line of code in the example contains the output after the text is tokenized-as you can see, the tokenized output is displayed as a list of the words in the input sentence (denoted by the test variable).
Pay attention to the list of words that was generated. With words like be, going, and worry-nothing too remarkable, right? However, pay attention to the way the two contractions in the test sentence were tokenized. Don’t was tokenized as do and n't while won’t was tokenized as wo and n't. Why might that be? Well, the pre-trained NLTK model we downloaded earlier (punkt) is really good at recognizing common English contractions as two separate words-don’t is shorthand for “do not” and won’t is shorthand for “will not”. However, just because a word in the string has an apostrophe doesn’t mean it will automatically be split in two-for instance the word “Cote D’Ivore” (the Ivory Coast nation in Africa) wouldn’t be split as it’s not a common English contraction.
Pretty neat stuff right? Now, let’s take a look at sentence-based tokenization:
import nltk
nltk.download('punkt')
test = input('Please type in a string: ')
nltk.sent_tokenize(test)
Please type in a string: How was your Memorial Day weekend? Mine was fun. Lots of sun!
['How was your Memorial Day weekend?', 'Mine was fun.', 'Lots of sun!']
In order to perform sentence-based tokenization, you’d need to utilize NLTK’s sent_tokenize model (just as you would utilize word_tokenize for word-based tokenization). Just like the word_tokenize module, the sent_tokenize module returns a list of strings that were derived from the larger string but in this case, sent_tokenize splits the string based on sentences rather than individual words. Notice how sent_tokenize perfectly notices where the punctuation is located in order to split the string based on sentences.
Thanks for reading,
Michael