Skip to main content

Python - Parsing or Regular Expression or Regex or RE (search, split, findall)

🐍 Parsing is way too simple in python. Basic parsing is done using the library re (Regular Expression).

TRY ON JUPYTER NOTEBOOK

Following should be the first line to start using functionalities of Regular Expressions:

import re

re.search(pattern, text)

import re
text = 'This is a string with term1, but it does not have the other term.'
pattern = 'term1'
print(re.search(pattern, text))

This will search 'term1' in the text and return Details, since it found it in the text.

import re
text = 'This is a string with term1, but it does not have the other term.'
pattern = 'term2'
print(re.search(pattern, text))

This will search 'term2' in the text and return 'None', since it didn't find it.

re.search(pattern, text).start()

It return the position of the first character of the pattern you are searching in the text.

import re
text = 'This is a string with term1, but it does not have the other term.'
pattern = 'term1'
re.search(pattern, text).start()

re.search(pattern, text).end()

It return the position of the last character of the pattern you are searching in the text.

import re
text = 'This is a string with term1, but it does not have the other term.'
pattern = 'term1'
re.search(pattern, text).end()

re.split(term, text)

re.split(split_term, text) splits the text where the split_term is encountered and returns a list of the split parts.

import re
text = 'This is a string with term1 in between.'
split_term = 'term1'
re.split(split_term, text)

re.findall(term, text)

When multiple instances of a term has to be spotted in a string, we use re.findall(term, text).

import re
text = 'Term1 and term1 again. In love with term1'
term = 'term1'
re.findall(term, text)

Pattern Recognition Using Metacharacters

* (0...n occurrence)

Asterisk is used to find terms with any number of occurrence of a character. Including zero occurrence.

import re
text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
term = 'sd*'
re.findall(term, text)

This will return sd with any number of d after s. That includes no occurrence of 'd' as well.

+ (1...n occurrence)

Plus is used to find terms with any number of occurrence of a character. NOT zero occurrence.

import re
text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
term = 'sd*'
re.findall(term, text)
This will return sd with any number of d after s. That does not include no occurrence of 'd'.

? (0 or 1 occurrence)

Question mark is used to find terms with zero or one number of occurrence of a character.

import re
text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
term = 'sd?'
re.findall(term, text)
This will return sd with zero or one number of d after s.

{} (m to n occurrences)

When you want to fix occurrences between two number you use curly brackets and put FROM to TO occurence number in it. See example below:
import re
text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
term = 'sd{3}'
re.findall(term, text)

This will return all terms with 3 occurrences of 'd' after 's'
import re
text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
term = 'sd{1,3}'
re.findall(term, text)
This will return all terms with either one, two, or three occurrences of 'd' after 's'.

Character Sets []

Characters in the square brackets are searched and returned.

import re
text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
term = '[sd]'
re.findall(term, text)
This will return all occurrences of 's' or 'd' separately.

import re
text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
term = 's[sd]+'
re.findall(term, text)
This will return all occurrences of ss+ or sd+. Which means 's' followed by any number of 's' or 's' followed by any number of 'd'.

Exclusion by [^]

^ character is used to exclude the characters after ^ and return a list of broken parts by those characters.
import re
text = 'This is a string! But it has punctuation. How can we remove it?'
term = '[^!_.?]+'
re.findall(term, text)
This will removes exclamation marks, spaces, full-stops, and question marks, and return a list of parts broken down by these characters to be excluded. Plus sign at the end is added to hold the rest of the parts together and not break the text by each character. Try without plus to understand the difference.

Character Ranges

- is used to define a range of characters that needs to be returned. It will exclude the rest.

import re
text = 'This is a string! But it has punctuation. How can we remove it?'
term = '[a-z]+'
re.findall(term, text)
This will return all characters that are alphabet and in lower case. Plus sign is to hold adjacent alphabet together and break only where an occurrence of non-lowercase-alphabet takes place.
import re
text = 'This is a string! But it has punctuation. How can we remove it?'
term = '[A-Z]+'
re.findall(term, text)
This will return all characters that are alphabet and in upper case. Plus sign is to hold adjacent alphabet together and break only where an occurrence of non-uppercase-alphabet takes place.
import re
text = 'This is a string! But it has punctuation. How can we remove it?'
term = '[A-Za-z]+'

re.findall(term, text)
This will return all characters that are alphabet. It can be upper-case or lower-case. Plus sign is to hold adjacent alphabet together and break only where an occurrence of non-alphabet takes place.
import re
text = 'This is a string! But it has PunCtuaTion. How can we remove it?'
term = '[A-Z][a-z]+'
re.findall(term, text)
This will return all characters that are alphabet, must start with an uppercase letter, followed by lowercase letters. Plus sign is to hold adjacent alphabet together and break only where an occurrence of non-alphabet or camel-case takes place.

Escape Codes

Some more patterns that you might need to extract can be done in the following ways. You will also see use of 'r' in the pattern. It is used to inform python that the part after that is an escape code.

Digits Only (r'\d+')

r'\d+' is used to extract digits from string. Plus sign is used to hold together adjacent digits.

import re
text = 'This is a string! It has 3 numbers. 9766 863 513. How can we extract it?'
term = r'\d+'
re.findall(term, text)

Non-Digits Only (r'\D+')

r'\D+' is used to extract non-digits from string. Which means everything, except digits.

import re
text = 'This is a string! It has 3 numbers. 9766 863 513. How can we extract it?'
term = r'\D+'
re.findall(term, text)

Spaces Only (r'\s+')

r'\s+' is used to extract spaces from string.

import re
text = 'This is a string!   It has many spaces .   How can we extract it?'
term = r'\s+'
re.findall(term, text)

Non-Spaces Only (r'\S+')

r'\S+' is used to extract everything except spaces from a string.

import re
text = 'This is a string!   It has many spaces .   How can we extract it?'
term = r'\S+'
re.findall(term, text)

Letters and Digits Both (r'\w+')

r'\w+' is used to extract letters and digits both from a string.

import re
text = 'This is a string!   It has letters and 3 numbers too - 9766 863 513.   How can we extract it?'
term = r'\w+'
re.findall(term, text)

Non-Letters and Non-Digits Only (r'\W+')

r'\W+' is used to extract everything except letters and digits from a string.

import re
text = 'This is a string!   It has letters and 3 numbers too - 9766 863 513.   How can we extract it?'
term = r'\W+'
re.findall(term, text)

Comments

Popular posts from this blog

Difference between .exec() and .execPopulate() in Mongoose?

Here I answer what is the difference between .exec() and .execPopulate() in Mongoose? .exec() is used with a query while .execPopulate() is used with a document Syntax for .exec() is as follows: Model.query() . populate ( 'field' ) . exec () // returns promise . then ( function ( document ) { console . log ( document ); }); Syntax for .execPopulate() is as follows: fetchedDocument . populate ( 'field' ) . execPopulate () // returns promise . then ( function ( document ) { console . log ( document ); }); When working with individual document use .execPopulate(), for model query use .exec(). Both returns a promise. One can do without .exec() or .execPopulate() but then has to pass a callback in populate.

Machine Learning — Supervised, Unsupervised, and Reinforcement — Explanation with Example

πŸ€– Let's take an example of machine learning and see how it can be performed in three different ways — Supervised, Unsupervised, and Reinforcement. We want a program to be able to identify apple in pictures Supervised Learning You will create or use a model that takes a set of pictures of apple and it analyses the commonality in those pictures. Now when you show a new picture to the program, it will identify whether it has an apple or not. It can also provide details on how confident is the program about it. Unsupervised Learning In this method, you create or use a model that goes through some images and tries to group them as per the commonalities it observes such as color, shape, size, partern, etc. And now you can go through the groups and inform the program what to call them. So, you can inform the program about the group that is apple mostly. Next time you show a picture, it can tell if an apple is there or not. Reinforcement Learning Here the model you create or...

269. Alien Dictionary

  Solution This article assumes you already have some confidence with  graph algorithms , such as  breadth-first search  and  depth-first searching . If you're familiar with those, but not with  topological sort  (the topic tag for this problem), don't panic, as you should still be able to make sense of it. It is one of the many more advanced algorithms that keen programmers tend to "invent" themselves before realizing it's already a widely known and used algorithm. There are a couple of approaches to topological sort;  Kahn's Algorithm  and DFS. A few things to keep in mind: The letters  within a word  don't tell us anything about the relative order. For example, the presence of the word  kitten  in the list does  not  tell us that the letter  k  is before the letter  i . The input can contain words followed by their prefix, for example,  abcd  and then  ab . These cases will never ...