π Parsing is way too simple in python. Basic parsing is done using the library re (Regular Expression).
TRY ON JUPYTER NOTEBOOK
Following should be the first line to start using functionalities of Regular Expressions:import re
re.search(pattern, text)
import re text = 'This is a string with term1, but it does not have the other term.' pattern = 'term1' print(re.search(pattern, text))
This will search 'term1' in the text and return Details, since it found it in the text.
import re text = 'This is a string with term1, but it does not have the other term.' pattern = 'term2' print(re.search(pattern, text))
This will search 'term2' in the text and return 'None', since it didn't find it.
re.search(pattern, text).start()
It return the position of the first character of the pattern you are searching in the text.
import re text = 'This is a string with term1, but it does not have the other term.' pattern = 'term1' re.search(pattern, text).start()
re.search(pattern, text).end()
It return the position of the last character of the pattern you are searching in the text.
import re text = 'This is a string with term1, but it does not have the other term.' pattern = 'term1' re.search(pattern, text).end()
re.split(term, text)
re.split(split_term, text) splits the text where the split_term is encountered and returns a list of the split parts.
import re text = 'This is a string with term1 in between.' split_term = 'term1' re.split(split_term, text)
re.findall(term, text)
When multiple instances of a term has to be spotted in a string, we use re.findall(term, text).
import re text = 'Term1 and term1 again. In love with term1' term = 'term1' re.findall(term, text)
Pattern Recognition Using Metacharacters
* (0...n occurrence)
Asterisk is used to find terms with any number of occurrence of a character. Including zero occurrence.
import re text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd' term = 'sd*' re.findall(term, text)
This will return sd with any number of d after s. That includes no occurrence of 'd' as well.
+ (1...n occurrence)
Plus is used to find terms with any number of occurrence of a character. NOT zero occurrence.
import re text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd' term = 'sd*' re.findall(term, text)
This will return sd with any number of d after s. That does not include no occurrence of 'd'.
? (0 or 1 occurrence)
Question mark is used to find terms with zero or one number of occurrence of a character.
import re text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd' term = 'sd?' re.findall(term, text)
This will return sd with zero or one number of d after s.
{} (m to n occurrences)
When you want to fix occurrences between two number you use curly brackets and put FROM to TO occurence number in it. See example below:
import re text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd' term = 'sd{3}' re.findall(term, text)
This will return all terms with 3 occurrences of 'd' after 's'
import re text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd' term = 'sd{1,3}' re.findall(term, text)
This will return all terms with either one, two, or three occurrences of 'd' after 's'.
Character Sets []
Characters in the square brackets are searched and returned.
import re text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd' term = '[sd]' re.findall(term, text)
This will return all occurrences of 's' or 'd' separately.
import re text = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd' term = 's[sd]+' re.findall(term, text)
This will return all occurrences of ss+ or sd+. Which means 's' followed by any number of 's' or 's' followed by any number of 'd'.
Exclusion by [^]
^ character is used to exclude the characters after ^ and return a list of broken parts by those characters.
import re text = 'This is a string! But it has punctuation. How can we remove it?' term = '[^!_.?]+' re.findall(term, text)
This will removes exclamation marks, spaces, full-stops, and question marks, and return a list of parts broken down by these characters to be excluded. Plus sign at the end is added to hold the rest of the parts together and not break the text by each character. Try without plus to understand the difference.
Character Ranges
- is used to define a range of characters that needs to be returned. It will exclude the rest.
import re text = 'This is a string! But it has punctuation. How can we remove it?' term = '[a-z]+' re.findall(term, text)
This will return all characters that are alphabet and in lower case. Plus sign is to hold adjacent alphabet together and break only where an occurrence of non-lowercase-alphabet takes place.
import re text = 'This is a string! But it has punctuation. How can we remove it?' term = '[A-Z]+' re.findall(term, text)
This will return all characters that are alphabet and in upper case. Plus sign is to hold adjacent alphabet together and break only where an occurrence of non-uppercase-alphabet takes place.
import re text = 'This is a string! But it has punctuation. How can we remove it?' term = '[A-Za-z]+' re.findall(term, text)
This will return all characters that are alphabet. It can be upper-case or lower-case. Plus sign is to hold adjacent alphabet together and break only where an occurrence of non-alphabet takes place.
import re text = 'This is a string! But it has PunCtuaTion. How can we remove it?' term = '[A-Z][a-z]+' re.findall(term, text)
This will return all characters that are alphabet, must start with an uppercase letter, followed by lowercase letters. Plus sign is to hold adjacent alphabet together and break only where an occurrence of non-alphabet or camel-case takes place.
Escape Codes
Some more patterns that you might need to extract can be done in the following ways. You will also see use of 'r' in the pattern. It is used to inform python that the part after that is an escape code.
Digits Only (r'\d+')
r'\d+' is used to extract digits from string. Plus sign is used to hold together adjacent digits.
import re text = 'This is a string! It has 3 numbers. 9766 863 513. How can we extract it?' term = r'\d+' re.findall(term, text)
Non-Digits Only (r'\D+')
r'\D+' is used to extract non-digits from string. Which means everything, except digits.
import re text = 'This is a string! It has 3 numbers. 9766 863 513. How can we extract it?' term = r'\D+' re.findall(term, text)
Spaces Only (r'\s+')
r'\s+' is used to extract spaces from string.
import re text = 'This is a string! It has many spaces . How can we extract it?' term = r'\s+' re.findall(term, text)
Non-Spaces Only (r'\S+')
r'\S+' is used to extract everything except spaces from a string.
import re text = 'This is a string! It has many spaces . How can we extract it?' term = r'\S+' re.findall(term, text)
Letters and Digits Both (r'\w+')
r'\w+' is used to extract letters and digits both from a string.
import re text = 'This is a string! It has letters and 3 numbers too - 9766 863 513. How can we extract it?' term = r'\w+' re.findall(term, text)
Non-Letters and Non-Digits Only (r'\W+')
r'\W+' is used to extract everything except letters and digits from a string.
import re text = 'This is a string! It has letters and 3 numbers too - 9766 863 513. How can we extract it?' term = r'\W+' re.findall(term, text)
Comments
Post a Comment