Mastering Regular Expressions in Python

Published Sept. 7, 2023, 3:26 p.m.

OUTLINE:

  1. Introduction to Regular Expressions
  2. Basic Pattern Matching
  3. Predefined Character Classes
  4. Quantifiers for Pattern Matching
  5. Important Functions of the re Module
    • compile()
    • finditer()
    • match()
    • fullmatch()
    • search()
    • findall()
    • finditer()
    • sub()
    • subn()
    • split()
  6. Examples to Master Regular Expressions
    • Validate Yava Language Identifiers
    • Check Mobile Numbers
    • Extract Mobile Numbers from a File
    • Web Scraping with Regex
    • Gmail Address Validation
    • Telangana Vehicle Registration
    • Flexible Mobile Number Checker

🔍 Mastering Regular Expressions in Python 🧩

Are you ready to dive into the fascinating world of Regular Expressions in Python? 🚀 Let's explore this powerful tool that helps you represent and manipulate text patterns effortlessly.

What Are Regular Expressions?

Regular Expressions, often referred to as regex or regexp, are declarative mechanisms for representing and manipulating text patterns. 📜 They're like magic spells for text, allowing you to find, match, and manipulate strings based on specific patterns.

Examples of Regular Expressions in Action ✨

1️⃣ You can write a regular expression to represent all mobile numbers. 2️⃣ You can write a regular expression to represent all email addresses.

Key Applications of Regular Expressions 🌟

  1. Validation Frameworks: Regular expressions are vital for creating validation logic in applications.
  2. Pattern Matching: They power pattern matching tools like ctrl-f in Windows and grep in UNIX.
  3. Translators: Regular expressions are used in compilers and interpreters.
  4. Digital Circuits: They play a role in developing digital circuits.
  5. Communication Protocols: Regular expressions are used in creating protocols like TCP/IP and UDP.

Python's re Module 🐍

Python provides the re module, which offers several built-in functions to work with regular expressions effortlessly.

Core Functions:

  1. compile(): Compiles a pattern into a RegexObject.
  2. finditer(): Returns an iterator yielding Match objects for every match.

Match Object Methods:

  • start(): Returns the start index of the match.
  • end(): Returns the end index + 1 of the match.
  • group(): Returns the matched string.
import re

pattern = re.compile("ab")
matcher = pattern.finditer("abaababa")

count = 0
for match in matcher:
    count += 1
    print(match.start(), "...", match.end(), "...", match.group())

print("The number of occurrences:", count)

Output:

0 ... 2 ... ab
3 ... 5 ... ab
5 ... 7 ... ab
The number of occurrences: 3

Character Classes 🧩

Character classes allow you to search for groups of characters:

  • [abc]: Matches either 'a', 'b', or 'c'.
  • [^abc]: Matches any character except 'a', 'b', or 'c'.
  • [a-z]: Matches any lowercase alphabet.
  • [A-Z]: Matches any uppercase alphabet.
  • [a-zA-Z]: Matches any alphabet character.
  • [0-9]: Matches any digit from 0 to 9.
  • [a-zA-Z0-9]: Matches any alphanumeric character.
  • [^a-zA-Z0-9]: Matches any special character (non-alphanumeric).

Character Classes 🧩

  1. [abc]: Matches either 'a', 'b', or 'c'.

Example:

import re

pattern = re.compile("[abc]")
result = pattern.search("The apple is on the table.")
if result:
    print(result.group())  # Output: 'a'
  1. [^abc]: Matches any character except 'a', 'b', or 'c'.

Example:

import re

pattern = re.compile("[^abc]")
result = pattern.search("The apple is on the table.")
if result:
    print(result.group())  # Output: 'T' (matches the first non-'abc' character)
  1. [a-z]: Matches any lowercase alphabet.

Example:

import re

pattern = re.compile("[a-z]")
result = pattern.search("The Quick Brown Fox")
if result:
    print(result.group())  # Output: 'h' (matches the first lowercase letter)
  1. [A-Z]: Matches any uppercase alphabet.

Example:

import re

pattern = re.compile("[A-Z]")
result = pattern.search("The Quick Brown Fox")
if result:
    print(result.group())  # Output: 'T' (matches the first uppercase letter)
  1. [a-zA-Z]: Matches any alphabet character.

Example:

import re

pattern = re.compile("[a-zA-Z]")
result = pattern.search("12345 Hello World!")
if result:
    print(result.group())  # Output: 'H' (matches the first alphabet character)
  1. [0-9]: Matches any digit from 0 to 9.

Example:

import re

pattern = re.compile("[0-9]")
result = pattern.search("The price is $25.99")
if result:
    print(result.group())  # Output: '2' (matches the first digit)
  1. [a-zA-Z0-9]: Matches any alphanumeric character.

Example:

import re

pattern = re.compile("[a-zA-Z0-9]")
result = pattern.search("User123 is online!")
if result:
    print(result.group())  # Output: 'U' (matches the first alphanumeric character)
  1. [^a-zA-Z0-9]: Matches any special character (non-alphanumeric).

Example:

import re

pattern = re.compile("[^a-zA-Z0-9]")
result = pattern.search("Hello World!")
if result:
    print(result.group())  # Output: ' ' (matches the first non-alphanumeric character, which is a space)

Explore these character classes in your regex patterns to customize your searches. 🕵️‍♀️

Predefined Character Classes 🧩

Use predefined character classes to simplify your patterns:

  • \s: Matches any space character.
  • \S: Matches any character except a space.
  • \d: Matches any digit from 0 to 9.
  • \D: Matches any character except a digit.
  • \w: Matches any word character (letters, digits, or underscore).
  • \W: Matches any character except a word character (special characters).
  • .: Matches any character, including special characters.

These shortcuts are incredibly useful for common patterns like spaces, digits, and more! 🧙‍♂️

Here are examples for each of the predefined character classes you mentioned:

  1. \s: Matches any space character.
import re

text = "Hello World"
matches = re.findall(r"\s", text)
print(matches)  # Output: [' ']
  1. \S: Matches any character except a space.
import re

text = "Hello World"
matches = re.findall(r"\S", text)
print(matches)  # Output: ['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd']
  1. \d: Matches any digit from 0 to 9.
import re

text = "The price is $25.99"
matches = re.findall(r"\d", text)
print(matches)  # Output: ['2', '5', '9', '9']
  1. \D: Matches any character except a digit.
import re

text = "The

 price is $25.99"
matches = re.findall(r"\D", text)
print(matches)  # Output: ['T', 'h', 'e', ' ', 'p', 'r', 'i', 'c', 'e', ' ', 'i', 's', ' ', '$', '.']
  1. \w: Matches any word character (letters, digits, or underscore).
import re

text = "User123 is online!"
matches = re.findall(r"\w", text)
print(matches)  # Output: ['U', 's', 'e', 'r', '1', '2', '3', 'i', 's', 'o', 'n', 'l', 'i', 'n', 'e']
  1. \W: Matches any character except a word character (special characters).
import re

text = "User123 is online!"
matches = re.findall(r"\W", text)
print(matches)  # Output: [' ', ' ', '!']
  1. .: Matches any character, including special characters.
import re

text = "Hello World!"
matches = re.findall(r".", text)
print(matches)  # Output: ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '!']

These examples demonstrate how to use predefined character classes in regular expressions to match specific types of characters or character groups in text strings.

Quantifiers 🔢

Quantifiers help you specify the number of occurrences to match:

  • a: Matches exactly one 'a'.
  • a+: Matches at least one 'a'.
  • a*: Matches any number of 'a's (including zero).
  • a?: Matches at most one 'a' (zero or one occurrence).
  • a{m}: Matches exactly 'm' occurrences of 'a'.
  • a{m,n}: Matches between 'm' and 'n' occurrences of 'a'.

With quantifiers, you can fine-tune your regex patterns for precise matches. 🧰 Here are examples for each of the quantifiers you mentioned:

  1. a: Matches exactly one 'a'.
import re

text = "An apple a day keeps the doctor away."
matches = re.findall(r"a", text)
print(matches)  # Output: ['a', 'a', 'a', 'a']
  1. a+: Matches at least one 'a'.
import re

text = "She saw a beautiful sunset."
matches = re.findall(r"a+", text)
print(matches)  # Output: ['a', 'a', 'a']
  1. a*: Matches any number of 'a's (including zero).
import re

text = "The cat sat on the mat."
matches = re.findall(r"a*", text)
print(matches)  # Output: ['', '', '', '', '', 'a', '', '', '', 'a', '', '', '', 'a', '', '', '', '']
  1. a?: Matches at most one 'a' (zero or one occurrence).
import re

text = "Color or colour, choose your favorite."
matches = re.findall(r"colou?r", text)
print(matches)  # Output: ['color', 'colour']
  1. a{m}: Matches exactly 'm' occurrences of 'a'.
import re

text = "She walked along the road."
matches = re.findall(r"a{2}", text)
print(matches)  # Output: ['aa']
  1. a{m,n}: Matches between 'm' and 'n' occurrences of 'a'.
import re

text = "The meeting is scheduled for aaamorningaaa."
matches = re.findall(r"a{2,3}", text)
print(matches)  # Output: ['aaa', 'aa', 'aaa']

These examples illustrate how to use quantifiers in regular expressions to match specific numbers of occurrences of a character or a group of characters within a text string.

Important Functions of re Module 📚

  1. match(): Checks if a pattern matches at the beginning of a target string.
  2. fullmatch(): Checks if a pattern matches the entire target string.
  3. search(): Searches for the pattern anywhere in the target string.
  4. findall(): Finds all occurrences of the pattern.
  5. finditer(): Returns an iterator with Match objects for each match.
  6. sub(): Replaces matched patterns with a specified string.
  7. subn(): Similar to sub(), but also returns the number of replacements.
  8. split(): Splits a string based on a pattern.
  9. compile(): Compiles a pattern into a RegexObject.

1. compile():

import re

pattern = re.compile("ab")
matcher = pattern.finditer("abaababa")

count = 0
for match in matcher:
    count += 1
    print(match.start(), "...", match.end(), "...", match.group())

print("The number of occurrences:", count)

2. finditer():

import re

pattern = re.compile("ab")
matcher = pattern.finditer("abaababa")

count = 0
for match in matcher:
    count += 1
    print(match.start(), "...", match.end(), "...", match.group())

print("The number of occurrences:", count)

3. match():

import re

s = "abcabdefg"
m = re.match("abc", s)

if m is not None:
    print("Match is available at the beginning of the String")
    print("Start Index:", m.start(), "and End Index:", m.end())
else:
    print("Match is not available at the beginning of the String")

4. fullmatch():

import re

s = "ababab"
m = re.fullmatch("ababab", s)

if m is not None:
    print("Full String Matched")
else:
    print("Full String not Matched")

5. search():

import re

s = "abaaaba"
m = re.search("aaa", s)

if m is not None:
    print("Match is available")
    print("First Occurrence of match with start index:", m.start(), "and end index:", m.end())
else:
    print("Match is not available")

6. findall():

import re

text = "My phone number is 1234567890, and my friend's number is 9876543210."
numbers = re.findall("[7-9]\d{9}", text)

print(numbers)

7. finditer():

import re

text = "My phone number is 1234567890, and my friend's number is 9876543210."
matcher = re.finditer("[7-9]\d{9}", text)

for match in matcher:
    print(match.start(), "...",

 match.end(), "...", match.group())

8. sub():

import re

text = "My phone number is 1234567890."
new_text = re.sub("\d{10}", "XXXXXXXXXX", text)

print(new_text)

9. subn():

import re

text = "My phone number is 1234567890."
new_text, replacements = re.subn("\d{10}", "XXXXXXXXXX", text)

print("Result String:", new_text)
print("The number of replacements:", replacements)

These functions are your toolbox for regex operations in Python. 🧰🐍

Examples to Master Regular Expressions 🧩

  1. Validate Yava Language Identifiers: Represent and validate Yava language identifiers following specific rules.
  2. Check Mobile Numbers: Verify if a given number is a valid 10-digit mobile number.
  3. Extract Mobile Numbers: Extract mobile numbers mixed with text from a file.
  4. Web Scraping with Regex: Use regex for web scraping tasks, like extracting titles from websites.
  5. Gmail Address Validation: Validate Gmail email addresses.
  6. Telangana Vehicle Registration: Check if a vehicle registration number is valid for Telangana state.
  7. Flexible Mobile Number Checker: Verify mobile numbers of varying lengths (10, 11, or 12 digits).

Examples for the Additional Tasks:

Validate Yava Language Identifiers:

import re

def is_valid_yava_identifier(identifier):
    pattern = re.compile("[a-k][0369][a-zA-Z0-9#]*")
    if pattern.fullmatch(identifier):
        return True
    else:
        return False

identifier1 = "a6kk9z##"
identifier2 = "k9b876"
identifier3 = "k7b9"

print(f"{identifier1} is {'valid' if is_valid_yava_identifier(identifier1) else 'invalid'} Yava Identifier")
print(f"{identifier2} is {'valid' if is_valid_yava_identifier(identifier2) else 'invalid'} Yava Identifier")
print(f"{identifier3} is {'valid' if is_valid_yava_identifier(identifier3) else 'invalid'} Yava Identifier")

Check Mobile Numbers:

import re

def is_valid_mobile_number(number):
    pattern = re.compile("[7-9]\d{9}")
    if pattern.fullmatch(number):
        return True
    else:
        return False

number1 = "9898989898"
number2 = "6786786787"
number3 = "898989"

print(f"{number1} is {'valid' if is_valid_mobile_number(number1) else 'invalid'} Mobile Number")
print(f"{number2} is {'valid' if is_valid_mobile_number(number2) else 'invalid'} Mobile Number")
print(f"{number3} is {'valid' if is_valid_mobile_number(number3) else 'invalid'} Mobile Number")

Extract Mobile Numbers from a File:

import re

with open("input.txt", "r") as f1, open("output.txt", "w") as f2:
    for line in f1:
        numbers = re.findall("[7-9]\d{9}", line)
        for n in numbers:
            f2.write(n + "\n")

print("Extracted all Mobile Numbers into output.txt")

Web Scraping with Regex:

import re
import urllib.request

sites = ["google", "rediff"]
for s in sites:
    print("Searching...", s)
    u = urllib.request.urlopen("http://" + s + ".com")
    text = u.read()
    title = re.findall("<title>.*</title>", str(text), re.I)
    print(title[0])

Gmail Address Validation:

import re

def is_valid_gmail_address(email):
    pattern = re.compile(r"\w[a-zA-Z0-9_.]*@gmail[.]com")
    if pattern.fullmatch(email):
        return True
    else:
        return False

email1 = "durgatoc@gmail.com"
email2 = "durgatoc"

print(f"{email1} is {'valid' if is_valid_gmail_address(email1) else 'invalid'} Gmail Address")
print(f"{email2} is {'valid' if is_valid_gmail_address(email2) else 'invalid'} Gmail Address")

Telangana Vehicle Registration:

import re

def is_valid_telangana_vehicle_registration(registration_number):
    pattern = re.compile("TS[012][0-9][A-Z]{2}\d{4}")
    if pattern.fullmatch(registration_number):
        return True
    else:
        return False

registration1 = "TS07EA7777"
registration2 = "TS07KF0786"
registration3 = "AP07EA7898"

print(f"{registration1} is {'valid' if is_valid_telangana_vehicle_registration(registration1) else 'invalid'} Telangana Vehicle Registration")
print(f"{registration2} is {'valid' if is_valid_telangana_vehicle_registration(registration2) else 'invalid'} Telangana Vehicle Registration")
print(f"{registration3} is {'valid' if is_valid_telangana_vehicle_registration(registration3) else 'invalid'} Telangana Vehicle Registration")

Flexible Mobile Number Checker:

import re

def is_valid_flexible_mobile_number(number):
    pattern = re.compile("(0|91)?[7-9][0-9]{9}")
    if pattern.fullmatch(number):
        return True
    else:
        return

 False

number1 = "9898989898"
number2 = "918989898989"
number3 = "6786786787"

print(f"{number1} is {'valid' if is_valid_flexible_mobile_number(number1) else 'invalid'} Mobile Number")
print(f"{number2} is {'valid' if is_valid_flexible_mobile_number(number2) else 'invalid'} Mobile Number")
print(f"{number3} is {'valid' if is_valid_flexible_mobile_number(number3) else 'invalid'} Mobile Number")

These examples cover a wide range of tasks you can accomplish with regular expressions in Python. 🧩✨ These practical examples will boost your regex skills and empower you to tackle real-world tasks. 🛠️ Now, armed with this regex knowledge, you can conquer text manipulation challenges like a pro! 🏆 Happy coding! 🚀🐍