Extracting information from text
Once text has been cleaned and split into manageable pieces, the next practical step is to pull useful signals out of it. Real programs rarely consume raw text directly. They extract names, numbers, labels, or markers and then make decisions based on what was found. This lesson orients us to simple ways Python programs extract structured information from unstructured text and turn it into data that logic can work with.
Identifying simple patterns in text
Text often contains repeated shapes. Dates look similar to other dates. Identifiers follow familiar formats. Headings, tags, and markers tend to repeat.
At this stage, we focus on recognizing simple patterns rather than describing text in full. The goal is to notice that parts of a string match something we care about.
In practice, this usually starts with checking whether a substring appears inside a larger block of text.
html = "<h1>Mars</h1>"
if "<h1>" in html:
print("Found a heading")
This kind of pattern check is crude, but it is often enough to decide what to do next.
Extracting structured information from unstructured text
Once a pattern is detected, the next step is to pull out the meaningful part. This turns vague text into a concrete value the program can store.
A common approach is to split text around known markers and keep the part we want.
html = "<h1>Mars</h1>"
start = html.find("<h1>") + len("<h1>")
end = html.find("</h1>")
title = html[start:end]
The result is no longer HTML. It is a plain string that represents a specific piece of information.
Using basic pattern matching techniques
Python provides tools that make pattern-based extraction more explicit. Regular expressions are one such tool, and even simple patterns can be useful without mastering the full feature set.
Here we match the text inside an <h1> tag.
import re
html = "<h1>Mars</h1>"
match = re.search(r"<h1>(.*?)</h1>", html)
if match:
title = match.group(1)
Pattern matching makes the intent clearer: we are not just slicing strings, we are describing the shape of what we are looking for.
Converting extracted information into structured data
Extracted values become more useful when they are grouped into familiar data structures. Strings turn into fields. Multiple fields turn into dictionaries or lists.
page = {
"title": title,
"source": "planet_page.html"
}
At this point, text has been transformed into data. The program no longer needs to re-interpret the original string to understand what it represents.
Using extracted data as signals for program logic
Structured data can now drive decisions. Programs branch, loop, or generate output based on what was extracted.
if page["title"] == "Mars":
print("Generating Mars page")
This is where text processing connects directly to control flow. The text itself is no longer the focus. The extracted signal is.
Conclusion
We now have a clear picture of how Python programs move from raw text to actionable information. By identifying patterns, extracting specific values, and converting them into structured data, text becomes something logic can reason about. This orientation is enough to recognize where text extraction fits and how it supports decision-making in real Python systems.