How journalists can use regular expressions to match strings of text

By: Joshua Hatch

August 15, 2011

One of the most powerful tools included in every word processor or text editor is the ability to find and replace text. This tool is invaluable to any writer or coder who needs to find a string (that is, a series) of text and replace it with another string of text.

For example, let’s say you routinely misspell “Barack Obama” as “Barak Obama.” A simple find and replace will allow you to find “Barak” and replace with “Barack.”

Or, to take a common example from the online world, what if you have some text that contains double quotation marks and you’re turning that text into XML (eXtensionable Markup Language)? Double quotation marks are special reserved characters, so you would need to replace them with the appropriate code, as seen here:

Find: ”
Replace: "

XML has other special reserved characters. For example, the ampersand (&). So, we need to find those, too. To that end, we’d do this search:

Find: &
Replace: &

Only there’s a problem. We’ve now broken our quotation mark code, changing it from " to &quot;.

We should have performed the ampersand find/replace first, and then performed the find/replace for quotation marks. This is a prime example of planning out a find/replace strategy before starting, because the order of searches can matter.

All of that is just the beginning of what’s possible when it comes to finding and replacing text. Where the power really lies is in the ability to find patterns.

As an example, let’s say we have a list of U.S. presidents and the years they served in office, as shown below:

George Washington, 1789-1797
John Adams, 1797-1801
Thomas Jefferson, 1801-1809
James Madison, 1809-1817
James Monroe, 1817-1825
John Quincy Adams, 1825-1829
Andrew Jackson, 1829-1837
Martin Van Buren, 1837-1841
William Henry Harrison, 1841
John Tyler, 1841-1845
James Knox Polk, 1845-1849
Zachary Taylor, 1849-1850
Millard Fillmore, 1850-1853
Franklin Pierce, 1853-1857
James Buchanan, 1857-1861
Abraham Lincoln, 1861-1865
Andrew Johnson, 1865-1869
Ulysses Simpson Grant, 1869-1877
Rutherford Birchard Hayes, 1877-1881
James Abram Garfield, 1881
Chester Alan Arthur, 1881-1885
Grover Cleveland, 1885-1889
Benjamin Harrison, 1889-1893
Grover Cleveland, 1893-1897
William McKinley, 1897-1901
Theodore Roosevelt, 1901-1909
William Howard Taft, 1909-1913
Woodrow Wilson, 1913-1921
Warren Gamaliel Harding, 1921-1923
Calvin Coolidge, 1923-1929
Herbert Clark Hoover, 1929-1933
Franklin Delano Roosevelt, 1933-1945
Harry S. Truman, 1945-1953
Dwight David Eisenhower, 1953-1961
John Fitzgerald Kennedy, 1961-1963
Lyndon Baines Johnson, 1963-1969
Richard Milhous Nixon, 1969-1974
Gerald Rudolph Ford, 1974-1977
James Earl Carter, Jr., 1977-1981
Ronald Wilson Reagan, 1981-1989
George Herbert Walker Bush, 1989-1993
William Jefferson Clinton, 1993-2001
George Walker Bush, 2001-2009
Barack Hussein Obama, 2009-

There’s nothing wrong with this list, but let’s say we want to put it in XML. To do that, we might want to change some things. For example, we might want to separate first and last names, as well as the first and last years of each president’s service.

So, instead of:

George Washington, 1789-1797

We want something like:

And we want that for every line. Clearly, a standard search and replace is not going to work in this case.

Instead, we can use pattern matching, or something known as “regular expressions.” (By the way, UNIX geeks may refer to it as GREP, which stands for global/regular expression/print.)

Not all word processors/text editors can handle regular expressions. A few that can include TextWrangler, BBEdit, TextMate and Komodoedit, as well as command-line tools. I prefer to use BBEdit, so the syntax I’ll be using below corresponds to that program (and its free cousin, TextWrangler).

To start, we need to break down the lines into a pattern. We see the first word is the first name (or “fn”). Then there’s a space, followed by an initial or another name. After that second name, there might be a third name, or there might be a comma. Following the comma is another space, then a four-digit date, followed by a hyphen and, except with Obama, another four-digit year. Lastly, the line ends with a return.

With that, we can start to build our regular expression search. To search for any letter, we write [a-z]. To have the computer search for multiple characters in a row, we can follow that with an asterisk. (An asterisk can also mean 0 of those characters in a row. You’ll see that later.)

So, to search for the first name, we’ll search for:

[a-z]*

That search is for any series of letters, so it will match for “George” and “Washington.” To make it match just “George,” we’ll tell the search to start at the beginning of the line. To do that, we add a ^ character.

So now our search would read:

^[a-z]*

That will search for the beginning of the line, followed by a series of letters. The only matching terms for that will be first names.

However, if we search only for that, the first names will be replaced with new text. We don’t want that. We want to preserve the first names, adding text before or after them.

To do that, we need to further alter our search term as follows:

^([a-z]*)

The addition of the parenthesis tells the computer to remember what the string of characters was. We can recall that string of text by using the code \1. (If there were multiple sets of parenthesis in a single search, we would increase the number to refer to the parenthetical set. More on that later.)

Now, we want to search for the additional names. To do that, we’ll expand our search to include a space and the middle name. However, we have to account for four different middle name scenarios. The first is where there is no middle name. The second is where there is a standard middle name. We must also account for a middle initial followed by a period, and for two middle names.

So, we’ll follow the first name by an optional space (it’s optional because the asterisk means zero, one or multiple instances), followed by an optional word, an optional space, an optional word and then an optional period. (Note the backslash in front of the period, which indicates that the period refers to the punctuation mark rather than a wild card character, which is what a period otherwise stands for.)

^([a-z]*) *([a-z]* *[a-z]*\.*)

Now we’ll add the search for the last name. To do that, we’ll add a required space (that is, one without an asterisk), followed by the string of letters, followed by the comma. Keep the comma out of the parenthesis so that it is not part of the saved pattern.

^([a-z]*) *([a-z]* *[a-z]*\.*) ([a-z]*),

Finally, we need to add the dates. As with the letters, we can add the dates by repeating patterns of digits, in this case zero through nine. For the first set of years, we’d write [0-9]* and wrap that in parenthesis to save the pattern.

Not all presidents have a range of dates. Obama, for example, has not finished his term yet. So, we have to make the hyphen and the second set of digits optional, using the asterisk. Doing so, we wind up with this:

^([a-z]*) *([a-z]* *[a-z]*\.*) ([a-z]*), ([0-9]*)-*([0-9]*)

Just to break that down again:

^ = beginning of the line
[a-z] = any letter between a and z.
() = a pattern to remember
* = zero, one or multiple iterations in a row.
\ = an escape character that tells search to see what follows as the literal character and not as a command.
[0-9] = any digit between 0 and 9

With that search string, we have broken each line up into its component parts and saved them as patterns we can call back. To do that, we use the backslash and the number for each pattern. The first name is pattern 1, the middle name is pattern 2, the last name is pattern 3, the start date is pattern 4 and the end date is pattern 5.

With that information, we can now write our replace string, where we write the text we need with the patterns sprinkled through it as we want:

And voila! We end up with the following:

Even if XML isn’t your destination, you could do something similar by rewriting the patterns as a CSV or tab-delimited file to bring in to Excel. Whatever your target text, knowing how to take apart and reassemble text through regular expressions is a powerful skill.

As with many programmatic things, there are multiple ways to do the same task. For example, one could use \d to refer to a digit instead of [0-9], or + to mean more than one instance instead of * to mean zero, one or more than one.

You might want to explore those distinctions by experimenting further with regular expressions or, as those in the know call it, regex.

For more on regulation expressions, check out the many sites or books devoted to regex. And understand that while different programs might have slightly different syntax, the concepts remain the same.

This story is part of a new Poynter Hacks/Hackers series. Each week, we’ll feature a How To focused on what journalists can learn from emerging trends in technology and new tech tools.

Support high-integrity, independent journalism that serves truth and democracy. Make a gift to Poynter today. The Poynter Institute is a nonpartisan, nonprofit organization, and your gift helps us make good journalism better.

Donate

Tags: Digital Strategies, Hacks/Hackers, Media Innovation

Joshua Hatch

Joshua Hatch is an online content manager of Sunlight Live and an adjunct professor at American University.

Joshua Hatch

More News

When Mexico’s richest man threw The New York Times a lifeline

Before the bundles, the podcasts and the 10 million digital subscribers, there was a $250 million loan with a sky-high interest rate

April 17, 2025

Rick Edmonds

Opinion | What The Houston Landing’s closure says about the state of nonprofit news

Its shutdown after less than two years underscores a hard truth: nonprofit newsrooms still face steep challenges

April 17, 2025

Angela Fu

TikTok to launch its own Community Notes-style fact-checking feature: Footnotes

The social media platform joins X, Meta and YouTube in big tech’s push toward crowdsourced fact-checking

April 16, 2025

Alex Mahadevan

Opinion | It appears Trump’s ban on The Associated Press just got worse

Instead of letting the AP back into the press pool, the White House eliminated the guaranteed spot for wire services altogether

April 16, 2025

Tom Jones

Meet the 30 journalists who claimed a spot in Poynter’s latest Essential Skills for Rising Newsroom Leaders

Poynter program gives April 2025 participants personalized guidance so they return to their newsrooms ready to lead

How journalists can use regular expressions to match strings of text

More News

When Mexico’s richest man threw The New York Times a lifeline

Opinion | What The Houston Landing’s closure says about the state of nonprofit news

TikTok to launch its own Community Notes-style fact-checking feature: Footnotes

Opinion | It appears Trump’s ban on The Associated Press just got worse

Meet the 30 journalists who claimed a spot in Poynter’s latest Essential Skills for Rising Newsroom Leaders

Comments

Media Jobs

When Mexico’s richest man threw The New York Times a lifeline

Opinion | What The Houston Landing’s closure says about the state of nonprofit news

TikTok to launch its own Community Notes-style fact-checking feature: Footnotes

Opinion | It appears Trump’s ban on The Associated Press just got worse

Meet the 30 journalists who claimed a spot in Poynter’s latest Essential Skills for Rising Newsroom Leaders

Comments

Start your day informed and inspired.

Media Jobs