Paul "Pablo" Croubalian

7 years ago · 4 min. reading time · ~100 ·

Blogging
>
Paul blog
>
How to get a GREP. Not a typo. This is Real and Powerful

How to get a GREP. Not a typo. This is Real and Powerful

A Primer on Regular Expressions

Now that myTweetPack.com is pretty much done, (Or as done as any major software ever is) I'm turning my attention back to self-publishing. 

Specifically, I'm looking at Amazon publishing. 

After all, Amazon is responsible for 70% of all books sold. Not just online books . . . ALL books. That's a market that begs investigation.

While digging into the technical side of creating an eBook, I saw a need to fiddle with the HTML Microsoft Word poops out. 

Was that too harsh? I don't think so.

Issue #1: Amazon prefers to start from HTML exported from Microsoft Word.

OMG

Finagling a 17,000 to 70,000-word eBook pooped out as HTML by Word is a minefield of Search and Replace functions that can get you into serious trouble in an awful hurry.

That's when you need to get a GREP

This post is a primer on Regular Expressions which are not so "regular" at all.

GREP is a nickname for Regular Expressions. The term is usually shortened to "regex" or "regexp." Some people, cruel ones, even call them "rational expressions."

GREP has a lot more uses than just cleaning up poopy HTML. Regex is common in data validation for web forms.

GERP is a search "language," or a pattern-matching scheme. It gives superpowers to things like Find and Replace functions.

To activate GREP in Word, check "Use wildcards" in the dialog box. Turn it on when you need it, and back off when you don't. Warning: Word only allows a subset of GREP.

Better yet, download Notepad++, a free code editor and use GREP at full-throttle.

What, the Hell, is GREP?

Word gives us a hint when it offers to "Use wildcards." Regular Expressions are a search language, a set of wildcards if you will. They are simple strings (text) that are really a search function on more steroids than a Russian weight-lifter.

This ain't a simple subject. Believe me. 

This is as simple as it can get.

I can see my fabourite technophobes, Lisa Gallagher and Trent Selbrede,  both falling to the floor in complete catatonic states as they read this.

My apologies to their families and other loved ones.

I'd miss them too.

Trent, Lisa, please stop reading now.

Seriously, GREP really is the "easier" way to fix Word's HTML-poop. The alternative would be to use a few hundred search-and-replace commands that would need be done in the correct order.

That would be a nightmare.

The "language" looks weird but isn't very tough to learn. It can make your search and replace actions lightning fast and super-easy. 

I cleaned up a 26,000-word manuscript in under 3 seconds.

Practice on something unimportant, or on a copy of the actual document.

WARNING: If you're new to Regular Expressions, and even if you're not, never work on the document itself. Work on a copy. Regex search and replace is lightning fast. It's very flexible, but it can't read minds. It will do what you tell it to do even if you made a mistake. It can fix a 300-page manuscript in seconds. It can destroy it just as fast.

Strictly speaking, regex is case-sensitive. "Paul" is not the same as "paul." That isn't true for most code editors. Ex: Notepad++ defaults to case insensitive. 

Regex is all about pattern matching

You already know about pattern matching. If you want to find all txt files in a folder, you just ask for *.txt. You know that means, "Show me files that start with anything and end in .txt." In Windows, * is a placeholder for "anything."

Regex kicks it up a notch or twenty. 

In case you're curious, the regex equivalent of *.txt is  ^.*\.txt$

You tell Regex what pattern you want to find. It goes out and finds them.

The rules are simple. 

So are the rules for chess. 

Simplicity often masks complexity.

You can make a single super-complicated regex or a few simple ones. Your choice.

Even ten or so regex strings is better than a couple of hundred search-replace commands.

REGEX in a nutshell

  • Any letter, digit or symbol will be matched as is. It's important to remember that some symbols have special meaning. They can't be used as-is. Technically, they are called "metacharacters." Think of them as "reserved." The metacharacters are the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), the opening square bracket [, and the opening curly brace {. Type a \ before them to let regex know that you want to treat a metacharacter as a literal value. Don't use 1+2=3. Use 1\+2=3.
  • Some letters are shortcuts. You turn them into shortcuts by adding a \. Ex: \t is a tab.
  • Square brackets [ ] are a choice of characters. Gr[ae]y will match both "Gray" and "Grey." You can also use a ^ after the opening square bracket. A ^ after a square bracket means "not."
  • A hyphen is a range. [0-9] will match any digit. [a-zA-Z] will match any character, upper or lower case.
  • A period is a placeholder for anything. So, r.d will match "rid," "red," or even "r&d." 
  • A caret (^) means "not" when inside square brackets. It turns inclusive searches into exclusive searches. ^P would match everything except P. P[^aul] would not match Paul. Outside of square brackets, it means "start with the first letter." So, ^a wouldn't match Paul because the "a" is not in the first position.
  • Plus signs (+) mean one or more. [0-9]+ means any number of any digit in a row. Use {} for a specific number. [0-9]{3} means any 3 digits in a row.
  • An asterix (*) means zero or more. [0-9]* means any digit or none.  <[A-Za-z][A-Za-z0-9]*> will match any HTML tag without attributes.
  • A question mark (?) means zero or one. You can use that to match Britsh and US spellings. Ex: colou?r matches both "color" and "colour."
  • The pipe character (|) means OR. You can also group them. To match "cat food" or "dog food," you'd use (cat|dog) food. You can add as many |'s as you want. That means you can also use (cat|dog|horse|rabbit|lemur/fish) food. Gerbils would be SOL.
  • ANCHORS: Anchors don't match anything. They are for positions. For example, \b matches by word. See the next item for an example.
  • Curly braces "{}" are for repetition. [0-9]{3} looks for any three digits in a row. It will match both "Testing123" and "123." To only match the digits, tell regex to only find them alone as a word. . . add \b. Like this, \b[0-9]{3}\b
  • GROUPING, CAPTURING, AND BACK REFERENCES: Round brackets group a regex together. Once they are grouped, you can refer to them by number. (regex-a) (regex-b) can now be \1 for whatever regex-a is and \2 for regex-b. So, in Notepad++ I can use this in a Find and Replace: Find - ((a-zA-Z]+)[0-9](a-zA-Z]+) Replace - \1\2. That would look for one or more letters on either side of any digit and replace the whole shebang with just the text part. Any number of any letters on either side of a digit would be replaced. I don't need to know which letters or which numbers. I don't even need to know how many letters.

What do you think this regex matches? 

 \b[1-9][0-9]{3}\b   

Because of the "\b's"  it looks for a whole word. First, it looks for any digit from 1 through 9, then it looks for any three digits in a row from 0 through 9. 

Answer: This regex will match the numbers 1000 through 9999, but only if they are on their own. I.E. "1000" not "file_1000."

Here's some homework. What do these regex' match?

\b[1-9][0-9]{2,4}\b

or, how about this more complex but much more comon one?

 \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b 

Leave your answers in the comments if you dare [insert diabolical laugh here]

To dig deeper into Regex

For more info on Regex, including a full tutorial, visit Regular-Expressions.info


Vr

 

Blog Poets

About the Author

Ts a ghost but not the kind that's into pottery wheels
Tis the writing kind

Toften wonder 1f I'm a tech-savvy writer or a wnt.
ing-savey technologist Maybe I'm both As one CMO
putt, “Paul makes tech my bitch!” That might be going

a little too far

QbeBee VIP, Ambassador

myTweetPack.com
The Ultimate Twitter System

""
Comments

Paul "Pablo" Croubalian

7 years ago #6

#8
LMAO... it certainly isn't suggested to use HTML from Word... EXCEPT for Amazon. HTML from Word is Amazon's preferred source file. I guess even crap is useful if it's uniformly crappy. You can get about 70% of the going straight from Word. We will go over some HTML fixes. That's why I wrote the lead-weighted GREP post.

Wayne Yoshida

7 years ago #5

Paul \ - thanks for the lessons here. A long time ago, I tried to use Word for HTML, and quickly gave up. Never went back.

Wayne Yoshida

7 years ago #4

#1
Deb \ud83d\udc1d Helfrich - I love the way you think!

Paul "Pablo" Croubalian

7 years ago #3

#5
Don't look at it like "devaluing a book," Deb. In any business venture, pricing is one of the toughest things to nail down. No one works in a vacuum. Sometimes competitors boost your profits, sometimes they cut them. I started a reply in comment form but it keeps getting too long. Look for a reply as a producer post. In short, I both agree and disagree with your contention that, "the $1.99 price point has devalued ebooks to the point of being free toasters with account.... I don't think it is right, nor necessarily long-lived." On one hand, it's up to us to fit the market, not the other way around. On the other, $1.99 IS low. We need to figure out how to get as close as possible to that 1.99 mark and still make it worthwhile. Not gonna be easy, and yes, I do believe it will be long-lived

Paul "Pablo" Croubalian

7 years ago #2

#2
I'm not sure what you're asking. If it's about myTweetPack.com, it's at http://www.myTweetPack.com

Paul "Pablo" Croubalian

7 years ago #1

#1
Actually, Joyce Bowen gave me an idea for that. It was so intriguing that I put on my MBA hat for a couple of hours and crunched numbers. I think it would be viable and fair. Getting a manuscript on Amazon as an eBook is fairly easy. Having it well formatted so as to look professional is much less so. Let's not even talk about creating cover art, audioBook versions, and formatting for print-on-demand (yikes). Pricing strategies, translation etc is also a nightmare. Then comes the REALLY tough part. . . Promoting and selling the damned thing. Joyce's idea? An e-publishing house that would work along the same line as a traditional publishing house and offer fee-based services as well. Authors should be able to submit works. Those we want to push, we would take on for a percentage of royalties. If the author prefers, or if we don't want to invest in the work for one reason or another, he/she can get the same services a la carte. NOT DONE by a long shot! Not even sure I will go ahead with it. But, it's a damned good idea. I guess I'll stick to testing Amazon. One book is nearly ready. Another is written and waiting to be formatted. A third is in final edit stages. Those three will be my test subjects. If we can crack the Amazon code, I'll revisit the idea.

Articles from Paul "Pablo" Croubalian

View blog
5 years ago · 4 min. reading time

I think my re-entry · into the dating world would make a good sitcom. Or maybe it would make a bette ...

5 years ago · 7 min. reading time

I've been somewhat less prolific than usual lately. A second divorce will do that to a guy's muse. · ...

6 years ago · 4 min. reading time

They say, "Every cloud has a silver lining." That's obviously false. Cumulus clouds are the big whit ...

Related professionals

You may be interested in these jobs

  • NovAtel

    Director of Human Resources

    Found in: Talent CA C2 - 2 days ago


    NovAtel Calgary, Canada Permanent Full time

    Overview · Hexagon's Autonomy & Positioning division is looking for a highly adaptable, customer-focused and results-oriented Director of Human Resources with extensive HR leadership experience. Reporting to the Vice President of Human Resources, this new role will lead a team ...

  • Innovatia

    Senior Instructional Designer

    Found in: Talent CA C2 - 5 days ago


    Innovatia Saint John, Canada

    Innovatia Content Solutions is an industry leading content solutions provider. Operating for over 20 years, we provide our customers enterprise content solutions to address business challenges while meeting the needs of their end users. We accomplish this through learning and doc ...

  • BMO

    Bilingual Specialist, Credit Underwriting

    Found in: Talent CA 2 C2 - 1 week ago


    BMO Québec City, Canada

    Application Deadline: · 08/04/2024 · Address: · VIRTUAL61 - HomeRes - QC - BMO · #B2COperationsPlease note that the selected candidate will have to come to the Montreal office twice per month. · Analyzes new customer credit applications. Makes credit decisions / recommendations ...