Use a sed script to print only valid email entries

emaillist.txt

1. Saman.desilva@tamucc.edu
2. saman_desilva@tamucc.edu
3. saman&desilva@tamucc.edu
4. Saman.desilva@gmail.com
5. saman@desilva@yahoo.com
6. saman@mail@com
7. saman.desilva@yahoo com

I want to print valid email addresses but am having trouble figuring this problem out. So far I have this script, but it doesn’t print the fully correct output. It still gives me an incorrect output.

sed -nr '/w+@w+.w+$/p' emaillist.txt

The output:

saman.desilva@tamucc.edu 
saman_desilva@tamucc.edu
saman&desilva@tamucc.edu 
Saman.desilva@gmail.com
saman@desilva@yahoo.com

Answer

First of all, a regular expression that matches all valid email addresses is notoriously complex. I’m going to assume, given the test data, that you’re aiming for a much simpler concept of email address validity.

One issue with your regex is that you aren’t matching from the beginning of the line, which is signified with ^. This allows invalid emails like the one with an ampersand in the username to match because it just matches everything after the ampersand. So if we add the ^, we then get the following output:

$ sed -nr '/^w+@w+.w+$/p' emaillist.txt
saman_desilva@tamucc.edu

Well that’s not right either, and now the problem is that w only represents any letter, number or underscore. Periods are the other “valid” non-alphanumeric character for usernames in your test data, so we also need to tweak your pattern to add that, and now we get the correct output:

$ sed -nr '/^(w|.)+@w+.w+$/p' emaillist.txt
Saman.desilva@tamucc.edu
saman_desilva@tamucc.edu
Saman.desilva@gmail.com