How to grep different strings followed by a tab on multiple lines

Welcome to Programming Tutorial official website. Today - we are going to cover how to solve / find the solution of this error How to grep different strings followed by a tab on multiple lines on this date .

Apologies, I could not find an answer that worked for me. Working on a Win10 machine with Cygwin and/or Gitbash, I have a file of sequence read names (“readsfile”) followed by other information all separated by tabs. The reads file looks like this:

NB501827:133:HMV5HAFX2:1:11101:3747:1066    75  NODE_622711+_length_75_cov_990.55   100.000 43
NB501827:133:HMV5HAFX2:1:11101:8852:1068    74  NODE_622752+_length_4244_cov_356.337    100.000 74

I want to simply use grep to parse out the read names up to the first tab of each line, outputting the results to a separate file “readnames.txt”. Not including the “tab” character would be a plus, but is fixable later. The output file “readnames.txt” should be:

NB501827:133:HMV5HAFX2:1:11101:3747:1066
NB501827:133:HMV5HAFX2:1:11101:3747:1066
NB501827:133:HMV5HAFX2:1:11101:8852:1068

(For now, the duplicated read names are okay) I have tried a multitude of solutions found on this site. Some examples taking into account grep vs egrep vs grep -E, vs Perl grep include:

grep -oE $'^*t' readsfile > readnames.txt
egrep '^NB*t' readsfile > readnames.txt
grep -oE '^NB'$'t' readsfile > readnames.txt
grep -oP 'NB*t' readsfile > readnames.txt
grep -o $'NB*t' readsfile > readnames.txt
grep -oE ^NB*$'t' readsfile > readnames.txt
grep -o '[NB*|[[:space:]]]' readsfile > readnames.txt
grep -o ^NB*[[:space:]] readsfile > readnames.txt
grep -o $"NB*$'t'" readsfile > readnames.txt
grep -o <NB*> readsfile > readnames.txt

Note that I have also used scripts to include “actual” tabs using <Cntrl-V><tab> or grep -oE '^NB* ' readsfile > readnames.txt or grep -oE '^NB.* ' readsfile > readnames.txt in most of the combinations used at the command line.

Also some other unsuccessful solutions:

sed -n 's/NB*t/&/p' readsfile > readnames.txt
sed -n 's/*t/&/p' readsfile > readnames.txt

I suspect this has been done but help is needed. Thank-you.

Answer

If you want everything past the first tab removed including the tab, this sed would do that sed 's/t.*//g'

Alternatively, sed 's/([^t]*)t.*/1/g' finds any non-tab character repeated any number of times, followed by a tab and any number of characters, captures the bit up to the first tab, and spits that out.

awk handles tab delimited input very well too. awk -F't' '!a[$1]++ {print}' will print out the deduped first field (delimited by tabs) for each line. This works by inserting and incrementing the value of an array keyed off the first field, so the first time it’s encountered it evaluates to !0, so print is fired, and each subsequent time a value is seen it will be !1, !2, etc, evaluating to false and not printing.