Your commands would discard lines containing no |
character, and lines where the mouse gene identifier has no version number. I’m not certain this is intended, but it’s a side effect of using sed -n
with the p
flag on the s
command. I’m going to assume that this is unintended.
Just use two expressions with sed
:
sed -e 's/.*|//' -e 's/..*//' file >newfile
With a grep
command that has the non-standard -o
option, and assuming that you just want to extract all Ensembl mouse gene stable IDs from the file (and that the file only contains stable IDs that you’d like to extract),
grep -o 'ENSMUSG[[:digit:]]*' file >newfile
You may also use two chained cut
commands, each one doing similar modifications of the data as the two sed
substitutions earlier in this answer. Using static cut would probably be quicker than using a regular expression, but I doubt you’d see any major speed differences unless your input data is huge.
cut -d '|' -f 2 file | cut -d '.' -f 1 >newfile
Read more here: Source link