bioinformatics – using sed to capture groups

Your commands would discard lines containing no | character, and lines where the mouse gene identifier has no version number. I’m not certain this is intended, but it’s a side effect of using sed -n with the p flag on the s command. I’m going to assume that this is unintended.

Just use two expressions with sed:

sed -e 's/.*|//' -e 's/..*//' file >newfile

With a grep command that has the non-standard -o option, and assuming that you just want to extract all Ensembl mouse gene stable IDs from the file (and that the file only contains stable IDs that you’d like to extract),

grep -o 'ENSMUSG[[:digit:]]*' file >newfile

You may also use two chained cut commands, each one doing similar modifications of the data as the two sed substitutions earlier in this answer. Using static cut would probably be quicker than using a regular expression, but I doubt you’d see any major speed differences unless your input data is huge.

cut -d '|' -f 2 file | cut -d '.' -f 1 >newfile

Read more here: Source link