(1): Explore the UCSC (U. California Santa Cruz)

(1): Explore the UCSC (U. California Santa Cruz) Genome Browser website

GENOME DATA

Question:

Write a script to obtain all protein sequences coded in the human genome. Your output should be in the multiple FASTA format, which looks like:

>ID1

Sequence 1…

>ID2

Sequence 2…

The ID field describes what the sequence is. You should use the concatenation (with colon “:” as the delimiter) of the RefSeq table name and name2 fields as the ID. For example, for the first record in the RefSeq table, the corresponding ID should be “>NM_001276352.2:Clorf141”.

The sequence field simply records the corresponding sequence, all in one line.

For example: MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

Read more here: Source link