(1): Explore the UCSC (U. California Santa Cruz) Genome Browser website
GENOME DATA
Question:
Write a script to obtain all protein sequences coded in the human genome. Your output should be in the multiple FASTA format, which looks like:
>ID1
Sequence 1…
>ID2
Sequence 2…
The ID field describes what the sequence is. You should use the concatenation (with colon “:” as the delimiter) of the RefSeq table name and name2 fields as the ID. For example, for the first record in the RefSeq table, the corresponding ID should be “>NM_001276352.2:Clorf141”.
The sequence field simply records the corresponding sequence, all in one line.
For example: MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
Read more here: Source link