How to create a custom Gene-to-GO mapping file for TopGO


So I’m trying to do some GO term enrichment analysis for some custom annotations, using the TopGO package in R.

I’m following section 4.3 of the user guide, found here.

The data needs to be in the following format (Note: file should have two, tab-delimited columns. The second of which should list the corresponding GO terms, separated by commas):

068724  GO:0005488, GO:0003774, GO:0001539, GO:0006935, GO:0009288
119608  GO:0005634, GO:0030528, GO:0006355, GO:0045449, GO:0003677, GO:0007275
049239  GO:0016787, GO:0017057, GO:0005975, GO:0005783, GO:0005792, GO:0004345, GO:0005788, GO:0047936, GO:0006098, GO:0005488, GO:0006006, GO:0055114, GO:0016491
067829  GO:0045926, GO:0016616, GO:0000287, GO:0030145, GO:0005739, GO:0000166, GO:0005575, GO:0006099, GO:0005524, GO:0008152, GO:0006102, GO:0005759, GO:0005975, GO:0004449, GO:0055114, GO:0016491

However, my data currently looks like this:

QBM89824.1  GO:0072659
QBM86167.1  GO:0070072
QBM87744.1  GO:0031307
QBM87744.1  GO:0045040
QBM87744.1  GO:0070096
QBM87389.1  GO:0000500
QBM87389.1  GO:0042790
QBM85935.1  GO:0035859
QBM85935.1  GO:0050790
QBM85935.1  GO:0005096
QBM85935.1  GO:0042819
QBM85935.1  GO:0032007

I’m having trouble transforming my data to look like the required format. There’s currently over 11k rows, so sorting it out manually isn’t an option. Does anyone know of any methods for doing this? I’m comfortable using Python but not so much with R

Thanks in advance!

Read more here: Source link