The sardine run in southeastern Africa is a mass migration into an ecological trap


Large-scale annual migrations occur in an extraordinary range of animals, from insects to the great whales. While the driving mechanisms of these migrations are varied and sometimes poorly understood, they often represent a way of optimizing conditions for breeding and adult fitness when these are in conflict. Often, populations may follow the availability of suitable food as it shifts in a geographically predictable fashion, as in the case of the mass migrations of the Serengeti (1). However, migrations may also allow adults to take advantage of high food availability in places, or at times, that are not suitable for breeding or for earlier ontogenetic stages, a strategy pursued by several Antarctic cetaceans (2). Species that depend on migrations for their survival are especially vulnerable because anything that disrupts this behavior threatens a huge proportion, or indeed all, of their populations. Understanding the drivers of migrations is key to predicting how or whether such behavior may alter with long-term changes in environmental conditions, such as those expected under anthropogenic climate warming.
One of the best documented annual marine mass migrations is the sardine run off the east coast of South Africa (3). While the so-called “greatest shoal on Earth” involves several small pelagic fish species and attracts huge numbers of opportunistic predators, including seabirds, gamefish, sharks, and marine mammals (4, 5), it primarily represents the migration of the southern African population of Pacific sardine, Sardinops sagax (Jenyns, 1842). This species is southern Africa’s commercially most important small pelagic fish. Its range in this region includes a distinct Namibian stock (6, 7) and a South African population with two major high-density areas separated by the boundary between the Atlantic and Indian oceans, approximately at Cape Agulhas (Fig. 1) (8, 9). Sardines from these two South African regions spawn at different temperatures (10) and have different nursery areas (11, 12). These and other differences (13) suggest that they comprise distinct western and southern stocks (8, 1418), but genetic studies have so far failed to confirm this (19).

Fig. 1. A map of the South African range of the southern African population of Pacific sardine, Sardinops sagax.

The map shows sites at which sardines were caught for genome and transcriptome sequencing. Colors represent mean sea surface temperatures (SSTs). The coastline was divided into five temperature-defined geographical regions (temperate core range: W, west; SW, southwest; S, south; SE, southeast; sardine run: E, east). Cape Agulhas is the boundary between the Atlantic and Indian oceans. The broken line represents the edge of the continental shelf, beyond which the sardines rarely disperse. The black and white arrows represent the approximate path of the Agulhas Current, which transports tropical Indian Ocean water southward and confines sardines participating in the run (blue arrows) to a narrow coastal band of cooler water.

The sardine run is a spawning migration of sardines from the eastern portion of their temperate South African core range to the subtropical east coast (Fig. 1) (3). It involves the movement of tens to hundreds of millions of sardines migrating to the northeast in high-density shoals by remaining associated with pockets of cool water that are sporadically uplifted onto the shallow continental shelf inshore of the warm, southward-flowing Agulhas Current (20, 21). During the migration, water temperatures may exceed the species’ preferred range (22), leading to thermal stress and reduced condition (20, 23). It has been suggested that the sardine run represents relic spawning behavior dating back to the previous glacial period, when what is now tropical Indian Ocean habitat may have been an important sardine nursery area (24). However, genetic data have so far failed to support the distinctness of sardines participating in the run from those in the remainder of the South African range (19, 25).
Here, we use genomic and transcriptomic data to test the hypothesis that the sardine run represents the spawning migration of a distinct stock (24). Such datasets have considerable potential to identify subtle population structure in high-dispersal species where traditional genetic methods suggest spatial homogeneity (2629). We also tested the hypothesis of distinct regional sardine stocks in the species’ core range, which is important for management of the sardine fishery, and then used this information to investigate the origins of fish that participate in the sardine run.


Large-scale annual migrations are a characteristic of many species and often represent a way of optimizing conditions for breeding and adult fitness when these are in conflict. In terms of spectacle and the sheer magnitude of the event, the sardine run is comparable to East Africa’s great wildebeest migration, but while the drivers of the latter are well known, the same cannot be said for the sardine run. Biological explanations for this migration into what is apparently unsuitable habitat include relic spawning behavior (20), equatorward movement of juvenile fish (30), natal homing (20), a feeding migration facilitated by habitat extension (22), and northward herding of sardines by predators (22). Early studies of the physical environment speculated that the sardine run was enabled by the establishment of a northeastward-flowing countercurrent inshore of the Agulhas Current on the east coast during winter (31) or that it represents a response to the expansion of suitable habitat during cooler winter conditions (22). Neither hypothesis has been strongly supported by subsequent research (21, 3234).
Our study provides new insights into the causes of the sardine run. First, it provides the first molecular support for a previously proposed two-stock hypothesis based on nongenetic data (8, 1018). This suggests that the previous evidence for stocks following the definition of Ihssen et al. (35) as being “intraspecific groups of randomly mating individuals with temporal and spatial integrity” is not merely a result of phenotypic plasticity in regional assemblages affected by different environmental conditions but that it has a heritable basis. The finding that spatial population structure in candidate SNPs is linked to the region’s temperature-defined biogeographical provinces (36, 37) rather than isolation by geographic distance previously reported for selectively neutral SNPs (38) has important implications for the management of the sardine fishery. Despite admixture between the two stocks that is facilitated by dispersal in both directions, a single-stock strategy can result in population declines, as regional stocks adapted to specific temperature ranges are overexploited. Although it was found that the sardine run comprises a heterogeneous assemblage of fish with ancestry from both the western and southern stocks, ancestry proportions indicated that these migrants originate mostly from the species’ cool-temperate core area in the Atlantic Ocean.
Our results further show that sardines participating in the run are not a distinct east coast population that mingles with the south coast stock during summer and then separates from it in winter. Instead, the primarily Atlantic origin of these sardines supports a previous hypothesis based on morphological similarities between east coast sardines and those from the western portion of the species’ range (31). A preference of these sardines for colder water not only explains why only a fraction of the sardines present on the south coast participates in the run (23) but also explains why cold water upwelling off the southern east coast is a prerequisite for annual sardine runs to occur.
The sequence of events that results in a sardine run is briefly summarized in light of our evidence (Fig. 4). During winter, shelf waters off the eastern south coast can temporarily become cooler than those off the western south coast (39), providing appropriate conditions that cause migrants of cool-temperate Atlantic ancestry to aggregate on the eastern south coast. Here, cooling of the shelf environment due to the uplift of cold, nutrient-rich waters by cyclonic eddies (21, 34, 4042), and wind-driven advection and vertical mixing that promote shoreward movement of these cold waters (34, 42), result in conditions of elevated productivity (43) that briefly resemble those found during summer upwelling in the Atlantic Ocean (39). The intermittent nature and random timing of these events provide an explanation for why the sardine run does not occur at the same time or with the same intensity every year. In addition, the cyclonic eddies do not always affect the shelf regions to the same degree, with some eddies resulting in only minimal cooling of the shelf waters (34). Ekman veering that results in the uplifting of cooler water at the shelf edge (34, 42) may facilitate the northward movement of sardines at depths of 100 to 200 m. The introduction of upwelled water, with temperatures that are closer to the western sardines’ preferred thermal tolerance range, may stimulate sardine run participants to separate from the warm-temperate southern stock and follow these transient pockets of cooler water up the east coast.

Fig. 4. Stock structure of Pacific sardine, S. sagax, in South African waters and sequence of events that results in a sardine run.

The spawning area in the Atlantic Ocean (blue) is numerically dominated by cool-temperate sardines (gray in Fig. 3), and the spawning area in the Indian Ocean (orange) is dominated by warm-temperate sardines (black in Fig. 3). There is considerable exchange between these areas, with eggs and larvae from the Indian Ocean stock primarily moving westward and juveniles and adults of both stocks moving eastward. Upwelling on the southeast coast attracts cool-temperate sardines present on the south coast, which follow the cooler water as it is transported northward. When the upwelling ceases, these sardines eventually find themselves in an ecological trap of suboptimal subtropical habitat.
Cessation of upwelling eventually results in these sardines finding themselves in subtropical waters that exceed their preferred thermal range. Thermal stress, coupled with physical exhaustion and generally lower food availability (compared to their typical west coast environment), results in poor body condition (20, 23). Moreover, avoidance of the tropical water of the Agulhas Current concentrates them on the narrow continental shelf off the east coast (Fig. 4) and makes them an easy target for predators.
The presence of sardine eggs in the plankton confirms their survival off the east coast until austral summer (44). At best, this would make the sardine run a migration of west coast sardines into east coast habitat where conditions are temporarily suitable for feeding and spawning during a time when such conditions do not exist on the south coast. At worst, it constitutes a costly navigation error into an ecological trap sensu Robertson and Hutto (45), where sardines show a preference for one habitat over another and fitness will differ between habitats and ultimately be lower in the preferred habitat. Specifically, the migration is driven by initially favorable environmental cues, which lure sardines with a preference for colder temperatures into habitat that eventually becomes unfavorable. Here, their fitness and survival are ultimately compromised because of physiologically challenging high temperatures and intense predation (35, 20, 22).
Long-term increases in sea surface temperature (SST) along the South African east coast that are driven by anthropogenic climate warming (46) increasingly make this region less hospitable to sardine [but see (47)]. Under the representative concentration pathway 4.5 scenario and using National Oceanic and Atmospheric Administration’s (NOAA’s) Coupled Model Intercomparison Project 5 ensemble (48), temperatures off the east coast during April to June (the period during which the sardine run starts) are predicted to increase by almost 1°C by 2055 and to show substantially increased variability (fig. S6). Given the colder water origins of sardines participating in the run, the projected warming could lead to the cessation of the sardine run in the next few decades (49). Despite the huge numbers of fish involved, the sardine run involves only a small portion (S. sagax population (8). Although cessation of the sardine run would mean the loss of one of nature’s most spectacular migrations, the effects on the population as a whole are likely to be negligible.


Sampling design

Population structure in the South African population of Pacific sardine, S. sagax, was assessed using genomic and transcriptomic data from sardines collected throughout the species’ core range and from three sardine runs. Genomic data were generated from tissue samples obtained from 284 sardines collected at 40 locations throughout the species’ range (table S2). To confirm the genomic findings independently, transcriptome data were generated using RNA extracted from liver tissue of 20 individuals collected at nine locations (table S2). Sardines from the west coast to the southeast coast were collected as part of pelagic research surveys conducted in autumn/winter (May and June) and spring/summer (October and November) in 2014 and 2015. All of these were used to generate genomic data, with a subset of 14 individuals from seven sites also used to generate transcriptome data. East coast samples for genomic analyses were obtained from artisanal fishers during the 2015 and 2018 sardine runs. As all these samples failed the quality screening for RNA sequencing, we obtained six additional samples from the 2019 sardine run that were not used for genomic analyses. For each sardine run, samples originated from two different locations that were sampled at different times (table S2).

Genome assembly

Genomic libraries were prepared on the basis of a double-digest restriction site–associated DNA sequencing approach, ddRADseq (50), using the restriction enzymes Sbf I–HF and Mse I (New England Biolabs), and one of 96 unique 6–base pair (bp) barcodes was ligated to each individual library. We multiplexed 48 or 96 samples per lane, and libraries were randomly assigned to each of five Illumina HiSeq 2500 (Illumina Inc., San Diego, USA) lanes (single end, 100 bp). Raw sequences were demultiplexed using the process_radtags script from Stacks v2.4 (51); barcodes and RAD tags were then trimmed, and reads with low average base quality (Q > 20) were removed using Trimmomatic v0.39 (52). Using dDocent v2.19 (53), the remaining reads were then de novo assembled in a reference catalog. This catalog was used to call SNPs and genotype each sample. The obtained variants were filtered using VCFtools v1.19 (54) following Teske et al. (28). To reduce the effect of linkage disequilibrium, we removed one SNP from each pair of SNPs with r2 > 0.8 (55). Details about the different steps in the pipeline are listed in table S3.

Transcriptome assembly

The transcriptome data were generated from RNA using sardine livers that were cut into small pieces using a sterile scalpel blade and stored in RNAlater solution (Thermo Fisher Scientific). Although RNA sequence data are often used for gene expression analyses, we considered this approach unsuitable because the sardine livers could not be preserved under controlled conditions. For example, while some sardine livers from the pelagic surveys were preserved immediately, some of the sardine run fish had died at least an hour before preservation. Total RNA was extracted from each liver sample using a combination of mechanical homogenization with TRIzol and QIAGEN RNeasy purification kit (QIAGEN, Hilden, Germany). Then, cDNA libraries were constructed from each extraction, indexed separately, and sequenced on an Illumina HiSeq 4000 platform (Illumina Inc., San Diego, USA) following the manufacturer’s instructions for 2 × 150 paired-end chemistry.

The quality of raw sequences was checked using FastQC (56). Leading and trailing low-quality bases (with Phred score 52). A reference transcriptome for liver tissue was assembled de novo in Trinity v2.8.6 by using default settings in the Inchworm, Chrysalis, and Butterfly plugins (57). The quality of the assembled reference transcriptome was evaluated in QUAST v.4.6.3 (58). Quality-filtered sequences were mapped against assembled transcripts using the semi-global aligner BWA-MEM v0.7.12 (59). Resulting Sequence Alignment Map (SAM) files were converted to binary format and sorted in Samtools v1.6 (60). Only sequences with a minimum alignment quality score of ≥30 Phred (probability of erroneous alignment ≤0.001) in the SAM files were selected for downstream variant calling.
A combination of Samtools v1.6 mpileup and VarScan v2.3.9 mpileup2snp commands (61) was used to detect variant sites within sardine transcriptome. In VarScan, the minimum number of reads to call a variant site was set to eight reads, the P value was set to a maximum 0.001, and all variants with >90% support on one strand (forward or reverse reads) were removed. Indels and variant sites with more than 20% missing data were removed using VCFtools (54). The resulting Variant Format Calling (VCF) file was indexed using the tabix command in Samtools, and sites that showed a linkage disequilibrium coefficient (r2) >80% (55) and genotyping quality less than 50 were removed in BCFtools v1.6.33 (62). Details about the trinity assembly are shown in table S4. Details on the functional content and core biochemical pathways of the liver transcripts are provided in (63).

Identification of candidate and selectively neutral SNPs

Population structure was assessed with the genome data using both selectively neutral and candidate adaptive SNPs. The latter were identified using consensus between FST-based genome scans and GEAs using water temperature as the explanatory variable, while datasets of neutral SNPs were generated by removing a larger candidate dataset identified using more relaxed settings. Outlier detection by means of genome scans was performed using the program BayeScan v2.1 (64), using a burn-in of 100,000 iterations, 20 pilot runs of 5000 iterations, a thinning interval of 100, and 10,000 recorded iterations, with prior odds (PO) set to 10 and a false discovery rate (FDR) of 0.05. To identify SNPs under environmental selection (i.e., GEAs), we compiled data for the following predictor variables for each site: SST, salinity, dissolved oxygen, nitrate, phosphate, and silicate. It was assumed that during the last month of each sardine’s life, the fish did not move beyond the 0.25° × 0.25° area in which it was caught. For each location, NOAA optimally interpolated 0.25° resolution daily SST data (65) were extracted for 30 days leading up to each fish catch. Several descriptors, including minimum, maximum, and mean SST over the 30-day period, as well as the 5th and 95th percentiles, were computed. Salinity at 5 m depth was obtained from the World Ocean Atlas (WOA) (66) using the objectively interpolated climatological (2005–2017) mean at ¼° resolution. The salinity of each sample was assigned on the basis of the ¼° block in which it was collected or to the nearest ¼° block. Data for the remaining environmental variables, namely, dissolved O2, nitrate, phosphate, and silicate (all at 5 m depth) were also obtained from the WOA, in this case, using objectively interpolated climatological means (all available years) at 1° resolution. As there were several sardine samples collected in 1° blocks for which there were no WOA data (particularly inshore), contour plots for each environmental variable were generated in Surfer ( using available data around the coast and within the region 16°E to 32°E and 29°S to 38°S. The environmental parameter value for each sample location was assigned on the basis of the contour plots.
Linear regression analyses identified strong correlations between all pairs of variables (P P 67) to identify loci under selection. This method corrects for geographic structure and population history and thus reduces the occurrence of false positives. Although correcting for geographic distance also reduces the signal from true positives (68), this was considered necessary given the strong correlations between environmental parameters and geography along the South African coastline. It is well known that coarse resolution SST products can illustrate large biases close to the coast (69) and do not always adequately resolve smaller transient features (21), particularly in narrow shelf regions such as the east coast of South Africa. However, the SST data provided sufficient distinction between the broadly defined geographical regions (Fig. 1). Specifically, there was a clear west-to-east gradient (Fig. 1), with west coast sites being coldest and east coast sites being warmest.
To minimize the occurrence of false positives, only loci with a log10 Bayes factor (logBF) of >1.3 (“strong support”) (70) were considered to be under selection. Geographical coordinates were corrected for coastline curvature using the melfuR ( script viamaris, with extent.buffer set to 5 and default settings for all other parameters, followed by the mass v7.3-47 (71) script isoMDS, with k set to 2. In their core range, sardines are normally constrained to an SST of 22) but may experience warmer conditions during the sardine run (20). To reduce the effect of sardines being present at unfavorably high SST, we excluded all sardines caught on the east coast (i.e., north of 32°S) from the gINLAnd analyses. The procedure for identifying outlier loci from the transcriptome data was identical, but since no loci were identified using genome scans, we selected a more stringent logBF of >2.2 (“overwhelming support”) for the gINLAnd analyses.

Datasets of selectively neutral loci were created by relaxing conditions for detection of candidate loci (BayeScan: PO = 1, FDR = 0.05; gINLAnd: logBF = 0.25, i.e., “hardly worth mentioning”) and removing the resulting larger number of candidate loci from the complete dataset. The final genomic dataset comprised 8296 loci, of which 63 and 198 loci were identified as candidate loci using gINLAnd and BayeScan, respectively. Eleven loci that were identified by both methods were used in subsequent analyses. The transcriptome dataset comprised 14,973 loci, 234 of which were identified as candidate loci. Using more relaxed conditions for detection of candidate loci to be removed from the genomic dataset to create a dataset of neutral loci, 85 and 495 loci were identified using gINLAnd and BayeScan, respectively. Of these, 26 were shared, resulting in a total of 554 unique loci that were removed from the complete dataset, with 7742 loci remaining. A total of 1960 putative candidate loci were removed from the transcriptome dataset to create a dataset of neutral loci, but as this exceeded the column limit of GenAlEx, the dataset was further reduced to 8191 loci (corresponding to 16,384 columns).

Assessment of population structure

Population structure using neutral SNPs was assessed using fastSTRUCTURE (72), ADMIXTURE (73), and PCoA. For fastSTRUCTURE, default settings were used, and K was identified using the chooseK algorithm. ADMIXTURE was run with default settings, and the most suitable K was identified by comparing the magnitude of cross-validation errors. The PCoA was conducted in GenAlEx v6.5 (74). Raw distance data were converted to standardized covariance matrices using the algorithm of Orlóci (75).
Population structure for candidate SNPs was assessed using the same methods as above, except that the fastSTRUCTURE analysis was replaced by an analysis using STRUCTURE 2.3.4 (76), with the “locprior” model (77) implemented. We specified a burn-in of 1 million generations followed by 10 million iterations and assessed genetic structure for one to four clusters (K). For each value of K, 10 runs were conducted, and consistency of results was assessed in CLUMPAK (78). The same program was used to generate CLUMPP (79) bar plots, and both L(K) (76) and ΔK (80) in STRUCTURE HARVESTER (81) were used to identify the best K for each dataset.


We are grateful to Y. Geja and other Fish Team members (DFFE) for collecting sardine and to the Centre for High Performance Computing (CHPC) for the use of computational resources. We also thank R. Waples and an anonymous reviewer for comments on an earlier version of this manuscript. Ethical clearance for sampling was granted by the Faculty of Science Ethics Committee, University of Johannesburg (approval code: 19022015). Funding: NRF CSUR (grant no. 87702) to P.R.T., C.D.v.d.L., C.D.M., and L.B.B.; ARC Future Fellowship (FT130101068) to L.B.B.; NRF (grant no. 64801) to C.D.M.; and DFFE funding to T.L. and C.D.v.d.L. Author contributions: Conceptualization: P.R.T., C.D.M., L.B.B., and C.D.v.d.L. Analysis: P.R.T., A.E.-K., T.R.G., J.S.-C., T.L., and C.D.v.d.L. Visualization: P.R.T., T.L., and C.D.v.d.L. Supervision: P.R.T. Writing (original draft): P.R.T. Writing (review and editing): P.R.T., B.C., T.L., C.D.M., L.B.B., and C.D.v.d.L. Competing interests: The authors declare that they have no competing interests. Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. Genomic data are available from figshare (

Read more here: Source link