sequence alignment – MarkDuplicatesSpark failing with cryptic error message. MarkDuplicates succeeds

[*]

I have been trying to follow the GATK Best Practice Workflow for ‘Data pre-processing for variant discovery’ (gatk.broadinstitute.org/hc/en-us/articles/360035535912).

This has all been run on Windows Subsystem for Linux 2 on the Bash shell.

I started off with FASTQ files from IGSR (www.internationalgenome.org/data-portal) and performed alignment with Bowtie2 (instead of BWA).

This produced a SAM file which I converted to a BAM file using SAMTOOLS.

I have been trying to use MarkDuplicatesSpark (gatk.broadinstitute.org/hc/en-us/articles/4409897162139-MarkDuplicatesSpark) but am getting an output that I can’t make sense of.

I am running the following command:

gatk MarkDuplicatesSpark -I HG00102_hGRCH38_exome_aignment.bam -O HG00102_hGRCH38_exome_aignment.marked_duplicates.bam --remove-sequencing-duplicates --spark-master local[*]

I am getting the following output:

Using GATK jar /mnt/c/Users/angus/Documents/Bioinformatics/GATK4/gatk-4.2.4.0/gatk-package-4.2.4.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /mnt/c/Users/angus/Documents/Bioinformatics/GATK4/gatk-4.2.4.0/gatk-package-4.2.4.0-local.jar MarkDuplicatesSpark -I HG00102_hGRCH38_exome_aignment.bam -O HG00102_hGRCH38_exome_aignment.marked_duplicates.bam --remove-sequencing-duplicates --spark-master local[*]

22:45:08.487 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/c/Users/angus/Documents/Bioinformatics/GATK4/gatk-4.2.4.0/gatk-package-4.2.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Dec 19, 2021 10:45:08 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
22:45:08.748 INFO  MarkDuplicatesSpark - ------------------------------------------------------------
22:45:08.748 INFO  MarkDuplicatesSpark - The Genome Analysis Toolkit (GATK) v4.2.4.0
22:45:08.748 INFO  MarkDuplicatesSpark - For support and documentation go to https://software.broadinstitute.org/gatk/
22:45:08.748 INFO  MarkDuplicatesSpark - Executing as *removed* on Linux v4.19.128-microsoft-standard amd64
22:45:08.749 INFO  MarkDuplicatesSpark - Java runtime: OpenJDK 64-Bit Server VM v11.0.9.1+1-Ubuntu-0ubuntu1.18.04
22:45:08.749 INFO  MarkDuplicatesSpark - Start Date/Time: December 19, 2021 at 10:45:08 PM GMT
22:45:08.749 INFO  MarkDuplicatesSpark - ------------------------------------------------------------
22:45:08.749 INFO  MarkDuplicatesSpark - ------------------------------------------------------------
22:45:08.750 INFO  MarkDuplicatesSpark - HTSJDK Version: 2.24.1
22:45:08.750 INFO  MarkDuplicatesSpark - Picard Version: 2.25.4
22:45:08.751 INFO  MarkDuplicatesSpark - Built for Spark Version: 2.4.5
22:45:08.751 INFO  MarkDuplicatesSpark - HTSJDK Defaults.COMPRESSION_LEVEL : 2
22:45:08.751 INFO  MarkDuplicatesSpark - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
22:45:08.751 INFO  MarkDuplicatesSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
22:45:08.751 INFO  MarkDuplicatesSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
22:45:08.751 INFO  MarkDuplicatesSpark - Deflater: IntelDeflater
22:45:08.751 INFO  MarkDuplicatesSpark - Inflater: IntelInflater
22:45:08.751 INFO  MarkDuplicatesSpark - GCS max retries/reopens: 20
22:45:08.752 INFO  MarkDuplicatesSpark - Requester pays: disabled
22:45:08.752 INFO  MarkDuplicatesSpark - Initializing engine
22:45:08.752 INFO  MarkDuplicatesSpark - Done initializing engine
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/12/19 22:45:09 WARN Utils: Your hostname, *hostname removed* resolves to a loopback address: *IP removed*; using 172.27.147.65 instead (on interface eth0)
21/12/19 22:45:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/mnt/c/Users/angus/Documents/Bioinformatics/GATK4/gatk-4.2.4.0/gatk-package-4.2.4.0-local.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/12/19 22:45:10 INFO SparkContext: Running Spark version 2.4.5
21/12/19 22:45:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/12/19 22:45:10 INFO SparkContext: Submitted application: MarkDuplicatesSpark
21/12/19 22:45:10 INFO SecurityManager: Changing view acls to: abgane
21/12/19 22:45:10 INFO SecurityManager: Changing modify acls to: abgane
21/12/19 22:45:10 INFO SecurityManager: Changing view acls groups to:
21/12/19 22:45:10 INFO SecurityManager: Changing modify acls groups to:
21/12/19 22:45:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(abgane); groups with view permissions: Set(); users  with modify permissions: Set(abgane); groups with modify permissions: Set()
21/12/19 22:45:10 INFO Utils: Successfully started service 'sparkDriver' on port 35319.
21/12/19 22:45:10 INFO SparkEnv: Registering MapOutputTracker
21/12/19 22:45:11 INFO SparkEnv: Registering BlockManagerMaster
21/12/19 22:45:11 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/12/19 22:45:11 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/12/19 22:45:11 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-1e5e323f-dd90-4093-980c-0ceda43349c2
21/12/19 22:45:11 INFO MemoryStore: MemoryStore started with capacity 3.6 GB
21/12/19 22:45:11 INFO SparkEnv: Registering OutputCommitCoordinator
21/12/19 22:45:11 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/12/19 22:45:11 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.27.147.65:4040
21/12/19 22:45:11 INFO Executor: Starting executor ID driver on host localhost
21/12/19 22:45:11 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33189.
21/12/19 22:45:11 INFO NettyBlockTransferService: Server created on 172.27.147.65:33189
21/12/19 22:45:11 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/12/19 22:45:11 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 172.27.147.65, 33189, None)
21/12/19 22:45:11 INFO BlockManagerMasterEndpoint: Registering block manager 172.27.147.65:33189 with 3.6 GB RAM, BlockManagerId(driver, 172.27.147.65, 33189, None)
21/12/19 22:45:11 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 172.27.147.65, 33189, None)
21/12/19 22:45:11 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 172.27.147.65, 33189, None)
22:45:11.804 INFO  MarkDuplicatesSpark - Spark verbosity set to INFO (see --spark-verbosity argument)
21/12/19 22:45:11 INFO GoogleHadoopFileSystemBase: GHFS version: 1.9.4-hadoop3
21/12/19 22:45:12 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 172.9 KB, free 3.6 GB)
21/12/19 22:45:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 35.5 KB, free 3.6 GB)
21/12/19 22:45:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.27.147.65:33189 (size: 35.5 KB, free: 3.6 GB)
21/12/19 22:45:13 INFO SparkContext: Created broadcast 0 from newAPIHadoopFile at PathSplitSource.java:96
21/12/19 22:45:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 172.9 KB, free 3.6 GB)
21/12/19 22:45:13 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 35.5 KB, free 3.6 GB)
21/12/19 22:45:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.27.147.65:33189 (size: 35.5 KB, free: 3.6 GB)
21/12/19 22:45:13 INFO SparkContext: Created broadcast 1 from newAPIHadoopFile at PathSplitSource.java:96
21/12/19 22:45:13 INFO FileInputFormat: Total input files to process : 1
21/12/19 22:45:13 INFO SparkUI: Stopped Spark web UI at http://172.27.147.65:4040
21/12/19 22:45:13 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/12/19 22:45:13 INFO MemoryStore: MemoryStore cleared
21/12/19 22:45:13 INFO BlockManager: BlockManager stopped
21/12/19 22:45:13 INFO BlockManagerMaster: BlockManagerMaster stopped
21/12/19 22:45:13 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/12/19 22:45:13 INFO SparkContext: Successfully stopped SparkContext
22:45:13.851 INFO  MarkDuplicatesSpark - Shutting down engine
[December 19, 2021 at 10:45:13 PM GMT] org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark done. Elapsed time: 0.09 minutes.
Runtime.totalMemory()=419430400
java.lang.IllegalArgumentException: Unsupported class file major version 55
    at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
    at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
    at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
    at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
    at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49)
        at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
        at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
    at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
    at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
        at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
    at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
    at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175)
    at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238)
    at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)
    at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355)
    at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:307)
    at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:306)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:306)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
        at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2100)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
    at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:309)
    at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:171)
    at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:151)
    at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
        at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
        at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
        at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:936)
        at org.broadinstitute.hellbender.utils.spark.SparkUtils.sortUsingElementsAsKeys(SparkUtils.java:165)
        at org.broadinstitute.hellbender.utils.spark.SparkUtils.sortReadsAccordingToHeader(SparkUtils.java:143)
        at org.broadinstitute.hellbender.utils.spark.SparkUtils.querynameSortReadsIfNecessary(SparkUtils.java:306)
        at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark.mark(MarkDuplicatesSpark.java:206)
        at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark.mark(MarkDuplicatesSpark.java:270)
        at org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark.runTool(MarkDuplicatesSpark.java:350)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:546)
        at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)
21/12/19 22:45:13 INFO ShutdownHookManager: Shutdown hook called
21/12/19 22:45:13 INFO ShutdownHookManager: Deleting directory /tmp/spark-f49df209-c946-40e3-96ca-9d8125c3a733

As an alternative, I am trying to sort the BAM file and use MarkDuplicates instead (gatk.broadinstitute.org/hc/en-us/articles/4409924785691-MarkDuplicates-Picard-).This appears to work fine. I obviously want to make use of multiprocessing though so ideally would like to get the Spark implementation going.

Does anyone know what might be causing this command to fail?

Any ideas for a fix?

Any alternative software that might be useful?

Thanks for your time,

Angus

[*]

Read more here: Source link