# Getting started with tomahawk¶

This is an introductory tutorial for using Tomahawk. It will cover:

• Importing data into Tomahawk
• Subsetting and filtering output data
• Aggregating datasets

## Usage instructions¶

The CLI of Tomahawk comprises of several distinct subroutines (listed below). Executing tomahawk gives a list of commands with brief descriptions and tomahawk <command> gives detailed details for that command.

All primary Tomahawk commands operate on the binary Tomahawk twk and Tomahawk output two file format. Interconversions between twk and vcf/bcf is supported through the commands import for vcf/bcf->twk and view for twk->vcf. Linkage disequilibrium data is written out in binary two format or human-readable ld format.

Command Description
aggregate data rasterization framework for TWO files
calc calculate linkage disequilibrium
scalc calculate linkage disequilibrium for a single site
concat concatenate TWO files from the same set of samples
import import VCF/VCF.gz/BCF to TWK
sort sort TWO file
view TWO->LD/TWO view, TWO subset and filter
haplotype extract per-sample haplotype strings in FASTA/binary format
relationship compute marker-based pair-wise sample relationship matrices
decay compute LD-decay over distance
prune perform graph-based LD-pruning of variant sites

## Importing into Tomahawk¶

By design Tomahawk only operates on diploid and bi-allelic SNVs and as such filters out indels and complex variants. Tomahawk does not support mixed phasing of genotypes in the same variant (e.g. 0|0, 0/1). If mixed phasing is found for a record, all genotypes for that site are converted to unphased genotypes. This is a conscious design choice as this will internally invoke the correct algorithm to use for mixed-phase cases.

Importing standard files to Tomahawk involes using the import command. The following command imports a bcf file and outputs file.twk while filtering out variants with >20% missingness and sites that deviate from Hardy-Weinberg equilibrium with a probability < 0.001.

 1 tomahawk import -i file.bcf -o file -m 0.2 -H 1e-3 
 1 2 $md5sum 1kgp3_chr6.bcf 8c05554ebe1a51e99be2471c96fad4d9 1kgp3_chr6.bcf   1 2 $ bcftools view 1kgp3_chr6.bcf -HG | wc -l 5024119 
 1 $time tomahawk import -i 1kgp3_chr6.bcf -o 1kgp3_chr6  Auto-completion of file extensions You do not have to append the .twk suffix to the output name as Tomahawk will automatically add this if missing. In the example above "1kgp3_chr6" will be converted to "1kgp3_chr6.twk" automatically. This is true for most Tomahawk commands when using the CLI.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Program: tomahawk-264d039a-dirty (Tools for computing, querying and storing LD data) Libraries: tomahawk-0.7.0; ZSTD-1.3.8; htslib 1.9 Contact: Marcus D. R. Klarqvist Documentation: https://github.com/mklarqvist/tomahawk License: MIT ---------- [2019-01-21 11:58:15,692][LOG] Calling import... [2019-01-21 11:58:15,692][LOG][READER] Opening 1kgp3_chr6.bcf... [2019-01-21 11:58:15,695][LOG][VCF] Constructing lookup table for 86 contigs... [2019-01-21 11:58:15,695][LOG][VCF] Samples: 2,504... [2019-01-21 11:58:15,695][LOG][WRITER] Opening 1kgp3_chr6.twk... [2019-01-21 11:58:39,307][LOG] Duplicate site dropped: 6:18233985 [2019-01-21 11:59:26,749][LOG] Duplicate site dropped: 6:55137646 [2019-01-21 11:59:39,315][LOG] Duplicate site dropped: 6:67839893 [2019-01-21 11:59:47,176][LOG] Duplicate site dropped: 6:74373442 [2019-01-21 11:59:51,440][LOG] Duplicate site dropped: 6:77843171 [2019-01-21 12:00:41,784][LOG] Duplicate site dropped: 6:121316830 [2019-01-21 12:01:12,830][LOG] Duplicate site dropped: 6:148573620 [2019-01-21 12:01:16,557][LOG] Duplicate site dropped: 6:151397786 [2019-01-21 12:01:42,644][LOG] Wrote: 4,784,608 variants to 9,570 blocks... [2019-01-21 12:01:42,644][LOG] Finished: 03m26,952s [2019-01-21 12:01:42,644][LOG] Filtered out 239,511 sites (4.76722%): [2019-01-21 12:01:42,645][LOG] Invariant: 15,485 (0.308213%) [2019-01-21 12:01:42,645][LOG] Missing threshold: 0 (0%) [2019-01-21 12:01:42,645][LOG] Insufficient samples: 0 (0%) [2019-01-21 12:01:42,645][LOG] Mixed ploidy: 0 (0%) [2019-01-21 12:01:42,645][LOG] No genotypes: 0 (0%) [2019-01-21 12:01:42,645][LOG] No FORMAT: 0 (0%) [2019-01-21 12:01:42,645][LOG] Not biallelic: 26,277 (0.523017%) [2019-01-21 12:01:42,645][LOG] Not SNP: 194,566 (3.87264%)  ## Computing linkage-disequilibrium¶ ### All pairwise comparisons¶ In this first example, we will compute all-vs-all LD associations using the data we imported in the previous section. To limit compute, we restrict our attention to a 1/45 section of the data by passing the -c and -C job parameters. We will be using 8 threads (-t), but you may need to modify this to match the hardware available on your host machine. This job involves comparing a pair of variants >141 billion times and as such takes around 30 min to finish on most machines.  1 $ tomahawk calc -pi 1kgp3_chr6.twk -o 1kgp3_chr6_1_45 -C 1 -c 45 -t 8 

Valid job balancing partitions

When computing genome-wide LD the balancing requires that number of sub-problems (-c) must be a member of the function: $$\binom{c}{2} + c = \frac{c^2 + c}{2}, c > 0$$ This function is equivalent to the upper-triangular of a square (c-by-c) matrix plus the diagonal. Read more about load partitioning in Tomahawk.

Here is the first 100 valid partition sizes:
1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136, 153, 171, 190, 210, 231, 253, 276, 300, 325, 351, 378, 406, 435, 465, 496, 528, 561, 595, 630, 666, 703, 741, 780, 820, 861, 903, 946, 990, 1035, 1081, 1128, 1176, 1225, 1275, 1326, 1378, 1431, 1485, 1540, 1596, 1653, 1711, 1770, 1830, 1891, 1953, 2016, 2080, 2145, 2211, 2278, 2346, 2415, 2485, 2556, 2628, 2701, 2775, 2850, 2926, 3003, 3081, 3160, 3240, 3321, 3403, 3486, 3570, 3655, 3741, 3828, 3916, 4005, 4095, 4186, 4278, 4371, 4465, 4560, 4656, 4753, 4851, 4950, 5050

By default, Tomahawk will attempt to compute the genome-wide linkage-disequilibrium for all the provided variants and samples. This generally require huge amount of pairwise variant comparisons. Depending on your interests, you may not want to undertake this large compute and need to parameterize for your particular use-case.

As a consequence of the job balancing approach in Tomahawk, the diagonal jobs will always involve less variant comparisons ($$\binom{n}{2}$$) compared to off-diagonal jobs ($$n \times m$$), where $$n$$ and $$m$$ are the number of records in either block. Keep this in mind if you are choosing a partition size given the run-times of a given job. For example, examine the total run-time of the second job (-C 2) — the first off-diagonal job — as a proxy for the expected average run-time per job.

Output ordering

By design, Tomahawk computes pairwise associations using an out-of-order execution paradigm and further permutes the input ordering by using cache-blocking. Additionally, each thread/slave transiently store a number of computed associations in a private buffer that will be flushed upon reaching some frequency threshold. Because of these technicalities, Tomahawk will neither produce ordered output nor will the permutation of the output order be identical between runs on identical data. This has no practical consequence to most downstream applications in Tomahawk with the exception of query speeds. To address this limitation we have introduced a powerful sorting paradigm that will be discussed below.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Program: tomahawk-264d039a-dirty (Tools for computing, querying and storing LD data) Libraries: tomahawk-0.7.0; ZSTD-1.3.8; htslib 1.9 Contact: Marcus D. R. Klarqvist Documentation: https://github.com/mklarqvist/tomahawk License: MIT ---------- [2019-01-21 12:13:50,210][LOG] Calling calc... [2019-01-21 12:13:50,211][LOG][READER] Opening 1kgp3_chr6.twk... [2019-01-21 12:13:50,212][LOG] Samples: 2,504... [2019-01-21 12:13:50,212][LOG][BALANCING] Using ranges [0-1063,0-1063] in square mode... [2019-01-21 12:13:50,212][LOG] Allocating 1,063 blocks... [2019-01-21 12:13:50,213][LOG] Running in standard mode. Pre-computing data... [2019-01-21 12:13:50,213][LOG][SIMD] Vectorized instructions available: SSE4... [2019-01-21 12:13:50,213][LOG] Constructing list, vector, RLE... [2019-01-21 12:13:50,213][LOG][THREAD] Unpacking using 7 threads: ....... Done! 01,291s [2019-01-21 12:13:51,504][LOG] 531,500 variants from 1,063 blocks... [2019-01-21 12:13:51,504][LOG][PARAMS] square=TRUE,window=FALSE,low_memory=FALSE,bitmaps=FALSE,single=FALSE,force_phased=TRUE,force_unphased=FALSE,compression_level=1,block_size=500,output_block_size=10000,l_surrounding=500000,minP=1.000000,minR2=0.100000,maxR2=100.000000,minDprime=0.000000,maxDprime=100.000000,n_chunks=45,c_chunk=0,n_threads=8,ldd_type=3,cycle_threshold=0 [2019-01-21 12:13:51,504][LOG] Performing: 141,245,859,250 variant comparisons... [2019-01-21 12:13:51,504][LOG][WRITER] Opening 1kgp3_chr6_1_45.two... [2019-01-21 12:13:51,505][LOG][THREAD] Spawning 8 threads: ........ [2019-01-21 12:13:51,533][PROGRESS] Time elapsed Variants Genotypes Output Progress Est. Time left [2019-01-21 12:14:21,533][PROGRESS] 30,000s 3,685,746,500 9,229,109,236,000 988,858 2.60945% 18m39s [2019-01-21 12:14:51,533][PROGRESS] 01m00,000s 6,425,868,750 16,090,375,350,000 1,689,422 4.54942% 20m58s [truncated] [2019-01-21 12:39:51,546][PROGRESS] 26m00,013s 140,131,382,750 350,888,982,406,000 48,635,114 99.211% 12s [2019-01-21 12:40:04,317][PROGRESS] Finished in 26m12,784s. Variants: 141,245,859,250, genotypes: 353,679,631,562,000, output: 49,870,388 [2019-01-21 12:40:04,317][PROGRESS] 89,806,242 variants/s and 224,874,830,855 genotypes/s [2019-01-21 12:40:04,321][LOG][PROGRESS] All done...26m12,815s! 

This run generated 935817820 bytes (935.8 MB) of output data in binary format using the default compression level (1):

 1 2 $ls -l 1kgp3_chr6_1_45.two -rw-rw-r-- 1 mk819 mk819 935817820 Jan 21 12:40 1kgp3_chr6_1_45.two  If this data was stored in plain, human-readable, text format it would use over 45 GB:  1 2 $ tomahawk view -i 1kgp3_chr6_1_45.two | wc -c 4544702179 

### Sliding window¶

If you are working with a species with well-know LD structure (such as humans) you can reduce the computational cost by limiting the search-space to a fixed-sized sliding window (-w). In window mode you are free to choose any arbitrary sub-problem (-c) size.

Window properties

By default, Tomahawk computes the linkage disequilibrium associations for all pairs of variants within -w bases of the target SNV on either direction. In order to maximize computational throughput, we made the design decision to maintain blocks of data that overlap to the target interval. This has the consequence that adjacent data that may not directly overlap with the target region will still be computed. The technical reasoning for this is, in simplified terms, that the online repackaging of internal data blocks is more expensive than computing a small (relatively) number of non-desired (off-target) associations.

 1 time tomahawk calc -pi 1kgp3_chr6.twk -o 1kgp3_chr6_4mb -w 4000000 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Program: tomahawk-264d039a-dirty (Tools for computing, querying and storing LD data) Libraries: tomahawk-0.7.0; ZSTD-1.3.8; htslib 1.9 Contact: Marcus D. R. Klarqvist Documentation: https://github.com/mklarqvist/tomahawk License: MIT ---------- [2019-01-21 14:10:33,616][LOG] Calling calc... [2019-01-21 14:10:33,616][LOG][READER] Opening 1kgp3_chr6.twk... [2019-01-21 14:10:33,619][LOG] Samples: 2,504... [2019-01-21 14:10:33,619][LOG][BALANCING] Using ranges [0-9570,0-9570] in window mode... [2019-01-21 14:10:33,619][LOG] Allocating 9,570 blocks... [2019-01-21 14:10:33,620][LOG] Running in standard mode. Pre-computing data... [2019-01-21 14:10:33,620][LOG][SIMD] Vectorized instructions available: SSE4... [2019-01-21 14:10:33,620][LOG] Constructing list, vector, RLE... [2019-01-21 14:10:33,620][LOG][THREAD] Unpacking using 7 threads: ....... Done! 15,774s [2019-01-21 14:10:49,394][LOG] 4,784,608 variants from 9,570 blocks... [2019-01-21 14:10:49,394][LOG][PARAMS] square=TRUE,window=TRUE,low_memory=FALSE,bitmaps=FALSE,single=FALSE,force_phased=TRUE,force_unphased=FALSE,compression_level=1,block_size=500,output_block_size=10000,window_size=4000000,l_surrounding=500000,minP=1.000000,minR2=0.100000,maxR2=100.000000,minDprime=0.000000,maxDprime=100.000000,n_chunks=1,c_chunk=0,n_threads=8,ldd_type=3,cycle_threshold=0 [2019-01-21 14:10:49,394][LOG] Performing: 11,446,234,464,528 variant comparisons... [2019-01-21 14:10:49,394][LOG][WRITER] Opening 1kgp3_chr6_4mb.two... [2019-01-21 14:10:49,402][LOG][THREAD] Spawning 8 threads: ........ [2019-01-21 14:10:49,424][PROGRESS] Time elapsed Variants Genotypes Output Progress Est. Time left [2019-01-21 14:11:19,424][PROGRESS] 30,000s 2,311,240,500 5,787,346,212,000 853,368 0 0 [2019-01-21 14:11:49,424][PROGRESS] 01m00,000s 4,562,731,500 11,425,079,676,000 1,717,956 0 0 [truncated] [2019-01-21 16:12:49,492][PROGRESS] 02h02m00,068s 527,889,726,500 1,321,835,875,156,000 470,354,846 0 0 [2019-01-21 16:13:17,236][PROGRESS] Finished in 02h02m27,812s. Variants: 529,807,522,528, genotypes: 1,326,638,036,410,112, output: 473,514,018 [2019-01-21 16:13:17,237][PROGRESS] 72,104,114 variants/s and 180,548,701,880 genotypes/s [2019-01-21 16:13:17,257][LOG][PROGRESS] All done...02h02m27,854s! 

### Single interval vs its local neighbourhood¶

In many cases we are not interested in computing large-scale linkage-disequilibrium associations but have a limited genomics window of interest. If this is the case, you can use the specialization subroutine of calc called scalc that is parameterizable with a target region and a neighbourhood size in bases.

In this example, we will compute the regional associations of variants mapping to the interval chr6:10e6-10.1e6 and a 100kb neighbourhood.

 1 tomahawk scalc -i 1kgp3_chr6.twk -o test -I 6:10e6-10.1e6 -w 100000 

Neighbourhood interval

The neighbourhood is computed as the region from [start of interval - neighbourhood, start of interval) and (end of interval + neighbourhood). If the start of the interval minus the neighbourhood falls outside the chromosome (position < 0) then this interval is truncated to 0 in the left end.

Scalability

This subroutine was designed to be used with a relatively short interval to maximize computability in this case. It is however possible to provide any valid interval as the target region to compute linkage-disequilibrium for. Providing large intervals will result in poor performance and potentially undesired output.

## Concatenating multiple archives¶

On of the immediate downsides of partitioning compute into multiple non-overlapping sub-problems is that we will generate a large number of independent files that must be merged prior to downstream analysis. Fortunatly, this is a trivial operation in Tomahawk and involves the concat command.

First lets compute some data using the first 3 out of 990 partitions of the dataset used above.

 1 for i in {1..3}; do time tomahawk calc -pi 1kgp3_chr6.twk -c 990 -C $i -o part$i\_3.two; done 

Next, since we only have three files, we can concatenate (merge) these files together into a single archieve by passing each file name to the command:

  1 2 3 4 5 6 7 8 9 10 11 $time tomahawk concat -i part1_3.two -i part2_3.two -i part3_3.two -o part3_concat [2019-01-22 10:55:47,799][LOG] All files are compatible. Beginning merging... [2019-01-22 10:55:47,800][LOG] Appending part1_3.two... 397.953344 Mb/98.987848 Mb [2019-01-22 10:55:47,889][LOG] Appending part2_3.two... 359.538884 Mb/54.763322 Mb [2019-01-22 10:55:47,938][LOG] Appending part3_3.two... 346.075500 Mb/52.072741 Mb [2019-01-22 10:55:47,988][LOG] Finished. Added 3 files... [2019-01-22 10:55:47,988][LOG] Total size: Uncompressed = 1.103568 Gb and compressed = 205.823911 Mb real 0m0.226s user 0m0.036s sys 0m0.190s  Passing every single input file in the command line does not become feasble when we have many hundreds to thousands of files. To address this, it is possible to first store the target file paths in a text file and then pass that to Tomahawk. Using the data from above, we save these file paths to the file part_file_list.txt.  1 for i in {1..3}; do echo part$i\_3.two >> part_file_list.txt; done 

Then we simply pass this list of file paths to Tomahawk using the -I argument:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 $time tomahawk concat -I part_file_list.txt -o part3_concat adding file=part1_3.two adding file=part2_3.two adding file=part3_3.two [2019-01-22 10:57:43,227][LOG] All files are compatible. Beginning merging... [2019-01-22 10:57:43,262][LOG] Appending part1_3.two... 397.953344 Mb/98.987848 Mb [2019-01-22 10:57:43,350][LOG] Appending part2_3.two... 359.538884 Mb/54.763322 Mb [2019-01-22 10:57:43,400][LOG] Appending part3_3.two... 346.075500 Mb/52.072741 Mb [2019-01-22 10:57:43,447][LOG] Finished. Added 3 files... [2019-01-22 10:57:43,447][LOG] Total size: Uncompressed = 1.103568 Gb and compressed = 205.823911 Mb real 0m0.477s user 0m0.040s sys 0m0.241s  Possible duplications The concat subroutine will dedupe input file paths but will neiter check nor guarantee that the actual input data is not duplicated. Therefore, it is possible to get duplicates in your data if you are not careful. These duplicate entries could corrupt any downstream insights! ## Sorting output files¶ All subroutines in Tomahawk will work on unsorted output files. However, sorted files are much faster to query and will result in considerable time savings. This is especially true for very large files. Sorting files in Tomahawk is trivial and involves calling the sort command with the required fields input (-i) and output (-o). Disk usage Tomahawk can easily sort huge .two files with the help of external merge sorting algorithms followed by a single k-way merge, memory assisted operation. The external merge step require additional disk space approximately equal to the size of the input dataset. Then, the merge operation will write a, generally, smaller output file. On average, we can approxiate that the sort operation require an additional two times the input file size on disk. After the sorting procedure is completed, the intermediate files are removed. Temporary files By default, Tomahawk will generate temporary files in the same directory as the input file. If this file path is undesired then you have to modify the appropriate parameters. In this example we will sort the almost 500 million (473,514,826) records computed from the sliding window example above. This archive corresponds to >50 Gb of binary data.  1 $ time tomahawk sort -i 1kgp3_chr6_4mb.two -o 1kgp3_chr6_4mb_sorted 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 Program: tomahawk-264d039a-dirty (Tools for computing, querying and storing LD data) Libraries: tomahawk-0.7.0; ZSTD-1.3.8; htslib 1.9 Contact: Marcus D. R. Klarqvist Documentation: https://github.com/mklarqvist/tomahawk License: MIT ---------- [2019-01-22 11:20:43,978][LOG] Calling sort... [2019-01-22 11:20:43,993][LOG] Blocks: 47,358 [2019-01-22 11:20:43,993][LOG] Uncompressed size: 50.192950 Gb [2019-01-22 11:20:43,993][LOG] Sorting 473,514,826 records... [2019-01-22 11:20:43,993][LOG][THREAD] Data/thread: 6.274119 Gb [2019-01-22 11:20:43,993][PROGRESS] Time elapsed Variants Progress Est. Time left [2019-01-22 11:20:43,994][LOG][THREAD] Slave-0: range=0->5919/47358 and name 1kgp3_chr6_4mb_sorted_fileSucRkC.two [2019-01-22 11:20:43,994][LOG][THREAD] Slave-1: range=5919->11838/47358 and name 1kgp3_chr6_4mb_sorted_filewPI10T.two [2019-01-22 11:20:43,994][LOG][THREAD] Slave-2: range=11838->17757/47358 and name 1kgp3_chr6_4mb_sorted_fileo1zcHb.two [2019-01-22 11:20:43,994][LOG][THREAD] Slave-3: range=17757->23676/47358 and name 1kgp3_chr6_4mb_sorted_fileBvRnnt.two [2019-01-22 11:20:43,994][LOG][THREAD] Slave-4: range=23676->29595/47358 and name 1kgp3_chr6_4mb_sorted_filewbBz3K.two [2019-01-22 11:20:43,994][LOG][THREAD] Slave-5: range=29595->35514/47358 and name 1kgp3_chr6_4mb_sorted_fileCrNLJ2.two [2019-01-22 11:20:43,994][LOG][THREAD] Slave-6: range=35514->41433/47358 and name 1kgp3_chr6_4mb_sorted_filerCfYpk.two [2019-01-22 11:20:43,994][LOG][THREAD] Slave-7: range=41433->47358/47358 and name 1kgp3_chr6_4mb_sorted_filexx9a6B.two [2019-01-22 11:21:13,994][PROGRESS] 30,000s 70,594,275 14.9086% 2m51s [2019-01-22 11:21:43,994][PROGRESS] 01m00,000s 146,732,835 30.988% 2m13s [2019-01-22 11:22:13,994][PROGRESS] 01m30,000s 245,535,675 51.8539% 1m23s [2019-01-22 11:22:43,995][PROGRESS] 02m00,001s 323,064,235 68.2268% 55s [2019-01-22 11:23:13,995][PROGRESS] 02m30,001s 421,592,790 89.0348% 18s [2019-01-22 11:23:24,951][LOG][THREAD] Finished: 1kgp3_chr6_4mb_sorted_filewbBz3K.two with 23676-29595. Sorted n=59190000 variants with size=1.176156 Gb [2019-01-22 11:23:25,114][LOG][THREAD] Finished: 1kgp3_chr6_4mb_sorted_fileBvRnnt.two with 17757-23676. Sorted n=59190000 variants with size=1.286938 Gb [2019-01-22 11:23:25,385][LOG][THREAD] Finished: 1kgp3_chr6_4mb_sorted_filerCfYpk.two with 35514-41433. Sorted n=59190000 variants with size=1.219126 Gb [2019-01-22 11:23:25,710][LOG][THREAD] Finished: 1kgp3_chr6_4mb_sorted_fileCrNLJ2.two with 29595-35514. Sorted n=59190000 variants with size=1.161081 Gb [2019-01-22 11:23:26,384][LOG][THREAD] Finished: 1kgp3_chr6_4mb_sorted_filexx9a6B.two with 41433-47358. Sorted n=59184826 variants with size=1.215231 Gb [2019-01-22 11:23:26,425][LOG][THREAD] Finished: 1kgp3_chr6_4mb_sorted_fileSucRkC.two with 0-5919. Sorted n=59190000 variants with size=1.242923 Gb [2019-01-22 11:23:29,966][LOG][THREAD] Finished: 1kgp3_chr6_4mb_sorted_fileo1zcHb.two with 11838-17757. Sorted n=59190000 variants with size=1.686758 Gb [2019-01-22 11:23:31,300][LOG][THREAD] Finished: 1kgp3_chr6_4mb_sorted_filewPI10T.two with 5919-11838. Sorted n=59190000 variants with size=2.013554 Gb [2019-01-22 11:23:31,338][PROGRESS] 02m47,344s 473,514,826 (2,829,587 variants/s) [2019-01-22 11:23:31,338][PROGRESS] Finished! [2019-01-22 11:23:31,341][LOG] Spawning 112 queues with 2.380952 Mb each... [2019-01-22 11:23:35,090][LOG][WRITER] Opening "1kgp3_chr6_4mb_sorted.two"... [2019-01-22 11:23:35,091][PROGRESS] Time elapsed Variants Progress Est. Time left [2019-01-22 11:24:05,091][PROGRESS] 30,000s 39,757,023 8.39615% 5m27s [2019-01-22 11:24:35,091][PROGRESS] 01m00,000s 76,990,000 16.2593% 5m9s [2019-01-22 11:25:05,091][PROGRESS] 01m30,000s 107,840,000 22.7744% 5m5s [2019-01-22 11:25:35,091][PROGRESS] 02m00,000s 133,420,000 28.1765% 5m5s [2019-01-22 11:26:05,092][PROGRESS] 02m30,000s 170,215,235 35.9472% 4m27s [2019-01-22 11:26:35,092][PROGRESS] 03m00,001s 208,070,000 43.9416% 3m49s [2019-01-22 11:27:05,092][PROGRESS] 03m30,001s 247,010,000 52.1652% 3m12s [2019-01-22 11:27:35,092][PROGRESS] 04m00,001s 287,690,000 60.7563% 2m35s [2019-01-22 11:28:05,092][PROGRESS] 04m30,001s 325,480,000 68.737% 2m2s [2019-01-22 11:28:35,093][PROGRESS] 05m00,001s 364,299,966 76.9353% 1m29s [2019-01-22 11:29:05,093][PROGRESS] 05m30,002s 401,367,427 84.7634% 59s [2019-01-22 11:29:35,093][PROGRESS] 06m00,002s 439,290,000 92.7722% 28s [2019-01-22 11:30:02,180][PROGRESS] 06m27,089s 473,514,826 (1,223,269 variants/s) [2019-01-22 11:30:02,180][PROGRESS] Finished! [2019-01-22 11:30:02,194][LOG] Finished merging! Time: 06m27,103s [2019-01-22 11:30:02,194][LOG] Deleting temp files... [2019-01-22 11:30:02,194][LOG] Deleted 1kgp3_chr6_4mb_sorted_fileSucRkC.two [2019-01-22 11:30:02,194][LOG] Deleted 1kgp3_chr6_4mb_sorted_filewPI10T.two [2019-01-22 11:30:02,194][LOG] Deleted 1kgp3_chr6_4mb_sorted_fileo1zcHb.two [2019-01-22 11:30:02,194][LOG] Deleted 1kgp3_chr6_4mb_sorted_fileBvRnnt.two [2019-01-22 11:30:02,194][LOG] Deleted 1kgp3_chr6_4mb_sorted_filewbBz3K.two [2019-01-22 11:30:02,194][LOG] Deleted 1kgp3_chr6_4mb_sorted_fileCrNLJ2.two [2019-01-22 11:30:02,194][LOG] Deleted 1kgp3_chr6_4mb_sorted_filerCfYpk.two [2019-01-22 11:30:02,194][LOG] Deleted 1kgp3_chr6_4mb_sorted_filexx9a6B.two [2019-01-22 11:30:02,678][LOG] Finished! real 9m18.793s user 23m31.390s sys 0m53.197s 

Faster queries

Success: You can now use the newly created sorted archieve exactly as unsorted files but with faster query times.

## Format descriptions¶

Memory layout (technical)

A frequent design choice when designing a file format is how to align memory of records: either as lists of records (array-of-struct) versus column-orientated layouts (struct-of-array). The pivoted layout (struct-of-array) of twk1_two_t structs in Tomahawk will result considerable savings in disk space usage at the expense of querying speed of the resulting data. We decided to build the two format around the classical array-of-struct memory layout as we want to maximize the computability of the data. This is especially true in our application as we will never support individual columnar slicing and subset operations.

### LD format¶

Tomahawk can output binary two data in the human-readable ld format by invoking the view command. The general schema for ld is below:

Column Description
FLAG Bit-packed boolean flags (see below)
CHROM_A Chromosome for marker A
POS_A Position for marker A
CHROM_B Chromosome for marker B
POS_B Position for marker B
REF_REF Inner product of (0,0) haplotypes
REF_ALT Inner product of (0,1) haplotypes
ALT_REF Inner product of (1,0) haplotypes
ALT_ALT Inner proruct of (1,1) haplotypes
D Coefficient of linkage disequilibrium
DPrime Normalized coefficient of linkage disequilibrium (scaled to [-1,1])
R Pearson correlation coefficient
R2 Squared pearson correlation coefficient
P Fisher's exact test P-value of the 2x2 haplotype contigency table
ChiSqModel Chi-squared critical value of the 3x3 unphased table of the selected cubic root (α, β, or δ)
ChiSqTable Chi-squared critical value of table (useful comparator when P = 0)

The 2x2 contingency table, or matrix, for the Fisher's exact test (P) for haplotypes look like this:

REF-A REF-B
REF-B A B
ALT-B C D

The 3x3 contigency table, or matrix, for the Chi-squared test for the unphased model looks like this:

0/0 0/1 1/1
0/0 A B C
0/1 D E F
1/1 G H J

The two FLAG values are bit-packed booleans in a single integer field and describe a variety of states a pair of markers can be in.

Bit position Numeric value One-hot Description
1 1 0000000000000001 Used phased math.
2 2 0000000000000010 Acceptor and donor variants are on the same contig.
3 4 0000000000000100 Acceptor and donor variants are far apart on the same contig.
4 8 0000000000001000 The output contingency matrix has at least one empty cell (referred to as complete).
5 16 0000000000010000 Output correlation coefficient is perfect (1.0).
6 32 0000000000100000 Output solution is one of >1 possible solutions. This only occurs for unphased pairs.
7 64 0000000001000000 Output data was generated in 'fast mode'.
8 128 0000000010000000 Output data is estimated from a subsampling of the total pool of genotypes.
9 256 0000000100000000 Donor vector has missing value(s).
10 512 0000001000000000 Acceptor vector has missing value(s).
11 1024 0000010000000000 Donor vector has low allele count (<5).
12 2048 0000100000000000 Acceptor vector has low allele count (<5).
13 4096 0001000000000000 Acceptor vector has a HWE-P value < 1e-4.
14 8192 0010000000000000 Donor vector has a HWE-P value < 1e-4.

## Viewing and manipulating output data¶

It is possible to filter two output data by: 1) either start or end contig e.g. chr1, 2) position in that contig e.g. chr1:10e6-20e6; 3) have a particular contig mapping e.g. chr1,chr2; 4) interval mapping in both contigs e.g. chr1:10e3-10e6,chr2:0-10e6

 1 tomahawk view -i 1kgp3_chr6_1_45.two -I 6:10e3-10e6,6:0-10e6 -H | head -n 6 

Viewing all data

 1 tomahawk view -i 1kgp3_chr6_1_45.two -H | head -n 6 
FLAG CHROM_A POS_A CHROM_B POS_B REF_REF REF_ALT ALT_REF ALT_ALT D DPrime R R2 P ChiSqModel ChiSqTable
2059 6 89572 6 214654 4999 6 0 3 0.000597965 1 0.577004 0.332934 4.01511e-09 1667.33 0
2059 6 89573 6 214654 4999 6 0 3 0.000597965 1 0.577004 0.332934 4.01511e-09 1667.33 0
2059 6 122855 6 214654 4988 17 0 3 0.000596649 1 0.38664 0.149491 5.44908e-08 748.648 0
3 6 143500 6 212570 4421 387 82 118 0.0195352 0.54402 0.331324 0.109775 1.53587e-69 549.755 0
3 6 143500 6 213499 4419 387 84 118 0.0194949 0.537523 0.329068 0.108286 7.57426e-69 542.296 0

Slicing a range

 1 tomahawk view -i 1kgp3_chr6_1_45.two -H -I 6:5e6-6e6 | head -n 6 
FLAG CHROM_A POS_A CHROM_B POS_B REF_REF REF_ALT ALT_REF ALT_ALT D DPrime R R2 P ChiSqModel ChiSqTable
2063 6 5022893 6 73938 5000 7 0 1 0.000199362 1 0.353306 0.124825 0.00159744 625.125 0
2063 6 5020564 6 89339 5000 7 0 1 0.000199362 1 0.353306 0.124825 0.00159744 625.125 0
7 6 5018459 6 156100 1150 1283 388 2187 0.0804323 0.509361 0.348864 0.121706 1.17649e-138 609.504 0
7 6 5018691 6 156100 861 811 677 2659 0.0693918 0.339199 0.31898 0.101748 1.10017e-109 509.554 0
7 6 5018910 6 156100 861 811 677 2659 0.0693918 0.339199 0.31898 0.101748 1.10017e-109 509.554 0

Slicing matches in B string

 1 tomahawk view -i 1kgp3_chr6_1_45.two -H -I 6:5e6-6e6,6:5e6-6e6 | head -n 6 
FLAG CHROM_A POS_A CHROM_B POS_B REF_REF REF_ALT ALT_REF ALT_ALT D DPrime R R2 P ChiSqModel ChiSqTable
3 6 5000012 6 5073477 4988 9 5 6 0.0011915 0.544089 0.465743 0.216917 1.05037e-13 1086.32 0
1027 6 5000160 6 5072168 5000 2 4 2 0.000398404 0.4994 0.407677 0.166201 7.1708e-06 832.333 0
1027 6 5000160 6 5078340 5000 2 4 2 0.000398404 0.4994 0.407677 0.166201 7.1708e-06 832.333 0
3 6 5000482 6 5079621 4993 6 2 7 0.0013931 0.777199 0.64641 0.417846 3.9492e-18 2092.57 0
3 6 5000482 6 5082227 4994 6 1 7 0.00139362 0.874675 0.685808 0.470333 8.78523e-19 2355.43 0

As alluded to in the sorting section, sorted files have generally much faster query times. We can demonstrate this difference by using the sorted and unsorted data from the sliding window example above.

 1 2 3 4 5 $time tomahawk view -i 1kgp3_chr6_4mb.two -H -I 6:1e6-5e6,6:1e6-5e6 > /dev/null real 3m1.029s user 2m41.929s sys 0m5.682s  Sorted  1 2 3 4 5 $ time tomahawk view -i 1kgp3_chr6_4mb_sorted.two -H -I 6:1e6-5e6,6:1e6-5e6 > /dev/null real 0m38.167s user 0m37.579s sys 0m0.192s 

Viewing ld data from the binary two file format and filtering out lines with a Fisher's exact test P-value < 1e-4, minor haplotype frequency < 5 and have both markers on the same contig (bit 2)

 1 tomahawk view -i file.two -P 1e-4 -a 5 -f 2 

Example output

FLAG CHROM_A POS_A CHROM_B POS_B REF_REF REF_ALT ALT_REF ALT_ALT D Dprime R R2 P ChiSqModel ChiSqTable
15 20 1314874 20 2000219 5002 5 0 1 0.00019944127 1 0.4080444 0.16650023 0.0011980831 0 833.83313
15 20 1315271 20 1992301 5005 2 0 1 0.00019956089 1 0.57723492 0.33320019 0.00059904153 0 1668.6665
15 20 1315527 20 1991024 5004 0 3 1 0.00019952102 1 0.49985018 0.2498502 0.00079872204 0 1251.2498
15 20 1315763 20 1982489 5006 0 1 1 0.00019960076 1 0.70703614 0.49990013 0.00039936102 0 2503.4999
15 20 1315807 20 1982446 5004 3 0 1 0.00019952102 1 0.49985018 0.2498502 0.00079872204 0 1251.2498

Example unphased output. Notice that estimated haplotype counts are now floating values and the bit-flag 1 is not set.

FLAG CHROM_A POS_A CHROM_B POS_B REF_REF REF_ALT ALT_REF ALT_ALT D Dprime R R2 P ChiSqModel ChiSqTable
46 20 1882564 20 1306588 5003.9996 2.0003999 1.0003999 0.99960008 0.00019936141 0.49950022 0.40779945 0.1663004 0.0011978438 0.0011996795 832.83241
46 20 1895185 20 1306588 5004.9998 1.0001999 1.0001999 0.99980011 0.0001994811 0.49970025 0.49970022 0.24970032 0.00079864228 0.00069960005 1250.4992
46 20 1306588 20 1901581 5003.9996 2.0003999 1.0003999 0.99960008 0.00019936141 0.49950022 0.40779945 0.1663004 0.0011978438 0.0011996795 832.83241
46 20 1901581 20 1306588 5003.9996 2.0003999 1.0003999 0.99960008 0.00019936141 0.49950022 0.40779945 0.1663004 0.0011978438 0.0011996795 832.83241
46 20 1306649 20 1885268 5006 1 1.2656777e-08 0.99999999 0.00019960076 1 0.70703614 0.49990013 0.00039936102 0.00039969285 2503.4999

Example output for forced phased math in fast mode. Note that only the REF_REF count is available, the fast math bit-flag is set, and all P and Chi-squared CV values are 0.

FLAG CHROM_A POS_A CHROM_B POS_B REF_REF REF_ALT ALT_REF ALT_ALT D Dprime R R2 P ChiSqModel ChiSqTable
67 20 1345922 20 1363176 4928 0 0 0 0.011788686 0.98334378 0.86251211 0.74392718 0 0 0
67 20 1345922 20 1367160 4933 0 0 0 0.011800847 0.98336071 0.89164203 0.79502547 0 0 0
67 20 1345958 20 1348347 4944 0 0 0 0.0092644105 0.97890127 0.85316211 0.72788554 0 0 0
67 20 1345958 20 1354524 4938 0 0 0 0.0092493389 0.86871892 0.80354655 0.64568704 0 0 0
75 20 1345958 20 1356626 4945 0 0 0 0.0033518653 0.99999994 0.51706308 0.26735422 0 0 0

## Aggregating and visualizing datasets¶

Tomahawk generally output many millions to many hundreds of millions to billions of output linkage disequilibrium (LD) associations generated from many millions of input SNVs. It is technically very challenging to visualize such large datasets. Read more about aggregation in Tomahawk.

Aggregation Description
R Pearson correlation coefficient
R2 Squared pearson correlation coefficient
D Coefficient of linkage disequilibrium
Dprime Scaled coefficient of linkage disequilibrium
Dp Alias for dprime
P Fisher's exact test P-value of the 2x2 haplotype contigency table
Hets Number of (0,1) or (1,0) associations
Alts Number of (1,1) associations
Het Alias for hets
Alt Alias for alts
Reduction Description
Mean Mean number of aggregate
Max Largest number in aggregate bin
Min Smallest number in aggregate bin
Count Total number of records in a bin
N Alias for count
Total Sum total of aggregated number in a bin

In this example, we will aggregate the genome-wide data generated from the sliding example above by aggregating on R2 values and reducing into count in a (4000,4000)-bin space. If a bin have less than 50 observations (-c) we will drop all the value and report 0 for that bin.

 1 twk aggregate -i 1kgp3_chr6_4mb_sorted.two -x 4000 -y 4000 -f r2 -r count -c 50 -t 4 -O b -o 1kgp3_chr6_4mb_aggregate.twa 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Program: tomahawk-7f8eef9b-dirty (Tools for computing, querying and storing LD data) Libraries: tomahawk-0.7.0; ZSTD-1.3.8; htslib 1.9 Contact: Marcus D. R. Klarqvist Documentation: https://github.com/mklarqvist/tomahawk License: MIT ---------- [2019-01-25 14:55:43,299][LOG] Calling aggregate... [2019-01-25 14:55:43,299][LOG] Performing 2-pass over data... [2019-01-25 14:55:43,299][LOG] ===== First pass (peeking at landscape) ===== [2019-01-25 14:55:43,299][LOG] Blocks: 47,352 [2019-01-25 14:55:43,299][LOG] Uncompressed size: 50.192950 Gb [2019-01-25 14:55:43,299][LOG][THREAD] Data/thread: 12.548238 Gb [2019-01-25 14:55:43,299][PROGRESS] Time elapsed Variants Progress Est. Time left [2019-01-25 14:56:13,299][PROGRESS] 30,000s 344,930,000 72.8446% 11s [2019-01-25 14:56:27,979][PROGRESS] 44,679s 473,514,826 (10,597,935 variants/s) [2019-01-25 14:56:27,979][PROGRESS] Finished! [2019-01-25 14:56:27,979][LOG] ===== Second pass (building matrix) ===== [2019-01-25 14:56:27,979][LOG] Aggregating 473,514,826 records... [2019-01-25 14:56:27,979][LOG][THREAD] Allocating: 2.560000 Gb for matrices... [2019-01-25 14:56:27,979][PROGRESS] Time elapsed Variants Progress Est. Time left [2019-01-25 14:56:57,979][PROGRESS] 30,000s 348,560,000 73.6112% 10s [2019-01-25 14:57:12,499][PROGRESS] 44,520s 473,514,826 (10,635,926 variants/s) [2019-01-25 14:57:12,499][PROGRESS] Finished! [2019-01-25 14:57:13,713][LOG] Aggregated 473,514,826 records in 16,000,000 bins. [2019-01-25 14:57:13,713][LOG] Finished. 

You can now use the output binary format .twa in your downstream analysis. Alternatively, it is possible to output a human-readable (x,y)-matrix by setting the -O parameter to u. This tab-delimited matrix can now be loaded in any programming language and used as input for graphical visualizations or for analysis. It is easiest to use these files directly in rtomahawk, the R-bindings for tomahawk.