{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "82629a77", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import os" ] }, { "cell_type": "markdown", "id": "63a907ae", "metadata": {}, "source": [ "# ENCODE DNase I and ChIP-Seq allelic imbalance data\n", "\n", "### If you have any questions, please reach us at:\n", "+ sabramov@altius.org Sergey Abramov \n", "+ sboytsov@altius.org Alexandr Boytsov\n" ] }, { "cell_type": "markdown", "id": "9942ab9e", "metadata": {}, "source": [ "# Accessing the data\n", "\n", "The data is stored at https://resources.altius.org/~jvierstra/projects/encode4-allelic-imbalance-v2
\n", "The directory contains:\n", "+ DNase I variants (`dnase-seq` folder): \n", " - `metadata+encode_id.tsv` - metadata file\n", " - `all.aggregation.dnase.bed.gz` - all tested variants (aggregated accross all the samples)\n", " - `cell_type.aggregation.bed.gz` - tested variants (aggregated accross cell types). Last column corresponds to `taxonomy_name_id` in the metadata.\n", " - `genotypes_dnase.vcf.gz` - genotypes file\n", " - `sample-split-variants.zip` - non-aggregated variants, used for custom aggregations.\n", "

\n", "+ ChIP-seq variants (`chip-seq` folder):\n", " - `metadata.tsv` - metadata file\n", " - `encode_meta.tsv` - metadata provided by Anshul, Vivek and Idan\n", " - `all.aggregation.chipseq.bed.gz` - all tested variants (aggregated accross all the samples)\n", " - `genotypes_chipseq.vcf.gz` - genotypes file\n", " - `sample-split-variants.zip` - non-aggregated variants, used for custom aggregations\n", "

\n", "+ Readme file: `readme.txt`\n", "+ This notebook" ] }, { "cell_type": "markdown", "id": "153fe2ec", "metadata": {}, "source": [ "## Metadata file\n", "+ `ag_id` - a unique identifier of a sample\n", "+ `indiv_id` - individual genotype id. Samples with the same indiv_id have very similar genotype (not necessarily the same cell type)\n", "+ `taxonomy_name` - cell type name\n", "+ `taxonomy_name_id` - cell type ID." ] }, { "cell_type": "code", "execution_count": 2, "id": "db332905", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2023-01-25 17:43:23-- https://resources.altius.org/~jvierstra/projects/encode4-allelic-imbalance-v2/dnase/metadata+encode_id.tsv\r\n", "Resolving resources.altius.org (resources.altius.org)... 10.129.64.28\r\n", "Connecting to resources.altius.org (resources.altius.org)|10.129.64.28|:443... connected.\r\n", "HTTP request sent, awaiting response... 200 OK\r\n", "Length: 388372 (379K) [text/tab-separated-values]\r\n", "Saving to: ‘metadata+encode_id.tsv’\r\n", "\r\n", "\r", " 0% [ ] 0 --.-K/s \r", "100%[======================================>] 388,372 --.-K/s in 0.003s \r\n", "\r\n", "2023-01-25 17:43:23 (124 MB/s) - ‘metadata+encode_id.tsv’ saved [388372/388372]\r\n", "\r\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ag_idln_numbertaxonomy_nameontology_idontology_termsystemsubsystemorganorgan_regionside_position...treatment_nametreatment_detailsdosedose_unitshotspot2_spotnuclear_percent_duplicationpaired_nuclear_alignencode_library_idtaxonomy_name_idindiv_id
0AG7206LN40338HNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaN0.571927.9486348371100.0ENCLB096YUZNaNINDIV_D0001
1AG80615LN3817K562EFO:0002067K562HematopoieticLymphoidBloodNaNNaN...untreateduntreatedNaNNaN0.544716.3414237879484.0ENCLB843GMH134.0INDIV_D0001
2AG80660LN3327K562EFO:0002067K562HematopoieticLymphoidBloodNaNNaN...untreateduntreatedNaNNaN0.475111.7948213181356.0ENCLB253REF134.0INDIV_D0001
3AG80850LN1691K562EFO:0002067K562HematopoieticLymphoidBloodNaNNaN...untreateduntreatedNaNNaN0.470233.4191NaNENCLB540ZZZ134.0INDIV_D0001
4AG80851LN1684K562EFO:0002067K562HematopoieticLymphoidBloodNaNNaN...untreateduntreatedNaNNaN0.283714.1151NaNENCLB539ZZZ134.0INDIV_D0001
..................................................................
1406AG79936LN69969Ah.CACO-2EFO:0001099Caco-2DigestiveNaNColonNaNNaN...untreateduntreatedNaNNaN0.422718.8354249544258.0ENCLB351LZG101.0INDIV_D0756
1407AG80025LN54667Ah.CACO-2EFO:0001099Caco-2DigestiveNaNColonNaNNaN...untreateduntreatedNaNNaN0.393534.8901228776942.0ENCLB564DMT101.0INDIV_D0757
1408AG80858LN1269h.CACO-2EFO:0001099Caco-2DigestiveNaNColonNaNNaN...untreateduntreatedNaNNaN0.572441.4670NaNENCLB422ZZZ101.0INDIV_D0758
1409AG80857LN1289h.CACO-2EFO:0001099Caco-2DigestiveNaNColonNaNNaN...untreateduntreatedNaNNaN0.421423.4804NaNENCLB423ZZZ101.0INDIV_D0759
1410AG80650LN3462fHeartUBERON:0000948heartCardiovascularCardiacHeartNaNNaN...untreateduntreatedNaNNaN0.107730.6859NaNENCLB740GHK33.0INDIV_D0760
\n", "

1411 rows × 36 columns

\n", "
" ], "text/plain": [ " ag_id ln_number taxonomy_name ontology_id ontology_term \\\n", "0 AG7206 LN40338H NaN NaN NaN \n", "1 AG80615 LN3817 K562 EFO:0002067 K562 \n", "2 AG80660 LN3327 K562 EFO:0002067 K562 \n", "3 AG80850 LN1691 K562 EFO:0002067 K562 \n", "4 AG80851 LN1684 K562 EFO:0002067 K562 \n", "... ... ... ... ... ... \n", "1406 AG79936 LN69969A h.CACO-2 EFO:0001099 Caco-2 \n", "1407 AG80025 LN54667A h.CACO-2 EFO:0001099 Caco-2 \n", "1408 AG80858 LN1269 h.CACO-2 EFO:0001099 Caco-2 \n", "1409 AG80857 LN1289 h.CACO-2 EFO:0001099 Caco-2 \n", "1410 AG80650 LN3462 fHeart UBERON:0000948 heart \n", "\n", " system subsystem organ organ_region side_position ... \\\n", "0 NaN NaN NaN NaN NaN ... \n", "1 Hematopoietic Lymphoid Blood NaN NaN ... \n", "2 Hematopoietic Lymphoid Blood NaN NaN ... \n", "3 Hematopoietic Lymphoid Blood NaN NaN ... \n", "4 Hematopoietic Lymphoid Blood NaN NaN ... \n", "... ... ... ... ... ... ... \n", "1406 Digestive NaN Colon NaN NaN ... \n", "1407 Digestive NaN Colon NaN NaN ... \n", "1408 Digestive NaN Colon NaN NaN ... \n", "1409 Digestive NaN Colon NaN NaN ... \n", "1410 Cardiovascular Cardiac Heart NaN NaN ... \n", "\n", " treatment_name treatment_details dose dose_units hotspot2_spot \\\n", "0 NaN NaN NaN NaN 0.5719 \n", "1 untreated untreated NaN NaN 0.5447 \n", "2 untreated untreated NaN NaN 0.4751 \n", "3 untreated untreated NaN NaN 0.4702 \n", "4 untreated untreated NaN NaN 0.2837 \n", "... ... ... ... ... ... \n", "1406 untreated untreated NaN NaN 0.4227 \n", "1407 untreated untreated NaN NaN 0.3935 \n", "1408 untreated untreated NaN NaN 0.5724 \n", "1409 untreated untreated NaN NaN 0.4214 \n", "1410 untreated untreated NaN NaN 0.1077 \n", "\n", " nuclear_percent_duplication paired_nuclear_align encode_library_id \\\n", "0 27.9486 348371100.0 ENCLB096YUZ \n", "1 16.3414 237879484.0 ENCLB843GMH \n", "2 11.7948 213181356.0 ENCLB253REF \n", "3 33.4191 NaN ENCLB540ZZZ \n", "4 14.1151 NaN ENCLB539ZZZ \n", "... ... ... ... \n", "1406 18.8354 249544258.0 ENCLB351LZG \n", "1407 34.8901 228776942.0 ENCLB564DMT \n", "1408 41.4670 NaN ENCLB422ZZZ \n", "1409 23.4804 NaN ENCLB423ZZZ \n", "1410 30.6859 NaN ENCLB740GHK \n", "\n", " taxonomy_name_id indiv_id \n", "0 NaN INDIV_D0001 \n", "1 134.0 INDIV_D0001 \n", "2 134.0 INDIV_D0001 \n", "3 134.0 INDIV_D0001 \n", "4 134.0 INDIV_D0001 \n", "... ... ... \n", "1406 101.0 INDIV_D0756 \n", "1407 101.0 INDIV_D0757 \n", "1408 101.0 INDIV_D0758 \n", "1409 101.0 INDIV_D0759 \n", "1410 33.0 INDIV_D0760 \n", "\n", "[1411 rows x 36 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "!wget https://resources.altius.org/~jvierstra/projects/encode4-allelic-imbalance-v2/dnase/metadata+encode_id.tsv\n", "meta = pd.read_table('metadata+encode_id.tsv')\n", "meta" ] }, { "cell_type": "markdown", "id": "724b814e", "metadata": {}, "source": [ "## Variants format\n", "### Aggregated variants are stored in the bed-like format:\n", "+ `#chr, start, end`: genomic position of the SNV, hg38 genome assembly;\n", "+ `ID`: rsSNP ID of the SNV according to the dbSNP build 151;\n", "+ `ref`: reference allele (A,C,G, or T, according to hg38);\n", "+ `alt`: alternative allele;\n", "+ `AAF, RAF`: Topmed allele frequencies of reference and alternative alleles;\n", "+ `FMR`: Failed mapping rate `#WASP filtered reads`/`#total reads`;\n", "+ `mean_BAD`: Mean background allelic dosage estimation at the variant. Higher BAD values correspond to the higher contribution of aneuploidy and local copy-number variants. BAD scores serve as a baseline when estimating the statistical significance and the effect size of each tested variant. BAD=1 in case of diploid and BAD=2 in case of triploid;\n", "+ `footprints_n`: number of individual variants in DNase footrpints (called with https://www.vierstra.org/resources/dgf), always 0 for ChIP-seq;\n", "+ `hotspots_n`: number of individual variants in DNase peaks (called with hotspots2), always 0 for ChIP-seq;\n", "+ `logit_pval_ref, logit_pval_alt`: Logit aggregated p-values for reference and alternative alleles;
\n", "Effect sizes are in log2 scale, positive values correspond to preference towards reference allele\n", " + `es_mean`: Aggregated effect size; mean of individual effect sizes. Less affected by outliers\n", " + `es_weighted_mean`: Aggregated effect size, averaged with weights `-log10[min(pval_ref, pval_alt)]`. More sensitive estimate.\n", "\n", "+ `nSNPs`: # of individual variants aggregated at this position;\n", "+ `max_cover`: Maximum read coverage among aggregated variants;\n", "+ `fdrp_bh_ref, fdrp_bh_alt`: FDR-corrected reference and alternative logit aggregated p-values(`logit_pval_ref, logit_pval_alt`);\n", "+ `min_fdr`: minimum of `fdrp_bh_ref` and `fdrp_bh_alt`. AS events have `min_fdr` <= 0.05;\n", "+ (only in cell-type specific aggregation) `taxonomy_name_id`: cell type name" ] }, { "cell_type": "code", "execution_count": 3, "id": "58e4201e", "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2023-01-25 17:43:24-- https://resources.altius.org/~jvierstra/projects/encode4-allelic-imbalance-v2/dnase/cell_type.aggregation.bed.gz\n", "Resolving resources.altius.org (resources.altius.org)... 10.129.64.28\n", "Connecting to resources.altius.org (resources.altius.org)|10.129.64.28|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 254483660 (243M) [application/x-gzip]\n", "Saving to: ‘cell_type.aggregation.bed.gz’\n", "\n", "100%[======================================>] 254,483,660 112MB/s in 2.2s \n", "\n", "2023-01-25 17:43:26 (112 MB/s) - ‘cell_type.aggregation.bed.gz’ saved [254483660/254483660]\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#chrstartendIDrefaltAAFRAFFMRmean_BAD...logit_pval_altes_meanes_weighted_meannSNPsmax_coverfdrp_bh_reffdrp_bh_altmin_fdrtaxonomy_name_idtaxonomy_name
0chr1804903804904rs61770167CT0.004093399592252800.99590660040774710.0000001.0...0.221049-0.540568-0.5405681271.0000001.01.00000075h.CD19.naive
1chr1976214976215rs7417106AG0.709663290010193670.29033670998980630.0000001.0...0.9970600.8460240.8805852450.2524241.00.25242475h.CD19.naive
2chr1999841999842rs2298214CA0.452479931192660550.54752006880733940.0666671.0...0.285810-0.415037-0.4150371281.0000001.01.00000075h.CD19.naive
3chr110000171000018rs146254088GA0.025476235983690110.97452376401630990.1212121.0...0.5747690.0000000.0000001281.0000001.01.00000075h.CD19.naive
4chr110000781000079rs3128113AG0.673133282364933740.32683486238532110.2187501.0...0.500529-0.125531-0.1255311231.0000001.01.00000075h.CD19.naive
..................................................................
5209518chr158889524688895247rs12148326CT0.128089959225280320.85172974006116210.0000001.5...0.253940-0.512982-0.5129821280.9745881.00.974588153h.HAP1
5209519chr158890378588903786rs35729705GA0.231699159021406720.7682928771661570.0000001.5...0.5600940.0380490.0380491220.8724621.00.872462153h.HAP1
5209520chr158890953188909532rs2280213TC0.222357607033639140.77764239296636080.0000001.5...0.570281-0.008538-0.0085381230.8724621.00.872462153h.HAP1
5209521chr158891338788913388rs139643219CG0.011451962283384300.98854803771661570.0480771.5...0.7941970.2100900.26297621490.8724621.00.872462153h.HAP1
5209522chr158899297888992979rs34757735CT0.314833397043832820.68516660295616720.0000001.5...0.7797620.1940970.2027422540.8724621.00.872462153h.HAP1
\n", "

5209523 rows × 23 columns

\n", "
" ], "text/plain": [ " #chr start end ID ref alt AAF \\\n", "0 chr1 804903 804904 rs61770167 C T 0.00409339959225280 \n", "1 chr1 976214 976215 rs7417106 A G 0.70966329001019367 \n", "2 chr1 999841 999842 rs2298214 C A 0.45247993119266055 \n", "3 chr1 1000017 1000018 rs146254088 G A 0.02547623598369011 \n", "4 chr1 1000078 1000079 rs3128113 A G 0.67313328236493374 \n", "... ... ... ... ... .. .. ... \n", "5209518 chr15 88895246 88895247 rs12148326 C T 0.12808995922528032 \n", "5209519 chr15 88903785 88903786 rs35729705 G A 0.23169915902140672 \n", "5209520 chr15 88909531 88909532 rs2280213 T C 0.22235760703363914 \n", "5209521 chr15 88913387 88913388 rs139643219 C G 0.01145196228338430 \n", "5209522 chr15 88992978 88992979 rs34757735 C T 0.31483339704383282 \n", "\n", " RAF FMR mean_BAD ... logit_pval_alt \\\n", "0 0.9959066004077471 0.000000 1.0 ... 0.221049 \n", "1 0.2903367099898063 0.000000 1.0 ... 0.997060 \n", "2 0.5475200688073394 0.066667 1.0 ... 0.285810 \n", "3 0.9745237640163099 0.121212 1.0 ... 0.574769 \n", "4 0.3268348623853211 0.218750 1.0 ... 0.500529 \n", "... ... ... ... ... ... \n", "5209518 0.8517297400611621 0.000000 1.5 ... 0.253940 \n", "5209519 0.768292877166157 0.000000 1.5 ... 0.560094 \n", "5209520 0.7776423929663608 0.000000 1.5 ... 0.570281 \n", "5209521 0.9885480377166157 0.048077 1.5 ... 0.794197 \n", "5209522 0.6851666029561672 0.000000 1.5 ... 0.779762 \n", "\n", " es_mean es_weighted_mean nSNPs max_cover fdrp_bh_ref \\\n", "0 -0.540568 -0.540568 1 27 1.000000 \n", "1 0.846024 0.880585 2 45 0.252424 \n", "2 -0.415037 -0.415037 1 28 1.000000 \n", "3 0.000000 0.000000 1 28 1.000000 \n", "4 -0.125531 -0.125531 1 23 1.000000 \n", "... ... ... ... ... ... \n", "5209518 -0.512982 -0.512982 1 28 0.974588 \n", "5209519 0.038049 0.038049 1 22 0.872462 \n", "5209520 -0.008538 -0.008538 1 23 0.872462 \n", "5209521 0.210090 0.262976 2 149 0.872462 \n", "5209522 0.194097 0.202742 2 54 0.872462 \n", "\n", " fdrp_bh_alt min_fdr taxonomy_name_id taxonomy_name \n", "0 1.0 1.000000 75 h.CD19.naive \n", "1 1.0 0.252424 75 h.CD19.naive \n", "2 1.0 1.000000 75 h.CD19.naive \n", "3 1.0 1.000000 75 h.CD19.naive \n", "4 1.0 1.000000 75 h.CD19.naive \n", "... ... ... ... ... \n", "5209518 1.0 0.974588 153 h.HAP1 \n", "5209519 1.0 0.872462 153 h.HAP1 \n", "5209520 1.0 0.872462 153 h.HAP1 \n", "5209521 1.0 0.872462 153 h.HAP1 \n", "5209522 1.0 0.872462 153 h.HAP1 \n", "\n", "[5209523 rows x 23 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "!wget https://resources.altius.org/~jvierstra/projects/encode4-allelic-imbalance-v2/dnase/cell_type.aggregation.bed.gz\n", "\n", "tested_variants = pd.read_table('cell_type.aggregation.bed.gz')\n", "tested_variants = tested_variants.merge(meta[['taxonomy_name_id', 'taxonomy_name']].drop_duplicates())\n", "tested_variants" ] }, { "cell_type": "code", "execution_count": 4, "id": "c7153138", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#chrstartendIDrefaltAAFRAFFMRmean_BAD...logit_pval_altes_meanes_weighted_meannSNPsmax_coverfdrp_bh_reffdrp_bh_altmin_fdrtaxonomy_name_idtaxonomy_name
18chr111719311171932rs61768481GA0.008218654434250760.9917733817533130.0281691.0...1.0000001.9655272.2829782670.0000071.00.00000775h.CD19.naive
59chr159920305992031rs34031616GC0.188559187054026500.81093909276248720.1025641.0...0.9999991.7137392.1457602390.0013321.00.00133275h.CD19.naive
71chr185260438526044rs151028640CT0.009078746177370030.990921253822630.0000001.0...0.9999441.8919491.9022512410.0022381.00.00223875h.CD19.naive
111chr11174934411749345rs11121820TG0.402706103465851170.59729389653414880.0303031.0...0.9999212.0588942.0588941310.0348741.00.03487475h.CD19.naive
141chr11607367016073671rs578144077GA0.033002038735983690.96699796126401630.0000001.0...0.9999811.4747691.5282002460.0139081.00.01390875h.CD19.naive
..................................................................
10861chr9124800693124800694rs913232GA0.630694125891946990.369281982670744140.0588241.0...0.9999501.5035291.6271952300.0328531.00.03285375h.CD19.naive
10893chr9129803078129803079rs76154399CT0.012590787461773700.987401248725790.0129871.0...0.9998360.5835220.76107421490.0308991.00.03089975h.CD19.naive
10907chr9129888566129888567rs2296791GT0.013434951580020380.9865491207951070.0441181.0...1.0000000.9586251.62792221380.0003511.00.00035175h.CD19.naive
10931chr9131669097131669098rs117156575GC0.028191896024464830.97180810397553520.0294121.0...0.9999511.3554091.7585732310.0303491.00.03034975h.CD19.naive
11021chr9137761126137761127rs11137204CT0.273756052497451580.72603688837920490.0625001.0...0.9998680.9771401.3412252690.0450601.00.04506075h.CD19.naive
\n", "

236 rows × 23 columns

\n", "
" ], "text/plain": [ " #chr start end ID ref alt AAF \\\n", "18 chr1 1171931 1171932 rs61768481 G A 0.00821865443425076 \n", "59 chr1 5992030 5992031 rs34031616 G C 0.18855918705402650 \n", "71 chr1 8526043 8526044 rs151028640 C T 0.00907874617737003 \n", "111 chr1 11749344 11749345 rs11121820 T G 0.40270610346585117 \n", "141 chr1 16073670 16073671 rs578144077 G A 0.03300203873598369 \n", "... ... ... ... ... .. .. ... \n", "10861 chr9 124800693 124800694 rs913232 G A 0.63069412589194699 \n", "10893 chr9 129803078 129803079 rs76154399 C T 0.01259078746177370 \n", "10907 chr9 129888566 129888567 rs2296791 G T 0.01343495158002038 \n", "10931 chr9 131669097 131669098 rs117156575 G C 0.02819189602446483 \n", "11021 chr9 137761126 137761127 rs11137204 C T 0.27375605249745158 \n", "\n", " RAF FMR mean_BAD ... logit_pval_alt es_mean \\\n", "18 0.991773381753313 0.028169 1.0 ... 1.000000 1.965527 \n", "59 0.8109390927624872 0.102564 1.0 ... 0.999999 1.713739 \n", "71 0.99092125382263 0.000000 1.0 ... 0.999944 1.891949 \n", "111 0.5972938965341488 0.030303 1.0 ... 0.999921 2.058894 \n", "141 0.9669979612640163 0.000000 1.0 ... 0.999981 1.474769 \n", "... ... ... ... ... ... ... \n", "10861 0.36928198267074414 0.058824 1.0 ... 0.999950 1.503529 \n", "10893 0.98740124872579 0.012987 1.0 ... 0.999836 0.583522 \n", "10907 0.986549120795107 0.044118 1.0 ... 1.000000 0.958625 \n", "10931 0.9718081039755352 0.029412 1.0 ... 0.999951 1.355409 \n", "11021 0.7260368883792049 0.062500 1.0 ... 0.999868 0.977140 \n", "\n", " es_weighted_mean nSNPs max_cover fdrp_bh_ref fdrp_bh_alt min_fdr \\\n", "18 2.282978 2 67 0.000007 1.0 0.000007 \n", "59 2.145760 2 39 0.001332 1.0 0.001332 \n", "71 1.902251 2 41 0.002238 1.0 0.002238 \n", "111 2.058894 1 31 0.034874 1.0 0.034874 \n", "141 1.528200 2 46 0.013908 1.0 0.013908 \n", "... ... ... ... ... ... ... \n", "10861 1.627195 2 30 0.032853 1.0 0.032853 \n", "10893 0.761074 2 149 0.030899 1.0 0.030899 \n", "10907 1.627922 2 138 0.000351 1.0 0.000351 \n", "10931 1.758573 2 31 0.030349 1.0 0.030349 \n", "11021 1.341225 2 69 0.045060 1.0 0.045060 \n", "\n", " taxonomy_name_id taxonomy_name \n", "18 75 h.CD19.naive \n", "59 75 h.CD19.naive \n", "71 75 h.CD19.naive \n", "111 75 h.CD19.naive \n", "141 75 h.CD19.naive \n", "... ... ... \n", "10861 75 h.CD19.naive \n", "10893 75 h.CD19.naive \n", "10907 75 h.CD19.naive \n", "10931 75 h.CD19.naive \n", "11021 75 h.CD19.naive \n", "\n", "[236 rows x 23 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# CAVs for h.CD19.naive\n", "tested_variants[tested_variants.eval('(taxonomy_name == \"h.CD19.naive\") & (min_fdr <= 0.05)')]" ] }, { "cell_type": "markdown", "id": "fcd61c50", "metadata": {}, "source": [ "## To increase sensitivity of allele-specific events calling, data can be grouped by various properties and then aggregated within each group.\n", "### For example, we can group by indiv_id (samples with the same genotype), taxonomy_name (i.e. cell type), e.t.c\n", "### Within each group, variants at the same genomic position and with the same genotype (ref and alt alleles) are aggregated\n", "### Due to unknown structural haplotypes at each variant, we perform one-sided test for reference and alternative allele imbalance separately. This yields two significance estimates: one for each of the alleles.\n" ] }, { "cell_type": "markdown", "id": "f58da725", "metadata": {}, "source": [ "### For DNase I data we provide two aggregations: cell-type specific and cell-type independent (aggregating variants from all the samples)\n", "### For ChIP-Seq data we provide only aggregation over all the samples." ] }, { "cell_type": "markdown", "id": "d99c3f46", "metadata": {}, "source": [ "# To aggregate the data in a custom way:" ] }, { "cell_type": "markdown", "id": "d4fc1f3d", "metadata": {}, "source": [ "## Download non-aggregated data\n", "- `wget https://resources.altius.org/~jvierstra/projects/encode4-allelic-imbalance-v2/dnase/sample-split-variants.zip`\n", "- `unzip sample-split-variants.zip`\n", "- Download corresponding metadata `wget https://resources.altius.org/~jvierstra/projects/encode4-allelic-imbalance-v2/dnase/metadata+encode_id.tsv`\n", "- Add custom annotation as one more column in `metadata+encode_id.tsv`. Samples with empty values in that column are excluded from aggregation\n", "\n", "## Download repo with CAV calling scripts\n", "- `git pull https://github.com/wishabc/nf-babachi`\n", "- `conda env create -n cavs-calling -f ./nf-babachi/environment.yml` (or use `mamba` for faster installation)\n", "- Modify `nextflow.config` to computing enviroment specifications\n", "\n", "\n", "## Run aggregation script\n", "- Activate created conda environment `conda activate cavs-calling`\n", "- ```nextflow run ./nf-babachi/main.nf \\\n", " -profile Altius -entry aggregatePvals \\\n", " --conda \"$CONDA_PREFIX\" \\\n", " --sample_pvals_dir ./by_sample_pvals/ \\\n", " --samples_file metadata+encode_id.tsv \\\n", " --aggregation-key ```\n", "- Alternatively, you can specify `conda`, `aggregation_key`, `samples_file` and `sample_pvals_dir` in `params.config` file as command line parameters and run just `nextflow run ./nf-babachi/main.nf -profile Altius -entry aggregatePvals`\n", "\n", "### The command will produce `./output/final.ag_files_binom.` folder with aggregated files for each group." ] }, { "cell_type": "code", "execution_count": null, "id": "1e1dc293", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }