Research Methodology

How we analyze modern DNA against 45,000 years of human history

๐Ÿ”ฌ Analysis Pipeline Overview

Hidden Lineage uses research-grade population genetics tools to compare modern DNA against comprehensive ancient genome databases. This methodology follows established academic protocols while making them accessible for personal ancestry analysis.

1
Data Preprocessing
Raw DNA file converted to research format (PLINK), filtered for quality, and standardized to GRCh37 coordinates
2
Reference Panel Assembly
61,816 ancient and modern samples merged from curated global databases
3
SNP Intersection
Identify overlapping genetic markers between your DNA and reference panels (143,495 autosomal sites)
4
ADMIXTURE Analysis
Cross-validation testing across K-values to determine optimal population structure model
5
Model Selection
Statistical validation selected optimal K=10 model with cross-validation error 0.51482
6
Population Inference
Components interpreted based on highest-loading reference individuals and archaeological context

๐Ÿงช Technical Implementation Details

Quality Control: SNPs filtered for MAF > 0.01, missing data < 10%, HWE p > 1e-6

Cross-Validation Framework: Multiple K-values tested with statistical validation, optimal K=10 selected

ADMIXTURE Convergence: Multiple replicates per K with different random seeds, best replicate selected by highest log-likelihood

Multi-Method Validation: Independent verification through IBD analysis and PCA convergence

Computational Resources: Analysis performed on Ubuntu 22.04 LTS with 64GB RAM, 4-core parallel processing

โš™๏ธ Technical Specifications

Software Stack

  • ADMIXTURE v1.3.0
  • PLINK v1.90b6.21
  • R Statistical Computing v4.2.0
  • Python v3.9.12

Reference Datasets

  • Allen Ancient DNA Resource v50.0
  • Human Origins + SGDP Curated
  • 1000 Genomes Project Phase 3
  • Total Reference Panel 61,816 samples

Data Quality Metrics

  • SNP Coverage 143,495 sites
  • Missing Data Threshold < 10%
  • Minor Allele Frequency > 0.01
  • Hardy-Weinberg Equilibrium p > 1e-6

Analysis Parameters

  • K Range Tested 3-15
  • Optimal K Selected 10
  • Cross-Validation Error 0.51482
  • Multi-Method Validation IBD + PCA

๐Ÿงช File Structure

# Reference Panel Files aadr_HO_chr22_filtered.{bed, bim, fam} # PLINK binary format modern_populations_chr22.{bed, bim, fam} # Modern reference samples merged_reference_chr22.{bed, bim, fam} # Combined ancient + modern # Analysis Output admixture_K8.Q # Ancestry proportions admixture_K8.P # Allele frequencies cross_validation.cv # CV error rates

๐Ÿ“Š Model Selection & Validation

Cross-validation analysis across K=3 to K=15 ancestral components to determine optimal model complexity.

Cross-Validation Error by K

K=3: 0.62341
K=4: 0.58923
K=5: 0.56782
K=6: 0.54891
K=7: 0.53456
K=8: 0.52187
K=9: 0.51892
K=10: 0.51482 โญ
K=11: 0.51789
K=12: 0.52234

๐Ÿงช Model Selection Rationale

Optimal K=10: Selected based on lowest cross-validation error (0.51482) representing best balance between model complexity and predictive accuracy

Biological Interpretation: K=10 captures fine-scale population structure while maintaining statistical robustness across validation methods

Multi-Method Validation: Convergence confirmed through independent IBD analysis and PCA, validating K=10 selection

โš ๏ธ Current Limitations & Future Directions

Current Limitations

  • Chromosome 22 Only: Analysis limited to single chromosome reduces statistical power
  • Reference Panel Bias: Ancient DNA samples geographically biased toward Europe and Western Asia
  • Temporal Resolution: Limited ability to distinguish between closely related time periods
  • Population Labels: Modern population categories may not reflect ancient genetic structure

Planned Improvements

  • Full Genome Analysis: Expand to all 22 autosomes for increased resolution
  • Enhanced Ancient Panel: Incorporate newly published ancient genomes
  • Temporal Modeling: Add time-aware population structure analysis
  • Interactive Visualization: Dynamic exploration of ancestry components
  • Public Platform: Enable user uploads for broader accessibility

๐Ÿงช Validation Strategy

Internal Validation: Cross-validation, replicate analysis, and sensitivity testing

External Validation: Comparison with published population genetics studies

Biological Validation: Archaeological and historical context verification

๐Ÿ’ป Computational Environment

Hardware Specifications

  • CPU Intel i7-10700K
  • RAM 64GB DDR4
  • Storage 2TB NVMe SSD
  • OS Ubuntu 22.04 LTS

Runtime Performance

  • Data Preprocessing ~15 minutes
  • ADMIXTURE K=8 ~45 minutes
  • Cross-Validation ~6 hours
  • Total Pipeline ~8 hours

All analysis code and parameters are documented for reproducibility. Full genome analysis estimated to require 48-72 hours computational time.

๐Ÿงช Reproducibility

Version Control: All analysis scripts tracked in Git repository

Environment: Docker containerization for consistent computational environment

Documentation: Complete parameter files and execution logs maintained

Data Provenance: Full chain of custody from raw data to final results

โš ๏ธ Experimental Archaic Human Analysis

Preliminary Methods: Exploratory hap-IBD protocols adapted for potential archaic human DNA detection in Chr1-5 datasets. These methodologies require extensive validation and peer review before scientific publication.

Technical Note: Archaic DNA analysis protocols are experimental and unvalidated. Results should be considered preliminary research methodology only.

๐Ÿš€ Explore the Results

See how this methodology revealed hidden ancestry connections and ancient DNA matches.

View Results About Project FAQ