Technical Methodology - Hidden Lineage

📊 Analysis Pipeline Overview

Raw Data Ingestion

→

Quality Control

→

Phasing

→

Local Ancestry

→

Global Ancestry

→

Ancient Matching

→

Report Generation

Stage	Tool/Method	Description	Output
Raw Data Ingest	Custom Python Script	Convert 23andMe/AncestryDNA raw data to VCF format	sample.vcf.gz
Quality Control	PLINK v1.9	SNP filtering, strand alignment, missingness QC	cleaned.vcf.gz
Phasing	Beagle v5.4	Statistical haplotype phasing using reference panel	phased.vcf.gz
Local Ancestry	RFMix v2.03	Chromosome segment ancestry inference	.msp.tsv, .fb.tsv
Global Ancestry	ADMIXTURE v1.3	Population structure analysis (K=2-15)	.Q ancestry proportions
Ancient Matching	Custom Algorithm	Euclidean distance to ancient DNA samples	ranked_matches.tsv
Haplogroup Calling	Yleaf, HaploGrep3	Y-chromosome and mtDNA haplogroup assignment	haplogroups.txt

📚 Reference Datasets & Panels

🏺 Ancient DNA (AADR + Curated)

Samples: 15,347 ancient individuals

Time Range: 45,000 years ago to 1000 CE

Coverage: Global, enhanced Near East/Europe

Source: Allen Ancient DNA Resource + curated datasets

🌍 Modern Reference (Multiple Sources)

Samples: 46,469 individuals

Populations: 284 populations worldwide

Coverage: Human Origins + 1000G + SGDP

Source: Merged global reference panel

🧬 SNP Coverage

Markers: 143,495 high-quality autosomal SNPs

Focus: Ancestry-informative markers

Coverage: Genome-wide representation

Source: Intersection of major genotyping arrays

🏛️ Cross-Validation Framework

Method: K=10 optimal model validation

Validation: Multi-method convergence (IBD + PCA)

Error Rate: CV = 0.51482

Robustness: Statistical significance confirmed

Panel Balancing Strategy:

# Population balancing to prevent reference bias
populations_retained = {
    'European': 12847,        # Balanced representation
    'East_Asian': 8234,       # Comprehensive sampling
    'African': 9156,          # Enhanced diversity
    'Middle_Eastern': 7891,   # Regional focus populations
    'Central_Asian': 4238,    # Historical populations
    'Native_American': 2156,  # All available samples
    'Oceanian': 894,          # Complete representation
    'South_Asian': 6789,      # Subcontinent populations
    'Ancient_Samples': 15347, # Time-depth coverage
    'Total_Panel': 61816      # Cross-validated reference
}
            

⚙️ Implementation Details

Quality Control Pipeline:

# SNP filtering criteria
plink --vcf raw_data.vcf.gz \
      --maf 0.01 \                    # Minor allele frequency > 1%
      --geno 0.02 \                   # SNP missingness < 2%
      --mind 0.05 \                   # Individual missingness < 5%
      --hwe 1e-6 \                    # Hardy-Weinberg equilibrium
      --exclude ambiguous_snps.txt \  # Remove A/T, G/C SNPs
      --make-bed \
      --out qc_filtered
            

Ancestry Inference:

# RFMix local ancestry inference
rfmix -f query.vcf.gz \
      -r reference.vcf.gz \
      -m reference_samples.map \
      -g genetic_map_GRCh37.txt \
      -o local_ancestry \
      --n-threads 16 \
      --chromosome=chr1-22

# ADMIXTURE global ancestry
for K in {2..15}; do
    admixture --cv=10 -j16 \
              balanced_panel.bed $K | \
              tee log${K}.out
done
            

Ancient DNA Matching Algorithm:

# Euclidean distance calculation
def calculate_ancestry_distance(query_proportions, ancient_proportions):
    """
    Calculate genetic distance between query and ancient sample
    using ancestry component proportions
    """
    # Normalize proportions to sum to 1.0
    query_norm = query_proportions / np.sum(query_proportions)
    ancient_norm = ancient_proportions / np.sum(ancient_proportions)
    
    # Calculate Euclidean distance
    distance = np.sqrt(np.sum((query_norm - ancient_norm) ** 2))
    
    return distance

# Match to top 50 most similar ancient samples
matches = []
for ancient_sample in ancient_panel:
    distance = calculate_ancestry_distance(
        query_ancestry, ancient_sample.ancestry
    )
    matches.append((ancient_sample.id, distance, ancient_sample.metadata))

# Sort by distance and return top matches
top_matches = sorted(matches, key=lambda x: x[1])[:50]
            

✅ Validation & Quality Assurance

Cross-Validation Testing

Method: 10-fold cross-validation on reference panels

Accuracy: 94.7% for continental assignment

Test Set: 1,247 individuals with known ancestry

Simulated Admixture

Method: Generate artificial admixed genomes

Recovery Rate: 92.3% within 5% of true proportions

Test Cases: 500 simulated individuals

Replication Studies

Method: Compare with published ancestry results

Correlation: r=0.96 with 23andMe estimates

Sample Size: 234 overlapping individuals

Family Consistency

Method: Parent-offspring trio validation

Consistency: 98.1% Mendelian inheritance

Families: 89 complete trios tested

⚠️ Known Limitations & Caveats

🌍 Geographic Bias

Reference panels are strongest for European and East Asian populations. African and Indigenous American ancestry estimates may have broader confidence intervals due to limited reference data availability.

🕰️ Deep Time Uncertainty

Migration dates older than 10,000 years have substantial uncertainty (±2,000-5,000 years) due to limited ancient DNA calibration points and demographic modeling assumptions.

🧬 Platform Limitations

Analysis limited to autosomal SNPs available on commercial genotyping arrays. No whole genome sequencing, structural variants, or rare variant analysis included.

📊 Statistical Smoothing

Ancestry estimates represent genome-wide averages. Individual chromosomal segments may show different patterns due to recombination and recent admixture events.

🔬 Research Context

This analysis is for research and educational purposes only. Results should not be used for medical decisions or legal genealogical claims without additional verification.

🛠️ Software & Version Information

# Core Analysis Tools
PLINK v1.90b6.21 (64-bit)
Beagle v5.4 (27Jul22)
RFMix v2.03-r0
ADMIXTURE v1.3.0
BCFtools v1.15.1
VCFtools v0.1.16

# Programming Environment
Python 3.9.12
NumPy 1.21.5
Pandas 1.4.2
SciPy 1.7.3
Matplotlib 3.5.1

# Reference Genome
Human Genome Build: GRCh37/hg19
Genetic Map: HapMap II recombination rates
SNP Annotation: dbSNP v151

# Analysis Date
Pipeline Version: 2.1.0
Analysis Date: November 2024
Last Updated: November 15, 2024
            

🔄 Reproducibility & Data Access

Code Availability:

Complete analysis pipeline available on request. Key processing scripts documented in supplementary materials.

Data Sharing:

Raw genetic data is not shared to protect participant privacy. Aggregate results and reference panel compositions available upon reasonable request.

Computational Requirements:

# Minimum System Requirements
RAM: 32 GB (64 GB recommended)
CPU: 16 cores (Intel Xeon or AMD EPYC)
Storage: 500 GB SSD
Runtime: ~4-6 hours per sample

# Cloud Computing
Platform: AWS EC2 (r5.4xlarge instances)
Estimated Cost: $15-25 per analysis
Parallel Processing: Up to 50 samples simultaneously
            

🔬 Technical Methodology & Pipeline