Software / programs


RECUR is an R application for the comparative analysis of allelic imbalance (AI) profiles derived from SNP array and next-generation sequencing data for two or more samples derived from the same individual. The algorithm compares AI profiles and identifies genomic segments where samples have opposite haplotype shifts, or “mirrored subclonal allelic imbalance.” RECUR was developed by Sasha Jakubek and Anthony San Lucas and can be obtained here.


hapLOH profiles and characterizes tumor genomes using data from SNP microarrays (Vattathil & Scheet, 2013). It is designed to be effective in difficult settings such as low tumor purity, subclones or detecting clonal mosaicism in normal tissue. For details on use and further documentation please visit our Google Sites page.

The software (written by Selina Vattathil and Jerry Fowler) is freely available under The MIT License at For additional inquiries, please contact Selina Vattathil (


hapLOHseq has been developed for the detection of subtle allelic imbalance events from next-generation sequencing data. hapLOHseq is a sequencing-based extension of hapLOH, which is a method for the detection of subtle allelic imbalance events from SNP array data. hapLOHseq is capable of identifying events of 10 mega-bases or greater occurring in as little as 16% of the sample using exome sequencing data (at 80x) and 4% using whole genome se-quencing data (at 30x), exceeding the capabilities of existing software. hapLOHseq runs on Linux and MacOSX platforms. For details on use and further documentation please visit our Google Sites page.

The software (written by Anthony San Lucas) is freely available under The MIT License for Linux ( and Mac OS X ( bundles. For additional inquiries, please contact Anthony San Lucas (


Cancer in silico Drug Discovery (CIDD) is a platform to integrate data from the TCGA, Connectivity Map (CMap) and Cancer Cell Line Encyclopedia to facilitate and automate discovery of candidate drug compounds with the ultimate goal for treatment or chemo-prevention of cancer. Our manuscript is currently under review. CiDD may be obtained by visiting


Our System for Quality-Assured Data Analysis (SyQADA) is a workflow management system that seeks to improve reproducibility in as simple a framework as feasible. Numerous researchers have used SyQADA workflows for dozens of projects.

SyQADA allows the simple specification of protocols that marshal and manage input data, allowing analyses to be run repeatably over large volumes of data for many samples and for different projects. Use of SyQADA requires Unix, python3, and no other software besides a text editor and the analysis programs used in the workflow. SyQADA exposes no programming interface. Configuration files are simple property lists; the protocol of a workflow is expressed as a list of task definitions and attributes. To date SyQADA has supporterd analysis for the following: whole exome and 409-gene panel DNA sequence, RNA, 16s Ribosomal DNA, and whole genome microbiome sequence, and SNP6 arrays, using Affymetrix, Illumina, and Ion Torrent technologies. Analyses included the Affymetrix birdsuite, the gatk best practices variant calling workflow, hapLOH, hapLOHseq, Ion Reporter, JLOH, mach, mutect, pairwise phasing, the Tuxedo suite and varscan2 as well as numerous small programs for pre- and post-processing data.

SyQADA is coming soon at


Homopolymer Analyzer (PolyAna) was developed as a quality control step in processing of Ion Torrent sequencing data to identify and remove potential homopolymers from the detected somatic mutations. The Ion semiconductor technology suffers from inaccurately estimating the number of nucleotides within the homopolymer regions and therefore results in false somatic mutation calls. We therefore developed an automated way to detect and annotate variants as potential homopolymers based on their position in the reference genome. A bundled file with example input/output files, a README and the script (Perl) is available here.


vtools is a set of tools for annotating and tracking sequence variation for large-scale exome sequencing projects (San Lucas et al, 2011). It was developed and authored by Anthony San Lucas and Bo Peng, and is available for download at


phylogenY-aware Effect size Tests for Interaction (YETI) is a statistical framework for detecting genetic interactions. This is joint work with Dr. Yong Chen and Yulun Liu. R code for YETI can be accessed here and also/eventually at Yong Chen's site.


Haploscope is a tool for visualizing haplotype diversity, based on a cluster-based model for haplotype variation (Scheet & Stephens, 2006). It automates the production of images such as those in Jakobsson et. al. (2008). It is written in Java (by Anthony San Lucas) and may be obtained by visiting

Haploscope is freely downloadable with a GNU GPL v3 license.


fastPHASE is a program to estimate missing genotypes and unobserved haplotypes. It is an implementation of the model described in Scheet & Stephens (2006). This is a cluster-based model for haplotype variation, and gains its utility from implicitly modeling the genealogy of chromosomes in a random sample from a population as a tree but summarizing all haplotype variation in the "tips" of the trees.

The program offers additional functionality, as well, including the following: estimation and correction of genotyping errors based on patterns of linkage disequilibrium (Scheet & Stephens, 2008), haplotype-based association mapping of binary phenotypes, estimation of missing genotypes from low-coverage sequencing data. We are in the process of developing a web-based tutorial for fastPHASE and will be updating this space soon.

fastPHASE is available as a Mac OS and a Linux executable. Documentation is available here.