RVASSOC is distributed as source code so that individual users can compile their own platform-optimized versions. It requires the freely available GNU Scientific Library (GSL), version 1.14 or higher, and a BLAS library to compile. GSL provides its own BLAS library, libgslcblas, but more optimized BLAS libraries, such as the freely available ATLAS, could be linked instead.

After unpacking the tar file into a directory, the following files should be visible:

LICENSE macros.h permwrap.cpp permwrap.h README.txt rvassoc.cpp teststats.cpp teststats.h vardata.h examples/example1.dat examples/example2.dat examples/example3.dat examples/example1.rvassoc.out examples/example2.rvassoc.out examples/example3.rvassoc.out

RVASSOC can be compiled using the GNU Compiler Collection (GCC) C++ compiler with the following minimal command:

g++ -I[GSL INCLUDE PATH] [GCC OPTIMIZATION FLAGS] -o rvassoc rvassoc.cpp permwrap.cpp teststats.cpp -L[GSL LIB PATH] -L[BLAS LIB PATH] -lgsl -l[BLAS LIB] -lm

The [bold] items must be supplied by the user. To use libgslcblas, the -L[BLAS LIB PATH] flag can be omitted and [BLAS LIB] is replaced by gslcblas. The user can choose [GCC OPTIMIZATION FLAGS] depending upon the version of GCC and the platform; see the GCC online documentation for details. At the very minimum, we recommend the -O2 optimization flag, although our original compilations used -O3 for greater optimization.

RVASSOC has only been compiled and extensively tested on x86-64 Linux platforms with GCC 4.1.2. While it should in principle be able to compile with any C++ compiler and run on any platform for which a GSL distribution exists, we do not currently provide support for platforms other than x86-64 Linux or compilers other than GCC.

## Input File

The program takes as input a tab-delimited text file containing column names in the first row and data for independent cases and controls in the following rows. The first column must contain a case indicator (1 = case/0 = control), and all subsequent columns must contain as many tab-separated genotype indicators (coded as 0, 1, or 2 minor alleles, as determined by the combined sample allele frequency) as there are names for columns other than the case indicator. Missing genotypes must be coded as -99. Below is an example of an input file:

caseind	geno1	geno2	geno3	geno4	geno5	geno6	geno7	geno8
0	0	0	0	0	0	0	0
0	0	0	0	0	-99	0	0
1	0	-99	0	0	0	0	1
1	-99	0	-99	0	2	1	0
0	0	1	0	0	0	1	0
1	1	0	0	2	0	0	0
1	0	1	2	0	0	0	0
0	2	1	1	0	0	0	-99
0	0	0	0	1	0	0	0

See the example*.dat files in the examples subdirectory of the tarfile for additional illustrations.

The observations for cases and controls can appear in any order in the input file, but changing this order between runs on the same data set may change the permutation p-values slightly even when the same random number seed is used. None of the observed test statistics or CMC results will change, but, because the permutation procedure shuffles the case status vector from its inputted order, it generates a unique set of shufflings (and sample from the permutation null distribution) for each random number seed and initial ordering of the case status vector.

All columns in the input file are treated as a single group of variants over which the user wishes to perform inference. Thus, if a user wishes to perform inference on a single gene, the input file must contain variants from only that gene. Alternatively, if a user wants to perform inference over an entire pathway, the input file must contain variants from only that pathway.

Although RVASSOC will automatically exclude variants that are monomorphic in the input file from analysis, we highly recommend that the input file include only variants that are polymorphic (i.e., have two or three distinct genotypes in the sample) and individuals with genotype data for at least one of these variants. When performing inference on small groups of variants, not following our recommendation increases the risk of including individuals with observed genotypes at only monomorphic variants. Such individuals will have completely missing genotypes after monomorphic variants are excluded from analysis, which may cause the results given by the CA max, CA sum, and WSS tests to differ from what they would have been had these individuals been excluded a priori. However, because they have some genotype data in the input file, these individuals will not trigger a warning message from the program.

## Running

The RVASSOC command line has the following format, where arguments are required arguments supplied by the user, -flags are optional flags that alter the operation of the program, and [values] are default values for the flags:

rvassoc filename -seed [314159265] -perms [10000] -cmccolmaf [0.01] -cmcswptol [1e-8] -buffsize [1048576] -fceps [1e-10]

More detail about the arguments and flags is provided below:

• filename: The path/name of the input file
• -seed: Positive integer random number seed
• -perms: Number of permutations
• -cmccolmaf: Variants with MAF ≤ cmcolmaf are collapsed in the CMC analysis
• -cmcswptol: Tolerance for diagonal elements in the covariance matrix below which the corresponding column is declared linearly dependent on previously swept columns by G2SWEEP in the CMC analysis (see Details)
• -buffsize: Length of line buffer for input file in bytes (i.e., number of characters)
• -fceps: Relative epsilon for comparing two floating-point numbers using Knuth's algorithm (see Details)

## Test Examples

There are 3 examples provided for users to test their compilations versus our original compilation and familiarize themselves with the software. The example data files are located in the examples subdirectory of the distribution tarfile and named example*.dat. When analyzed with the following RVASSOC calls, output identical to that in the example*.rvassoc.out files should be printed to STDOUT:

rvassoc example1.dat -cmccolmaf 0.02 rvassoc example2.dat -cmccolmaf 0.005 rvassoc example3.dat

The resulting output for a new compilation can be redirected to a file and compared to the output from our original compilation using a utility such as diff.

## Output

The following is an example of the output produced by the call:  rvassoc example2.dat -cmccolmaf 0.005 The contents of the output are explained in greater detail in the Details section.

------------------------------------------------------------------- RARE VARIANT ASSOCIATION ANALYSIS Version: 1.12 Copyright (C) 2010-12 University of Miami Miller School of Medicine This program is distributed under the GNU General Public License version 3. It comes with ABSOLUTELY NO WARRANTY. ------------------------------------------------------------------- Input File: example2.dat Case Indicator Variable: caseind Genotype Variables: geno1 geno2 geno3 geno4 geno5 geno6 geno7 geno8 geno9 geno10 geno11 geno12 geno13 geno14 geno15 geno16 geno17 geno18 geno19 geno20 geno21 geno22 Total Variants: 22 -------------------------------- RARE VARIANT ASSOCIATION RESULTS -------------------------------- Random Number Seed: 314159265 Cases/Controls: 97/412 Average Per-Individual Call Rate: 0.897446 FP Equality Relative Epsilon: 1e-10 Cochran-Armitage Permutation Max X2 Test: Number of Polymorphic Variants: 20 Permutations: 10000 Observed Max X2: 8.1914 P: 0.151285 P(tied): 0.00309969 Cochran-Armitage Permutation Sum X2 Test: Number of Polymorphic Variants: 20 Permutations: 10000 Observed Sum X2: 24.7015 P: 0.213279 P(tied): 9.999e-05 WSS Permutation Test: Number of Polymorphic Variants: 20 Permutations: 10000 Observed W: 25413.5 P: 0.510549 P(tied): 0.00049995 CMC Hotelling's T2 Test: Number of Polymorphic Variants: 20 Rare Variants Collapsed if MAF <= 0.005 Number of Common Variants: 10 Number of Complete Individuals: 64 G2SWEEP Tolerance: 1e-08 F(7,56): 0.841933 P: 0.557553

## Details

This program should be used only for inference on case-control designs with independent (i.e., unrelated) subjects and biallelic variants. For all tests, any variant with only a single genotype in the sample or completely missing genotypes is eliminated from consideration. The number of polymorphic variants, defined as those having 2 or more different genotypes in the sample, is reported by the software with the output of each test for convenience.

The CA max/sum tests base inferences on the permutation null distributions of the maximum/sum over single-variant Cochran-Armitage chi-square statistics in the group of variants under consideration. Kinnamon et al. (2012) found that these tests had power comparable to or greater than the CMC and WSS tests in the presence of extensive neutral variation and missing genotypes. Therefore,we recommend the use of these tests over the others implemented in RVASSOC.

Letting $Q_t$ denote the value of the summary statistic in permutation $t$ and $Q_{obs}$ denote the observed value in the sample, the two-sided p-value for the CA max/sum test is estimated from $m$ permutations by:
$$\hat{p}=\frac{\#(Q_t \ge Q_{obs})+1}{m+1}$$ This p-value follows "P:" in the output. The program also estimates the permutation null distribution probability mass function at $Q_{obs}$ by: $$\hat{p}_{tied}=\frac{\#(Q_t=Q_{obs})+1}{m+1}$$ This probability mass follows "P(tied):" in the output. High values of $\hat{p}_{tied}$ are indicative of a highly discrete permutation null distribution.

Because two floating-point representations of the same number may differ due to rounding and truncation errors, all comparisons of test statistics in the above calculations are performed using Knuth's algorithm, as implemented in the GSL function gsl_fcmp. Briefly, Knuth's algorithm returns: $$\begin{array}{rcl} Q_t \lt Q_{obs} & \mathrm{if} & Q_t-Q_{obs} \lt -2^{k} \epsilon \\ Q_t \gt Q_{obs} & \mathrm{if} & Q_t-Q_{obs} \gt 2^{k} \epsilon \\ Q_t=Q_{obs} & \mathrm{if} & |Q_t-Q_{obs}| \le 2^{k} \epsilon \end{array}$$ In these formulas, $k$ is the larger of the base-2 exponents of $Q_t$ and $Q_{obs}$, and $\epsilon$ is the floating-point equality relative epsilon set by the flag -fceps.

The CA max/sum tests use all available genotype data at every variant. The average per-individual percentage of available genotypes is reported by the software. In cases in which there are as many or more individuals with missing genotypes at a variant in the entire sample than there are individuals in either the case or control groups, the CA statistic may not be calculable at that variant in the observed sample or in one or more permutations. If this occurs, RVASSOC issues a warning message to STDERR, and the analysis should be rerun with the offending variants excluded. Make sure to keep the STDERR so that important error and warning messages are not lost!

The CMC test (Li and Leal 2008) collapses variants having overall sample MAF (i.e., MAF estimate from both cases and controls) ≤ the value specified by -cmccolmaf into a single presence/absence indicator in the analysis. An approximate $F$ test based Hotelling's $T^2$ is used for inference on the vector comprising this indicator and minor allele counts at the common variants. This vector has a number of elements equaling the number of common variants plus 1.

One issue not considered in the original paper is that linkage disequilibrium (LD) among common variants can reduce the number of linearly independent elements in this random vector, which leads to a singular covariance matrix for which a standard inverse does not exist. However, calculating Hotelling’s $T^2$ statistic with any generalized inverse is equivalent to calculating the statistic with a standard inverse on a full-rank subset of linearly independent common variants (Kinnamon et al. 2012). Goodnight (1979) provides an algorithm for automatically calculating a $g2$ generalized inverse and the dimension of the full-rank subset without any prior knowledge of the full-rank subset. The algorithm involves applying the G2SWEEP operator once to each of the columns of the covariance matrix in succession. This operator zeros the rows and columns corresponding to common variants that are linearly dependent on the previous common variant minor allele counts and/or the rare variant indicator. Linear dependence is determined numerically using a relative tolerance specified by the flag -cmcswptol. The effective number of linearly independent vector elements, $v$, is then automatically obtained by subtracting the number of columns that are zeroed from the total number of columns in the covariance matrix. The p-value is then calculated using the approximate $F(v,N-v-1)$ distribution of the appropriately scaled Hotelling’s $T^2$ statistic calculated using the $g2$ generalized inverse of the covariance matrix, where $N$ is the total sample size.

Only individuals with complete genotype data at common variants can be used in calculating Hotelling’s $T^2$. Provided genotype data at all common variants are complete, individuals with missing genotype data at rare variants can be used if at least one minor allele is present for a variant with a non-missing genotype because the coding of the rare variant indicator would be 1 regardless of the other variant genotypes. However, if no minor allele is present at any variant with non-missing genotypes, the coding of the rare variant is ambiguous because it would depend on the values of the unobserved genotypes. Therefore, such individuals also have to be excluded from calculating Hotelling’s $T^2$. The software outputs the number of individuals with complete data who were used in the CMC. If there are not at least 2 cases, 2 controls, and 1 denominator degree of freedom (ddf), the CMC test cannot be performed and the results will read F(-99,-99): -99 P: -99. With ddf≤4, results are reported, but they may not be reliable based on our experiments and the fact that the variance of the $F$ distribution exists only with ddf>4.

In cases where there are as many common variants as polymorphic variants (i.e., no rare variants below MAF cutoff), the CMC yields the same result as Hotelling's $T^2$ test on all variants because the column of zeros for the rare variant indicator is eliminated as linearly dependent by the G2SWEEP operator.

The implementation of the WSS test follows the description in Madsen and Browning (2009) with four modifications. First, midranks are used to break ties in genetic scores when calculating the rank-sum statistic, $W$. Second, a two-sided p-value is used. Third, the two-sided p-value is estimated directly from the permutation distribution of $W$. Letting $W_t$ denote the value in permutation $t$, $W_{obs}$ denote the observed value in the sample, and $\overline{W}$ denote the mean of $W$ over all $m$ permutations, the two-sided p-value is estimated as: $$\hat{p}^{WSS}=\frac{\#(|W_t - \overline{W}|\ge|W_{obs} - \overline{W}|)+1}{m+1}$$ This p-value follows "P:" in the output. The program also estimates the permutation null distribution probability mass function at $|W_{obs} - \overline{W}|$ by: $$\hat{p}^{WSS}_{tied}=\frac{\#(|W_t - \overline{W}|=|W_{obs} - \overline{W}|)+1}{m+1}$$ This probability mass follows "P(tied):" in the output. High values of $\hat{p}^{WSS}_{tied}$ are indicative of a highly discrete permutation null distribution. Again, all comparisons in the above calculations are performed using the gsl_fcmp implementation of Knuth's algorithm. Finally, missing single-variant genotypes, which were not considered in the original paper, are not used in estimation of the MAF in controls and assigned values of 0 so as not to contribute to the WSS in an individual. This procedure is equivalent to calculating the genetic score over only nonmissing genotypes in each individual.

## References

CA Max/Sum Tests
Kinnamon DD, Hershberger RE, Martin ER. Reconsidering association testing methods using single-variant test statistics as alternatives to pooling tests for sequence data with rare variants. PLoS ONE. 2012;7(2):e30238. Epub 2012 Feb 17. PMID: 22363423

Knuth's Floating-Point Comparison Algorithm
Knuth DE. The Art of Computer Programming. 3rd ed. Vol. 2, Seminumerical Algorithms. Upper Saddle River, NJ: Addison-Wesley; 1998. Section 4.2.2; p. 233.

CMC Test
Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008 Sep 12;83(3):311-21. Epub 2008 Aug 7. PMID: 18691683.

G2SWEEP Operator
Goodnight JH. A tutorial on the SWEEP operator. Am Stat. 1979;33:149-158.

WSS Test
Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009 Feb;5(2):e1000384. Epub 2009 Feb 13. PMID: 19214210.