Short Courses (May 22-23, 2022)

Unless otherwise specified, the morning session of a short course is from 9:00 AM to 12:30 PM EST, and the afternoon session is from 1:30 PM to 5:00 PM EST.

Bayesian Clinical Trial Designs and Use of RWD

Instructor: Dr. Peter Mueller

Peter Mueller Peter Mueller is Professor in the Department of Statistics and Data Science and in the Department of Mathematics at UT Austin. Before coming to Austin he served on the faculty in the Institute of Statistics and Decision Science at Duke University, and in the Department of Biostatistics at M.D. Anderson Cancer Center. He received his Ph.D. degree in statistics from the Purdue University in 1991. Dr. Mueller's current major area of interest is the theory and application of statistics to biomedical problems. In particular, he has developed methods for non-parametric Bayesian data analysis, semi-parametric statistical methods for repeated measurement data, simulation based approaches to optimal design, innovative clinical trial designs, model based smoothing methods, and simulation based methods for posterior inference. He is an elected fellow of the ASA, the IMS, and ISBA, recipient of the Zellner Medal, and served as president of ISBA, chair of the ISBA/BNP section and the ASA/SBSS section.


We discuss Bayesian approaches to clinical trial design, focusing on early phase studies. We start with a brief review of Bayesian inference, to introduce notation and concepts. The discussion inludes Bayesian decision problems that add a formal description of selecting optimal actions in the context of Bayesian inference. We review the setup and elements of basic decision problems. We then proceed with a review of Bayesian approaches to phase I designs, including CRM (O'Quigley et al., 1990), EWOC (Tighiouart and Rogatko, 2010), Bayesian logistic regression (Neuenschwander et al., 2008), mTPI (Ji et al., 2010) and BOIN (Liu and Yuan, 2015). We summarize some recent results (Duan et al, 2022) showing how these designs can be represented as special instances of a more general unified decision-theoretic formulation of the phase I design problem.

In the afternoon we will discuss general notions of adaptive Bayesian designs for phase II trials, including in particular (adaptive) sequential stopping. We will review master protocols and underlying hierarchical models for Bayesian inference across related cohorts (sub-models). As a last major theme we discuss challenges and opportunities related to using real world data (RWD) in clinical trial design. We will introduce methods that exploit RWD and historical trials for prior construction. This includes historical data priors (Chen and Ibrahim, 2000), commensurate priors (Hobbs et al., 2011), MAP priors (Neuenschwander et al., 2010) and robust MAP priors (Schmidli et al., 2014). Several methods use propensity scores to adjust for a lack of randomization, including approaches proposed in Liu et al. (2021), Chen et al. (2020) and Wang and Rosner (2019). Finally, we discuss in more detail a recently proposed approach by Chandra et al. (2022).

Statistical Learning Methods in Neuroimaging Data Analysis (Recording)

Instructor: Dr. Hongtu Zhu

Hongtu Zhu Dr. Hongtu Zhu is a tenured professor of biostatistics and computer science at University of North Carolina at Chapel Hill (UNC-CH). He was DiDi Fellow and Chief Scientist of Statistics at DiDi Chuxing, an IOT company and was Endowed Bao-Shan Jing Professorship in Diagnostic Imaging at MD Anderson Cancer Center. He is an internationally recognized expert in statistical learning, medical image analysis, precision medicine, biostatistics, artificial intelligence, and big data analytics. He has been an elected Fellow of American Statistical Association and Institute of Mathematical Statistics since 2011. He received an established investigator award from Cancer Prevention Research Institute of Texas in 2016 and received the INFORMS Daniel H. Wagner Prize for Excellence in Operations Research Practice in 2019. His google citation is over 19000+ since 2001. His group pioneers in the joint analysis of imaging, clinical, and genetic data from large-scale biobank studies, such as the UK Biobank. He has published more than 300+ papers in top journals including Nature, Science, Cell, Nature Genetics, PNAS, Biometrika, JASA, AOS, and JRSSB, as well as 45+ conference papers in top conferences including NeurIPS, AAAI, KDD, ICDM, MICCAI, and IPMI. He has served/is serving an editorial board member of premier international journals including Statistica Sinica, JRSSB, AOS, and JASA.


With modern imaging techniques, massive imaging data can be observed over both time and space. Such imaging techniques include functional magnetic resonance imaging (fMRI), electroencephalography (EEG), diffusion tensor imaging (DTI), positron emission tomography (PET), and single photon emission-computed tomography (SPECT) among many other imaging techniques. The subject of medical imaging analysis has exploded from simple algebraic operations on imaging data to advanced statistical and mathematical methods on imaging data. This course is designed to provide students advanced topics on statistical learning methods for medical imaging data.

This course is designed for researchers and students who wish to analyze and model medical image data quantitatively. The course material is applicable to a wide variety of medical and biological imaging problems. The topics cover some basic neuroimaging modalities, shape representation, population statistics, manifold-data analysis, big-data integration, imaging genetics, and mapping genetic-imaging-clinical networks.

Challenges and Methods for Extracting Reliable Evidence From Real World Data

Instructor: Dr. Yong Chen

Yong Chen Yong Chen is an Associate Professor of Biostatistics in the Department of Biostatistics, Epidemiology and Informatics at the University of Pennsylvania. He has a strong interest in statistical theory with a focus on robust inference, and methodological research on leveraging large healthcare data (EHR data, administrative claims data) for evidence-based medicine and personalized disease prevention/intervention strategies. He is keen in developing informatics and statistical methods with associated software, using EHR data, to facilitate evidence extraction and synthesis for comparative effectiveness studies, as well as well-calibrated risk prediction models for aiding clinical decision-making. He has published over 160 peer-reviewed papers in statistical inference, medical informatics, comparative effectiveness research, and biomedical sciences. He has taught short courses at JSM, ENAR, the Deming Conference on Applied Statistics, ICSA annual conference, and workshops at the University of Pennsylvania.


The widespread adoption of electronic health records (EHR) has created a vast resource for the study of treatments and health outcomes in the general population. The 21st Century Cures Act and the FDA’s subsequent publication of a framework for using real world data (RWD) to generate real world evidence (RWE) has spurred additional interest in using EHR to generate RWE. While there are many benefits to conducting research with RWD, many challenges arise due to the complex and messy processes that give rise to EHR data. To make valid inference, statisticians must be aware of data generation, capture, integration, and availability issues and utilize appropriate study designs and statistical analysis methods to account for these issues. In this half-day short course, we will discuss key issues for research conducted using RWD, including error in covariates and outcomes extracted from EHR data; and synthesize evidence from EHR data across heterogeneous clinical sites. For each issue we will present a motivating case study to focus our discussion and use this to spur thinking about the pros and cons of using RWD for a given research question and alternative methodological choices that can strengthen inference. The overarching goal is to provide participants with a framework for thinking about the design and analysis of EHR-based studies to help guide their use of statistical best practices in the conduct of their own research.

  1. Introduction
  2. Overview of real-world data and EHR data
  3. Correcting for bias due to EHR data errors
  4. Data integration for distributed healthcare data networks
  5. Wrap-up

Programming With Hierarchical Statistical Models Using NIMBLE

Instructor: Dr. Sally Paganin

Sally Paganin Sally Paganin is a Research fellow in the Department of Biostatistics at Harvard T.H. Chan School of Public Health, currently working on statistical methods for early cancer detection. Previously, she was Postdoctoral Researcher at UC Berkeley, where she worked on Bayesian methodology and algorithms, contributing to the NIMBLE project. Her research focuses on latent variable models and Bayesian nonparametrics, along with the development of statistical software and algorithms.


NIMBLE is a system for building and sharing methods for statistical models, especially for hierarchical models and computationally-intensive methods. NIMBLE is built in R but compiles models and algorithms using C++ for speed. The resulting objects are manipulated from R without any need for analysts to program in C++. NIMBLE provides analysts with a flexible system for using MCMC, sequential Monte Carlo, MCEM, and other techniques, along with the ability to write computationally efficient algorithms in an R-like syntax that can be easily disseminated.

This workshop will introduce the NIMBLE system and demonstrate how one can use NIMBLE to:

  • Flexibly specify an MCMC for a specific model, including choosing samplers and blocking approaches;
  • Tailor an MCMC to a specific model using user-defined distributions and user-defined functions;
  • Write your own MCMC sampling algorithms and use them in combination with samplers from NIMBLE's library of samplers;
  • Use specialized model components such as Dirichlet processes, conditional autoregressive (CAR) models, and reversible jump for variable selection.
Outline (tentative):
  • Introduction to NIMBLE: basic concepts and workflows
  • Customizing an MCMC and advanced model building
  • Highlights of special features in NIMBLE
  • Programming algorithms in NIMBLE

Participants should have a basic understanding of Bayesian/hierarchical models and of one or more algorithms such as MCMC. Some experience with R is also expected. Please bring a laptop; I’ll give instructions in advance for installing NIMBLE.

Applied Event Time Data Analytics with R

Instructors: Dr. Sy Han (Steven) Chiou and Dr. Jun Yan

Sy Han (Steven) Chiou Dr. Sy Han (Steven) Chiou is an assistant professor in the Department of Mathematical Sciences at the University of Texas at Dallas (UTD). Before joining UTD, Dr. Chiou was a postdoctoral research fellow in the Department of Biostatistics at the Harvard T.H. Chan School of Public Health during 2015-2017 and an assistant professor in the Department of Mathematics at the University of Minnesota Duluth during 2013-2015. Dr. Chiou received his PhD in Statistics from the University of Connecticut in 2013. Dr. Chiou's primary research interests focus on addressing important questions that arise with data under complicated sampling schemes, dependent truncation, and recurrent event data. Dr. Chiou has developed several R packages in the related topic include aftgee, reReg, rocTree, and spef. Dr. Chiou is an Elected Member of the International Statistical Institute.

Jun Yan Dr. Jun Yan is a Professor in the Department of Statistics at the University of Connecticut (UConn) and a Research Fellow in the Center for Population Health at UConn Health. He received his PhD in Statistics from University of Wisconsin--Madison in 2003. He was on the faculty of the Department of Statistics and Actuarial Science at the University of Iowa for four years before joining UConn in 2007. Dr. Yan's methodological research interests include survival analysis, clustered data analysis, spatial extremes, and statistical computing. His application domains are public health, environmental sciences, and sports. With a special interest in making his statistical methods available via open source software, he and his coauthors developed and maintain a collection of R packages in the public domain. Since July 2020, he has been the editor of the Journal of Data Science and led the reform of the journal. Dr. Yan is an Elected Member of the International Statistical Institute and a fellow of the American Statistical Association.


This course will provide a comprehensive and practical introduction to analyzing data in the form of time-to-event or survival times. We will begin with the fundamental survival analysis concepts and techniques, including censoring, truncation, survival functions, hazard function, Kaplan-Meier curves, and log-rank tests. Regression analysis covers the Cox proportional hazards model and the accelerated failure time (AFT) model. The models will be extended to allow a cure rate and variable selection. Multivariate event times will be analyzed with marginal Cox or AFT models. A special case type of multivaraite event time data is recurrent events. Standard survival analysis methods that focus only on time to the first event cannot capture the cumulative experience of the recurrent events and could lead to invalid inferences. Thus, the development of statistical methods that appropriately address the structure of recurrent events has attracted considerable attention. We will introduce virtualization tools, nonparametric estimation, and regression analysis in recurrent event data. All statistical analysis will be illustrated with practical applications in R.

Target audience

This course's intended audience includes researchers who want to gain basic exposure to analyzing time-to-event data with the ultimate goal of incorporating R into their research programs.


Introductory statistics; entry level of R knowledge; a laptop.

  • Introduction to survival data: survival, survminer
  • Cox & AFT: survival, aftgee
  • Cure-rate: intsurv, smcure
  • Recurrent events: reda, reReg

Machine Learning for Analyzing Patient Health Data: Small n, Large p, and the Implications

Instructor: Dr. Fei Wang

Fei Wang Fei Wang is an Associate Professor in Division of Health Informatics, Department of Population Health Sciences, Weill Cornell Medicine, Cornell University. His major research interest is data mining, machine learning and their applications in health data science. He has published more than 250 papers on the top venues of related areas such as ICML, KDD, NIPS, CVPR, AAAI, IJCAI, JAMA Internal Medicine, Annals of Internal Medicine, Lancet Digital Health, etc. His papers have received over 19,000 citations so far with an H-index 67. His (or his students’) papers have won 8 best paper (or nomination) awards at top international conferences on data mining and medical informatics. His team won the championship of the NIPS/Kaggle Challenge on Classification of Clinically Actionable Genetic Mutations in 2017 and Parkinson's Progression Markers' Initiative data challenge organized by Michael J. Fox Foundation in 2016. Dr. Wang is the recipient of the NSF CAREER Award in 2018, as well as the inaugural research leadership award in IEEE International Conference on Health Informatics (ICHI) 2019. Dr. Wang’s Research has been supported by NSF, NIH, ONR, PCORI, MJFF, AHA, Amazon, etc. Dr. Wang is the past chair of the Knowledge Discovery and Data Mining working group in American Medical Informatics Association (AMIA). Dr. Wang is a fellow of AMIA and a Distinguished Member of ACM.


Machine learning algorithms, especially deep learning, have achieved great successes in a number of application domains in recent years. These approaches typically need a large data set for effective training (large n) due to the high model complexity. In healthcare and medicine, the study problems are typically complicated due to the complexity of the diseases and the size of the patient samples available for model training is typically limited (small n). At the same time, with the rapid development of computer software and hardware technologies and the initiatives such as precision medicine, richer and more heterogeneous information are captured for each individual patient now adays (large p, such as electronic health records, multi-omics, biomedical images, etc.). These trends and characteristics of patient health data make machine learning model development promising but challenging. In this short course, I will present experiences (success and failure) in recent years on developing machine learning models in such scenario which cover a diverse set of topics including multi-modal learning, algorithmic fairness, model interpretability, federated and transfer learning. I will demonstrate all these topics can be naturally unified and understood from such small n, large p learning framework. I will also discuss its implications on future research directions.

Statistical Topics in Outcomes Research: Patient-Reported Outcomes, Meta-Analysis, and Health Economics

Instructors: Dr. Joseph C. Cappelleri and Dr. Thomas Mathew

Joseph C. Cappelleri Joseph C. Cappelleri, PhD, MPH, MS is an executive director in the Statistical Research and Data Science Center at Pfizer Inc. He earned his M.S. in statistics from the City University of New York (Baruch College), Ph.D. in psychometrics from Cornell University, and M.P.H. in epidemiology from Harvard University. As an adjunct professor, Dr. Cappelleri has served on the faculties of Brown University, University of Connecticut, and Tufts Medical Center. He has delivered numerous conference presentations and has published extensively on clinical and methodological topics, including on regression-discontinuity designs, meta-analyses, and health measurement scales. He is lead author of the book Patient-Reported Outcomes: Measurement, Implementation and Interpretation and has co-authored or co-edited three other books (Phase II Clinical Development of New Drugs, Statistical Topics in Health Economics and Outcomes Research, Design and Analysis of Subgroups with Biopharmaceutical Applications). Dr. Cappelleri is a fellow of the American Statistical Association and president of the New England Statistical Society.

Thomas Mathew Thomas Mathew, PhD, Professor, Department of Mathematics & Statistics, University of Maryland Baltimore County (UMBC). He earned his PhD in statistics from the Indian Statistical Institute in 1983, and has been a faculty member at UMBC since 1985. He has delivered numerous conference presentations, nationally and internationally, and has published extensively on methodological and applied topics, including cost-effectiveness analysis, bioequivalence testing, exposure data analysis, meta-analysis, mix ed and random effects models, and tolerance intervals. He is the co-author of two books Statistical Tests in Mixed Linear Models and Statistical Tolerance Regions: Theory, Applications and Computation, both published by Wiley. He has served on the Editorial Boards of several journals, and is currently an Associate Editor of the Journal of the American Statistical Association, Journal of Multivariate Analysis, and Sankhya. Dr. Mathew is a Fellow of the American Statistical Association, and a Fellow of the Institute of Mathematical Statistics. He has also been appointed as Presidential Research Professor at his campus.


Based in part on the co-edited volume “Statistical Topics in Health Economics and Outcomes Research” (Alemayehu et al.), this four-hour short course recognizes that, with ever-rising healthcare costs, evidence generation through health economics and outcomes research (HEOR) plays an increasingly important role in decision-making about the allocation of resources. This course highlights three major topics related to HEOR, with objectives to learn about 1) patient-reported outcomes, 2) analysis of aggregate data, and 3) methodological issues in health economic analysis. Key themes on patient-reported outcomes are presented regarding their development and validation: content validity, construct validity, and reliability. Regarding analysis of aggregate data, several areas are elucidated: traditional meta-analysis, network meta-analysis, assumptions, and best practices for the conduct and reporting of aggregated data. For methodological issues on health economic analysis, cost-effectiveness criteria are covered: traditional measures of cost-effectiveness, the cost-effectiveness acceptability curve, statistical inference for cost-effectiveness measures, the fiducial approach (or generalized pivotal quantity approach), and a probabilistic measure of cost-effectiveness. Illustrative examples are used throughout the course to complement the concepts. Attendees are expected to have taken at least one graduate-level course in statistics.

Learning Objectives

To understand and critique the major methodological issues in outcomes research on the development and validation of patient-reported outcomes, traditional meta-analysis and network meta-analysis, and health economic analysis.

  • Alemayehu D, Cappelleri JC, Emir B, Zou KH (editors). Statistical Topics in Health Economics and Outcomes Research. Boca Raton, Florida: Chapman & Hall/CRC Press. 2017
  • Bebu I, Luta G, Mathew T, Kennedy TA, Agan BK. Parametric cost-effectiveness inference with skewed data. Computational Statistics and Data Analysis. 2016; 94:210–220.
  • Bebu I, Mathew T, Lachin JM. Probabilistic measures of cost-effectiveness. Statistics in Medicine. 2016; 35:3976-3986.
  • Cappelleri JC, Zou KH, Bushmakin AG, Alvir JMJ, Alemayehu D, Symonds T. Patient-Reported Outcomes: Measurement, Implementation and Interpretation. Boca Raton, Florida: Chapman & Hall/CRC Press. 2013.