This course will focus on introducing existing libraries and tools available within the Hadoop and Spark ecosystem. We will introduce general practices of using Hadoop/Spark cluster for practical analysis problems, such as running batch jobs with different cluster deployment modes and running interactive jobs. Existing analysis libraries and applications will be introduced during the class, including Hadoop streaming, MLlib, SparkSQL and GraphX. We will also introduce how to use Hadoop/Spark cluster with other programming languages including R and Python. Participants should have basic knowledge, experience and are comfortable with coding. Participants are also expected to have knowledge of the Hadoop cluster system, concepts of parallelism and can work on computing resources at TACC.



Analysis with Spark

Practical Machine Learning with MLlib and GraphX

Data Frame JSON File


Weijia Xu, Ph.D.

Dr. Weijia Xu is the group lead for Data Mining & Statistics group. Prior to joining TACC, he obtained a master's degree in Biological Sciences and a doctoral degree in Computer Science from The University of Texas at Austin. Dr. Xu's main research interest is in the field of large scale information management and analysis. The goal of his research is to enable data driven discoveries through developing new methods and applications that facilitate the data to knowledge transfer process. Dr. Xu has extensive experiences in working with domain scientists in database and analytical methods development. Dr. Xu has over thirty peer-reviewed conference and journal publications in similarity based data retrieval, data analysis and information visualization with data from various scientific domains.

Zhao Zhang, Ph.D.

Dr. Zhao Zhang is a computer scientist in the Data Intensive Computing group at TACC. His research interest is to build computer systems to enable and facilitate scientific research in parallel and distributed computing environments. Dr. Zhang’s current work focuses on machine learning and deep learning systems. He is supporting open source deep learning frameworks on TACC supercomputers and clusters. Dr. Zhang is also actively researching on topics of deep learning framework scalability, performance prediction, reproducibility, and usability. Dr. Zhang has rich collaboration experience with domain scientists from the areas of astronomy, bioinformatics, and earth science. Dr. Zhang joins TACC in 2016. Before that, he was a joint-postdoc researcher in AMPLab and Berkeley Institute for Data Science at University of California, Berkeley. He received the Ph.D from the Department of Computer Science at University of Chicago in 2014.

Last modified: Tuesday, May 21, 2019, 11:07 AM