All fields of science are experiencing an explosive growth of data, requiring new techniques to analyze, manage, store and visualize this information in order to drive new scientific discoveries and enable new technologies. Faculty and students conduct research on fundamental and applied topics related to all aspects of data science, including statistical analytics, data management systems, high performance and distributed computing, artificial intelligence/machine learning and visualization.
Hong Yang (advisor: Travis Desell and Alexander Ororbia)
Introduction to Big Data
This course provides a broad introduction to the exploration and management of large datasets being generated and used in the modern world. First, practical techniques used in exploratory data analysis and mining are introduced; topics include data preparation, visualization, statistics for understanding data, and grouping and prediction techniques. Second, approaches used to store, retrieve, and manage data in the real world are presented; topics include traditional database systems, query languages, and data integrity and quality. Case studies will examine issues in data capture, organization, storage, retrieval, visualization, and analysis in diverse settings such as urban crime, drug research, census data, social networking, and space exploration. Big data exploration and management projects, a term paper and a presentation are required. Sufficient background in database systems and statistics is recommended.
Foundations of Database System Implementation
This course provides a broad introduction to database management systems including data modeling, the relational model, and SQL. Database system implementation issues are covered next, where the focus is on data structures and algorithms used to implement database management systems. Topics include physical data organizations, indexing and hashing, query processing and optimization, database recovery techniques, transaction management, concurrency control, and database performance evaluation. Current research topics in database system implementation are also explored. Programming projects, a term paper, and presentations will be required.
Note: Students who take this course may not take CSCI-421 for credit.
Data Security and Privacy
This course examines policies, methods and mechanisms for securing enterprise and personal data and ensuring data privacy. Topics include data integrity and confidentiality; access control models; secure database architectures; secure transaction processing; information flow, aggregation, and inference controls; auditing; securing data in contemporary (relational, XML and other NO SQL) database systems; data privacy; and legal and ethical issues in data protection. Programming projects are required.
Big Data Analytics
This course provides a graduate-level introduction to the concepts and techniques used in data mining. Topics include the knowledge discovery process; prototype development and building data mining models; current issues and application domains for data mining; and legal and ethical issues involved in collecting and mining data. Both algorithmic and application issues are emphasized to permit students to gain the knowledge needed to conduct research in data mining and apply data mining techniques in practical applications. Data mining projects, a term paper, and presentations are required.
Foundations of Data Cleaning and Preparation
This course provides an introduction to the concepts and techniques used in preparing data for subsequent data mining. Topics include the knowledge discovery process; data exploration and its role; data extraction, cleaning, integration and transformation; handling numeric, unstructured, text, web, and other forms of data; and ethical issues underlying data preparation and mining. Data cleaning projects, a term paper, and presentations are required.
Note: Students who take this course may not take CSCI-521 for credit.
Data Analytics Cognitive Comp
Building on prior knowledge of data analytics, this course brings in the impact of natural language processing and cognitive computing on data analysis. Topics include an overview of natural language processing; data mining, information retrieval and knowledge processing; corpus identification and preparation; training and test data and methods; current research in the field; and ethical concerns. Students will apply the concepts learned in class through team projects, programming assignments, presentations, and a research paper.
Web Services and Service Oriented Computing
This course introduces fundamental concepts of Web services and the Service-Oriented Computing (SOC) paradigm, and reviews seminal work, current research, and modern practices in these areas. Topics in Web Services include XML; reference model (WSDL, UDDI, SOAP); service coordination and composition; and service security and privacy. Big data analytics in SOC will also be covered, such as large scale service data retrieval and storage, service clustering and classification, service recommendation, and service discovery. Students will apply the concepts learned in the class through programming assignments and a comprehensive term project.
Topics in Data Science for Computer Scientists
This course examines current topics in Data Science. This is intended to allow faculty to pilot potential new graduate offerings. Specific course details (such as prerequisites, course topics, format, learning outcomes, assessment methods, and resource needs) will be determined by the faculty member(s) who propose a specific topics course in this area. Specific course instances will be identified as belonging to the Data Science cluster, the Security cluster, or both.
Foundations of Data Science and Analytics
A foundations course in data science, emphasizing both concepts and techniques. The course provides an overview of data analysis tasks and the associated challenges, spanning data preprocessing, model building, model evaluation, and visualization. The major areas of machine learning, such as unsupervised, semi-supervised and supervised learning are covered by data analysis techniques including classification, clustering, association analysis, anomaly detection, and statistical testing. The course includes a series of assignments utilizing practical datasets from diverse application domains, which are designed to reinforce the concepts and techniques covered in lectures. A substantial project related to one or more data sets culminates the course.
Software Engineering for Data Science
This course focuses on the software engineering challenges of building scalable and highly available big data software systems. Software design and development methodologies and available technologies addressing the major software aspects of a big data system including software architectures, application design patterns, different types of data models and data management, and deployment architectures will be covered in this course.
High Performance Data Science
This course will cover concurrent, parallel and distributed programming paradigms and methodologies with a focus on implementing them for use in applied data science or scientific computing tasks. In particular, the course will focus on developing software using graphical processing units (GPUs) and the message passing interface (MPI); with an emphasis on properly handling large-scale, real-world data as part of these applications. The course will also teach scalability and load balancing techniques for developing efficient distributed systems. Programming assignments are required.
NoSQL Database Normalization [Mior]: Constructing a normalized model from a NoSQL database is a challenging problem. Traditional normalization algorithms such as lossless join BCNF decomposition fail to appropriately handle the forms of denormalization present in NoSQL databases. Appropriate algorithms for constructing a normalized schema from NoSQL databases are a critical step in performing meaningful data integration with other sources. https://michael.mior.ca/projects/eson
Column2Vec [Mior, Ororbia]: Column2Vec is a distributed representation of database columns based on column metadata. Our distributed representation has several applications. Using known names for groups of columns (i.e., a table name), we train a model to generate an appropriate name for columns in an unnamed table. We demonstrate the viability of our approach using schema information collected from open source applications on GitHub. https://arxiv.org/abs/1903.08621
Data quality and security evaluation framework for mobile devices platform [Reznik]: The project builds a proof-of-the-concept design, which will be used to develop, verify and promote a comprehensive methodology for data quality and cybersecurity (DQS) evaluation focusing on an integration of cybersecurity with other diverse metrics reflecting DQS, such as accuracy, reliability, timeliness, and safety into a single methodological and technological framework. The framework will include generic data structures and algorithms covering DQS evaluation. While the developed evaluation techniques will cover a wide range of data sources from cloud based data systems to embedded sensors, the framework implementation will concentrate on using an ordinary user’s owned mobile devices and Android based smartphones in particular.
Intelligent Security Systems [Reznik]: The project designs a curriculum, develops course materials, tests and evaluates them in real college classroom settings, prepares and submits them for dissemination of a college level course on Intelligent Security Systems. In order to facilitate interconnections with other courses and its inclusion into the national Cybersecurity curricula, the course is composed of nine separate modules. Five modules cover the specialized topics including: a review of the modern state of the cybersecurity and the current problems and approaches; firewall design; intrusion detection systems; anti-malware methods and tools; hacking activity and attack recognition and prevention. Other modules provide additional support to assist in course teaching preparation, such as test and exam questions, course project and research assignment specifications, and tool presentation descriptions. This course idea is innovative and unique. It merges together various knowledge areas as diverse as artificial intelligence and machine learning techniques with computer security systems and applications. The course will allow to instill into students a unique knowledge in the very intense domain and will lead students towards getting much better prepared to their practical work ahead. It combines theoretical knowledge and practical skills development. Also, it advances students research, communication and presentation skills.