Data Science

Narrative

All fields of science are experiencing an explosive growth of data, requiring new techniques to analyze, manage, store and visualize this information in order to drive new scientific discoveries and enable new technologies. Faculty and students conduct research on fundamental and applied topics related to all aspects of data science, including statistical analytics, data management systems, high performance and distributed computing, artificial intelligence/machine learning and visualization.

Ph.D. Students

  • Farhad Akhbardeh (advisor: Travis Desell)
  • Asma Alnemari (advisors: R. Raj and C. Romanowski)
  • Chris Bondy (advisor: Pengcheng Shi)
  • Eduardo Coelho (advisor: Xumin Liu)
  • Fernando Cueva (advisor: Pengcheng Shi)
  • AbdElRahman ElSaid (advisors: Travis Desell and Alex Ororbia)
  • Igor Khokhlov (advisor: Leon Reznik)
  • Justin Namba (advisor: Michael Mior)
  • Mark Petrie (advisor: Pengcheng Shi)
  • Shao-Hsuan Su (advisor: S. Jay Yang)
  • Gordon Werner (advisor: S. Jay Yang)

Related Courses

CSCI-620
Credits 3
This course provides a broad introduction to the exploration and management of large datasets being generated and used in the modern world. First, practical techniques used in exploratory data analysis and mining are introduced; topics include data preparation, visualization, statistics for understanding data, and grouping and prediction techniques. Second, approaches used to store, retrieve, and manage data in the real world are presented; topics include traditional database systems, query languages, and data integrity and quality. Case studies will examine issues in data capture, organization, storage, retrieval, visualization, and analysis in diverse settings such as urban crime, drug research, census data, social networking, and space exploration. Big data exploration and management projects, a term paper and a presentation are required. Sufficient background in database systems and statistics is recommended.
CSCI-621
Credits 3
This course provides a broad introduction to database management systems including data modeling, the relational model, and SQL. Database system implementation issues are covered next, where the focus is on data structures and algorithms used to implement database management systems. Topics include physical data organizations, indexing and hashing, query processing and optimization, database recovery techniques, transaction management, concurrency control, and database performance evaluation. Current research topics in database system implementation are also explored. Programming projects, a term paper, and presentations will be required.
CSCI-622
Credits 3
This course examines policies, methods and mechanisms for securing enterprise and personal data and ensuring data privacy. Topics include data integrity and confidentiality; access control models; secure database architectures; secure transaction processing; information flow, aggregation, and inference controls; auditing; securing data in contemporary (relational, XML and other NO SQL) database systems; data privacy; and legal and ethical issues in data protection. Programming projects are required.
CSCI-720
Credits 3
This course provides a graduate-level introduction to the concepts and techniques used in data mining. Topics include the knowledge discovery process; prototype development and building data mining models; current issues and application domains for data mining; and legal and ethical issues involved in collecting and mining data. Both algorithmic and application issues are emphasized to permit students to gain the knowledge needed to conduct research in data mining and apply data mining techniques in practical applications. Data mining projects, a term paper, and presentations are required.
CSCI-721
Credits 3
This course provides an introduction to the concepts and techniques used in preparing data for subsequent data mining. Topics include the knowledge discovery process; data exploration and its role; data extraction, cleaning, integration and transformation; handling numeric, unstructured, text, web, and other forms of data; and ethical issues underlying data preparation and mining. Data cleaning projects, a term paper, and presentations are required.
CSCI-722
Credits 3
Building on prior knowledge of data analytics, this course brings in the impact of natural language processing and cognitive computing on data analysis. Topics include an overview of natural language processing; data mining, information retrieval and knowledge processing; corpus identification and preparation; training and test data and methods; current research in the field; and ethical concerns. Students will apply the concepts learned in class through team projects, programming assignments, presentations, and a research paper.
CSCI-724
Credits 3
This course introduces fundamental concepts of Web services and the Service-Oriented Computing (SOC) paradigm, and reviews seminal work, current research, and modern practices in these areas. Topics in Web Services include XML; reference model (WSDL, UDDI, SOAP); service coordination and composition; and service security and privacy. Big data analytics in SOC will also be covered, such as large scale service data retrieval and storage, service clustering and classification, service recommendation, and service discovery. Students will apply the concepts learned in the class through programming assignments and a comprehensive term project.
CSCI-729
Credits 3
This course examines current topics in Data Management. This is intended to allow faculty to pilot potential new graduate offerings. Specific course details (such as prerequisites, course topics, format, learning outcomes, assessment methods, and resource needs) will be determined by the faculty member(s) who propose a specific topics course in this area. Specific course instances will be identified as belonging to the Data Management cluster, the Security cluster, or both clusters.
DSCI-633
Credits 3
A foundations course in data science, emphasizing both concepts and techniques. The course provides an overview of data analysis tasks and the associated challenges, spanning data preprocessing, model building, model evaluation, and visualization. Major families of data analysis techniques covered include classification, clustering, association analysis, anomaly detection, and statistical testing. The course includes a series of programming assignments which will involve implementation of specific techniques on practical datasets from diverse application domains, reinforcing the concepts and techniques covered in lectures.
DSCI-644
Credits 3
This course focuses on the software engineering challenges of building scalable and highly available big data software systems. Software design and development methodologies and available technologies addressing the major software aspects of a big data system including software architectures, application design patterns, different types of data models and data management, and deployment architectures will be covered in this course.
DSCI-650
Credits 3
This course will cover concurrent, parallel and distributed programming paradigms and methodologies with a focus on implementing them for use in applied data science or scientific computing tasks. In particular, the course will focus on developing software using graphical processing units (GPUs) and the message passing interface (MPI); with an emphasis on properly handling large-scale, real-world data as part of these applications. The course will also teach scalability and load balancing techniques for developing efficient distributed systems. Programming assignments are required.

Research Projects

  • NoSQL Database Normalization [Mior]: Constructing a normalized model from a NoSQL database is a challenging problem. Traditional normalization algorithms such as lossless join BCNF decomposition fail to appropriately handle the forms of denormalization present in NoSQL databases. Appropriate algorithms for constructing a normalized schema from NoSQL databases are a critical step in performing meaningful data integration with other sources. https://michael.mior.ca/projects/eson
  • Column2Vec [Mior, Ororbia]: Column2Vec is a distributed representation of database columns based on column metadata. Our distributed representation has several applications. Using known names for groups of columns (i.e., a table name), we train a model to generate an appropriate name for columns in an unnamed table. We demonstrate the viability of our approach using schema information collected from open source applications on GitHub. https://arxiv.org/abs/1903.08621
  • Data quality and security evaluation framework for mobile devices platform [Reznik]: The project builds a proof-of-the-concept design, which will be used to develop, verify and promote a comprehensive methodology for data quality and cybersecurity (DQS) evaluation focusing on an integration of cybersecurity with other diverse metrics reflecting DQS, such as accuracy, reliability, timeliness, and safety into a single methodological and technological framework. The framework will include generic data structures and algorithms covering DQS evaluation. While the developed evaluation techniques will cover a wide range of data sources from cloud based data systems to embedded sensors, the framework implementation will concentrate on using an ordinary user’s owned mobile devices and Android based smartphones in particular.
  • Intelligent Security Systems [Reznik]: The project designs a curriculum, develops course materials, tests and evaluates them in real college classroom settings, prepares and submits them for dissemination of a college level course on Intelligent Security Systems. In order to facilitate interconnections with other courses and its inclusion into the national Cybersecurity curricula, the course is composed of nine separate modules. Five modules cover the specialized topics including:  a review of the modern state of the cybersecurity and the current problems and approaches; firewall design; intrusion detection systems; anti-malware methods and tools; hacking activity and attack recognition and prevention. Other modules provide additional support to assist in course teaching preparation, such as test and exam questions, course project and research assignment specifications, and tool presentation descriptions. This course idea is innovative and unique. It merges together various knowledge areas as diverse as artificial intelligence and machine learning techniques with computer security systems and applications. The course will allow to instill into students a unique knowledge in the very intense domain and will lead students towards getting much better prepared to their practical work ahead. It combines theoretical knowledge and practical skills development. Also, it advances students research, communication and presentation skills.