CI7340 Programing|Jupyter Notebook – Study Room
Question: Note: This assessment is only meant to be submitted by the students who have been given a reassessment opportunity from the Assessment Board. If you have already passed this module, please DO NOT submit this assignment. Your retake assessment consists of three different but related components: Research Report which is 40% of your total marks for this module. A practical report which is 50% of your total marks for this module. Oral Presentation which is 10% of your total marks for this module. The assessment described below consists of both report and practical components combined. Your oral presentation component will be arranged by your module leader and you will be informed about the same. Please note that all three components are linked together, and your oral presentation performance will be taken into account while grading your whole report (including but not limited to checking for any academic misconduct). Dataset You are being provided with a Bank Marketing Data Set: The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. The required dataset is available here. It consists of 41188 observations with 20 input features and one output feature. The data is ordered by date (from May 2008 to November 2010). The input variables and the output (target) variable names and meaning is mentioned below. This section should include an introduction to the field of data science, industry/application/domain that the dataset is representative of. You are required to formulate a problem statement and explain the proposed approach you will take to solve the identified problem. Identify different (research) questions (minimum 8, maximum of 10) that you plan to investigate in this report. Define the objectives, intended research methodology and expected outcomes. Describe your workflow supported with figures. Tools, Dataset and Initial Data Analysis (35%) a) Programming Language and Tools (8%): In this section, you are required to discuss different tools (3-5) and methods available for data analysis. Discuss the advantages and disadvantages, features of them, the choice of programming language (Python) and the related libraries that you will be using. What are the advantages and disadvantages of using your selected tool and its associated libraries? Which functionality or features you will be using for your analysis and why? b) Dataset(s) (7%): Read the dataset in the Jupyter notebook and discuss the dataset’s source, format, type of data, the source of the dataset and format used (structured/unstructured), columns, type of data in the columns, etc. c) Initial Data Analysis (20%): Perform Initial Data Analysis using Jupyter Notebook (including but not limited to): 1) Quality of data: a) Frequency counts b) Descriptive/Summary statistics (mean, median, standard deviation, etc.) c) Normality (frequency histograms), if applicable 2) Quality of measurements, if applicable 3) Data transformation: data merging (features and labels), transpose, data type change, data sorting, data deletion, etc. 4) Characteristics of dataset: a) Printing top and bottom 8 rows b) Basic plots c) Correlation analysis Discuss the quality of data that you have. Are there any missing values, data is clean, etc.)? Are different transformation methods required for your further analysis? Make sure that you discuss the different methods of data cleaning, different methods to account for missing data (Null values), pre-processing, and transform features. What are the different types of data wrangling skills (e.g., extraction, merging, and/or construction of analytical data set) that can be used? Exploratory Data Analysis (35%) a) Introduction to EDA (5%): In this section, you need to discuss EDA. Also, make sure to include a discussion about if further data cleaning or transformation is required? What are the different approaches and which ones do you think will be suitable for the data in hand? b) Descriptive Statistics (10%) (including but not limited to): In this section, discuss the various statistical methods. Discuss which ones you think will be relevant to understand your data, help answer your research questions and why? You need to present a detailed discussion on various methods and their suitability for your use case. c) Data Visualization (20%): What is data visualization, how and why are they used. Discuss the best practices. What are the different types of graphs (pie-chart/scatter plots/bar plots/histogram, and when are they used? Which visualization methods do you think will be appropriate to your research questions and why? (Note: You need to justify at least one visualization method for each of your research questions). Using appropriate plots, present the results of the research questions previously identified in the introduction section. You need to use matplotlib and seaborn both for your visualization. Tasks: You are asked to submit your presentation in .pptx or .pdf format before your presentation. You will be required to provide a demo of the code (Jupyter Notebook) and will be asked questions about the code and implementation. Hence make sure that you have prepared beforehand so that you can show the required application running during your presentation.