Format
The term project will delivered in teams, consisting 4 students each.
Goal
-
You will create a data product suitable for analysis and submit it together with documentation outlining the technical choices you made. The purpose of this assignment is for you to apply all the concepts learned during the course.
-
Your grade depends on the number of different tools and concepts you apply and how appropriately you do so. The topic of the data product has no bearing on your grade, but you may want to think about what data you are going to work with in your Capstone Project and work with that.
High level requirements
- Combine at least two distinct datasets
- Persist at least one dataset in a database (SQL or NoSQL) the rest can be loaded directly from a file
- Use an API to enrich your data
- Build an ETL data pipeline with Knime
- Do some data cleaning
- Do some basic analytics or visualization at the end of the pipeline
Delivery
The project artifacts should be stored and handed over in a folder “Term2” in a GitHub repo (One repo of a team member)
The main artifact submitted is a 1-2 page report (including figures and data citation), with the following requirements:
- document the solution provided
- document the technical choices made
- document the data model (ER diagram for RDBMS)
- document the analytics and/or visualization
Artifacts to be submitted:
- Term project report
- Powerpoint presentation (or similar) - material for a few minutes of presentation
- Knime workflow file
- Additional artifacts to run the workflow (with description)
Reproducibility: the project should be reproducible in a straightforward manner. In other words, we should be able to run your code and obtain the same outcome as you.
Grading criteria
- Fitness of the input dataset to the purpose 5 points
- Complexity of the input data set 5 points
- Usage of concepts used in the class 10 points
- Knime pipeline 15 points
- Using database(s) 10 points
- API as data source 10 points
- Delivery: Naming, structure 5 points
- Delivery: Report 10 points
- Delivery: Presentation 20 points
- Reproducibility 10 points
Extra points:
- Using NoSQL in the project
- Using cloud instead of local servers
- Anything special not covered during the course but, makes sense in the project context
Submission and deadlines
4th December - Every material should be committed to GitHub. Submit the GitHub link to Moodle when you are ready.
6th December (17:00 - 20:00)
- We will start the session with a 10 mins quiz (Questions from the course, only the topics we discussed)
- After quiz, teams will present their results online on Teams
- The order of presentation will be by team number
- Every team will have 7 minutes for presentations, followed by 7 minutes Q&A