Skip to content

bhavyasj/data-engineering-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Data Engineering Project

This repository contains my work, progress, and contributions for the Data Engineering Project as part of the Applied Data Science and Artificial Intelligence program.

Project Overview

The goal of this project is to design and implement a complete data engineering pipeline involving:

  • Web data crawling from multiple dynamic websites
  • Collection of structured data and media content
  • Distributed storage using Hadoop HDFS
  • Data redundancy and fault-tolerance testing
  • Database design and implementation
  • Data ingestion from HDFS into a relational database
  • Business intelligence and analytical query development

Repository Purpose

This repository serves as a record of:

  • Project development progress
  • Individual contributions
  • Source code and scripts
  • Configuration files
  • Documentation
  • Testing results
  • Data processing workflows

Technologies Used

  • Python
  • Playwright
  • Hadoop HDFS
  • Apache Spark
  • Docker
  • SQL
  • Git & GitHub

Current Work

The repository is continuously updated throughout the project lifecycle to document:

  • Data collection activities
  • Data storage implementation
  • Cluster setup and configuration
  • Database development
  • Query creation and analysis
  • Testing and optimization

About

Tracking the progress and updating the project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages