Project 1: Data Engineering (GCP)

This is a project I did for ETL in Google Clousd Platform.
Data was a dummy created with Python.

Technologies Used



Python Code Snippet: Data Loading

import csv
from faker import Faker
import pandas as pd

fake = Faker()

def sanitize_text(text):
    """Removes newlines, extra spaces, and ensures proper formatting."""
    return text.replace('\n', ' ').replace('\r', ' ').replace(',', ' ').strip()

def generate_employee_data(num_employees=1000):
    with open("cleaned_employee_data.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = [
            "employee_id", "first_name", "last_name", "email", "phone_number", "address", "birthdate", 
            "hire_date", "job_title", "department", "salary", "password"

GCP Composer

GCP Instances


GCP Data Fusion Pipeline


GCP Big Query


GCP Looker


Project 2: Blood Donor (PySpark|Machine Learning)

This is a project I did to Predict if a patient is Hep or not based parameter.
Dataset contains laboratory values of blood donors and Hepatitis C patients and demographic values like age.


Feature Engineering


Google Colab code snippet: Pyspark Session

!pip install pyspark

# Load our Pkgs
from pyspark import SparkContext

# Spark
spark = SparkSession.builder.appName("MLwithSpark").getOrCreate()

# Load our dataset
df ="/content/drive/MyDrive/Colab Notebooks/Data/hcvdata.csv",header=True,inferSchema=True)

Google Colab code snippet: Logistic Model

train_df,test_df = vec_df.randomSplit([0.7,0.3])
from import LogisticRegression,DecisionTreeClassifier

# Logist Model
lr = LogisticRegression(featuresCol='features',labelCol='Target')
lr_model =
y_pred = lr_model.transform(test_df)'target','rawPrediction', 'probability', 'prediction').show()

Google Colab code snippet: Model Evaluation

from import MulticlassClassificationEvaluator

# How to Check For Accuracy
multi_evaluator = MulticlassClassificationEvaluator(labelCol='Target',metricName='accuracy')
from pyspark.mllib.evaluation import MulticlassMetrics
lr_metric = MulticlassMetrics(y_pred['target', 'prediction'].rdd)


Project 3: Sentiment Analysis on X (Data Science)

This is a project I did for Sentiment Analysis on X (Twitter).
Data was get from X through their API. (Sentiment analysis with data from twitter:)


Google Colab code snippet: Authentication for Twitter API

auth = tweepy.OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessTokenSecret)
api = tweepy.API(auth)

Google Colab code snippet: Getting Tweets With Keyword or Hashtag

for tweet in tweets:
    analysis = TextBlob(tweet.text)
    score = SentimentIntensityAnalyzer().polarity_scores(tweet.text)
    neg = score['neg']
    neu = score['neu']
    pos = score['pos']
    comp = score['compound']
    polarity += analysis.sentiment.polarity

Google Colab code snippet: Creating new data frames for all sentiments (positive, negative and neutral)

tw_list_negative = tw_list[tw_list["sentiment"]=="negative"]
tw_list_positive = tw_list[tw_list["sentiment"]=="positive"]
tw_list_neutral = tw_list[tw_list["sentiment"]=="neutral"]

Sentiment Analysis for word: “UCM”

Positive Sentiment for word: “UCM”

Negative Sentiment for word: “UCM”

Project 4: Data Visualization with Tableau

1. Sales Analisis in the USA. see on Public Tableau

2. Analysis of Sales and Profitability in EU. see on Public Tableau

3. Customer Segmentation in the UK. see on Public Tableau

4. World Tertiary Education, STEM vs non STEM. see on Public Tableau