Erick Bautista - Portfolio

Project 1: Data Engineering (GCP)

This is a project I did for ETL in Google Clousd Platform.
Data was a dummy created with Python.

Technologies Used

Python, SQL (BigQuery), Google Data Fusion, Looker

WorkFlow

Project Design: Draw.io
Data Generation: Python code
ETL: GCP Enviroment, Instance, Data Fusion, Pipeline, Big Query
Data Visualization: Looker

Design

Python Code Snippet: Data Loading

import csv
from faker import Faker
import pandas as pd

fake = Faker()

def sanitize_text(text):
    """Removes newlines, extra spaces, and ensures proper formatting."""
    return text.replace('\n', ' ').replace('\r', ' ').replace(',', ' ').strip()


def generate_employee_data(num_employees=1000):
    with open("cleaned_employee_data.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = [
            "employee_id", "first_name", "last_name", "email", "phone_number", "address", "birthdate", 
            "hire_date", "job_title", "department", "salary", "password"
        ]

View Full Code on GitHub

GCP Composer

GCP Instances

GCP Data Fusion Pipeline

GCP Big Query

GCP Looker

Project 2: Blood Donor (PySpark|Machine Learning)

This is a project I did to Predict if a patient is Hep or not based parameter.
Dataset contains laboratory values of blood donors and Hepatitis C patients and demographic values like age.

WorkFlow

Data Prep
Feature Engineering
Build Model
Evaluate

Feature Engineering

Numberical Values
Vectorization
Scaling

Technologies

Google Colab, PySpark, Machine Learning (Linear Regression), Pandas, Numpy, matplotlib, seaborn

Google Colab code snippet: Pyspark Session

!pip install pyspark

# Load our Pkgs
from pyspark import SparkContext

# Spark
spark = SparkSession.builder.appName("MLwithSpark").getOrCreate()

# Load our dataset
df = spark.read.csv("/content/drive/MyDrive/Colab Notebooks/Data/hcvdata.csv",header=True,inferSchema=True)

Google Colab code snippet: Logistic Model

train_df,test_df = vec_df.randomSplit([0.7,0.3])
from pyspark.ml.classification import LogisticRegression,DecisionTreeClassifier

# Logist Model
lr = LogisticRegression(featuresCol='features',labelCol='Target')
lr_model = lr.fit(train_df)
y_pred = lr_model.transform(test_df)
y_pred.show()
y_pred.select('target','rawPrediction', 'probability', 'prediction').show()

Google Colab code snippet: Model Evaluation

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# How to Check For Accuracy
multi_evaluator = MulticlassClassificationEvaluator(labelCol='Target',metricName='accuracy')
multi_evaluator.evaluate(y_pred)
from pyspark.mllib.evaluation import MulticlassMetrics
lr_metric = MulticlassMetrics(y_pred['target', 'prediction'].rdd)

print("Accuracy",lr_metric.accuracy)
print("Precision",lr_metric.precision(1.0))
print("Recall",lr_metric.recall(1.0))
print("F1Score",lr_metric.fMeasure(1.0))

View Full Code on GitHub

Heatmap

—

Project 3: Sentiment Analysis on X (Data Science)

This is a project I did for Sentiment Analysis on X (Twitter).
Data was get from X through their API. (Sentiment analysis with data from twitter:)

WorkFlow

Preparation: textblob, tweepy, pycountry, wordcloud, langdetect
Authentication for Twitter API
Getting Tweets With Keyword or Hashtag
Extracting text values

Google Colab code snippet: Authentication for Twitter API

auth = tweepy.OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessTokenSecret)
api = tweepy.API(auth)

Google Colab code snippet: Getting Tweets With Keyword or Hashtag

for tweet in tweets:
    
    #print(tweet.text)
    tweet_list.append(tweet.text)
    analysis = TextBlob(tweet.text)
    score = SentimentIntensityAnalyzer().polarity_scores(tweet.text)
    neg = score['neg']
    neu = score['neu']
    pos = score['pos']
    comp = score['compound']
    polarity += analysis.sentiment.polarity

Google Colab code snippet: Creating new data frames for all sentiments (positive, negative and neutral)

tw_list_negative = tw_list[tw_list["sentiment"]=="negative"]
tw_list_positive = tw_list[tw_list["sentiment"]=="positive"]
tw_list_neutral = tw_list[tw_list["sentiment"]=="neutral"]

Erick Bautista

Erick Bautista - Portfolio

Project 1: Data Engineering (GCP)

Technologies Used

WorkFlow

Design

Python Code Snippet: Data Loading

GCP Composer

GCP Instances

GCP Data Fusion Pipeline

GCP Big Query

GCP Looker

Project 2: Blood Donor (PySpark|Machine Learning)

WorkFlow

Feature Engineering

Technologies

Google Colab code snippet: Pyspark Session

Google Colab code snippet: Logistic Model

Google Colab code snippet: Model Evaluation

Heatmap

Project 3: Sentiment Analysis on X (Data Science)

WorkFlow

Google Colab code snippet: Authentication for Twitter API

Google Colab code snippet: Getting Tweets With Keyword or Hashtag

Google Colab code snippet: Creating new data frames for all sentiments (positive, negative and neutral)

Sentiment Analysis for word: “UCM”

Positive Sentiment for word: “UCM”

Negative Sentiment for word: “UCM”

Project 4: Data Visualization with Tableau

1. Sales Analisis in the USA. see on Public Tableau

2. Analysis of Sales and Profitability in EU. see on Public Tableau

3. Customer Segmentation in the UK. see on Public Tableau

4. World Tertiary Education, STEM vs non STEM. see on Public Tableau