Hi, I'm Nina Plotko

And I am Senior Data Scientist at ADP. I finished my B.S. in Industrial and Systems Engineering at Georgia Institute of Technology. My concentration within my major was Data Analytics and Science. This is my data science portfolio. Scroll down to see my coursework, skills, and project experience. You can also check out some of my work on GitHub and find me on LinkedIn.

Coursework at Georgia Tech

Data Input and Manipulation, Linear Algebra, Calculus (up to Multivariable), Discrete Mathematics, Probability with Applications, Statistics, Regression and Forecasting, Database Systems, Engineering Economy, Simulation Analysis and Design, Courses in Optimization (Engineering Optimization and Advanced Optimization), Classes in Stochastic Systems (Stochastic Manufacturing and Service Systems, Advanced Stochastic Systems), Machine Learning, Methods for Quality Improvement, Human-Computer Interaction, Foundations of Modern Data Science.

My Data Science Skills

Large Language Models (LLMs) and Natural Language Processing (NLP)

Prior to the popular release of GPT-3.5, I was using classic NLP techniques like embedding models (BERT), calculating semantic distance, and building sentiment/topic classifiers. Now, I have over a year of experience using LLMs for retrieval-augmented Q&A, content generation, and knowledge engineering. I'm skilled in combining traditional NLP tools with prompting techniques to create high-quality and reliable output from LLMs.

Data Visualizations

Using Python packages like Matplotlib, Seaborn, and Plotly I create meaningful visualizations to drive my exploratory data analysis and convey my findings from predictive models. I create scatter plots, bar and line charts, pie charts, heatmaps, cloropleth graphs, as well as statistical graphs like boxplots and histograms. Aggregating data and plotting in various ways helps people outside of data specialists understand the dataset and view trends.

Machine Learning Algorithms

My deep foundational understanding of popular models like regression, neural networks, and K nearest neighbors gives me an advantage when implementing algorithms with Scikit-Learn and Statsmodels. My skill for data analytics assists in the ML pipeline when performing exploratory data analysis. I have a strong appreciation for hyperparameter tuning, in which I utilize K-fold cross-validation, validation curves, and grid search to optimize parameters. I have trained models on standard datasets and corporate data. I have also experimented with unsupervised learning techniques like LDA and dimensionality reduction techniques like PCA and UMAP.

Data Analytics

I have a passion for identifying trends in data. My statistical knowledge combined with my technical skills make me an excellent data analyst. Using data manipulation libraries like Pandas and NumPy, I can calculate central tendency measures and view the distributions for data to understand if, why, and how something occurred in the dataset. I have experience doing analyses with datasets on Kaggle, as well as webscraping and using APIs to analyze web data.

My Data Science Projects

Below are the descriptions of some of the projects I have worked on. The projects with links will take you to my personal github for the code (in the form of a jupyter notebook).

LLM Search Snippets

Project as Data Scientist at ADP
This was a proof of concept for augmenting the Enterprise Search experience with an LLM-powered Q&A. The PoC focused on creating "Google search snippets" for questions relating to payroll, retirement, benefits, and other HR topics. My role on this project was the generation and testing portion, where I did the prompt engineering, guardrails, and data manipulation from the search results.

LLM Training Scenario Creation

Project as Data Scientist at ADP
This was a proof of concept for augmenting the Learning Business Parter role with an LLM-powered training scenario creation tool. The PoC focused taking high-quality calls and using GPT models to create new scenarios for associates to learn. This solution created high-quality training scenarios at a rate 20x faster than without Generative AI.

LLM RAG Q&A for Human Resource Outsourcing

Project as Data Scientist at ADP
This was a proof of concept for augmenting the HR Business Partner role with an LLM-powered Q&A. The PoC focused on family and medical leave (FMLA). I uploaded government websites, regulations, and fact sheets to S3 and set-up a vector database within AWS OpenSearch. I ran experiments and benchmarked the search parameters and the prompts for different LLMs like GPT-3, GPT-4, and Claude v1.

Outlier Detection Algorithm

Project as Data Scientist at ADP
Upstream misclassifications and unexpected salaries were affecting a downstream deep learning model (a compensation benchmarking model). I developed the algorithm which identifies the misclassifications based on semantic distance and statistical anomoly detection techniques. This algorithm now runs in production on monthly payroll data, about 20M+ records. The resulting filtered table is used for training the downstream model and also powers the many Analytics products at ADP.

Cost Estimation Tool for Honeywell Building Technologies

Senior Design Project
Honeywell approached the team with high deviations between their cost estimates and actual cost incurred. I led the team in building one of the deliverables, the Cost Estimation Tool, which predicts the labor cost of the project. The Cost Estimation Tool has three regression models in the back-end, one for each type of project. This project involved a meticulous feature selection process to determine which materials and project attributes should be used as variables. I led the regression assumption analysis, feature engineering and selection, and the training of the models. Each model is non-linear, with interactions between the material type and the project attributes. This project had another deliverable, the Risk Simulation Model. Together, these deliverables won best overall Industrial Engineering project in the Georgia Tech Senior Design Capstone Expo.

IMDB Movie Review Analysis and Classification

Personal Project
Using the famous IMDB dataset (taken from Kaggle), I performed exploratory data analysis, created wordclouds, and pre-processed using TF-IDF vectorization. Then, I performed cross-validation on logistic regression and K nearest neighbor models using sklearn's validation_curve to visualize the increase in accuracy over the hyperparameter setting. I also compared this to LinearSVC. The packages used in this project were Pandas, Sci-Kit Learn, and SpaCy.

Analysis of Sitewide Feedback

Project as Data Science and Machine Learning Intern at ADP
Led the project kickoff to streamline to process for viewing sitewide feedback of a product. I experimented with unsupervised learning models like LDA (topic modeling) and UMAP with HDBSCAN (clustering). I also built a multi-label topic classification model identifying whether feedback was about support, system performance, UI/UX issues, and more. To do this, I fine-tuned SBERT to perform this task. I also trained a classifier to predict the sentiment around feedback.

Analysis of University Data

Personal Project
To learn more about the Plotly library of Python, I performed exploratory data analysis on a dataset relating to university data. This dataset includes world rank, teaching scores, research scores, the number of students, the female/male ratio, and more. The exploratory data analysis consists of bar charts, line charts, wordclouds, and scatter plots. Everything built with Plotly is interactive. Here, I tried to learn what qualities make a university ranked higher, so I performed linear regression to predict the total score of a university given the features of the dataset.

Experiments with Overparameterized Linear Regression Models

Project for Foundations of Modern Data Science Class
After researching the novel "Double Descent Phenomenon," I conducted a series of experiments with randomly generated data to understand and visualize this phenomenon's relation to linear regression. One experiment replicates the work of Preetum Nakkiran, while the others were based on my own research. As this project featured a form of linear regression unapplicable to sklearn's estimator, I performed the linear algebra calculations to train the models used here. I also used PyTorch to speed up the matrix operations.

Applied Data Science Capstone

Project for IBM Data Science Professional Certificate on Coursera
Given a few datasets containing information on SpaceX rocket launches, I performed the full data science pipeline. This included data cleaning, normalization, and exploratory data analysis, in which I created a dashboard using Plotly's Dash package. Using this dashboard, I learned correlations between rocket characteristics and launch failures. I, then, performed feature engineering and created a classifier to predict successful missions with multiple algorithms and analyzed the effect of payload mass on landing success rates.

Drone Grocery Delivery Database

Project for Database Systems Class
In this class, I was given a series of 15 screens outlining the full functionality of a drone grocery delivery system with users including grocery store managers, drone technicians, and customers. Customers would pick a grocery store, select the groceries and quantities, and place an order. This order would be mapped to a drone to then deliver the groceries. From these screens, I created an ERD and mapped it to a relational schema. In SQL, I created the database, inserted the data into the database, and created procedures for the functionality of all users.

Predicting Life Expectancy from Immunizations and Socio-Economic Factors

Project for Regression and Forecasting Class
Using a Kaggle dataset, my team built multiple regression models and performed factor analysis to determine what the most significant factors were in predicting life expectancy for several countries. First, we cleaned the data. Next, we checked the assumptions for linear regression (constant variance and normality) and transformed the data to fix issues with collinearity of features. We used several techniques such as forward/backward selection, stepwise regression, and best subsets regression to learn the most significant features, and we used RMSE and R-squared as our accuracy measures. The most significant features found were mortality rates, deaths under 5 years of age, GDP, infant deaths, status of country (developing or developed), total government expenditure percentage, income composition of resources, and alcohol usage. With these features, the team was able to obtain an R-squared value of 83.76 percent. This project was done completely using R.