Exploring GitHub Copilot for Data Engineering

Home >
Perspectives >
Exploring GitHub Copilot for Data Engineering

Blog November 26, 2024

10 min read

Exploring GitHub Copilot for Data Engineering

As GitHub Copilot’s popularity grows among Fortune 100 companies, we explore this AI-powered coding assistant’s potential impact on development workflows. Built by GitHub and OpenAI, this tool integrates with IDEs like Visual Studio Code, suggesting contextually relevant code snippets in real-time. While Copilot can boost productivity, developers must carefully review its suggestions to ensure they align with their needs. Read on to explore how this tool can enhance your coding experience.

TAGS:

Sagar Adabaddi

Sidharth Ramesh

AI in Beauty: Decoding Customer Preferences with the Power of Affinity Embeddings How can AI help luxury retailers unlock a premium customer experience? GenAI for ITSM – 4 Ways LLMs Improve IT Ticket Handling and User Experience Navigating the Digital Seas: How Snowflake’s External Access Integration Streamlines Maritime Data Management Building Trusted Data: A Comprehensive Guide to Tiger Analytics’ Snowflake Native Data Quality Framework

GitHub Copilot is a fine-tuned AI model designed to act as your pair programmer, helping you write code faster and with fewer errors. Developed by GitHub in collaboration with OpenAI, Copilot integrates seamlessly with popular code editors like Visual Studio Code and provides dynamic suggestions, making your development process effortless. While Microsoft Copilot might sound similar to GitHub Copilot, they are two separate applications. Microsoft 365 Copilot supports applications like Word, Excel, and PowerPoint, whereas GitHub Copilot is specifically designed for software development.

As we at Tiger Analytics continue to leverage the latest AI tools to make our development process more efficient and innovative, we noticed a spike in interest among our Fortune 100 clients in adopting GitHub Copilot as part of their data engineering development cycles. While Copilot is no silver bullet, we found that integrating it into either the development or testing process can improve project throughput levels.

In the following sections, we will explain the basic architecture of GitHub Copilot and present some data engineering use cases that can aid in solving challenges like data lineage, code documentation, and refactoring. All examples covered can broadly be applied to solve business problems in insurance, pharma, and retail domains.

Key Features

AI-Powered: GitHub Copilot leverages OpenAI’s language models and is fine-tuned on GitHub’s code repositories to better understand your code’s context and provide relevant, applicable suggestions.
Seamless Integration: GitHub Copilot integrates with editors like Visual Studio Code while also supporting other development environments.
Context-Aware Suggestions: GitHub Copilot includes comments and existing functions to provide code suggestions that match your context.

Copilot Architecture:

Copilot Architecture

As the above architecture indicates, GitHub copilot uses advanced machine learning models that are housed in the cloud and delivers suggestions directly to your IDE. This integration allows Copilot to deliver real-time development suggestions which are also relevant to developer context. Here’s a closer look at its architecture:

1. OpenAI Codex Model

OpenAI Codex is a specialized version of the GPT-3 model fine-tuned specifically for programming tasks. It is a regressed model that takes from a large corpus of internet data [2] with use of natural language text and is finetuned on the 54 millions of lines of code repos that are publicly housed on github.com [3]. This gives Copilot the ability to understand and produce code more efficiently than models trained solely on natural language.

2. Cloud Infrastructure

Copilot provides responses within the IDE through its cloud service, powered by GitHub and OpenAI, using an API gateway to facilitate communication between the IDE plugin and the Codex model [4]. This ensures scalability, allowing the service to handle multiple requests from users around the globe simultaneously. To protect user data, security measures are usually implemented for the transmission between the user’s IDE and cloud servers. However, users should remain cautious when handling sensitive data to avoid potential security risks. [5]

3. IDE Integration

GitHub Copilot integrates seamlessly with popular code editors like Visual Studio Code through plugins, which act as intermediaries between the local development environment and the cloud-based Codex model [6]. Users can install and manage the Copilot extension through marketplaces such as the VS Code Marketplace, which allows user authentication, configuration changes and updates. While the heavy lifting of suggestion generation happens in the cloud, the plugin is responsible for local processing, including context and display of suggestions.

4. User Interface and Experience

As developers write code, the GitHub Copilot plugin sends the current context, including code and comments, to the Codex model, which returns real-time suggestions. These suggestions are displayed inline within the code editor or displayed on the Copilot chat, allowing developers to accept, reject, or modify them. Users can also customize Copilot’s behavior through settings in the IDE, adjusting suggestion frequency and enabling or disabling specific features to suit their preferences.

Mechanics of Prompting

Prompting plays a crucial role in maximizing the effectiveness of AI tools like GitHub Copilot. Writing prompts for GitHub Copilot is all about being clear and providing enough context to guide the AI model effectively.

This is where in-context learning plays a pivotal role. Copilot doesn’t just react to isolated prompts — it continuously learns from the code, comments, and patterns you’ve used in your project. This adaptive learning helps Copilot refine its suggestions, aligning more closely with your unique coding style and making your interactions with the AI more personalized and efficient. By mastering prompting, developers can significantly boost productivity and enhance the quality of their AI-assisted code.

For data engineering use cases, the context provided should include:

Data formats
Source and target table formats
Steps involved in programming logic

Unrefined Prompt: “Write a function to load file onto table”
Refined Prompt: “Write a Python function to load data from a csv file onto a postgres table.
The function should take in parameterized values for Database connections and have input argument for specifying CSV file location.
Use Pandas and Pscycopg2 to prepare the dataframe and ingest it into Postgres”

Formulating prompts is vital because LLMs have a large base of parameters to draw from. Furthermore, Copilot is a fine-tuned LLM, meaning it is trained on a specific task – providing output related to software development. The context we provide as input helps Copilot formulate attention, allowing it to draw relevant information from its vast knowledge base that directly applies to the given input problem.

Stacking up GitHub Copilot Across Various Data Engineering Use Cases

1. Code Generation- ETL Pipeline

pipeline

To demonstrate Copilot’s application in data engineering tasks, we look at ETL code generation. Consider the following pipeline: Data flowing from Landing, Staging, Dimension, and Fact tables, ultimately feeding Reporting Views for an e-commerce company. We will look at Customers, Products, and Sales.

Prompt 1

I am looking to develop a simple data flow for an e-commerce company that will take data that is ingested in CSV and will ingest into Postgres Landing Table.

The data flow will go from Landing Table which will be of SCD type 1 to Staging Table which will be of SCD type 2 to Dimension and Fact Tables. Please generate data flow example for storing customer data

The above prompt generates the following response:

To develop a simple data flow for an eCommerce company that ingests data from CSV files into a Postgres database, we can outline the following steps:

Ingest CSV Data into Landing Tables: These tables will store raw data as it is ingested.

Transform Data into Staging Tables: These tables will handle Slowly Changing Dimensions (SCD) type 2.

Load Data into Dimension and Fact Tables: These tables will be used for analytics and reporting.

Step-by-Step Data Flow Example

-- Customers Landing Table
DROP TABLE IF EXISTS customers_landing CASCADE;
CREATE TABLE customers_landing (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(255),
    email VARCHAR(255),
    created_at TIMESTAMP
);

-- Customers Staging Table
DROP TABLE IF EXISTS customers_staging CASCADE;
CREATE TABLE customers_staging (
    customer_dim_id SERIAL PRIMARY KEY,
    customer_id INT,
    customer_name VARCHAR(255),
    email VARCHAR(255),
    start_date TIMESTAMP,
    end_date TIMESTAMP,
    is_current BOOLEAN
);

-- Customer Dimension Table
DROP TABLE IF EXISTS customer_dim CASCADE;
CREATE TABLE customer_dim (
    customer_dim_id SERIAL PRIMARY KEY,
    customer_id INT,
    customer_name VARCHAR(255),
    email VARCHAR(255),
    start_date TIMESTAMP,
    end_date TIMESTAMP,
    is_current BOOLEAN
);

Note that Copilot is able to give us the table definitions in the manner requested for Landing and Staging. Copilot also generates Python code that can be used to ingest CSV files into the Postgres database. For the code reference, click here.

2. Code Refactoring-SQL

Code Refactoring SQL

Some code pipelines, especially in the insurance domain, were put in production nearly 10 or even 20 years ago. Modernizing these lines of code can take a while. Since Copilot has access to data from global software development repositories, it is exposed to examples of optimized code written by experienced developers. This gives Copilot the ability to:

Explain complex code
Take patterns from prior examples and apply it to a given problem

You can attach code files to establish a long context window by attaching the file to Copilot chat [13][12][14] and ask Copilot to refactor SQL and Python code snippets that may be complex and poorly written. Some useful prompts:

P 1
Explain this block of code

P 2
Can you please help refactor the above code snippet

P 3
Can you please explain recursive WITH clause used in this sql

P 4
Are any methods common between the two files

P 5
Is this method running with the best Big(O) value ?

Here is an example of refactoring poorly written SQL code using Copilot with a prompt:

SELECT full_name 
	,  kpi1*kpi2 as ratio
from (
SELECT full_name
	,  product_name
	,  quantity
	,  price
	,  price/quantity kpi1
	,  1/quantity::float kpi2
FROM (
SELECT
    c.customer_id,
    c.first_name || ' ' || c.last_name AS full_name,
    c.email,
    p.product_name,
    p.category,
    p.price,
    s.quantity,
    s.sale_amount,
    s.effective_date AS sale_effective_date,
    s.expiration_date AS sale_expiration_date
FROM
    staging_sales_fact s
JOIN customer_dim c ON s.customer_id = c.customer_id AND c.is_current = 'y'
JOIN product_dim p ON s.product_id = p.product_id AND p.is_current = 'y'
WHERE
    s.is_current = 'y'
)
)

Copilot is able to refactor the SQL to the below:

SELECT 
    c.first_name || ' ' || c.last_name AS full_name,
    (p.price / s.quantity) * (1 / s.quantity::float) AS ratio
FROM
    staging_sales_fact s
JOIN 
    customer_dim c ON s.customer_id = c.customer_id AND c.is_current = 'y'
JOIN 
    product_dim p ON s.product_id = p.product_id AND p.is_current = 'y'
WHERE
    s.is_current = 'y';

In the above example, there is a clear difference between the raw SQL version and the refactored version. Copilot could recognize the redundancy in the raw version and eliminate it to provide an optimized version.

3. Code Documentation

Code Documentation
[GPT generated image]

The retail domain has tighter timelines and deploys at a rapid pace. Oftentimes, low-value tasks like documentation fall by the wayside. But, documentation for the deployed code is important from a team continuity standpoint.

In this case, you can use Copilot to create code documentation, the most common example being inline commenting. You can also use the chat feature, thanks to the long context feature, to add Python files that may have complicated logic and ask Copilot to create an exhaustive Markdown file explaining all the methods and classes used. This is beneficial for code documentation and collaborations. Consider the following example file,

Prompt 1: – Please create a markdown file that will explain the repo to a new team member

Overview
This repository contains a Python module designed to facilitate interactions with the CSV module. It provides utilities for handling CSV files, including reading, writing, and managing CSV dialects. The module also includes classes and functions to support operations on CSV headers and dialects.

Features
Quote Lookup: Functions to convert between quote names and quote numbers.
Header Management: A class to manage CSV header information, allowing the use of field names instead of field numbers.
Dialect Handling: A class and functions to manage CSV dialects, including auto-detection and validation.

File Structure
csvhelper.py: Main module containing all the functionalities.
LICENSE: License file governing the use of this code.

Key Components
Functions
get_quote_number(quote_name: Optional[str]) -> Optional[int]

Converts a quote name to its corresponding quote number.
get_quote_name(quote_number: int) -> Optional[str]

Converts a quote number to its corresponding quote name.
convert_dialect(std_dialect: Type[_csv.Dialect]) -> Dialect

Converts a standard CSV dialect to a custom Dialect object.

Classes
● Header
Manages CSV header information to support using field names rather than field numbers in configurations.

Methods
load_from_file(file_name: str, dialect)
load_from_list(field_names: List[str])
get_field_position(field_name)
.
.
.

Usage
To use this module, import the necessary classes and functions from csvhelper.py. Below is an example of how to use the Header class to load and manage CSV headers:

From csvhelper import Header, Dialect

# Create a Dialect object
dialect = Dialect(delimiter=',', has_header=True, quoting=csv.QUOTE_MINIMAL)

# Create a Header object
header = Header()

# Load header from a CSV file
header.load_from_file('example.csv', dialect)

# Get field position by name
position = header.get_field_position('field_name')

# Get field name by position
name = header.get_field_name(0)

Code documentation is often overlooked by developers, but Copilot’s ability to generate markdown files with descriptions of the methods used along with usage references helps create wiki page documentation for teams.

4. Data Lineage and Table Relationships

Data Lineage and Table Relationships

Drawing out the relationships between data and tables is always helpful, especially in the insurance domain where the table widths are rather unwieldy, and some tables have close to 500 columns. While Copilot does not have the visualization ability that Commercial ETL tools have for data lineage, it can provide a text-based relationship between tables and columns. This would require the entire schema to be loaded onto the Copilot chat, and then the prompt should establish the context of table syntax and specify the column at the View level for which you would want to see the lineage.

P1
Can you please draw an ERD diagram for table_relationship.sql file and describe how many tables have been referred to in it? Can you please use a line diagram to demonstrate the relationships between the tables with the columns that link them ?

+-------------------+          +-------------------+          +-------------------+
|   Application     |          |   Application     |          |   Application     |
|     Cities        |          |   StateProvinces  |          |     Countries     |
+-------------------+          +-------------------+          +-------------------+
| CityID (PK)       |          | StateProvinceID (PK) |       | CountryID (PK)    |
| CityName          |          | StateProvinceName    |       | CountryName       |
| StateProvinceID (FK) --------> CountryID (FK)       |       | LastEditedBy (FK) |
| Location          |          | LastEditedBy (FK)    |       +-------------------+
| LatestRecordedPop |          +-------------------+
| LastEditedBy (FK) |
+-------------------+
        |
        |
        v
+-------------------+
|   Application     |
|     People        |
+-------------------+
| PersonID (PK)     |
| FullName          |
| PreferredName     |
| IsPermittedToLogon|
| LastEditedBy (FK) |
+-------------------+

+-------------------+          +-------------------+          +-------------------+
|   Application     |          |   Application     |          |   Application     |
| DeliveryMethods   |          | PaymentMethods    |          | TransactionTypes  |
+-------------------+          +-------------------+          +-------------------+
| DeliveryMethodID (PK) |      | PaymentMethodID (PK) |       | TrnsctnTypeID(PK) |
| DeliveryMethodName    |      | PaymentMethodName    |       | TransctionTypeName|
| LastEditedBy (FK)     |      | LastEditedBy (FK)    |       | LastEditedBy (FK) |
+-------------------+          +-------------------+          +-------------------+
        |
        |
        v
+-------------------+
|   Application     |
|     People        |
+-------------------+
| PersonID (PK)     |
| FullName          |
| PreferredName     |
| IsPermittedToLogon|
| LastEditedBy (FK) |
+-------------------+

While the ERD diagrams produced by Copilot are good, they must be reviewed for accuracy before the output is considered as documentation.

5. Code Conversion – R to Python

Code Conversion – R to Python

Insurance and pharmaceutical clients often have legacy code that continues to perform its intended function and thus, has not been modified. However, enterprise modernization is still relevant and often involves replacing legacy code with modern programming paradigms. GitHub Copilot can work with multiple programming languages. While we’ve primarily focused on SQL and Python – as they are the common denominators in data engineering, Copilot can also work with other languages such as Scala, Java, R, and JavaScript. Drawing from a vast pool of data on GitHub, it is proficient in virtually all languages with extensive codebases. One potential application of particular interest is code conversions from R to Python.

Can you please convert the following R code to Python?

library(RSQLite)
library(dplyr)

# Load CSV 
csv_file_path <- "path/to/your/data.csv"

data <- read.csv(csv_file_path)
data <- data %>%
  rename_all(tolower) %>%
  # Filter out rows with missing values
  filter(!is.na(column1) & !is.na(column2))
db <- dbConnect(SQLite(), dbname = "path/to/your/database.sqlite")
dbWriteTable(db, "table_name", data, overwrite = TRUE)
dbDisconnect(db)
cat("ETL process completed successfully.\n")

The above R Code is refactored into Python

import pandas as pd
import sqlite3

# Define the file path for the CSV file
csv_file_path = "path/to/your/data.csv"

# Extract: Read data from the CSV file
data = pd.read_csv(csv_file_path)

# Transform: Clean and transform the data
# Convert column names to lower case
data.columns = data.columns.str.lower()
# Filter out rows with missing values in 'column1' and 'column2'
data = data.dropna(subset=['column1', 'column2'])

As always, the output from Copilot must be tested for its logic before being added to the code repo.

For supporting files, please visit Git Repo.

Limitations:

Here are some main limitations of GitHub Copilot [8][9]

Limitation	Description	Remedy
Wrong Context	GitHub Copilot may not always get the context in the way you intended and might suggest inaccurate implementations	Establishing context takes more than a single prompt and substantial amount of text to ensure Copilot sees things the way you see it.
Intellectual Prop Security	As GitHub Copilot is trained on GitHub data, some data could be from copyrighted sources, and using code generated from such sources as is could lead to Intellectual property concerns	Generic logic that serves as boilerplate for projects won’t differ significantly. However, for logic that contains a specific type of implementation, it is prudent to remove sensitive or infringing snippets
Code Quality	GitHub Copilot churns a lot of code, and some of it may not exactly adhere to your conventions	Review generated code and remove aspects that don't fit into your code conventions to ensure best practice
Overreliance	Developers can go into autopilot mode with aggressive use of GitHub Copilot	As with any AI output, there are three things to remember when using GitHub Copilot: Review - Review - Review

Examining the landscape of popular codegen applications - GitHub Copilot, LLM Studio, and Blackbox.ai

Here is a table comparing GitHub Copilot with other popular codegen applications like LLM Studio and Blackbox.ai

Feature	Microsoft Copilot	GitHub Copilot	Blackbox.ai
Main Purpose	AI assistant for Microsoft 365 apps	AI-powered coding assistant for code completion and suggestions	Code generation from natural language queries
User Interface	Integrated within Microsoft 365 applications interface	Integrated into IDE (e.g., VS Code)	Integrated into IDE (e.g., VS Code)
Programming Function	Generate Content in word, ppt, analyze excel, summarize emails	Supports many languages like Python, JavaScript, TypeScript, Ruby, Go	Supports major languages (Python, Java, JavaScript, etc.)
Customization	Adapts to organizational data and user context within office apps	Customizable code suggestions based on context	Basic customization with natural language queries
IDE Integration	Integrated into Microsoft 365 apps	Integrated with VS Code, JetBrains IDEs	Integrated with VS Code, JetBrains IDEs
Real-time Suggestions	Provides real time assistance within Office apps	Yes, real-time inline code suggestions	Yes, real-time code completion
Pricing Model	Enterprise license	Freemium ($10/month for individuals)	Freemium model (more straightforward pricing)
Target Audience	Business professionals, enterprises	Developers needing code suggestions and completion	Beginners and developers seeking quick code solutions
Strengths	Enhances productivity across Office apps	Context-aware suggestions, broad language support	Natural language to code, simple query-based code generation

Final Thoughts

As noted in the examples above, we can infer that GitHub Copilot can enhance the coding experience for developers at all levels. Developers can

Increase their productivity - By reducing the amount of boilerplate and repetitive code, Copilot increases productivity, allowing developers to focus on more complex and creative tasks.
Learn from better implementations -Copilot serves as a valuable learning tool for new developers, providing examples and promoting best practices.
Reduce errors -By offering accurate code completions, Copilot helps reduce syntax errors and typos, leading to fewer mistakes and improved code quality.
Iterate quickly -Copilot quickly generates necessary code snippets, expediting prototyping with alternative approaches.

GitHub Copilot is fast, almost precise, and convenient, enabling dev teams to move quickly from concept to implementation. As impressive as all the above demonstrations have been, it is crucial to check the output provided before applying it as necessary.

References

[1] Copilot Architecture and Features - https://github.com/features/copilot
[2] Open AI - https://openai.com/index/openai-codex/
[3] GitHub 54M Lines of Code - https://venturebeat.com/business/openai-warns-ai-behind-githubs-copilot-may-be-susceptible-to-bias/
[4] Workings of Copilot - https://blog.quastor.org/p/github-copilot-works
[5] Privacy - https://docs.github.com/en/site-policy/privacy-policies/github-general-privacy-statement
[6] VSCode Integration - https://code.visualstudio.com/docs/copilot/overview
[7] GitHub Copilot Custom Fine Tuning - https://github.blog/news-insights/product-news/fine-tuned-models-are-now-in-limited-public-beta-for-github-copilot-enterprise/
[8] GitHub Copilot Pros and Cons - https://medium.com/neudesic-innovation/github-copilot-unveiling-the-pros-cons-and-key-considerations-f7da9389676
[9] GH Copilot Review - https://intellias.com/github-copilot-review/
[10] Fine Tuning Limitations - https://codeium.com/blog/what-github-copilot-lacks-finetuning-on-your-private-code
[11] GitHub Fine Tuning Blog - https://github.blog/news-insights/product-news/fine-tuned-models-are-now-in-limited-public-beta-for-github-copilot-enterprise/
[12] Large Context Window - https://github.blog/changelog/2024-09-09-larger-context-window-improved-test-generation-and-more-in-vs-codes-copilot-chat/
[13] Attach File - https://code.visualstudio.com/updates/v1_93#_attach-context-in-quick-chat
[14] Code Review - https://github.blog/changelog/2024-10-29-multi-file-editing-code-review-custom-instructions-and-more-for-github-copilot-in-vs-code-october-release-v0-22/

Sagar Adabaddi

Sidharth Ramesh

Exploring GitHub Copilot for Data Engineering

Key Features

Copilot Architecture:

1. OpenAI Codex Model

2. Cloud Infrastructure

3. IDE Integration

4. User Interface and Experience

Mechanics of Prompting

Stacking up GitHub Copilot Across Various Data Engineering Use Cases

1. Code Generation- ETL Pipeline

2. Code Refactoring-SQL

3. Code Documentation

4. Data Lineage and Table Relationships

5. Code Conversion – R to Python

Limitations:

Final Thoughts

References

Explore more blogs

Thank you!

Thank you!