Hello, and welcome to the second post of the BigQuery Bootcamp series. In this post, we will delve deeply into the core stages of data analysis: Data Collection, Cleaning, Analysis, and Visualization, and explore how to effectively utilize BigQuery at each stage. Additionally, we will provide actual CSV data as a practical example to help you follow along hands-on. By understanding the key stages of data analysis and practicing them in conjunction with BigQuery, you can further enhance your data analysis skills.
Overview of the Key Stages of Data Analysis
Data analysis is the process of transforming data into meaningful information through a series of systematic steps. This process consists of the following four key stages:
1.
Data Collection
2.
Data Cleaning
3.
Data Analysis
4.
Data Visualization
Each stage is closely interconnected, and thorough execution of each step is crucial for effective data analysis.
1. Data Collection
1.1 Importance of Data Collection
Data collection is the first stage of data analysis, and securing the right data is pivotal to the success of the analysis. The source and quality of data directly impact the reliability of the analysis results, necessitating careful data collection.
1.2 Methods of Data Collection
Data collection can be carried out through various methods. The primary data collection methods are as follows:
•
Databases: Extract data from existing relational databases (RDS, MySQL, PostgreSQL, etc.).
•
APIs: Collect real-time data through APIs of external services or applications.
•
Log Files: Analyze server logs, application logs, etc., to gather data.
•
File Uploads: Upload files in various formats such as CSV, JSON to secure data.
1.3 Collecting Data in BigQuery
BigQuery supports integration with diverse data sources, facilitating the data collection process. The main methods to collect data in BigQuery are:
•
Load Data: Load data in various formats such as CSV, JSON, Avro, Parquet into BigQuery tables.
•
Streaming Insert: Insert data into BigQuery in real-time for real-time analysis.
•
Data Transfer Service: Automatically transfer data from various Google services like Google Analytics, Google Ads to BigQuery.
•
External Data Sources Connection: Connect external data sources like Cloud Storage, Google Sheets to BigQuery and query them directly.
Practical Example: Loading a CSV File into BigQuery
In the example below, we will explain how to load a sales_data.csv file containing sales data into BigQuery.
Example CSV File: sales_data.csv
sale_id,date,category,product,quantity,unit_price,total_sales
1,2024-01-15,Electronics,Smartphone,10,500,5000
2,2024-01-17,Electronics,Laptop,5,1200,6000
3,2024-01-20,Home Appliances,Vacuum Cleaner,7,150,1050
4,2024-02-05,Furniture,Office Chair,12,85,1020
5,2024-02-10,Furniture,Desk,8,200,1600
6,2024-02-15,Electronics,Tablet,15,300,4500
7,2024-03-01,Home Appliances,Air Conditioner,3,400,1200
8,2024-03-10,Electronics,Smartwatch,20,150,3000
9,2024-03-15,Furniture,Bookshelf,6,120,720
10,2024-04-01,Home Appliances,Microwave,10,100,1000
11,2024-04-05,Electronics,Headphones,25,80,2000
12,2024-04-10,Furniture,Sofa,2,700,1400
13,2024-04-15,Electronics,Camera,4,600,2400
14,2024-05-01,Home Appliances,Refrigerator,2,800,1600
15,2024-05-05,Furniture,Bed Frame,3,500,1500
16,2024-05-10,Electronics,Television,6,400,2400
17,2024-06-01,Home Appliances,Blender,10,50,500
18,2024-06-05,Furniture,Dining Table,4,350,1400
19,2024-06-10,Electronics,Speaker,18,90,1620
20,2024-06-15,Home Appliances,Washer,1,700,700
Plain Text
복사
Explanation of the CSV File
•
sale_id: Unique sale identifier
•
date: Sale date (in YYYY-MM-DD format)
•
category: Product category (e.g., Electronics, Home Appliances, Furniture)
•
product: Product name
•
quantity: Quantity sold
•
unit_price: Unit price (USD)
•
total_sales: Total sales amount (USD) - quantity × unit_price
How to Create the CSV File
1. Using a Text Editor
1.
Copy the data above and paste it into a text editor (e.g., Notepad, VS Code, Sublime Text).
2.
Save the file with the name sales_data.csv.
2. Using Spreadsheet Software
1.
Open a spreadsheet program like Microsoft Excel or Google Sheets.
2.
Enter the headers and data corresponding to each column.
3.
Export (download) the file in CSV format.
Example with Google Sheets:
1.
2.
Enter the data into each cell as shown above.
3.
From the top menu, select File > Download > Comma-separated values (.csv) to save the file as a CSV.
Loading CSV Data into BigQuery
Now, let's briefly explain how to load the created sales_data.csv file into BigQuery.
1. Uploading the CSV File to Google Cloud Storage
1.
Log in to the Google Cloud Console.
2.
Navigate to Cloud Storage, create a new bucket or select an existing one.
3.
Click the Upload Files button and upload the sales_data.csv file.
2. Loading Data in BigQuery
1.
Navigate to BigQuery.
2.
From the left navigation menu, select Dataset or create a new dataset.
3.
Within the dataset, click Create Table.
4.
In the Source section, select Google Cloud Storage and enter the file path of the uploaded CSV file (gs://your_bucket/sales_data.csv).
5.
Set the File format to CSV.
6.
In the Destination section, select your Project and Dataset, and specify a Table name (e.g., sales_data).
7.
Define the Schema:
•
Enable Auto-detect for BigQuery to automatically detect the schema.
•
Alternatively, manually define the data types for each column.
sale_id: INTEGER
date: DATE
category: STRING
product: STRING
quantity: INTEGER
unit_price: FLOAT
total_sales: FLOAT
Plain Text
복사
8.
Click Create Table to load the data.
3. Verifying Data Load
Once the data is successfully loaded, you can click on the table in BigQuery to verify that the data has been correctly loaded. For example, run the following query to view the data:
SELECT * FROM `your_project.your_dataset.sales_data` LIMIT 10;
SQL
복사
2. Data Cleaning
2.1 Importance of Data Cleaning
Collected data often contains missing values, duplicate records, and errors. Data cleaning is the process of resolving these issues to transform the data into a suitable format for analysis. Cleaned data ensures accurate analysis and reliable results.
2.2 Key Data Cleaning Tasks
During the data cleaning stage, the following tasks are performed:
•
Handling Missing Values: Remove or replace missing data.
•
Removing Outliers: Identify and remove abnormal values.
•
Data Type Conversion: Ensure data types are consistent.
•
Removing Duplicates: Identify and remove duplicate records.
•
Data Integration: Combine data collected from multiple sources.
2.3 Cleaning Data in BigQuery
BigQuery provides robust SQL-based data cleaning capabilities. The main methods to clean data in BigQuery are:
Handling Missing Values
-- Replace NULL values with 0
SELECT
IFNULL(quantity, 0) AS quantity,
IFNULL(unit_price, 0) AS unit_price,
IFNULL(total_sales, 0) AS total_sales
FROM
`your_project.your_dataset.sales_data`;
SQL
복사
Removing Duplicates
-- Remove duplicate records
CREATE OR REPLACE TABLE `your_project.your_dataset.cleaned_sales_data` AS
SELECT
DISTINCT *
FROM
`your_project.your_dataset.sales_data`;
SQL
복사
Data Type Conversion
-- Convert string to date format
SELECT
sale_id,
PARSE_DATE('%Y-%m-%d', CAST(date AS STRING)) AS date,
category,
product,
quantity,
unit_price,
total_sales
FROM
`your_project.your_dataset.sales_data`;
SQL
복사
Practical Example: Removing Missing Values and Duplicates
In the example below, we remove records with missing values and eliminate duplicate records from the sales_data table to create the cleaned_sales_data table.
-- Remove records with missing values
CREATE OR REPLACE TABLE `your_project.your_dataset.cleaned_sales_data` AS
SELECT
*
FROM
`your_project.your_dataset.sales_data`
WHERE
quantity IS NOT NULL
AND unit_price IS NOT NULL
AND total_sales IS NOT NULL;
-- Remove duplicate records
CREATE OR REPLACE TABLE `your_project.your_dataset.cleaned_sales_data` AS
SELECT
DISTINCT *
FROM
`your_project.your_dataset.cleaned_sales_data`;
SQL
복사
3. Data Analysis
3.1 Purpose of Data Analysis
Data analysis is the process of identifying patterns, deriving insights, and supporting decision-making based on cleaned data. Various methods and techniques can be employed depending on the purpose of the analysis.
3.2 Data Analysis Techniques
There are numerous techniques in data analysis, with the primary ones being:
•
Descriptive Statistics: Summarize the basic features of the data (mean, median, standard deviation, etc.).
•
Exploratory Data Analysis (EDA): Visually explore data patterns and relationships.
•
Regression Analysis: Model and predict relationships between variables.
•
Classification & Clustering: Categorize or group data.
•
Time Series Analysis: Analyze data changes over time.
3.3 Analyzing Data in BigQuery
BigQuery offers a variety of SQL functions to efficiently analyze large datasets. The main methods to perform data analysis in BigQuery are:
Calculating Descriptive Statistics
-- Calculate mean, median, and standard deviation
SELECT
AVG(quantity) AS average_quantity,
APPROX_QUANTILES(quantity, 2)[OFFSET(1)] AS median_quantity,
STDDEV(quantity) AS stddev_quantity,
AVG(unit_price) AS average_unit_price,
APPROX_QUANTILES(unit_price, 2)[OFFSET(1)] AS median_unit_price,
STDDEV(unit_price) AS stddev_unit_price,
AVG(total_sales) AS average_total_sales,
APPROX_QUANTILES(total_sales, 2)[OFFSET(1)] AS median_total_sales,
STDDEV(total_sales) AS stddev_total_sales
FROM
`your_project.your_dataset.sales_data`;
SQL
복사
Grouping and Aggregation
-- Calculate total and average sales by category
SELECT
category,
SUM(total_sales) AS total_sales,
AVG(total_sales) AS average_sales
FROM
`your_project.your_dataset.sales_data`
GROUP BY
category
ORDER BY
total_sales DESC;
SQL
복사
Joining Tables
-- Join two tables to integrate data
SELECT
a.sale_id,
a.date,
a.category,
a.product,
a.quantity,
a.unit_price,
a.total_sales,
b.region
FROM
`your_project.your_dataset.sales_data` AS a
JOIN
`your_project.your_dataset.regions` AS b
ON
a.sale_id = b.sale_id;
SQL
복사
Practical Example: Analyzing Sales by Category
In the example below, we use the sales_data table to calculate the total and average sales by category.
-- Calculate total and average sales by category
SELECT
category,
SUM(total_sales) AS total_sales,
AVG(total_sales) AS average_sales
FROM
`your_project.your_dataset.sales_data`
GROUP BY
category
ORDER BY
total_sales DESC;
SQL
복사
Example Results:
category | total_sales | average_sales |
Electronics | 26,920 | 3,365.0 |
Furniture | 7,640 | 1,273.333333 |
Home Appliances | 6,050 | 1,008.333333 |
4. Data Visualization
4.1 Importance of Data Visualization
Data visualization is a crucial process that conveys analysis results in an intuitive and easily understandable manner. Through visualization, complex data patterns can be quickly grasped, and insights can be effectively communicated to decision-makers.
4.2 Data Visualization Tools
The main data visualization tools that can be integrated with BigQuery are as follows:
•
Google Data Studio: A free data visualization tool from Google that integrates seamlessly with BigQuery to create dashboards and reports.
•
Tableau: A commercial tool offering powerful data visualization capabilities with support for BigQuery integration.
•
Looker: Google's data analysis and visualization platform, tightly integrated with BigQuery.
•
Power BI: Microsoft's data visualization tool that connects with BigQuery to generate various visualizations.
4.3 Integrating BigQuery with Google Data Studio
Google Data Studio is optimized for visualizing data through seamless integration with BigQuery. Here’s how to connect BigQuery data to Google Data Studio:
1.
2.
Add Data Source: Click on Create > Data Source, then select BigQuery.
3.
Select Project and Dataset: Choose the BigQuery project and dataset you want to connect.
4.
Select Table: Choose the table you want to visualize.
5.
Complete the Connection: Load the data to use it for visualization.
Practical Example: Creating a Sales Dashboard in Google Data Studio
The steps below describe how to create a sales dashboard in Google Data Studio using the sales_data loaded into BigQuery.
1. Adding a Data Source
1.
In Google Data Studio, click on Create > Data Source.
2.
Select BigQuery and choose the project and dataset you wish to connect to, then click Connect.
3.
In the Select Table screen, choose the sales_data table and click Add.
2. Creating Charts
1.
Bar Chart: Visualize total sales by category.
•
Dimension: category
•
Metric: total_sales
2.
Pie Chart: Display sales distribution by category.
•
Dimension: category
•
Metric: total_sales
3.
Table: Show detailed information for each sale item.
•
Dimension: date, category, product
•
Metric: quantity, unit_price, total_sales
3. Applying Filters
1.
Apply filters for specific categories or time periods to analyze data in detail.
•
Example: Add a Category Filter to allow selection among Electronics, Home Appliances, Furniture.
•
Example: Add a Date Filter to allow selection of specific months or quarters.
4. Assembling the Dashboard
1.
Arrange multiple visualization elements on a single dashboard to provide an overview of sales status.
2.
Adjust the Layout to make the dashboard visually appealing and easy to interpret.
Conclusion
We have thoroughly explored the key stages of data analysis: Data Collection, Cleaning, Analysis, and Visualization. These four stages form the foundation of data analysis, and by effectively utilizing BigQuery at each stage, you can derive more precise and meaningful analysis results. In the Data Collection stage, you secure data from various sources; in the Cleaning stage, you transform data into a suitable format for analysis. During the Analysis stage, you explore the data using SQL, and in the Visualization stage, you present the analysis results intuitively to provide insights necessary for decision-making.
Use the provided sales_data.csv file in this post to practice loading and analyzing data. This will allow you to experience BigQuery's powerful features firsthand and understand the entire data analysis process.
In the next post, we will delve deeply into the Relationship Between AI and Data Analysis – How AI Impacts Data Analysis. We will explore how AI technologies integrate into the data analysis process and enhance the efficiency and accuracy of analysis, supported by specific examples. We appreciate your interest and look forward to your continued support!
If you have any questions or feedback about this post, please leave a comment. We will incorporate your opinions to provide more valuable content.
Thank you!
Series
Read in other languages:
Support the Author:
If you enjoy my article, consider supporting me with a coffee!
Search