본문 바로가기

Data Analytics/Google Data Analytics

[Coursera] Course 5. Analyze Data to Answer Questions

WEEK1 - Organizing data to begin analysis

[Definition]

-Filtering: Showing only the data that meets a specific criteria while hiding the rest

-Sorting: When you arrange data into a meaningful order to make it easier to understand, analyze, and visualize

  • Sort sheet: All of the data in a spreadsheet is sorted by the ranking of a specific sorted column - data across rows is kept together
  • Sort range: Nothing else on the spreadsheet is rearranged besides the specified cells in a column
  • Customized sort order: When you sort data in a spreadsheet using multiple conditions

[SORT function]

-Sheets

=SORT(range, sort_column, is_ascending, [sort_column2, is_ascending2, ...])

 

-SQL

SELECT *
FROM movie_table.movies
ORDER BY Release_Date DESC

[Filter function]

-Sheets

=FILTER(range, condition1, [condition2, ...])-SQL

 

-SQL

SELECT *
FROM movie_table.movies
WHERE Genre = 'Comedy' AND Revenue  > 30000000
ORDER BY
Release_Date DESC

[The 4 phases of analysis]

1. Organize data

2. Format and adjust data

3. Get input from others

4. Transform data

 

The goal of analysis is to identify trends and relationships within data
so you can accurately answer the question you're asking

 


WEEK 2 - Formatting and adjusting data

[Definition]

-Conditional formatting: A spreadsheet tool that changes how cells appear when values meet specific conditions

 

[CONVERT function]

=CONVERT(value, start_unit, end_unit)

 

+) Available unit conversions

https://support.google.com/docs/answer/6055540?hl=en 

 

CONVERT - Google Docs Editors Help

Converts a numeric value to a different unit of measure. Sample Usage CONVERT(5.1, "g", "kg") CONVERT(32, "C", "F") CONVERT(35.7, "in^2", "m^2") CONVERT(A1, A2, A3) Syntax CONVERT(value, start_unit, end_unit) value - the numeric value in start_unit to conv

support.google.com

[Data validation]

  • Add dropdown lists with predetermined options
  • Create custom checkboxes
  • Protect structured data and formulas

[Google sheets function list]

https://support.google.com/docs/table/25273?hl=en 

 

Google Sheets function list - Google Docs Editors Help

LogicalLETLET(name1, value_expression1, [name2, …], [value_expression2, …], formula_expression )Assigns name with the value_expression results and returns the result of the formula_expression. The formula_expression can use the names defined in the sco

support.google.com

 

 


WEEK 3 - Aggregating data for analysis

[Definition]

-Aggregation: Collecting or gathering many separate pieces into a whole

-Data aggregation: The process of gathering data from multiple sources in order to combine it into a single summarized collection

-Subquery: A query within another query

-Absolute reference: A reference that is locked so that rows and columns won't change when copied

 

 

Before applying VALUE function, check these.
1) FORMAT data if needed
2) TRIM extra spaces
3) REMOVE duplicates

[VALUE function]

A function that converts a text string that represents a number to a numerical value

=VALUE(text)

[JOIN function]

A SQL clause that is used to combine rows from two or more tables based on a related column

  • INNER JOIN: A function that returns records with matching values in both tables
  • LEFT / RIGHT JOIN: A function that will return all the records from the left/right table and only the matching records from the right/left table
  • OUTER JOIN: A function that combines LEFT and RIGHT JOIN to return all matching records in both tables

source: coursera google data analytics certificate course

 

SELECT
      employees.name AS employee_name,
      employees.role AS employee_role,
      departments.name AS department_name
FROM
      employees
LEFT JOIN
      departments
ON
employees.department_id = departments.department_id

 

※ You can set aliases for tables by specifying the alias for the table after the table's name in FROM and/or JOIN statements.

https://dataschool.com/how-to-teach-people-sql/sql-join-types-explained-visually/

 

SQL Join Types Explained in Visuals

Data within a database exists across multiple tables, JOINs allow you to combine datasets into new tables for analysis. Learn more.

dataschool.com

[Subqueries]

-Question: What percentage of the orders are fulfilled by each warehouse?

SELECT
      Warehouse.warehouse_id,
      CONCAT(Warehouse.state, ': ', Warehouse.warehouse_alias) AS warehouse_name,
      COUNT(Orders.order_id) AS number_of_orders,
      (SELECT COUNT(*) FROM warehouse_orders.Orders) AS total_orders,
      CASE
            WHEN COUNT(Orders.order_id) / (SELECT COUNT(*) FROM warehouse_orders.Orders) <= 0.20
            THEN "Fulfilled 0-20% of Orders"       
            WHEN COUNT(Orders.order_id) / (SELECT COUNT(*) FROM warehouse_orders.Orders) > 0.20
            AND COUNT(Orders.order_id) / (SELECT COUNT(*) FROM warehouse_orders.Orders) <= 0.60
            THEN "Fulfilled 21-60% of Orders"    
            ELSE "Fulfilled more than 60% of Orders"    
            END AS fulfillment_summary
FROM
      warehouse_orders.Warehouse AS Warehouse
LEFT JOIN
      warehouse_orders.Orders AS Orders
ON
      Warehouse.warehouse_id = Orders.warehouse_id
GROUP BY
      Warehouse.warehouse_id,
      warehouse_name
HAVING
      COUNT(Orders.order_id) > 0

[Troubleshooting questions]

  • How should I prioritize these issues?
  • In a single sentence, what's the issue I'm facing?
  • What resources can help me solve the problem?
  • How can I stop this problem from happening in the future?

 


WEEK 4 - Performing data calculations

[Definition]

-Profit margin: A percentage that indicates how many cents of profit has been generated for each dollar of sale

-Calculated field: A new field within a pivot table that carries out certain calculations based on the values of other fields

-Modulo(SQL): An operator (%) that returns the remainder when one number is divided by another

(MODULO: SQL, MOD: sheets)

-GROUP BY(SQL): A command that groups rows that have the same values from a table into summary rows

-Data validation process: Checking and rechecking the quality of your data so that it is complete, accurate, secure and consistent

-Temporary table: A database table that is created and exists temporarily on a database server

 

[SUMIF, SUMIFS, COUNTIF, COUNTIFS]

Function Example
=SUMIF(range, criteria, [sum_range]) =SUMIF(A1:A9, "Fuel", B1:B9)
=SUMIFS(sum_range, range1, value1, [range2, value2, ...]) =SUMIFS(B1:B9, A1:A9, "Fuel", C1:C9, "12/15/20")
=COUNTIF(range, value) =COUNTIF(A1:A9, "Food")
=COUNTIFS(range1, value1, [range2, value2, ...]) =COUNTIFS(A1:A9, "Coffee", C1:C9, "12/15/2020")

 

https://exceljet.net/functions/ifs-function

 

IFS Function

The Excel IFS function can run multiple tests and return a value corresponding to the first TRUE result. Use the IFS function to evaluate multiple conditions without multiple nested IF statements. IFS allows shorter, easier to read formulas.

exceljet.net

[SUMPRODUCT function]

Multiply arrays and return the sum of those products.

=SUMPRODUCT(array1, array2, ...)

[EXTRACT, GROUP BY]

SELECT
      EXTRACT( YEAR FROM STARTTIME)
      COUNT(*) AS number_of_rides
FROM
      bigquery-public-data.new_york.citibike_trips
GROUP BY
      year
ORDER BY
      year

[Types of data validation]

  Purpose Example Limitations
Data type Check that the data matches the data type defined for a field. Data values for school grades 1-12 must be a numeric data type. The data value 13 would pass the data type validation. For this case, data range validation is also needed.
Data range Check that the data falls within an acceptable range of values defined for the field. Data values for school grades should be values between 1 and 12. The data value 11.5 would pass the range validation. For this case, data constraint validation is also needed.
Data constraints Check that the data meets certain conditions or criteria for a field. This includes the type of data entered as well as other attributes of the field, such as number of characters. Content constraint: Data values for school grades 1-12 must be whole numbers. The data value 13 would pass the content would pass the content constraint validation. For this case, data range validation is also needed.
Data consistency Check that the data makes sense in the context of other related data. Data values for product shipping dates can't be earlier than product production dates. Data might be consistent but still incorrect or inaccurate. A shipping date could be later than a production date and still be wrong.
Data structure Check that the data follows or conforms to a set structure. Web pages must follow a prescribed structure to be displayed properly. A data structure might be correct with the data still incorrect or inaccurate. Content on a web page could be displayed properly and still contain the wrong information.
Code validation Check that the application code systematically performs any of the previously mentioned validations during user data input. Common problems discovered during code validationi include: more than one data type allowed, data range checking not done, or ending of text strings not well defined. Code validation might not validate all possible variatioins with data input.

[Temporary table]

Temporary tables in a SQL database that aren't stored permanenlty.

  • They are automatically deleted from the database when you end your SQL session.
  • They can be used as a holding area for storing values if you are making a series of calculations. (pre-processing)
  • They can collect the results of multiple, separate queries. (staging)
  • They can store a filtered subset of the database. Using fewer SQL commands hepls to keep your data clean.

[How to create temporary tables]

  • WITH clauses - BigQuery
  • SELECT INTO statements - MySQL 
  • CREATE TABLE / DROP TABLE statements
WITH trips_over_1hr AS (
      SELECT *
      FROM bigquery-public-data.new_york.citibike_trips
      WHERE tripduration > 60
       )

## Count how many trips are 60+ minutes long
SELECT
      COUNT (*) AS cnt
FROM trips_over_1_hr

 

SELECT
      *
INTO
      AfricaSales
FROM
      GlobalSales
WHERE
      Region = "Africa"