본문 바로가기

Data Analytics/Google Data Analytics

[Coursera] Course 4. Process Data from Dirty to Clean

WEEK 1 - The importance of integrity

[Definition] 

-Data integrity: The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle

-Data replication: The process of storing data in multiple locations

-Data transfer: The process of copying data from a storage device to memory, or from one computer to another

-Data manipulation: The process of changing data to make it more organized and easier to read

-Population: All possible data values in a certain dataset

-Sample size: A part of a population that is representative of the population

-Sampling bias: A sample isn't representative of the population as a whole

-Random sampling: A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen

-Statistical power: The probability of getting meaningful results from a test

-Hypothesis testing: A way to see if a survey or experiment has meaningful results

-Confidence level: The probability that your sample size accurately reflects the greater population

-Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population

 

If a test is statistically significant,
it means the results of the test are real and not an error caused by random chance.

 

[Data constraints and examples]

Data constraint Definition Examples
Data type Values must be of a certain type:
data, number, percentage, boolean, etc.
If the data type is a date, a single number like 30 would fail the constraint and be invaild
Data range Values must fall between predefined maximum and minimum values If the data range is 10-20, a value of 30 would fail the constraint and be invalid
Mandatory Values can't be left blank or empty If age is mandatory, that value must be filled in
Unique Values can't have a duplicate Two people can't have the same mobile number within the same service area
Regular expression(regex) patterns Values must match a prescribed pattern A phone number must match ###-###-#### (no other characters allowed)
Cross-field validation Certain conditions for multiple fields must be satisfied Values are percentages and values from multiple fields must add up to 100%
Primary-key (Databases only) value must be unique per column A database table can't have two rows with the same primary key value. A primary key is an identifier. 
Set-membership (Databases only) values for a column must come from a set of discrete values Value for a column must be set to Yes, No, or Not Applicable
Foreign-key (Databases only) values for a column must be unique values coming from a column in another table In a U.S. taxpayer database, the State column must be a valid state or territory with the set of acceptable values defined in a separate States table
Accuracy The degree to which the data conforms to the actual entity being measured or described If values for zip codes are validated by street location, the accuracy of the data goes up
Completeness The degree to which the data contains all desired components or measures If data for personal profiles required hair and eye color, and both are collected, the data is complete
Consistency The degree to which the data is repeatable from different points of entry or collection If a customer has the same address in the sales and repair databases, the data is consistent

 


WEEK 2 - Sparkling-clean data

[Definition]

-Dirty data: Data that is incomplete, incorrect, or irrelevant to the problem you're trying to solve

-Clean data: Data that is complete, correct, and relevant to the problem you're trying to solve

-Data engineers: Transform data into a useful format for analysis and give it a reliable infrastructure

-Data warehousing specialists: Develop processes and procedures to effectively store and organize data

-Data validation: A tool for checking the accuracy and quality of data before adding or importing it

-Conditional formatting: A spreadsheet tool that changes how cells appear when values meet specific conditions

-Remove duplicates: A tool that automatically searches for and eliminates duplicate entries from a spreadsheet

-Split: A tool that divides text around a specified character and puts each fragment into a new, separate cell

-Concatenate: A function that joins multiple text strings into a single string

-Syntax: A predetermined structure that includes all required information and its proper placement

-Pivot table: A data summarization tool that is used in data processing

-Data mapping: The process of matching fields from one data source to another

[Dirty data]

Types Description Possible causes
Duplicate data Any data record that shows up more than once Manual data entry, batch data imports, or data migration
Outdated data Any data that is old which should be replaced with newer and more auucrate information People changing roles or companies, or software and system becoming obsolete
Incomplete data Any data that is missing important fields Improper data collection or incorrect data entry
Incorrect/Inaccurate data Any data that is complete but inaccurate Human error inserted during data input, fake information, or mock data
Inconsistent data Any data that uses different formats to represent the same thing Data stored incorrectly or errors inserted during data transfer

[Data Cleaning Tools]

  • Data validation
  • Conditional formatting
  • COUNTIF
  • Sorting
  • Filtering

[Useful functions]

Description Function Example
기준을 충족하는 셀의 개수 반환 =COUNTIF(range, "value") =COUNTIF(A2:A72, "<100")
텍스트 문자열의 문자 수 반환 =LEN(range) =LEN(A2)
지정한 개수의 문자열 왼쪽부터 반환 =LEFT(range, number of characters) =LEFT(A2,5)
지정한 개수의 문자열 오른쪽부터 반환 =RIGHT(range, number of characters) =RIGHT(A2,4)
지정한 개수의 문자열 원하는 곳부터 반환 =MID(range, reference starting point, number of middle characters) =MID(D2,4,2)
두 개 이상의 텍스트 문자열을 하나로 연결 =CONCATENATE(item1,item2) =CONCATENATE(G2,H2)
=CONCATENATE(G2," ",H2)
지정한 구분자에 의해 문자열을 쪼개어 반환 =SPLIT(range, delimiter) =SPLIT(A2,"-")
텍스트에 불필요한 공백 제거(단어 사이 제외) =TRIM(range) =TRIM(A2)
행별로 항목을 찾아야 할 때 =VLOOKUP(data to look up,'where to look'!Range, column, FALSE) =VLOOKUP(A2,'Sheet 2'!A1:B31,2,FALSE)

 

 

 


WEEK 3 - Cleaning data with SQL

[Definition]

-Typecasting: Converting data from one type to another

 

[SQL function]

-CAST()

Can be used to convert anything from one data type to another

SELECT
      CAST(purchase_price AS FLOAT64)
FROM
      customer_data.customer_purchase
ORDER BY
      CAST(purchase_price AS FLOAT64) DESC

 

-CONCAT()

Adds strings together to create new text strings that can be used as unique keys

SELECT
      CONCAT(product_code, product_color) AS new_product_code
FROM
      customer_data.customer_purchase
WHERE
      product = 'couch'

 

-COALESCE()

Can be used to return non-null values

SELECT
      COALESCE(product, product_code) AS product_info
FROM
      customer_data.customer_purchase

WEEK 4 - Verify and report on your cleaning results

[Definition]

-Verification: A process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable

-Documentation: The process of tracking changes, additions, deletions, and errors involved in your data-cleaning effort

-Changelog: A file containing a chronologically ordered list of modifications made to a project

-COUNTA: A function that counts the total number of values within a specified range

[The most common problems]

Problems Checklist
Sources of errors Did you use the right tools and functions to find the source of the errors in your dataset?
Null data Did you search for NULLs using conditional formatting and filters?
Misspelled words Did you locate all misspellings?
Mistyped numbers Did you double-check that your numeric data has been entered correctly?
Extra spaces and characters Did you remove any extra spaces or characters using the TRIM function?
Duplicates Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL?
Mismatched data types Did you check that numeric, date, and string data are typecast correctly?
Messy (inconsistent) strings Did you make sure that all of your strings are consistent and meaningful?
Messy (inconsistent) data formats Did you format the dates consistently throughout your dataset?
Misleading variable labels (columns) Did you name your columns meaningfully?
Truncated data Did you check for truncated or missing data that need correction?
Business Logic Did you check that the data makes sense given your knowledge of the business?

[Review the goal of the project]

  • Confirm the business problem
  • Confirm the goal of the project
  • Verify that data can solve the problem and is aligned to the goal

[CASE statement]

The CASE statement goes through one or more conditions and returns a value as soon as a condition is met

SELECT
      customer_id,
      CASE
          WHEN first_name = 'Tnoy' THEN 'Tony'
          WHEN first_name = 'Sohe' THEN 'Sophie'
          ELSE first_name
          END AS cleaned_name
FROM
      customer_data.customer_name

[Benefits of documentation]

  • Recall data-cleaning errors
  • Inform other users of changes
  • Determine quality of data

WEEK 5 - Optional: Adding data to your resume

[Definition]

-Transferable skills: Skills and qualities that can transfer from one job or industry to another

-Soft skills: Non-technical traits and behaviors that relate to how you work

[Soft skills]

  • Presentation skills
  • Collaboration
  • Communication
  • Research
  • Problem-solving skills
  • Adaptability
  • Attention to detail