Skip to content
iabaako edited this page Oct 7, 2024 · 9 revisions

Overview

The IPA Data Cleaning Package is a Stata package developed by Innovations for Poverty Action (IPA) to streamline the process of cleaning and validating survey data. This package includes a suite of commands designed to handle common data cleaning tasks efficiently.

Software Requirements

Some of the commands in the ipaclean program are heavily dependent on Stata’s data frames. Unfortunately, Stata data frames are only available in Stata 17 or later. Consequently, ipaclean requires that the user has Stata 17.0 or later installed on their machine prior to running ipaclean. IPA employees with older versions of Stata should contact IT for access to a newer version.

Installation

To install the package, use the following command in Stata:

* Install ipaclean using
net install ipaclean, all replace from("https://raw.githubusercontent.com/PovertyAction/ipaclean/main")

* After installation, run the following command to install helper commands:
ipaclean update

* check your version of ipaclean with the command:
ipaclean version

Features

The package includes several key commands:

  • ipaappend: Safely append datasets.
  • ipamergerepeats: Safely merge datasets.
  • ipaodksplit: Create dummy variables from SurveyCTO/ODK style select_multiple type questions.
  • ipaodkmergerepeats: Reshape and Merge ODK/SurveyCTO repeat groups.
  • ipacompare: Compare Datasets across multiple rounds of survey data collection.
  • ipacodebook: Describe data content and export codebook to excel.

ipaappend – Safely Append Datasets

The ipaappend command is designed to safely append Stata-format datasets to an existing dataset in memory with additional features to avoid common issues such as mismatched variable types. This command ensures data consistency and optionally allows for a detailed append report to assess potential type conflicts. ipaappend's safely option is an alternative to using the force option with the Stata default append command which will lead to data loss. The safely option checks for the best data type that can accommodate all values and converts the variable in the master or using datasets so the append can happen without data loss.

Use case:

The ipaappend command is especially useful in cases where datasets need to be appended but variables contain inconsistent data types across datasets eg. price variable is numeric in master dataset and string in using dataset. It ensures that appending doesn’t lead to data loss or incompatibilities that could result from variable mismatches.

Example:

Suppose you have two datasets containing car data — one with data for domestic cars, where price is recorded as a string, and one with data for foreign cars, where price is numeric. Using Stata's native append command with the force option could result in data loss. However, ipaappend with the safely option will handle the type mismatch without data loss.

* Prepare datasets
sysuse auto, clear
keep if foreign == 0
tostring price, replace
save domestic

sysuse auto, clear
keep if foreign == 1
save foreign

* Attempting to append with the native append command results in an error
append using domestic

* Using the force option will result in data loss
append using domestic, force

* Using ipaappend's safely option instead
use foreign, clear
ipaappend using domestic, outfile("append_report.xlsx") safely replace

ipamerge – Safely Merge Datasets

The ipamerge command is designed to safely merge Stata-format datasets on one or more key variables while avoiding issues such as mismatched variable types. This command performs match merges (one-to-one, one-to-many, many-to-one, and many-to-many), ensuring data consistency and offering an alternative to Stata's merge command. With the safely option, ipamerge can handle type conflicts that would otherwise result in data loss or errors, converting variable types where necessary to retain all data during the merge process.

Use case:

The ipamerge command is especially useful for merging datasets with inconsistent variable types. For example, if a variable is numeric in the master dataset but a string in the using dataset, ipamerge with the safely option will convert data types as needed to avoid data loss or merge errors.

Example:

Suppose you have two datasets containing car information, one with foreign as a string variable and another with price as a string variable. Attempting a merge with the default Stata merge command may lead to errors or data loss when using force. However, ipamerge with the safely option will manage these inconsistencies without data loss.

* Prepare datasets
sysuse auto, clear
keep make price trunk weight length turn displacement gear_ratio foreign
tostring foreign, force replace
save "using_data", replace

sysuse auto, clear
keep make price mpg rep78 headroom foreign
tostring price, replace
save "master_data", replace

* Attempting a standard merge with the native merge command will cause an error
merge 1:1 make using "using_data"

* Using the force option will lead to missing values
merge 1:1 make using "using_data", force

* Using ipamerge's safely option to handle data type mismatch safely
use "master_data", clear
ipamerge 1:1 make using "using_data", safely outfile("mergereport.xlsx", replace)

ipaodksplit – Create Dummy Variables from SurveyCTO/ODK select_multiple Questions

The ipaodksplit command is designed to process XLSForm files and create dummy variables for select_multiple questions in SurveyCTO or ODK datasets. It reads the specified XLSForm and generates dummy variables based on the choices provided for each select_multiple question. This command can efficiently handle multiple languages and provides flexibility in variable labeling and ordering.

Use case:

The ipaodksplit command is especially useful for survey datasets with select_multiple questions, allowing researchers to create individual dummy variables for each option in the select_multiple questions. The command provides a much more streamlined and easier method for splitiing select_multiple variables during data cleaning.

Example

In this example, ipaodksplit is used to label the dummy variables and assign value labels for binary responses (e.g., "Yes" and "No"). A prefix is added to avoid naming conflicts.

* Import Data
use "Employment Status and Consumption Patterns in 2023.dta", clear
* Define custom value labels and process select_multiple questions
label define yesno 0 "No" 1 "Yes"
* Split variables and label dummies using yesno value label
ipaodksplit using "Employment Status and Consumption Patterns in 2023.xlsx", order label vallab(yesno) prefix(_)
Clone this wiki locally