SAVI outlier analysis using R.

Description of the project:

This project involves a meticulous investigation into outliers within a dataset that captures a range of educational health indicators across various school corporations over multiple years. Utilizing robust statistical techniques, our analysis aims to detect and understand anomalies within the data, potentially revealing crucial insights into public health trends and educational impacts.

Outlier Analysis Data:

The dataset encompasses a variety of records that reflect various combinations of year, geography, and indicator, which are likely used to identify and analyze outliers. These records include the actual data points that will be scrutinized to detect anomalies in the dataset.

Geographic Contextual Data:

This aspect of the data set provides more in-depth information about the geographic areas in question, which are key to the analysis. The dataset includes unique identifiers and descriptive information for each geographic location, which allows you to map data points to specific areas and interpret the results within the proper context.

Indicator Visualization Data:

The dataset contains details about the indicators that are being analyzed, including whether there is a corresponding visualization available for each indicator. This facilitates a better understanding of the data and assists in communicating the findings, as visual representations can make it easier to identify patterns and convey complex information to stakeholders.


We currently have 3 goals in this project.


We need to get standard deviation, mean , bias and trends for the data for the data present.


We need to find the count of outliers for each indicator to see if some indicators show up as a problem.


The third goal is a comparative analysis, aiming to quantify and compare the high or low outlier counts by geographic region, which in this context, are the school corporations. This will help in determining whether some regions have significantly more or fewer outliers, which could be indicative of regional disparities or data collection inconsistencies.

The primary aim is to identify outlier data points across indicators and geographies to ascertain whether any indicators or regions exhibit problematic trends. We will contextualize geographic data with detailed information about the locales, providing a nuanced understanding of each school corporation’s characteristics. Additionally, we will analyze the availability and utility of visualizations for these indicators, which will serve in interpreting and disseminating our findings.

Data set




CSV files





Anticipated Key Findings and Deliverables:

Our analysis is expected to lead to several key deliverables, including:

A detailed report identifying the outliers for each health indicator, with a special focus on their implications for educational outcomes and public health.

A geographic analysis report that highlights regional patterns and disparities, potentially informing targeted interventions.

A portfolio of data visualizations that succinctly present our findings to various stakeholders, from policymakers to educational leaders.


Currently working on Goal 1

Data Collection: Gather additional data as required, especially geographic information for Goals.

Data Cleaning: Ensure the data is clean and suitable for analysis. This would involve standardizing the formats, handling missing values, and possibly normalizing the data.

Data Preprocessing: Clean and prepare the dataset for analysis, including normalization and treatment of missing values.

After these steps we should proceed to find the outliers and do the goal 3 and proceed to do the visualizations.