Implementation of K-Means to Classify Poverty Based on Housing Characteristics in Central Java in 2021

Poverty is a condition in which a person's inability to meet basic needs such as food, clothing, shelter, and education so that he is unable to guarantee his own survival. To support the successful implementation of development programs, especially those aimed at reducing poverty, grouping districts/cities using cluster analysis can be assisted. Cluster analysis can be carried out to identify how the poverty rate is based on housing characteristics in Central Java which can be taken into consideration so that development programs are more targeted. Cluster analysis is a grouping method in which a group has the same characteristics, while between groups have different characteristics. K-means is one of the algorithms in data mining that can be used for grouping/clustering. The purpose of this study was to determine the classification of poverty in Central Java districts/cities based on housing indicators which include the floor area of the house, building materials for the widest floor, sources of drinking water, main building materials for roofs, and main fuel for cooking. This study yielded three clusters, with cluster 1 consisting of 22 districts and cities, cluster 2 consisting of 5 districts and cities, and cluster 3 consisting of 8 districts and cities. Cluster 1 grouping indicators were based on the sources of drinking water and the type of fuel used for cooking, cluster 2 grouping indicators were based on the size of the house's floor plan, and cluster 3 grouping indicators were based on the materials used to construct the house's widest floor and its main roof


Introduction
Central Java Province is one of the provinces in Java which is located between two large provinces, namely West Java and East Java. Central Java Province consists of 29 regencies and 6 cities with an area of Central Java recorded at 3.28 million hectares or around 25.04 percent of the total area of Java Island . Central Java province is confronted with a singular challenge, specifically the presence of poverty. Poverty refers to a state wherein individuals lack the means to fulfil essential requirements like nourishment, attire, housing, and education, thereby impeding their ability to ensure their own sustenance and well-being. To support the successful implementation of development programs, especially those aimed at reducing poverty, grouping districts or cities can be assisted. Districts/cities with homogeneous indicators and characteristics of poverty can be included in a group and can be analyzed by grouping analysis or cluster analysis. The strength of the relationship between objects becomes the basis for forming clusters. Cluster analysis is a grouping method in which a group has the same characteristics, while between groups have different characteristics . One of the methods used in clustering is k-means. K-means is one of the algorithms in data mining that can be used for grouping/clustering. Researchers use the k-means method because k-means is a simple clustering method that can handle numerical data with fast computation and has been used by many previous researchers. There are various ways to form a cluster, one of which is by setting rules to determine members of the same group based on their level of equality. Previous research (Rianda, 2022) regarding the Application of K-Means and K-Medoids Algorithms in Grouping Provinces in Indonesia Based on Household Housing Indicators in 2020 resulted in the conclusion that the K-Means algorithm is better for grouping provinces in Indonesia based on household housing indicators compared to the K-Medoids algorithm. The purpose of this study was to determine the classification of poverty in Central Java districts or cities based on housing indicators which include the floor area of the house, building materials for the widest floor, sources of drinking water, main building materials for roofs, and main fuel for cooking.

Method
The K-means algorithm was employed in this study as a methodology. The K-means algorithm is an algorithm that functions for clustering. According to (Asroni et al, 2018) the clustering process with the K-Means algorithm is as follows: 1. Determine the desired number of clusters 2. Distribute data according to the number of clusters that have been determined 3. Determine the centroid value for each cluster 4. Calculate the shortest distance using the Euclidean formula 5. Display results based on the lowest distance from the Euclidean formula calculation results 6. If you haven't got the appropriate results, then continue the iteration again using step 3, the iteration will be stopped if the clustering results are the same as the previous iteration.
The centroid value can be determined based on the range value that is in the data source by selecting according to the selected centroid value. According to (Sani, 2018) distance is used formula Euclidean as follows: = √∑ ( − ) 2 =1 description: d : object distance pk : coordinates of object p qk : coordinates of object q k : the order of the coordinates n : objects Secondary data was the research's main information source. Secondary data is research data obtained indirectly through intermediary media obtained and recorded by other parties. This study utilized housing-based poverty data in Central Java for the year 2021, which was obtained from the Central Statistics Agency (BPS). The data was collected based on several indicators, including the floor area of the house, the building materials used for the widest floor (land), the sources of drinking water, the main building materials for roofs, and the primary fuel utilized for cooking. To better target development projects and combat poverty, these data are processed using the K-Means Algorithm to create groups and clusters.

Results
The research data used consisted of housing characteristics in the districts/cities of Central Java, including the floor area of the houses, the building materials used for the widest floor, the source of drinking water, the primary building material for the house's roof, and the main fuel used for cooking. There are 35 data for each variable.

Summary
This section will show the summary of the data which will show the statistics for each variable. Following are the results of the test: From the table above, you can see the summary or characteristics of the data such as the amount of data in each variable totalling 35, there are no missing values in the data, as well as seeing the mean, median, modus, maximum value, minimum value, and many more.

Data Preparation
The initial step in data preparation involves data input, where the relevant data is collected and recorded. Subsequently, the focus shifts towards selecting the appropriate columns for analysis. Following this, it is essential to examine to identify and address any potential outliers in the data. These are the outcomes of checking for outliers: Figure 1. Output Outliers Based on Figure 1, it can be seen that output values x1 to x5 were all 0. Therefore, it could be interpreted that the data used had no outliers. So, It was not necessary to remove data. After that, knowing whether there was a correlation between variables. The Pearson correlation method is the one that is utilized to calculate correlations since it assesses the linear relationship between two numerical variables. After checking, the results were as follows: Figure 2. Correlation output Based on Figure 2 above, it can be observed that the correlation values between x1 and x2 were -0.38, between x1 and x3, were -0.35, and so on, until the correlation value between x4 and x5 was -0.19. The figure indicated a correlation or relationship between x3 and x5, marked with **. Consequently, only one variable could be utilized. As a result, variable x5 was omitted. The subsequent step involved deleting variable 5 and standardizing the data. Data standardization was carried out to ensure consistent units for each variable.

Clustering Distance Measures
The choice of spacing size is an important step in grouping. It defines how the similarity of two elements (x, y) is calculated and it will affect the shape of the cluster. The distance matrix can be seen in the image below:

Figure 3. Distance Matrix Visualization
Referring to the presented Figure 3, the colour gradation provided valuable insights regarding distances. A deeper purple shade indicated a shorter distance, implying a higher likelihood of belonging to the same cluster. Conversely, a pinker hue signified a greater distance, suggesting a higher probability of belonging to different clusters.

Calculating K-Means Clustering
The first thing to do is determine the number of clusters used. Determination of the number of clusters using the Elbow method with the following results: Figure 4. Elbow Outputs From the picture above, it can be seen that when he ideal number of clusters is three groups, as evidenced by the slope movement at the cluster point at index number 3. Then determine the cluster members with k = 3, the following results are obtained: Figure 5. Cluster Visualization From the picture above, it was found that cluster 1 comprised individuals from 22 districts and cities, cluster 2 was from 5 districts and cities, and cluster 3 was from 8 districts and cities. Then The features of each cluster can be determined by comparing the standardized data to the original data. The results were as follows: The numbers in the table above show the result of unstandardized data or the result of returning the data to the original scale. From the table above, it can be concluded that Cluster 1's similarity indicators were x3 and x5, Cluster 2's similarity indicator was x1, and Cluster 3's similarity indicators were x2 and x4.

Discussion
Cluster analysis is a grouping method in which a group has the same characteristics, while between groups have different characteristics . The information used in this study pertained to housing characteristics in the districts and cities of Central Java, including the size of the home's floor area, the materials used to construct the widest floor, the source of the home's drinking water, the main material used to construct the home's roof, and the main cooking fuel. The study revealed distinct clusters based on various factors. Cluster 1 comprised 22 regencies/cities, characterized by similar patterns in terms of sources of drinking water and fuel for cooking. Cluster 2 consisted of 5 regencies/cities, distinguished by similarities in house floor area. Lastly, cluster 3 encompassed 8 districts/cities, identified by commonalities in building materials used for the widest floor and the main materials employed for roofing houses. This result is to support the successful implementation of development programs, especially those aimed at reducing poverty in Central Java. These results are the same as the research conducted by Anggraini & Muharom (2017) regarding the grouping of sub-districts based on the education sector using the k-means cluster method. The variables used are students, the number of schools, the number of teachers, and the number of students.

Conclusion
Based on the above analysis, the following conclusions can be obtained: Regencies Banyumas, Purbalingga, Banjarnegara, Kebumen, Purworejo, Wonosobo, Magelang, Boyolali, Klaten, Sukoharjo, Wonogiri, Karanganyar, Pati, Kudus, Jepara, Demak, Temanggung, Kendal, and Brebes, as well as the cities of Pekalongan and Tegal, were part of Cluster 1. The Semarang Regency, the Cities of Magelang, Surakarta, Salatiga, and Semarang were part of Cluster 2. Cilacap, Sragen, Grobogan, Blora, Rembang, Batang, Pekalongan, and Pemalang Regencies are part of Cluster 3. Cluster 1 grouping indicators were based on the sources of drinking water and cooking fuel used, cluster 2 grouping indicators were based on the floor area of the house, while Cluster 3 grouping indicators were based on building materials for the widest floor and the main building material for the roof of the house.