Film Popularity Analysis through Combined K-Means Clustering and Gradient Boosted Trees
DOI:
https://doi.org/10.64366/ijids.v2i2.81Keywords:
Film Popularity; Machine Learning; Clustering; Explainable AI; Audience RatingsAbstract
The dynamic and competitive nature of the global film industry presents complex challenges in predicting film popularity, as success is shaped by the interplay of production investment, casting decisions, and audience preferences. This research addresses the limitations of previous studies that have focused primarily on direct relationships, such as budget versus box office returns, by introducing an integrated analytical framework that combines K-Means clustering and Gradient Boosted Trees (GBT) with explainable AI techniques. Utilizing the TMDB movie dataset and constructing features such as actor influence and studio power, the study segments films and predicts audience ratings while providing interpretable visualizations. The results reveal four distinct film clusters and demonstrate that actor influence and budget allocation are the most significant predictors of popularity. The proposed model achieves an R² score of 0.75 and a mean squared error of 0.35 in predicting audience ratings, while cluster analysis shows that Blockbuster films reach the highest average ratings (6.76), and Underperforming films the lowest (2.42). By integrating interpretable predictive modeling and interactive scenario tools, this research offers both theoretical advancement and practical value for industry stakeholders. However, the findings are limited by the available metadata and do not account for factors such as marketing or real-time audience trends, suggesting opportunities for future research to expand the analytical framework.
Downloads
References
K. U. Sarker et al., “A Ranking Learning Model by K-Means Clustering Technique for Web Scraped Movie Data,” Computers, vol. 11, no. 11, Nov. 2022, doi: 10.3390/computers11110158.
M. Alzakan, H. Almousa, A. Almarzoqi, M. Alghasham, M. Aldawsari, and M. Al-Hagery, “Enhancing K-means Clustering Results with Gradient Boosting: A Post-Processing Approach,” International Journal of Advanced Computer Science and Applications, vol. 15, no. 2, 2024, doi: 10.14569/IJACSA.2024.0150292.
S. Leem, J. Oh, D. So, and J. Moon, “Towards Data-Driven Decision-Making in the Korean Film Industry: An XAI Model for Box Office Analysis Using Dimension Reduction, Clustering, and Classification,” Entropy, vol. 25, no. 4, p. 571, Mar. 2023, doi: 10.3390/e25040571.
W. Lu, X. Zhang, and X. Zhan, “Movie Box Office Prediction Based on IFOA-GRNN,” Discrete Dyn Nat Soc, vol. 2022, no. 1, p. 3690077, Jan. 2022, doi: https://doi.org/10.1155/2022/3690077.
S. Tang, “The box office prediction model based on the optimized XGBoost algorithm in the context of film marketing and distribution,” PLoS One, vol. 19, no. 10, p. e0309227, Oct. 2024, doi: 10.1371/journal.pone.0309227.
I. F. Ashari, R. Banjarnahor, D. R. Farida, S. P. Aisyah, A. P. Dewi, and N. Humaya, “Application of Data Mining with the K-Means Clustering Method and Davies Bouldin Index for Grouping IMDB Movies,” Journal of Applied Informatics and Computing, vol. 6, no. 1, pp. 07–15, Jul. 2022, doi: 10.30871/jaic.v6i1.3485.
C. Xie, “A refined approach to early movie box office prediction leveraging ensemble learning and feature encoding,” Applied and Computational Engineering, vol. 75, no. 1, pp. 273–284, Jul. 2024, doi: 10.54254/2755-2721/75/20240555.
Y. Zheng, “Predicting Movie Box Office Based on Machine Learning, Deep Learning, and Statistical Methods,” Applied and Computational Engineering, vol. 94, no. 1, pp. 20–32, Oct. 2024, doi: 10.54254/2755-2721/94/2024MELB0069.
A. Singh, P. Singh, and A. K. Tiwari, “A Comprehensive Survey on Machine Learning,” Journal of Management and Service Science (JMSS), vol. 1, no. 1, pp. 1–17, Mar. 2021, doi: 10.54060/JMSS/001.01.003.
Y. Zheng, “Predicting Movie Box Office Based on Machine Learning, Deep Learning, and Statistical Methods,” Applied and Computational Engineering, vol. 94, no. 1, pp. 20–32, Oct. 2024, doi: 10.54254/2755-2721/94/2024MELB0069.
S. Ça?l?yor, B. Öztay?i, and S. Sezgin, “Forecasting Box Office Performances Using Machine Learning Algorithms,” in Intelligent and Fuzzy Techniques in Big Data Analytics and Decision Making, C. Kahraman, S. Cebi, S. Cevik Onar, B. Oztaysi, A. C. Tolga, and I. U. Sari, Eds., Cham: Springer International Publishing, 2020, pp. 257–264.
S. Li, R. Xie, Y. Zhu, X. Ao, F. Zhuang, and Q. He, “User-Centric Conversational Recommendation with Multi-Aspect User Modeling,” in SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, Inc, Jul. 2022, pp. 223–233. doi: 10.1145/3477495.3532074.
K. M. Sujon, R. B. Hassan, Z. T. Towshi, M. A. Othman, M. A. Samad, and K. Choi, “When to Use Standardization and Normalization: Empirical Evidence From Machine Learning Models and XAI,” IEEE Access, vol. 12, pp. 135300–135314, 2024, doi: 10.1109/ACCESS.2024.3462434.
N. Pavitha et al., “Movie recommendation and sentiment analysis using machine learning,” Global Transitions Proceedings, vol. 3, no. 1, pp. 279–284, 2022, doi: https://doi.org/10.1016/j.gltp.2022.03.012.
A. G., S. S. Rao, and K. Chandrasekaran, “Application of Machine Learning in Movie Recommendation using Harris Hawks Optimization and K-means (HHO-k-means) Clustering,” International Journal of Intelligent Systems and Applications in Engineering, vol. 11, no. 7s, pp. 515–525, Jul. 2023, [Online]. Available: https://ijisae.org/index.php/IJISAE/article/view/2990
T. Widiyaningtyas, I. Hidayah, and T. B. Adji, “Recommendation algorithm using clustering-based upcsim (Cb-upcsim),” Computers, vol. 10, no. 10, Oct. 2021, doi: 10.3390/computers10100123.
J. Xia, Y. Zhang, J. Song, Y. Chen, Y. Wang, and S. Liu, “Revisiting Dimensionality Reduction Techniques for Visual Cluster Analysis: An Empirical Study,” IEEE Trans Vis Comput Graph, vol. 28, no. 01, pp. 529–539, 2022, doi: 10.1109/TVCG.2021.3114694.
T. T. Cai and R. Ma, “Theoretical foundations of t-SNE for visualizing high-dimensional clustered data,” J. Mach. Learn. Res., vol. 23, no. 1, Jan. 2022.
S. A. and S. R., “A systematic review of Explainable Artificial Intelligence models and applications: Recent developments and future trends,” Decision Analytics Journal, vol. 7, p. 100230, 2023, doi: https://doi.org/10.1016/j.dajour.2023.100230.
S. Tang, “The box office prediction model based on the optimized XGBoost algorithm in the context of film marketing and distribution,” PLoS One, vol. 19, no. 10, pp. e0309227-, Oct. 2024, [Online]. Available: https://doi.org/10.1371/journal.pone.0309227
Y. Filmus, I. Mehalel, and S. Moran, “A Resilient Distributed Boosting Algorithm,” in Proceedings of the 39th International Conference on Machine Learning, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., in Proceedings of Machine Learning Research, vol. 162. PMLR, Sep. 2022, pp. 6465–6473. [Online]. Available: https://proceedings.mlr.press/v162/filmus22a.html
D. Chicco, M. J. Warrens, and G. Jurman, “The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation,” PeerJ Comput Sci, vol. 7, p. e623, 2021, doi: 10.7717/peerj-cs.623.
C. Miller, T. Portlock, D. M. Nyaga, and J. M. O’Sullivan, “A review of model evaluation metrics for machine learning in genetics and genomics,” Frontiers in Bioinformatics, vol. Volume 4-2024, 2024, [Online]. Available: https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2024.1457619
S. J. Silva, C. A. Keller, and J. Hardin, “Using an Explainable Machine Learning Approach to Characterize Earth System Model Errors: Application of SHAP Analysis to Modeling Lightning Flash Occurrence,” J Adv Model Earth Syst, vol. 14, no. 4, p. e2021MS002881, Apr. 2022, doi: https://doi.org/10.1029/2021MS002881.
Z. Ning, J. Chen, J. Huang, U. J. Sabo, Z. Yuan, and Z. Dai, “WeDIV – An improved k-means clustering algorithm with a weighted distance and a novel internal validation index,” Egyptian Informatics Journal, vol. 23, no. 4, pp. 133–144, 2022, doi: https://doi.org/10.1016/j.eij.2022.09.002.
P. Bombina, D. Tally, Z. B. Abrams, and K. R. Coombes, “SillyPutty: Improved clustering by optimizing the silhouette width,” PLoS One, vol. 19, no. 6, p. e0300358, Jun. 2024, doi: 10.1371/journal.pone.0300358.
E. D. Omar et al., “Comparative Analysis of Logistic Regression, Gradient Boosted Trees, SVM, and Random Forest Algorithms for Prediction of Acute Kidney Injury Requiring Dialysis After Cardiac Surgery,” Int J Nephrol Renovasc Dis, vol. 17, pp. 197–204, Jul. 2024, doi: 10.2147/IJNRD.S461028.
L. W. Rizkallah, “Enhancing the performance of gradient boosting trees on regression problems,” J Big Data, vol. 12, no. 1, p. 35, 2025, doi: 10.1186/s40537-025-01071-3.
X. Huang et al., “A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability,” Comput Sci Rev, vol. 37, p. 100270, 2020, doi: https://doi.org/10.1016/j.cosrev.2020.100270.
A. T. Keleko, B. Kamsu-Foguem, R. H. Ngouna, and A. Tongne, “Health condition monitoring of a complex hydraulic system using Deep Neural Network and DeepSHAP explainable XAI,” Advances in Engineering Software, vol. 175, p. 103339, 2023, doi: https://doi.org/10.1016/j.advengsoft.2022.103339.
Bila bermanfaat silahkan share artikel ini
Berikan Komentar Anda terhadap artikel Film Popularity Analysis through Combined K-Means Clustering and Gradient Boosted Trees
ARTICLE HISTORY
How to Cite
Issue
Section
Copyright (c) 2025 Agi Candra Bramantia, Desyanti, Jeperson Hutahaean, Erlin Windia Ambarsari

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).