Literary Style Determination Dased on Statistical Hypothesis Testing and KNN Approach
Keywords:
writing style, authorship attribution, two-sample test, re-sampling
Abstract
The paper presents a method for the literary style determination. It is based on a re-sampling approach and character level features. A text is considered as a sequence of characters (n-grams) generated by different random sources. Bootstap-like approach is used to draw samples from the texts. Kolmogorov-Smirnov two-sample test and KNN based statistic are applied. Experiments with texts in English and Russian are given, illustrating the algorithm operation.
References
1. E. Stamatatos, "A survey of modern authorship attribution methods Journal of the American Society for Information Science and Technology, vol. 60, no. 3, pp. 538–556, 2009.
2. R. Zheng, J. Li, H. Chen, and Z. Huang, "A framework for authorship identification of online messages: Writing-style features and classification techniques Journal of the American Society for Information Science and Technology, vol. 57, no. 3, pp. 378–393, 2006.
3. M. Koppel, and J. Schler, "Authorship verification as a one-class classification problem Proc. of the 21st International Conference on Machine Learning, NewYork: ACM Press, p. 62, 2004.
4. S. Meyer zu Eissen, B. Stein, and M. Kulig, "Plagiarism detection without reference collections Advances in Data Analysis, Berlin, Germany: Springer, pp. 359–366, 2007.
5. M. Koppel., S. Argamon, A.R. Shimoni, "Automatically categorizing written texts by author gender Literary and Linguistic Computing, vol. 17 no. 4, pp. 401–412, 2002.
6. F. Mosteller, D.L. Wallace, "Inference in an authorship problem - a comparative-study of discrimination methods applied to authorship of disputed Federalist Papers Journal of the American Statistical Association, vol. 58(302), p. 275, 1963.
7. P. Juola, "Authorship attribution Foundations and Trends in Information Retrieval, vol. 1, no. 3, pp. 233–334, 2006.
8. F. Sebastiani, "Machine learning in automated text categorization ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002.
9. O. Granichin, Z. V. Volkovich, and D. Toledano-Kitai, Randomized Algorithms in Automatic Control and Data Mining, Springer, 2015.
10. Granichin O., Kizhaeva N., Shalymov D., Volkovich Z. "Writing style determination using the KNN text model"// In: Proc. of the 2015 IEEE International Symposium on Intelligent Control, September 21-23, 2015, Sydney, Australia, pp. 900-905.
11. Ширяев А.Н., Вероятность-1: элементарная теория вероятностей, математические основания, предельные теоремы. МЦНМО, Москва, 2011.
12. B.S. Duran, "A survey of nonparametric tests for scale Communications in statistics - Theory and Methods, vol. 5, pp. 1287–1312, 1976.
13. W.J. Conover, M.E. Johnson, and M.M. Johnson, "Comparative study of tests of homogeneity of variances, with applications to the outer continental shelf bidding data Technometrics, vol.23, pp. 351–361, 1981.
14. J.H. Friedman and L.C. Rafsky, "Multivariate generalizations of the Wolfowitz and Smirnov twosample tests Annals of Statistics, vol.7, pp. 697–717, 1979.
15. N. Henze, "A multivariate two-sample test based on the number of nearest neighbor type coincidences Annals of Statistics, vol.16, pp. 772–783, 1988.
16. B. Efron, R. Tibshirani, "An Introduction to the Bootstrap". Boca Raton, FL: Chapman & Hall/CRC, 1993.
17. S. Stein and S. Argamon, "A mathematical explanation of burrows’ delta In Proceedings of Digital Humanities 2006, Paris, France, 2006.
18. W. Oliveira Jr., E. Justino, L.S. Oliveira, "Comparing compression models for authorship attribution Forensic Science International, vol. 228, pp. 100–104, 2013.
19. D.I. Holmes, R. Forsyth, "The Federalist revisited: New directions in authorship attribution Literary and Linguistic Computing, vol. 10, no.2, pp. 111–127, 1995.
20. J. Savoy, "Authorship attribution based on a probabilistic topic model Information Processing and Management, vol. 49, pp. 341–354, 2013.
2. R. Zheng, J. Li, H. Chen, and Z. Huang, "A framework for authorship identification of online messages: Writing-style features and classification techniques Journal of the American Society for Information Science and Technology, vol. 57, no. 3, pp. 378–393, 2006.
3. M. Koppel, and J. Schler, "Authorship verification as a one-class classification problem Proc. of the 21st International Conference on Machine Learning, NewYork: ACM Press, p. 62, 2004.
4. S. Meyer zu Eissen, B. Stein, and M. Kulig, "Plagiarism detection without reference collections Advances in Data Analysis, Berlin, Germany: Springer, pp. 359–366, 2007.
5. M. Koppel., S. Argamon, A.R. Shimoni, "Automatically categorizing written texts by author gender Literary and Linguistic Computing, vol. 17 no. 4, pp. 401–412, 2002.
6. F. Mosteller, D.L. Wallace, "Inference in an authorship problem - a comparative-study of discrimination methods applied to authorship of disputed Federalist Papers Journal of the American Statistical Association, vol. 58(302), p. 275, 1963.
7. P. Juola, "Authorship attribution Foundations and Trends in Information Retrieval, vol. 1, no. 3, pp. 233–334, 2006.
8. F. Sebastiani, "Machine learning in automated text categorization ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002.
9. O. Granichin, Z. V. Volkovich, and D. Toledano-Kitai, Randomized Algorithms in Automatic Control and Data Mining, Springer, 2015.
10. Granichin O., Kizhaeva N., Shalymov D., Volkovich Z. "Writing style determination using the KNN text model"// In: Proc. of the 2015 IEEE International Symposium on Intelligent Control, September 21-23, 2015, Sydney, Australia, pp. 900-905.
11. Ширяев А.Н., Вероятность-1: элементарная теория вероятностей, математические основания, предельные теоремы. МЦНМО, Москва, 2011.
12. B.S. Duran, "A survey of nonparametric tests for scale Communications in statistics - Theory and Methods, vol. 5, pp. 1287–1312, 1976.
13. W.J. Conover, M.E. Johnson, and M.M. Johnson, "Comparative study of tests of homogeneity of variances, with applications to the outer continental shelf bidding data Technometrics, vol.23, pp. 351–361, 1981.
14. J.H. Friedman and L.C. Rafsky, "Multivariate generalizations of the Wolfowitz and Smirnov twosample tests Annals of Statistics, vol.7, pp. 697–717, 1979.
15. N. Henze, "A multivariate two-sample test based on the number of nearest neighbor type coincidences Annals of Statistics, vol.16, pp. 772–783, 1988.
16. B. Efron, R. Tibshirani, "An Introduction to the Bootstrap". Boca Raton, FL: Chapman & Hall/CRC, 1993.
17. S. Stein and S. Argamon, "A mathematical explanation of burrows’ delta In Proceedings of Digital Humanities 2006, Paris, France, 2006.
18. W. Oliveira Jr., E. Justino, L.S. Oliveira, "Comparing compression models for authorship attribution Forensic Science International, vol. 228, pp. 100–104, 2013.
19. D.I. Holmes, R. Forsyth, "The Federalist revisited: New directions in authorship attribution Literary and Linguistic Computing, vol. 10, no.2, pp. 111–127, 1995.
20. J. Savoy, "Authorship attribution based on a probabilistic topic model Information Processing and Management, vol. 49, pp. 341–354, 2013.
Published
2015-10-30
How to Cite
Кижаева, Н. А., & Шалымов, Д. С. (2015). Literary Style Determination Dased on Statistical Hypothesis Testing and KNN Approach. Computer Tools in Education, (5), 14-23. Retrieved from http://cte.eltech.ru/ojs/index.php/kio/article/view/1452
Issue
Section
Informational systems
This work is licensed under a Creative Commons Attribution 4.0 International License.