Определение авторского стиля текстов на основе статистического подхода двухвыборочного тестирования и метода K-ближайших соседей
Ключевые слова:
авторский стиль, определение авторства текста, сравнение текстов, двухвыборочный критерий
Аннотация
В статье рассматривается задача определения авторского стиля текста. Разработан метод, основанный на процессе генерации повторной выборки. Тексты произведений рассматриваются как последовательности символов, сгенерированные различными случайными источниками. Процедура генерации повторных выборок применена для получения тестовых фрагментов текста. Для того чтобы проверить, принадлежат ли выборки одной генеральной совокупности, используется двухвыборочный критерий. Представлены результаты численных экспериментов для текстов на английском и русском языках.
Литература
1. E. Stamatatos, "A survey of modern authorship attribution methods Journal of the American Society for Information Science and Technology, vol. 60, no. 3, pp. 538–556, 2009.
2. R. Zheng, J. Li, H. Chen, and Z. Huang, "A framework for authorship identification of online messages: Writing-style features and classification techniques Journal of the American Society for Information Science and Technology, vol. 57, no. 3, pp. 378–393, 2006.
3. M. Koppel, and J. Schler, "Authorship verification as a one-class classification problem Proc. of the 21st International Conference on Machine Learning, NewYork: ACM Press, p. 62, 2004.
4. S. Meyer zu Eissen, B. Stein, and M. Kulig, "Plagiarism detection without reference collections Advances in Data Analysis, Berlin, Germany: Springer, pp. 359–366, 2007.
5. M. Koppel., S. Argamon, A.R. Shimoni, "Automatically categorizing written texts by author gender Literary and Linguistic Computing, vol. 17 no. 4, pp. 401–412, 2002.
6. F. Mosteller, D.L. Wallace, "Inference in an authorship problem - a comparative-study of discrimination methods applied to authorship of disputed Federalist Papers Journal of the American Statistical Association, vol. 58(302), p. 275, 1963.
7. P. Juola, "Authorship attribution Foundations and Trends in Information Retrieval, vol. 1, no. 3, pp. 233–334, 2006.
8. F. Sebastiani, "Machine learning in automated text categorization ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002.
9. O. Granichin, Z. V. Volkovich, and D. Toledano-Kitai, Randomized Algorithms in Automatic Control and Data Mining, Springer, 2015.
10. Granichin O., Kizhaeva N., Shalymov D., Volkovich Z. "Writing style determination using the KNN text model"// In: Proc. of the 2015 IEEE International Symposium on Intelligent Control, September 21-23, 2015, Sydney, Australia, pp. 900-905.
11. Ширяев А.Н., Вероятность-1: элементарная теория вероятностей, математические основания, предельные теоремы. МЦНМО, Москва, 2011.
12. B.S. Duran, "A survey of nonparametric tests for scale Communications in statistics - Theory and Methods, vol. 5, pp. 1287–1312, 1976.
13. W.J. Conover, M.E. Johnson, and M.M. Johnson, "Comparative study of tests of homogeneity of variances, with applications to the outer continental shelf bidding data Technometrics, vol.23, pp. 351–361, 1981.
14. J.H. Friedman and L.C. Rafsky, "Multivariate generalizations of the Wolfowitz and Smirnov twosample tests Annals of Statistics, vol.7, pp. 697–717, 1979.
15. N. Henze, "A multivariate two-sample test based on the number of nearest neighbor type coincidences Annals of Statistics, vol.16, pp. 772–783, 1988.
16. B. Efron, R. Tibshirani, "An Introduction to the Bootstrap". Boca Raton, FL: Chapman & Hall/CRC, 1993.
17. S. Stein and S. Argamon, "A mathematical explanation of burrows’ delta In Proceedings of Digital Humanities 2006, Paris, France, 2006.
18. W. Oliveira Jr., E. Justino, L.S. Oliveira, "Comparing compression models for authorship attribution Forensic Science International, vol. 228, pp. 100–104, 2013.
19. D.I. Holmes, R. Forsyth, "The Federalist revisited: New directions in authorship attribution Literary and Linguistic Computing, vol. 10, no.2, pp. 111–127, 1995.
20. J. Savoy, "Authorship attribution based on a probabilistic topic model Information Processing and Management, vol. 49, pp. 341–354, 2013.
2. R. Zheng, J. Li, H. Chen, and Z. Huang, "A framework for authorship identification of online messages: Writing-style features and classification techniques Journal of the American Society for Information Science and Technology, vol. 57, no. 3, pp. 378–393, 2006.
3. M. Koppel, and J. Schler, "Authorship verification as a one-class classification problem Proc. of the 21st International Conference on Machine Learning, NewYork: ACM Press, p. 62, 2004.
4. S. Meyer zu Eissen, B. Stein, and M. Kulig, "Plagiarism detection without reference collections Advances in Data Analysis, Berlin, Germany: Springer, pp. 359–366, 2007.
5. M. Koppel., S. Argamon, A.R. Shimoni, "Automatically categorizing written texts by author gender Literary and Linguistic Computing, vol. 17 no. 4, pp. 401–412, 2002.
6. F. Mosteller, D.L. Wallace, "Inference in an authorship problem - a comparative-study of discrimination methods applied to authorship of disputed Federalist Papers Journal of the American Statistical Association, vol. 58(302), p. 275, 1963.
7. P. Juola, "Authorship attribution Foundations and Trends in Information Retrieval, vol. 1, no. 3, pp. 233–334, 2006.
8. F. Sebastiani, "Machine learning in automated text categorization ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002.
9. O. Granichin, Z. V. Volkovich, and D. Toledano-Kitai, Randomized Algorithms in Automatic Control and Data Mining, Springer, 2015.
10. Granichin O., Kizhaeva N., Shalymov D., Volkovich Z. "Writing style determination using the KNN text model"// In: Proc. of the 2015 IEEE International Symposium on Intelligent Control, September 21-23, 2015, Sydney, Australia, pp. 900-905.
11. Ширяев А.Н., Вероятность-1: элементарная теория вероятностей, математические основания, предельные теоремы. МЦНМО, Москва, 2011.
12. B.S. Duran, "A survey of nonparametric tests for scale Communications in statistics - Theory and Methods, vol. 5, pp. 1287–1312, 1976.
13. W.J. Conover, M.E. Johnson, and M.M. Johnson, "Comparative study of tests of homogeneity of variances, with applications to the outer continental shelf bidding data Technometrics, vol.23, pp. 351–361, 1981.
14. J.H. Friedman and L.C. Rafsky, "Multivariate generalizations of the Wolfowitz and Smirnov twosample tests Annals of Statistics, vol.7, pp. 697–717, 1979.
15. N. Henze, "A multivariate two-sample test based on the number of nearest neighbor type coincidences Annals of Statistics, vol.16, pp. 772–783, 1988.
16. B. Efron, R. Tibshirani, "An Introduction to the Bootstrap". Boca Raton, FL: Chapman & Hall/CRC, 1993.
17. S. Stein and S. Argamon, "A mathematical explanation of burrows’ delta In Proceedings of Digital Humanities 2006, Paris, France, 2006.
18. W. Oliveira Jr., E. Justino, L.S. Oliveira, "Comparing compression models for authorship attribution Forensic Science International, vol. 228, pp. 100–104, 2013.
19. D.I. Holmes, R. Forsyth, "The Federalist revisited: New directions in authorship attribution Literary and Linguistic Computing, vol. 10, no.2, pp. 111–127, 1995.
20. J. Savoy, "Authorship attribution based on a probabilistic topic model Information Processing and Management, vol. 49, pp. 341–354, 2013.
Опубликован
2015-10-30
Как цитировать
Кижаева, Н. А., & Шалымов, Д. С. (2015). Определение авторского стиля текстов на основе статистического подхода двухвыборочного тестирования и метода K-ближайших соседей. Компьютерные инструменты в образовании, (5), 14-23. извлечено от http://cte.eltech.ru/ojs/index.php/kio/article/view/1452
Выпуск
Раздел
Информационные системы
Материал публикуется под лицензией: