Person Identification System from Speech and Laughter Using Machine Learning Algorithms
Abstract
Automated person identification and authentication is paramount for preclusion of cybercrime, national security and veracity
of electoral processes. This is a critical component of Information and Communication Technology (ICT), which is the mainstay
for national development. This paper presents the use of speech and laughter of people for person identification with the focus
on forensics application where people speak and laugh in between. Features were extracted using the Librosa library in Python
programming language via Scientific Python Development Environment (SPYDER) IDE (version 4.1.3) of the Anaconda
software. While the Orange software (version 3.25.0) for data-mining was used for training, testing and validation of five
standard machine learning algorithms: Neural Networks (NN), Support Vector Machine (SVM), Random Forest (RF), Naïve
Bayes (NB) and Logistic Regression (LR). Results showed that the neural networks classifier gave the best accuracy followed
by the SVM. There was an average of 17.6% and 14.1% increase in the validation metrics when both speech and laughter were
combined as compared to speech and laughter independently respectively. This research area is very useful in forensics
especially for recognising criminals in conversation.
References
Applications, 1-16.
Adekitan, A. I., Salau, O. (2020). Toward an improved learning process: the relevance of ethnicity to data mining
prediction of students’ performance. SN Applied Sciences, 2(1), 8.
Arora, S. V., Vig, R. (2020). An efficient text-independent speaker verification for short utterance data from Mobile
devices. Multimedia Tools and Applications, 79(3), 3049-3074.
Assunção, G., Menezes, P., Perdigão, F. (2020). Speaker Awareness for Speech Emotion Recognition. International
Journal of Online and Biomedical Engineering (iJOE), 16(04), 15-22.
Bäckström, Tom (2019). Introduction to Speech Processing. Online Blog, Retrieved from
https://wiki.aalto.fi/pages/viewpage.action?pageId=149890776.
Bakmand-Mikalski, D., Rasmussen, A. H., Christensen, N. O. (2007). Speaker identification. Master Thesis.
Retrieved from http://www2.compute.dtu.dk/pubdb/pubs/5580-full.html
Belin, P., Fecteau, S., Bedard, C. (2004). Thinking the voice: neural correlates of voice perception. Trends in
cognitive sciences, 8(3), 129-135.
Bhogal, K. S., Hamilton, I. R. A., Kanevsky, D., Pickover, C. A., Sand, A. R. (2012). U.S. Patent No. 8,140,340.
Washington, DC: U.S. Patent and Trademark Office.
Boersma, P., Weenink, D. (2015). Praat: doing phonetics by computer [Computer program, version 6.0. 06].
retrieved 18 December 2015 from http://www.praat.org.
Cakmak, H., Urbain, J., Tilmanne, J., Dutoit, T. (2014). Evaluation of HMM-based visual laughter synthesis. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4578-4582). IEEE.
Chettri, B., Stoller, D., Morfi, V., Ramírez, M. A. M., Benetos, E., Sturm, B. L. (2019). Ensemble models for spoofing detection
in automatic speaker verification. arXiv preprint arXiv:1904.04589.
Delac, K., Grgic, M. (2004, June). A survey of biometric recognition methods. In Proceedings. Elmar-2004. 46th International
Symposium on Electronics in Marine (pp. 184-193). IEEE.
Devillers, L., Vidrascu, L. (2007). Positive and negative emotional states behind the laughs in spontaneous spoken dialogs.
In Interdisciplinary workshop on the phonetics of laughter (p. 37).
El Ayadi, M., Hassan, A. K. S., Abdel-Naby, A., Elgendy, O. A. (2017). Text-independent speaker identification using robust
statistics estimation. Speech Communication, 92, 52-63.
Farzaneh, M., Toroghi, R. M. (2020). Robust Audio Watermarking Using Graph-based Transform and Singular Value
Decomposition. arXiv preprint arXiv:2003.05223.
Fayek, H. (2016). Speech processing for machine learning: Filter banks, mel-frequency cepstral coefficients (mfccs) and what’s
in-between. URL: https://haythamfayek. com/2016/04/21/speech-processingfor-machine-learning. html.
Feng, L. (2004). Speaker recognition (Master's thesis, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark).
Retrieved from https://www.researchgate.net/publication/259333765
Folorunso, C. O., Asaolu, O. S., Popoola, O. P. (2019). A Review of Voice-Base Person Identification: State-of-the-Art. Covenant
Journal of Engineering Technology, (CJET) 3(1). ISSN: p 2682-5317 e 2682-5325 DOI: 10.20370/2cdk-7y54.
https://journals.covenantuniversity.edu.ng/index.php/cjet/article/view/1635/978
Folorunso, C.O., Asaolu, O.S. and Popoola, O.P. (2020) ‘Laughter signature: a novel biometric trait for person identification’,
Int. J. Biometrics, Vol. 12, No. 3, pp.283–300.
Fromont, L. A., Royle, P., Steinhauer, K. (2020). Growing Random Forests reveals that exposure and proficiency best account
for individual variability in L2 (and L1) brain potentials for syntax and semantics. Brain and Language, 204, 104770.
Gyanendra K. Verma “Multi-feature Fusion for Closed Set Text Independent Speaker Identification” International conference
on information intelligence, systems, technology and management, Springer (2011), pp. 170-179
Gosztolya, G., Beke, A., Neuberger, T., Tóth, L. (2016). Laughter classification using Deep Rectifier Neural Networks with a
minimal feature subset. Archives of Acoustics, 41.
Jagdale, S. M., Shinde, A. A., Chitode, J. S. (2020). Robust Speaker Recognition Based on Low-Level-and Prosodic-LevelFeatures. In Advances in Data Sciences, Security and Applications (pp. 267-274). Springer, Singapore.
Jahangir, R., Teh, Y. W., Memon, N. A., Mujtaba, G., Zareei, M., Ishtiaq, U., Ali, I. (2020). Text-Independent Speaker
Identification Through Feature Fusion and Deep Neural Network. IEEE Access, 8, 32187-32202.
Jain, M., Narayan, S., Balaji, P., Bhowmick, A., Muthu, R. K. (2020). Speech Emotion Recognition using Support Vector
Machine. arXiv preprint arXiv:2002.07590.
Jason Brownlee (2020). How to Combine Oversampling and Undersampling for Imbalanced Classification. Online Machine
Learning Course, Retrieved from https://machinelearningmastery.com/combine-oversampling-andundersampling-for-imbalanced-classification/ on May 7, 2020.
Jiang, D. N., Lu, L., Zhang, H. J., Tao, J. H., Cai, L. H. (2002). Music type classification by spectral contrast feature.
In Proceedings. IEEE International Conference on Multimedia and Expo (Vol. 1, pp. 113-116). IEEE.
Kamble, M. R., Sailor, H. B., Patil, H. A., Li, H. (2020). Advances in anti-spoofing: from the perspective of ASVspoof
challenges. APSIPA Transactions on Signal and Information Processing, 9.
Kattel, M., Nepal, A., Shah, A. K., Shrestha, D. (2019). Chroma feature extraction. In Conference: Chroma Feature Extraction
using Fourier Transform.
Kaur, J., Kaur, S. (2016). A Brief Review: Voice Biometric for Speaker Verification in Attendance Systems. Imp. J. Interdiscip.
Res, 2(10).
Kinnunen, T., Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech
communication, 52(1), 12-40.
Korshunov, P., Marcel, S. (2016). Joint operation of voice biometrics and presentation attack detection. In 2016 IEEE 8th
International Conference on Biometrics Theory, Applications and Systems (BTAS) (pp. 1-6). IEEE.
Krawczyk, S., Jain, A. K. (2005). Securing electronic medical records using biometric authentication. In International
Conference on Audio-and Video-Based Biometric Person Authentication (pp. 1110-1119). Springer, Berlin,
Heidelberg.
Laskowski, K., Schultz, T. (2008). Detection of laughter-in-interaction in multichannel close-talk microphone recordings of
meetings. In International Workshop on Machine Learning for Multimodal Interaction (pp. 149-160). Springer, Berlin,
Heidelberg.
Levitan, S. I., Mishra, T., Bangalore, S. (2016). Automatic identification of gender from speech. In Proceeding of speech
prosody (pp. 84-88).
Low, D. M., Bentley, K. H., Ghosh, S. S. (2020). Automated assessment of psychiatric disorders using speech: A systematic
review. Laryngoscope Investigative Otolaryngology, 5(1), 96-116.
Mazaira-Fernandez, L. M., Álvarez-Marquina, A., Gómez-Vilda, P. (2015). Improving speaker recognition by biometric voice
deconstruction. Frontiers in bioengineering and biotechnology, 3, 126.
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., Nieto, O. (2015). librosa: Audio and music signal
analysis in python. In Proceedings of the 14th python in science conference (Vol. 8, pp. 18-25).
Medikonda, J., Bhardwaj, S., Madasu, H. (2020). An information set-based robust text-independent speaker
authentication. Soft Computing, 24(7), 5271-5287.
Mokgonyane, T. B., Sefara, T. J., Modipa, T. I., Mogale, M. M., Manamela, M. J., Manamela, P. J. (2019). Automatic speaker
recognition system based on machine learning algorithms. In 2019 Southern African Universities Power Engineering
Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa
(SAUPEC/RobMech/PRASA) (pp. 141-146). IEEE.
Nagels, L., Gaudrain, E., Vickers, D., Hendriks, P., Başkent, D. (2020). Development of voice perception is dissociated across
gender cues in school-age children. Scientific Reports, 10(1), 1-11.
Nagakrishnan, R., Revathi, A. (2020). A robust cryptosystem to enhance the security in speech-based person
authentication. Multimedia Tools and Applications. Retrieved from https://doi.org/10.1007/s11042-020-08846-1
Palo, H. K., Behera, D., Rout, B. C. (2020). Comparison of Classifiers for Speech Emotion Recognition (SER) with
Discriminative Spectral Features. In Advances in Intelligent Computing and Communication (pp. 78-85). Springer,
Singapore.
ResearchGate (2014). How to estimate a person's vocal tract length, using a recorded audio file. Online Blog Retrieved from
https://www.researchgate.net/post/How_can_I_estimate_a_persons_vocal_tract_length_using_a_recorded_audi
o_file#:~:text=I'm%20aware%20of%20the,to%20an%20unconstricted%20vocal%20tract.
Rohdin, J., Silnova, A., Diez, M., Plchot, O., Matějka, P., Burget, L., Glembek, O. (2020). End-to-end DNN based textindependent speaker recognition for long and short utterances. Computer Speech & Language, 59, 22-35.
Ruch, W., Wagner, L., Hofmann, J. (2019). A lexical approach to laughter classification: Natural language distinguishes six
(classes of) formal characteristics. Current Psychology, 1-13. https://doi.org/10.1007/s12144-019-00369-9
Scheffer, N., Ferrer, L., Lawson, A., Lei, Y., McLaren, M. (2013). Recent developments in voice biometrics: Robustness and
high accuracy. In 2013 IEEE International Conference on Technologies for Homeland Security (HST) (pp. 447-452).
IEEE.
Scheirer, E. D. (1998). Tempo and beat analysis of acoustic musical signals. The Journal of the Acoustical Society of
America, 103(1), 588-601.
Shakil, K. A., Zareen, F. J., Alam, M., Jabin, S. (2020). BAM Health Cloud: A biometric authentication and data management
system for healthcare data in cloud. Journal of King Saud University-Computer and Information Sciences, 32(1), 57-
64.
Sheikh, I. A., Kopparapu, S. K., Vachhani, B. B., Garlapati, B. M., Chalamala, S. R. (2020). Method and system for muting
classified information from an audio. U.S. Patent Application No. 16/254,387.
Singh, A., Joshi, A. M. (2020). Speaker Identification Through Natural and Whisper Speech Signal. In Optical and Wireless
Technologies (pp. 223-231). Springer, Singapore.
Sun, L., Gu, T., Xie, K., Chen, J. (2019). Text-independent speaker identification based on deep Gaussian correlation
supervector. International Journal of Speech Technology, 22(2), 449-457.
Tawara, N., Ogawa, A., Iwata, T., Delcroix, M., Ogawa, T. (2020). Frame-Level Phoneme-Invariant Speaker Embedding for
Text-Independent Speaker Recognition on Extremely Short Utterances. In ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6799-6803). IEEE.
Thoman, C. (2009). Model-Based Classification of Speech Audio. Florida Atlantic University. Retrieved from
http://fau.digital.flvc.org/islandora/object/fau%3A3410/datastream/OBJ/view/Modelbased_classification_of_speech_audio.pdf.
Tran, D., Wagner, M., Lau, Y. W., Gen, M. (2004). Fuzzy methods for voice-based person authentication. IEEJ Transactions
on Electronics, Information and Systems, 124(10), 1958-1963.
Urbain, J., Dutoit, T. (2011, October). A phonetic analysis of natural laughter, for use in automatic laughter processing
systems. In International Conference on Affective Computing and Intelligent Interaction (pp. 397-406). Springer,
Berlin, Heidelberg.JER Vol. 25, No. 2 Popoola et al. pp 173-190
190
Urone, Paul Peter and Hinrichs, Roger (2020). College Physics. Online course published by OpenStax. Retrieved from
https://openstax.org/books/college-physics/pages/16-11-energy-in-waves-intensity.
Vatsa, M., Singh, R., Noore, A. (2009). Feature based RDWT watermar
Urone, Paul Peter and Hinrichs, Roger (2020). College Physics. Online course published by OpenStax. Retrieved from
https://openstax.org/books/college-physics/pages/16-11-energy-in-waves-intensity.
Vatsa, M., Singh, R., Noore, A. (2009). Feature based RDWT watermarking for multimodal biometric system. Image and
Vision Computing, 27(3), 293-304.
Yasmin, G., Dhara, S., Mahindar, R., Das, A. K. (2019). Speaker Identification from Mixture of Speech and Non-speech Audio
Signal. In Soft Computing in Data Analytics (pp. 473-482). Springer, Singapore.