tidyEmoji: Discover, Count, Categorise, Score, Translate and Relate Emoji in Text

A tidy toolkit for working with the emoji in any text column, such as social-media posts, product reviews, chat logs or survey responses. Unicode is awkward to handle and not every code point is an emoji, which makes emoji statistics fiddly to obtain. 'tidyEmoji' extracts, counts, categorises, sentiment-scores and emotion-scores emoji, converts them to and from text (for accessibility and NLP preprocessing), searches the emoji catalogue, maps emoji co-occurrence and sequences (graph-ready edge lists and n-grams), measures where and how densely emoji are used, and builds document-by-emoji feature tables for machine learning, with grapheme-aware detection (so skin-tone and multi-person sequences stay intact), returning tidy data frames that slot straight into a 'tidyverse' workflow. The bundled emoji sentiment lexicon is from the Emoji Sentiment Ranking of Kralj Novak et al. (2015) doi:10.1371/journal.pone.0144296 , released under CC BY-SA 4.0; the emotion lexicon is from EmoTag1200 of Shoeb & de Melo (2020) https://aclanthology.org/2020.emnlp-main.720/, released under the MIT licence.

Output and naming contract

Every verb follows verb(data, text, ...), takes the text column unquoted, and returns a tibble. Columns added to your data carry a dotted .emoji_* prefix (.emoji, .emoji_name, .emoji_category, .emoji_sentiment, .emoji_n, ...) so they cannot collide with your own columns; new summary tibbles (e.g. emoji_frequency()) use bare names. group always refers to the Unicode top-level category (the term used by the underlying emoji::emojis table). Every glyph-to-metadata join is normalised through a codepoint key that strips the U+FE0F variation selector, so qualified and unqualified emoji forms resolve identically in every verb.

Author

Maintainer: Youzhi Yu yuyouzhi666@icloud.com

tidyEmoji: Discover, Count, Categorise, Score, Translate and Relate Emoji in Text

Output and naming contract

See also

Author