tidyEmoji: Discover, Count, Categorise, Score, Translate and Relate Emoji in Text
Source:R/tidyEmoji.R
tidyEmoji-package.RdA tidy toolkit for working with the emoji in any text column, such as social-media posts, product reviews, chat logs or survey responses. Unicode is awkward to handle and not every code point is an emoji, which makes emoji statistics fiddly to obtain. 'tidyEmoji' extracts, counts, categorises, sentiment-scores and emotion-scores emoji, converts them to and from text (for accessibility and NLP preprocessing), searches the emoji catalogue, maps emoji co-occurrence and sequences (graph-ready edge lists and n-grams), measures where and how densely emoji are used, and builds document-by-emoji feature tables for machine learning, with grapheme-aware detection (so skin-tone and multi-person sequences stay intact), returning tidy data frames that slot straight into a 'tidyverse' workflow. The bundled emoji sentiment lexicon is from the Emoji Sentiment Ranking of Kralj Novak et al. (2015) doi:10.1371/journal.pone.0144296 , released under CC BY-SA 4.0; the emotion lexicon is from EmoTag1200 of Shoeb & de Melo (2020) https://aclanthology.org/2020.emnlp-main.720/, released under the MIT licence.
Output and naming contract
Every verb follows verb(data, text, ...), takes the text column unquoted,
and returns a tibble. Columns added to your data carry a dotted
.emoji_* prefix (.emoji, .emoji_name, .emoji_category,
.emoji_sentiment, .emoji_n, ...) so they cannot collide with your own
columns; new summary tibbles (e.g. emoji_frequency()) use bare names.
group always refers to the Unicode top-level category (the term used by
the underlying emoji::emojis table). Every glyph-to-metadata join is
normalised through a codepoint key that strips the U+FE0F variation
selector, so qualified and unqualified emoji forms resolve identically in
every verb.
Author
Maintainer: Youzhi Yu yuyouzhi666@icloud.com