Built an LLM that mimics love language in Korean, trained on 270B tokens from 100K+ real Korean couple messenger chats, to
research AI's ability to mimic emotions in conversations. Team size: 4
● Accelerated context vector computation for 3M messages by distributing workloads across 4 GPUs, achieving up to 4x speedup.
● Reduced the dimensionality of 1M message vectors and visualized them to identify unique trends in message distributions.
● Co-developed a feature to predict the time taken to get a reply to a message by fine-tuning the model with 1B tokens,
improving prediction accuracy from 12.5% up to 40%.
● Built a distributed data processing pipeline, expanding the dataset from 10B to 270B tokens, reducing overfitting issues.
● Stack: Python, PyTorch, CUDA, Linux, Docker, Kubernetes, numpy, pandas
더보기