
- efficient multi-threading, scaling real-world workloads almost linearly with the number of CPU cores (128x for AMD Ryzen Threadripper 3990X)
- SIMD vectorization (SSE to AVX-512) and RTM (Restricted Transactional Memory) based acceleration, up to 16x improvement in computing thread or even copying
- cache-aware algorithms: up to 50x improvement on some workloads
- up to 20 trillion operations/second in CUDA on GTX1080 (thousands of times faster than CPU)
- up to the theoretical limit (6.8 Gigarays/second on RTX 2080 laptop GPU) in ray-tracing with OWL and OptiX
- expert system and recommendation engine ProbQA
- improving the performance of an initially single-threaded Java program that is run for days to optimize a flight schedule. In the first 20 hours of work, we improved the performance by more than 2x while keeping the program single-threaded.
- raytracing application on the GPU: OptiX, OWL (OptiX Wrapper Library), RTX.
- vectorization (SSE2/AVX2/FMA) for ALGLIB: https://www.alglib.net/
- high-performance, massively multithreaded networking application for web scraping. The application consists of multiple servers and even more microservices that communicate with each other via HTTP/REST and Windows-named pipes.
Implemented projects with GPT-3/3.5/4 or other LLMs:
- Tone and sentiment analysis for a large database of email subjects and bodies. This way, I determined the best-selling tones and then implemented a program that lets my customer rewrite emails in the desired tones, using OpenAI API.
- Question answering for a database of documents. Fine-tuning of OpenAI models and embedding extraction&storage were used.
- Extraction of keywords and summaries for a huge list of websites. This started as OpenAI-based proof of concept, but because their prices are huge, I later changed to an on-premise large language model.
- ChatGPT-like chat using Alpaca-finetuned GPT-J large language model for customer support purposes. This can be used on-premise without API calls to OpenAI.
- Performance optimization for the underlying Transformers technology in CUDA.
- Lossy compression for the offload of weights when training or inferencing with limited GPU memory.
- A Chatbot and a dating application, where GPT4- (or a local LLM-) based chatbot talks to the user naturally so to collect the summary of user’s personality, interests and preferences. Then maximum weighted graph matchmaking is run using the distance between user summary embeddings as the dissimilarity metric.
- Multiple consulting, including LLMs for editors, legal and finance.
