Google's New AI Can Read 400 Pages

Gemini Pro 1.5 enters the AI Race, Predicting Parkinson's Disease and Trusting AI in Decision Making,

Welcome curious enthusiasts, knowledge seekers and hardcore researchers.

The world of AI is moving fast, Google Bard is gone, Gemini is now front and center. Gemini Pro 1.5 made a splash with context window boasting an eye watering 1 million tokens. In acknowledgement of this feat, today’s One Minute Papers kicks off with a look on the latest research on Gemini followed by how deep learning is being applied to Parkinson’s disease detection ending with advances on how LLM’s are parsing complexed structured data such as documents and tables.

The System

Let’s start.

An Evaluation of GPT-4V and Gemini in Online VQA

Mengchen Liu, Chongyan Chen, Danna Gurari

Key Topics: Visual Question Answering, Multimodal Models, GPT-4V, Gemini, Model Evaluation

Link: here | AI Score: 🚀 🚀 | Interest Score: 🧲 🧲 | Reading Time: 

Result: This paper evaluates the performance of two large multimodal models, GPT-4V and Gemini, in online visual question answering. The study categorizes existing assessments into qualitative and quantitative evaluations, highlighting the strengths and weaknesses of the models. GPT-4V excels in science-related super-topics, particularly Social Sciences, while Gemini performs well in Religion and Spirituality. The analysis underscores the importance of possessing multiple image processing capabilities for effective visual question answering.

An In-depth Look at Gemini's Language Abilities

Syeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex Bäuerle, Ángel Alexander Cabrera, Krish Dholakia, Chenyan Xiong, Graham Neubig

Key Topics: Language Model Evaluation, Gemini vs. GPT Models, Multilingual Capabilities, Machine Translation, Model Performance Analysis

Link: here | AI Score: 🚀 🚀 | Interest Score: 🧲 🧲 | Reading Time:  

Result: The paper explores the language abilities of the Google Gemini models compared to the OpenAI GPT series. Through a comprehensive evaluation across various tasks, Gemini Pro demonstrates comparable accuracy to GPT 3.5 Turbo in English tasks but excels in translating into other languages. The study highlights Gemini Pro's impressive performance in specific languages like South Levantine Arabic, Romanian, and Mesopotamian Arabic. However, general-purpose language models have not surpassed dedicated machine translation systems for non-English languages. The research provides insights into the strengths and weaknesses of Gemini and GPT models in different language tasks, offering a detailed analysis of their performance.

Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models

Yuqing Wang, Yun Zhao

Key Topics: Commonsense Reasoning, Multimodal Large Language Models, Dataset Analysis, Contextual Understanding, Emotion Recognition in Images

Link: here | AI Score: 🚀 🚀 | Interest Score: 🧲 🧲 | Reading Time: 

Result: The paper evaluates the commonsense reasoning capabilities of Gemini Pro, a Multimodal Large Language Model (MLLM). Through a comprehensive analysis across 12 diverse datasets, the study reveals that Gemini Pro performs comparably to GPT-3.5 Turbo in language-based tasks but lags behind GPT-4 Turbo in accuracy. The model struggles with tasks requiring deep contextual understanding, abstract reasoning, temporal dynamics, social scenarios, and emotion recognition in images. The findings emphasize the need for further research to enhance specialized domains and nuanced recognition of mental states and emotions in multimodal contexts within LLMs and MLLMs.

Predicting Parkinson's Disease Evolution Using Deep Learning

Maria Frasca, Davide La Torre, Gabriella Pravettoni, Ilaria Cutica

Key Topics: Parkinson's Disease Diagnosis, Deep Learning, MRI Image Analysis, Data Augmentation, Temporal Data Processing

Link: here | AI Score: 🚀 🚀 | Interest Score: 🧲 🧲 | Reading Time: 

Result: The paper presents a deep learning approach to improve the diagnosis and staging of Parkinson's disease, which is crucial for patient care and treatment planning. By leveraging the Parkinson's Progression Markers Initiative dataset and employing a combination of 3D-CNNs and recurrent neural network layers, the study aims to enhance the accuracy of classifying disease progression stages. The use of advanced pre-processing and data augmentation techniques further strengthens the model's performance, potentially contributing to the field of medical AI and neurodegenerative disease research.

Trusting AI in High-stake Decision Making

Ali Saffarini

Key Topics: Trust in AI, Human-AI Interaction, AI Development, Psychological Factors, Decision-Making

Link: here | AI Score: 🚀 | Interest Score: 🧲 🧲 | Reading Time: 

Result: The paper explores the challenges of establishing trust between artificial intelligence systems and humans. It discusses how the absence of human-like traits in AI, such as consciousness and emotions, creates barriers to trust. The study emphasizes the importance of involving people in the AI development process to increase trust. Additionally, it highlights the psychological factors, such as emotionlessness and inflexibility, that influence trust in AI systems. The paper concludes that addressing both technical and psychological barriers is essential for building trust in AI for important decision-making.

Large Language Models are Complex Table Parsers

Bowen Zhao, Changkai Ji, Yuejie Zhang, Wen He, Yingwen Wang, Qing Wang, Rui Feng, Xiaobo Zhang

Key Topics: Complex Table QA, GPT-3.5, Hierarchical Structure Parsing, Dialogue Prompts, Benchmark Dataset Performance

Link: here | AI Score: 🚀 🚀 | Interest Score: 🧲 | Reading Time: 

Result: The paper discusses the use of GPT-3.5 as a parser for Complex Table QA tasks. The authors propose a novel approach that leverages GPT-3.5 to enhance the model's ability to understand the hierarchical structure of complex tables. They address the input token limitation of GPT-3.5 by designing single-turn and multi-turn dialogue prompts tailored to different table lengths. Extensive experiments on benchmark datasets show that their method outperforms previous state-of-the-art methods in Complex Table QA tasks. The paper highlights the importance of utilizing multi-turn dialogue to gather valuable context for Complex Table QA.

DocLLM: A layout-aware generative language model for multimodal document understanding

Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, Xiaomo Liu

Key Topics: Multimodal Document Understanding, Generative Language Models, Layout Structures, Optical Character Recognition (OCR), Zero-shot Learning

Link: here | AI Score: 🚀 🚀 | Interest Score: 🧲 🧲 | Reading Time: 

Result: The paper introduces DocLLM, a generative language model that enhances document understanding by incorporating complex layout structures. DocLLM allows for the inclusion of visually rich documents in pre-training without extensive preprocessing. It addresses the limitations of previous models by considering multi-page awareness and spatial layout structures. The model's performance can be improved by using accurate OCR engines. The paper also discusses the methodology for zero-shot results and presents results from various tasks such as key information extraction, natural language inference, and visual question-answering. Overall, DocLLM shows promising results in improving performance for document intelligence tasks.

-That’s all for now-

AI Trivia

Origin of "Artificial Intelligence": In 1956, at the Dartmouth Conference, John McCarthy proposed the term "Artificial Intelligence," which marked the official beginning of AI as a field. This conference brought together researchers interested in neural nets, the theory of computation, and the possibility of replicating human intelligence, setting the foundation for AI research.

McCarthy was a pioneering figure in the field of AI, whose contributions laid foundational stones for the discipline. Not only did he coin the term, but his work spanned several decades and encompassed significant achievements that have profoundly influenced computer science and AI.

ChatGPT