Hosting AI Model Mistral 7B on Linux without GPU and Query from React Component.
Why pay for AI tokens when you can self host on your laptop.
Mistral 7B is a powerful AI model that you can set up and run on a Linux computer without needing a special graphics card, known as a GPU. This guide will explain in simple terms how to prepare your computer, install the necessary software, and start using Mistral 7B. You'll learn about the tools you need, the steps to follow, and some tips to make sure everything works smoothly. Even if you're new to working with AI or don't have a GPU, this introduction will help you get started with hosting Mistral 7B on your own Linux system.
We need LLM inference software that can run our model after we download it locally on our PC/Laptop. llama.cpp is a software that enables us to do that. It's available as a C/C++ source on GitHub so we need to clone and build it before use.
If the C/C++compiler toolchain is not installed then install it with
sudo apt install make gcc g++
then clone and make
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp/
make
the compiling should start. let it finish
Now that we have the inference software ready let's download the model. We choose Mistral-7B-Instruct-v0.2 but it can be any model of choice. We need the model in GGUF format so head to the Huggingface website to download the required model.
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
I recommend using the huggingface-hub
Python library:
pip3 install huggingface-hub
Download our model file. Should be around 7GB approx so it will take some time to download.
cd models
huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF mistral-7b-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
create a system prompt file system_prompt.json
nano ~/llama.cpp/system_prompt.json
with content for the prompt
{
"prompt": "You are a PC Configuration Assitant, There are four PC types available Epic Extreme Power v2 ,Budget Friend,Speed King,Balance Perfect according to Speed Budget and Power requirement. Select one based on user requirement and recommend only one. Always, Repeat Always In the last line add name of Pc Type in last line after a :\nUser:",
"anti_prompt": "User:",
"assistant_name": "Assistant:"
}
Run the HTTP server that will provide us with a completion API. Note: adjust your current working directory according to your PC.
/root/llama.cpp/server -spf /root/llama.cpp/system_prompt.json --host 0.0.0.0 -m /root/llama.cpp/models/mistral-7b-instruct-v0.2.Q8_0.gguf
Now the HTTP server is running on port 8080
You may additionally need to create this as a service so it runs in the background.
If you have a firewall enable port 8080 to accept all traffic.
opening localhost:8080 should show this screen implying success. You can use this screen for debugging and testing the model.
Using this in a React.js component.
Add two react components from https://github.com/honeydreamssoftwares/llama-cpp-react
//completion.tsx
interface Params {
api_key?: string;
n_predict: number,
stream: boolean
}
interface Config {
controller?: AbortController;
api_url?: string;
}
interface CompletionParams extends Params {
prompt: string;
}
interface EventError {
message: string;
code: number;
type: string;
}
interface GenerationSettings {
n_ctx: number;
n_predict: number;
model: string;
seed: number;
temperature: number;
dynatemp_range: number;
dynatemp_exponent: number;
top_k: number;
top_p: number;
min_p: number;
tfs_z: number;
typical_p: number;
repeat_last_n: number;
repeat_penalty: number;
presence_penalty: number;
frequency_penalty: number;
penalty_prompt_tokens: string[];
use_penalty_prompt_tokens: boolean;
mirostat: number;
mirostat_tau: number;
mirostat_eta: number;
penalize_nl: boolean;
stop: string[];
n_keep: number;
n_discard: number;
ignore_eos: boolean;
stream: boolean;
logit_bias: boolean[];
n_probs: number;
min_keep: number;
grammar: string;
samplers: string[];
}
interface EventData {
content: string;
id_slot: number;
stop: boolean;
model: string;
tokens_predicted: number;
tokens_evaluated: number;
generation_settings: GenerationSettings;
prompt: string;
truncated: boolean;
stopped_eos: boolean;
stopped_word: boolean;
stopped_limit: boolean;
stopping_word: string;
tokens_cached: number;
timings: {
prompt_n: number;
prompt_ms: number;
prompt_per_token_ms: number;
prompt_per_second: number;
predicted_n: number;
predicted_ms: number;
predicted_per_token_ms: number;
predicted_per_second: number;
};
}
interface SSEEvent {
event?: unknown;
data?: EventData;
id?: string;
retry?: number;
error?: EventError; // Define a more specific error interface if possible
[key: string]: unknown;
}
type ParsedValue = string | EventData | EventError | undefined;
export async function* llama(prompt: string, params: Params, config: Config = {}): AsyncGenerator<SSEEvent, string, undefined> {
const controller = config.controller ?? new AbortController();
const api_url = config.api_url ?? "";
const completionParams: CompletionParams = {
...params,
prompt
};
const response = await fetch(`${api_url}/completion`, {
method: 'POST',
body: JSON.stringify(completionParams),
headers: {
'Connection': 'keep-alive',
'Content-Type': 'application/json',
'Accept': 'text/event-stream',
...(params.api_key ? { 'Authorization': `Bearer ${params.api_key}` } : {})
},
signal: controller.signal,
});
if(response.body===null){
return "No response";
}
const reader = response.body.getReader();
const decoder = new TextDecoder();
let content = "";
let leftover = "";
try {
let cont = true;
while (cont) {
const result = await reader.read();
if (result.done) {
break;
}
const text = leftover + decoder.decode(result.value);
const endsWithLineBreak = text.endsWith('\n');
const lines = text.split('\n');
if (!endsWithLineBreak) {
leftover = lines.pop() ?? "";
} else {
leftover = "";
}
for (const line of lines) {
const match = /^(\S+):\s(.*)$/.exec(line);
if (match) {
let key = match[1];
const value:ParsedValue = match[2];
const event: Partial<SSEEvent> = {};
if (key === "data" || key === "error") {
// Attempt to parse JSON for data or error fields
try {
if(value){
// eslint-disable-next-line @typescript-eslint/no-unsafe-assignment
event[key] = JSON.parse(value);
}
} catch {
event[key] = value as unknown as EventData & EventError; // Use raw value if JSON parsing fails
}
} else {
if(!key){
key='0';
}
event[key] = value;
}
// Specific processing for parsed data
if (key === "data" && event.data) {
content += event.data.content;
yield event as SSEEvent;
if (event.data.stop) {
cont = false;
break;
}
}
// Handling errors
if (key === "error" && event.error) {
try {
if (event.error.message.includes('slot unavailable')) {
throw new Error('slot unavailable');
} else {
console.error(`llama.cpp error [${event.error.code} - ${event.error.type}]: ${event.error.message}`);
}
} catch(e) {
console.error(`Error parsing error data:`, event.error);
}
}
}
}
}
} catch (e: unknown) {
if (typeof e === "object" && e !== null && "name" in e && (e as { name: string }).name !== 'AbortError') {
console.error("llama error: ", e);
} else if (e instanceof Error) {
// This checks if it's an Error object which is more specific and safer
console.error("llama error: ", e.message);
} else {
// Handle cases where error is not an object or unexpected type
console.error("Unexpected error type:", e);
}
throw e;
} finally {
controller.abort();
}
return content;
}
streamedcontent.tsx is the main entry component.
Note: Replace api_url with your AI Backend from llama.cpp
//streamedcontent.tsx
import React, { useState } from 'react';
import { llama } from './completion';
// Define interfaces for expected data types
interface StreamedContentComponentProps {
onPcTypeRecommended: (pcType: string) => void;
}
// Main component definition using React.FC (Functional Component)
const StreamedContentComponent: React.FC<StreamedContentComponentProps> = ({ onPcTypeRecommended }) => {
const [messages, setMessages] = useState<string>(""); // State for storing message strings
const [loading, setLoading] = useState<boolean>(false); // State to track loading status
// Handler for changes in text input, assuming you want to capture input for the llama function
const handleQueryChange = (event: React.ChangeEvent<HTMLInputElement>) => {
setMessages(event.target.value);
};
// Asynchronously fetch or process data, abstracted into its own function for clarity
const fetchStreamedContent = async (query: string) => {
console.log("fetching...",query);
const params = {
n_predict: 512,
stream: true,
};
//Replace with your AI Backend from llama.cpp
const config = {
api_url: 'http://localhost:8080'
};
try {
let newContent="";
for await (const event of llama(query, params, config)) {
if (event.data?.content) {
// Append new content to existing messages
newContent += event.data.content;
setMessages(currentMessages => currentMessages + event.data?.content );
}
if (event.error) {
throw new Error(event.error.message);
}
if (event.data && event.data.stop) {
pcRecommended(newContent);
console.log("Stop received",event.data);
break; // Exit loop if stop signal is received
}
}
} catch (error) {
console.error('Error consuming events:', error);
}
};
const pcRecommended = (currentMessages: string) => {
console.log("pcRecommended filtering...",currentMessages);
const lastColonIndex = currentMessages.lastIndexOf(':');
if (lastColonIndex >= 0) {
let pcType = currentMessages.substring(lastColonIndex + 1).trim();
if (pcType.endsWith('.')) {
pcType = pcType.slice(0, -1).trim();
}
pcType = pcType.replace(/[.,;]$/g, '').trim();
onPcTypeRecommended(pcType);
}
};
// Button click handler that triggers the asynchronous operation
const handleAskClick = () => {
if (!messages.trim()) return; // Prevent running with empty query
setLoading(true); // Set loading true when the process starts
void fetchStreamedContent(messages).finally(() => {
setLoading(false); // Reset loading state when the process completes or fails
});
};
return (
<div className="flex flex-col items-center justify-center space-y-4">
<input
type="text"
onChange={handleQueryChange}
className="w-full rounded-lg border border-gray-300 bg-gray-50 p-4 text-sm focus:border-blue-500 focus:ring-blue-500 dark:border-gray-600 dark:bg-gray-700 dark:text-white"
placeholder="Ask AI"
/>
<button
onClick={handleAskClick}
disabled={loading}
className="px-4 py-2 bg-blue-500 text-white rounded hover:bg-blue-600 disabled:bg-blue-300"
>
{loading ? "Thinking..." : "Ask"}
</button>
<div className="w-full p-4 bg-gray-100 rounded shadow">
{messages.split("\n").map((message, index) => (
<p key={index}>{message}</p>
))}
</div>
</div>
);
};
export default StreamedContentComponent;
Using the component.
In main App.tsx include the StreamedContentComponent
//App.tsx
import StreamedContentComponent from './components/streamedcontent'
import './App.css'
function App() {
const onPcTypeRecommended=(pc:string)=>{
console.log("filtering...",pc);
//Filter logic here
}
return (
<div className="min-h-screen bg-gray-100">
<nav className="bg-white shadow">
<div className="max-w-7xl mx-auto px-4 sm:px-6 lg:px-8">
<div className="flex justify-between h-16">
<div className="flex">
<div className="flex-shrink-0 flex items-center">
LLAMA.cpp Selfhosting Demo
</div>
</div>
<div className="hidden sm:flex sm:items-center sm:ml-6">
<a href="#" className="text-gray-800 hover:text-gray-600 px-3 py-2 rounded-md text-sm font-medium">Home</a>
<a href="#" className="text-gray-800 hover:text-gray-600 px-3 py-2 rounded-md text-sm font-medium">Features</a>
<a href="#" className="text-gray-800 hover:text-gray-600 px-3 py-2 rounded-md text-sm font-medium">About</a>
</div>
</div>
</div>
</nav>
<main className="py-10">
<div className="w-96 mx-auto">
<div className="bg-white overflow-hidden shadow-sm sm:rounded-lg">
<div className="p-6 bg-white border-b border-gray-200">
<h3 className="text-lg font-semibold text-gray-800 mb-4">Get AI recommendation</h3>
<div className="p-4 border rounded-lg shadow-lg bg-white">
<StreamedContentComponent onPcTypeRecommended={onPcTypeRecommended}></StreamedContentComponent>
</div>
</div>
</div>
</div>
</main>
</div>
)
}
export default App
Run dev command
npm run dev
Open http://localhost:5173 and ask your question.
Now that this react component can be used anywhere in your project with a self-hosted AI model, it is time to move to cloud-hosted models for scalability. Stay tuned and subscribe. bye.
Links
https://github.com/honeydreamssoftwares/llama-cpp-react
https://github.com/ggerganov/llama.cpp/tree/master/examples/server
Note: Multiple types of models are supported check out llama.cpp GitHub for the list and fine-tuning instructions.