Data Preprocessing
Data Preprocessing is a critical step in building effective RAG systems. The accuracy and relevance of the answers generated by the model directly depend on the quality of data preparation. In this section, we will look at the main methods and approaches to preparing text data for RAG.
You can apply data preprocessing not only in RAG - it is useful in GenAI Workflows and agents!
Data - is the king. Data preprocessing is 50% of the success for RAG.
If you underestimate this step, you may lose 50% of the relevance of the answers. Trash in, trash out.
Questions
- We want to cut the document into pieces for vectorization. What are the ways to do this?
- Why is cutting by 1000 characters a bad idea? What about 100 tokens?
- How to cut specific documents, such as html, json, or code?
- When might we need to extract text from seemingly text documents?
Steps
1. Read carefully about ways to split text into chunks in langchain
or watch the lecture
2. Text extraction
There are a huge number of unstructured data types: pdf, docx, rtf and the like. The problem with them is that they can consist of data of different formats at once:
- text
- tables
- pictures (including infographics, drawings, graphs)
- etc.
Before we start breaking documents into chunks, we must learn how to extract text from these documents. Experience in the field shows that it is impossible to globally prepare for text extraction: there are pitfalls in every case. However, we have left some materials here for you - but studying them is not necessary.
- Upstage AI Document Parser: Revolutionise Complex PDF Data Extraction!
- ✦ Marker: This Open-Source Tool will make your PDFs LLM Ready
- ✦ Extracting Text from PDFs for Large Language Models and RAG (PyMuPDF4llm 💚)
Python libraries for text extraction
- pandas – convenient DataFrame/Series structures; indispensable for loading, cleaning gaps, combining different sources and preparing tabular data for subsequent generation of text examples.
- scikit-learn – tools for scaling (StandardScaler, MinMaxScaler), encoding categories (OneHotEncoder, LabelEncoder) and splitting the sample (train_test_split); examples: preparing features before training a classifier taking into account the generative context.
- NLTK – classic NLP modules: tokenization, stop words, stemming/lemmatization; useful for initial text processing before feeding into the LLM.
- spaCy – fast parsing, NER, lemmatization and POS tagging; used to extract entities and structure knowledge in RAG pipelines.
- regex – advanced work with regular expressions (Unicode support, POS environments); necessary for complex text cleaning and pattern analysis.
- ftfy – «fixes text for you»: fixing broken Unicode, broken HTML encodings and OCR artifacts; case: cleaning data from various web scrapings.
- chardet – automatic detection of text file encoding; helps to correctly read documents in different encodings before preprocessing.
- langdetect – a library for determining the language of text; used for multilingual RAG solutions, filtering and routing documents by language models.
- clean-text – ready-made text cleaning functions: removing links, emojis, special characters and extra spaces; speeds up corpus preparation before vectorization.
- unstructured – recognition and parsing of PDF, DOCX, HTML, PPTX; extracts "clean" text and metadata to create knowledge sources.
- Apache Tika – a service for extracting text/metadata from many formats; useful in ETL pipelines for large document repositories.
- PDFPlumber – detailed work with PDF: tables, columns, coordinate text extraction; suitable for structuring corporate reports.
- PyPDF2 – basic PDF reading/writing functions, merging and splitting pages; used to prepare batches of documents for vector storage.
- BeautifulSoup4 – parsing HTML/XML; used to collect and clean web data before creating wiki-like indexes.
Extra Steps
E1. How to Set the Chunk Size in Document Splitter
E2. Additional reading: Mastering Text Splitting for Effective RAG with Langchain
Now we know...
We have studied the key techniques for extracting text from various data formats and splitting text into chunks, which are necessary for preparing data in RAG. Understanding these methods allows you to optimize the process of indexing and searching for relevant information. Now you are ready to apply this knowledge in practice to improve your RAG applications.
Exercises
Food for thought to get the student's neurons moving:
- How will the quality of RAG system responses change if you use very small or very large chunk sizes? What compromises exist when choosing a chunk size?
- Imagine you need to process data containing tables and code. Which text splitting strategies will be most effective and why might standard splitters fail?
- In a real project, data may come from different sources (PDF, HTML, JSON, databases) and have different structures. What difficulties might you encounter when creating a universal data preprocessing pipeline and how can they be overcome?
- How can you assess the quality of text splitting into chunks before the index building and response generation stage? Are there any metrics or approaches for such an assessment?