Handling and analyzing large amounts of data is called large-scale data processing. It involves extracting valuable knowledge, making informed decisions and solving complex problems. It is crucial in several fields, including business, science, healthcare, and more. The choice of tools and methods depends on the specific requirements of the data processing task and the available resources. Programming languages such as Python, Java, and Scala are often used for large-scale data processing. In this context, frameworks such as Apache Flink, Apache Kafka and Apache Storm are also valuable.
Researchers have created a new open source framework called fondant to simplify and accelerate large-scale data processing. It has several built-in tools to download, explore and process data. It also includes components for download via URL and image download.
The current challenge with generative ai, such as Stable Diffusion and Dall-E, relies on hundreds of millions of images on the public Internet, including copyrighted works. This creates legal risks and uncertainties for users of these images and is unfair to copyright holders who may not want their patented work reproduced without consent.
To address this, researchers have developed a data processing pipeline to create 500 million Creative Commons image datasets to train latent diffusion imaging models. Data processing pipelines are steps and tasks designed to collect, process, and move data from one source to another, where it can be stored and analyzed for various purposes.
Creating custom data processing pipelines involves several steps, and the specific approach may vary depending on your data sources, processing requirements, and tools. Researchers use the building block approach to create custom pipelines. They designed the Fondant pipes to mix reusable components and custom components. Additionally, they deployed it to a production environment and set up automation for regular data processing.
Fondant-cc-25m contains 25 million image URLs with Creative Commons license information that can be easily accessed in one go. The researchers have published a detailed step-by-step installation program for local users. To run pipelines locally, users must have Docker installed on their systems with at least 8 GB of RAM allocated to their Docker environment.
As the published dataset may contain sensitive personal information, the researchers only designed the datasets to include non-personal public information to support the conduct and publication of their open access research. They say the filtering process for the data set is still in progress and they are open to contributions from other researchers to help create anonymous channels for the project. The researchers say that in the future they want to add different components like image-based deduplication, automatic subtitles, visual quality estimation, watermark detection, face detection, text detection, and much more.
Review the ai/en/latest/announcements/CC_25M_community/”>blog article and ai/fondant-cc-25m”>Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 31k+ ML SubReddit, Facebook community of more than 40,000 people, Discord Channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
We are also on WhatsApp. Join our ai channel on Whatsapp.
Arshad is an intern at MarktechPost. He is currently pursuing his international career. Master’s degree in Physics from the Indian Institute of technology Kharagpur. Understanding things down to the fundamental level leads to new discoveries that lead to the advancement of technology. He is passionate about understanding nature fundamentally with the help of tools such as mathematical models, machine learning models, and artificial intelligence.
<!– ai CONTENT END 2 –>