Farm and Ranch Data Hygiene

March 16, 2022
By Daniel Foy
LinkedIn

Share this article

Take-Aways

The next generation of IoT and big data sensors, platforms and programs need a common protocol that allows for the easy and quick integration of quality data produced by on-farm level data systems, which benefits all on-farm and service providers.
It is recommended that AgriTech providers and manufacturers begin to report and describe data-cleaning methods, error types and rates, error deletion and correction rates, and differences in outcome with and without remaining outliers.
The automation of data cleaning, classifying and the three-stage process; screening, diagnosing, and editing, at farm level ahead of intelligence, will increase accuracy in Precision Livestock Farming (PLF).
The current generation of AgriTech are to be considered with some caution, ensuring they have the transparency on the data ownership front in any contractual agreements.
If purchasing intelligence; be sure you can understand the quality of the data produced by a tool or applied to intelligence, and where does that data go after processing through that AI or ML.

Getting to the Bottom of Data Hygiene with Farm or Ranch AgriTech Systems and Platforms

We’ve all seen a rogue number or two in a table or graph that we do not know how or why it got there, or the missing data point that produces an empty or error output, or simply just the mislabeled or entered data that skews everything out of control, that usually further confuses the situation.

With the ever-increasing number of new and current generation data systems, platforms, programs, and Internet of Things (IoT) sensors, that are designed with their many limitations, for example:

Range
Battery life
Volume of data
Veracity of the data
Disparate local and cloud locations for storage and processing
Limited compute power
Unfriendly user interfaces and languages

These varying limitations of the current data systems, platforms, programs, and IoT sensors all create disparate insights that are becoming a nightmare to manage, update and evolve on-site for improved decision making. This is not unique to livestock agriculture and is usually just the first wave of any digital revolution that so many other sectors have gone through.

As we look at the next wave of AgriTech IoT tools that will be big data-focused, data hygiene will be part of our management and understanding of farm or ranch level data, and an area of your business that you will need to have a greater awareness of, as the data owner. Auditing, testing, and checking the AgriTech systems on-site, so that you can be sure that they continue to work to their optimal function without disruption, deviation, or erroneous outputs, especially as these services, programs, and platforms today can be costly.

Within the US market, it is going to be demanded through processors, retailers, and consumers, that the farm or ranch meet standards on sustainability, welfare, and quality. The need to be confident in the validity and accuracy of your data and numbers that are produced by the on-site data systems will be of critical importance in allowing producers to confidently stand by the quality and standard of their products that are on the retailers’ shelves. We all can imagine the kind of “data-gate” situation a brand or farm could experience if the data or resulting calculations are not correct.

The European Union regulates down to its farmers to produce auditable data that can be used to determine whether or not producers are meeting and complying with the necessary standards. They must ensure that these data are correct and accurate if farmers are going to be incentivized and evolve from the current incentivization/grant systems, otherwise, it could become another VW diesel car scandal. Either way, if the data are incorrect, biased, or missing, and we are unaware or we do not have the system to detect these issues, can influence negatively on the animal welfare, crop yield, environment and soil health, profits and thereby threaten the industry image and consumer confidence.

Data cleaning can be defined as a three-stage process, involving repeated cycles of screening, diagnosing, and editing of suspected data abnormalities. The goal is to have minimal edit and data abnormalities, and this is something the entire AgriTech community should be moving towards while being transparent about it across academia and commercial suppliers. Many data errors are detected incidentally during study activities other than data cleaning. However, it is more efficient to detect errors by actively searching for them in a planned way. There is a need for greater automation of data cleaning and the three-stage process at farm level ahead of the intelligence, to increase our accuracy in Precision Livestock Farming (PLF), Machine Learning (ML), and Artificial Intelligence (AI), that ultimately assists with our aims of sustainability, greater welfare, traceability, and improving profitability.

Data cleaning deals with data problems once they have occurred and error-prevention strategies can reduce many problems but cannot eliminate them. It is not always immediately clear whether a data point is erroneous. Many times, what is detected is a suspected data point or pattern that needs careful examination. Similarly, missing values require further examination. Missing values may be due to interruptions of the data flow or the unavailability of the target information. Hence, predefined industry AgriTech rules for dealing with errors, true missing, and extreme values should be part of good practice and transparency, or at least understanding that these currently need to be available and transparent to the data owner, supplier of AI and ML services and the wider AgriTech industry.

The University of Wisconsin’s Dairy Brain’s Agricultural Data Hub (AgDH) gives a first and useful five-step method to ingest different data streams available to dairy farms, that is helpful in handling farm-level data:

Transporting raw data into a centralized system
Decoding and storing data in a database
Cleaning data to ensure its validity
Homogenization of data by extracting the common features among the different software and farms; and
Integration of data from the different systems.

Each of these steps is crucial to make data available from various sources in a consistent manner, ease algorithmic development and its implementation, and facilitate the deployment of new tools that utilize the integrated data. This historical and current data can then be made available to authenticated users via a single application programming interface (API) hosted through a web service, with the appropriate licensing.

We do need to go beyond this Agricultural Data Hub approach and at the AgriTech service, manufacturer, and supplier level, there needs to be a common protocol for how the farm level data are handled from each and every system on-farm that produces raw or pre-processed data. Doing so allows for transparent commercial validation, traceability, and auditability of farm and ranch data, which also allows data owners to have a more complete understanding of their data, thereby allowing for true ownership and enhanced value. This standard protocol will also be critical for future PLF AI and ML products, and the greater number under research, otherwise it makes for an immensely complicated task and something that will then become commercially difficult to replicate farm to farm, ranch to ranch.

The integration and accessibility of data can facilitate a wide range of descriptive, diagnostic, predictive, and prescriptive analytics that can be developed and deployed directly on farms to increase animal performance, efficiency, health and welfare, profit margins, and decrease the environmental impact of livestock farming systems. This is also a whole new economy for the agriculture service industry to supply ML and AI to livestock production systems. While the task is hard for integration of data within a single farm, the task becomes exponentially more complicated when trying to compare values from one farm with other farms. Similarly, the systems produce data of different quality, which the analyst and data owner should be aware of to judge accurately within the significance of results from the resulting analytics.

As data of interest are scattered across various data sources inside and outside of a farm, data extraction must deal with the diverse character or content of the data. These “heterogeneities” (traits) are not limited to the interfaces of source systems and sensors but also concern the data formats. The data sources for a single farm, such as sensors and management information systems, for example, herd management or wearable systems, are from various vendors, and each of these vendors employs its own proprietary extraction processes and data formats.

Most of the “raw data” extracted from the data sources are wrapped into comma-separated values (CSV) or similar export files. These files must be further processed in order to extract the data of interest. A first step toward extracting the data is the mapping of the data to the required levels of granularity. Some measurements, however, are not available per cow or highly granular. For example, raw data about microclimate within farms do not include records per cow. Rather, climate data consist of temperature, humidity, air quality, solar radiation, precipitation, etc. measurements that were taken at a specific date and time in a specific location of the farm. To obtain records per cow, these environmental measurements must be mapped to the location data for individual cows. Hence, the measurements of local environmental conditions are mapped to the cows that were near the location at the time of measurement, and which must therefore have been exposed to the measured climatic conditions, which we can add to lying time, rumination, step count, and production, to quantify the impacts of heat stress or other management insights.

As more independently developed AI and ML tools are applied to commercial farmer data ecosystems, we must begin to look at the quality and traits of the “raw data” that is coming in from the base sensor systems and platforms. For example, in the case of cleaning wearable sensor data, there are a number of research publications that discuss missing data files and filtering methods that were applied to these data. If the “raw data” that is coming into a central data ecosystem is missing one day out of a month, how does that affect alerts or any analytics that are generated? If we are missing 140 out of 1,440 data files or data packets for a day from an animal’s wearable sensors accelerometer, how does that affect the analysis and resulting outputs? If these files or packets are intermittently missing over a couple of months or longer, what is the effect of that on longer-term analytics and management decision-making? With commercially applied data systems, we will not have the luxury of filtering out data for greater results like in the research literature, without great consequences on welfare, sustainability, and profitability.

Apart from transforming the data to the desired granularity and traits, data may be transformed from farm-specific to globally relevant and comparable across farms, production systems, local environment, etc. To be able to compare all records concerning a specific cow, records using the farm-specific identifiers must be transformed into records that use the corresponding national identifier in order for the records to be comparable across farms.

Automated data cleaning systems must aid in minimizing data entry errors while having the intelligence that can limit and identify potential errors, this will be central to a farm or ranch data ecosystem. The exploration and deployment of technologies within a farm-level data ecosystem like non-fungible token (NFT) and data ledging technologies (DLT), can assist in the identification of missing data, understanding biased data, and cleaning, while these are also applications that can be used to automate and authenticate some aspects of our digital audits, regulations, and standards while enforcing data ownership rights, that will aid in a new agricultural data economy. Should North American farmers and ranchers want to participate in comparing records concerning a specific cow that correspond to the national identifier in order for the records to be comparable across all farms, without losing fourth amendment rights, NFT and DLT are what will allow you to operate in today’s digital food industry, while supplying what the customer and buyer need to know about your resulting products.

In cleaning the data, it also means we must begin classifying a temperature to it, whether it is Hot, Cold, or Warm data. This is not an area that has been mentioned within PLF data assessments but needs to be understood as its important for us to know who is going to see the data, how that data will be needed within other applications, the frequency it is called up. Classifying data assists with estimating the lifetime value for storage, which aids with the overall architecture for a commercial farm or ranch data ecosystem and future analytics that need to be applied to the data. The temperature of a data point may also change over time, with the greater availability, access, and integration of other data from IoT and data capturing technologies, that will add new value to a farm data ecosystem.

To Conclude

First, we must start by tackling and embracing Extended Digital Literacy at the farm or ranch level. This is critical for PLF, but also for the control and ownership of the data, the welfare of the animals, meeting our sustainability goals, and that farms and ranches are now the first lines of data security, which is food security.

As prospective users, farmers, veterinarians, and consultants are hardly data scientists or business Intelligence specialists, these users require intuitive query facilities to support them with the analytical tasks at hand within their particular expertise. Much like the farm team and data owner; data lakes typically do not provide such facilities. For data warehouses, however, more sophisticated analysis tools and query languages exist or can be more easily developed and implemented due to a more uniform structure and higher quality of the collected data.

Secondly, a farm or ranch data ecosystem enforces data ownership, security, and independent business intelligence. Structuring that data ecosystem so we have the data lake to the data warehouse is a good starting point. A data warehouse automates systems and functions for cleaning and integrating, high-quality data streams, that are viable for data analyses in PLF. Missing or erroneous data has a long-term effect on farms, ranches, the data’s value, and the industry at large.

With the next generation of IoT and big data sensors, platforms and programs, there is a need to ensure that a common protocol is put in place that allows for the easy and quick integration of quality data produced by on-site data systems, that allows for the control and oversight by the data owner. This requires understanding the farmer and rancher own their data pre and post algorithmic processing, which can be achieved through on-farm data ecosystems. It is recommended that AgriTech providers and manufacturers begin to report and describe data-cleaning methods, error types and rates, error deletion and correction rates, and differences in outcome with and without remaining outliers. That will benefit all parties in the supply chain, services industry, regulators, and data owners.

From there, the automation of data cleaning, classifying, and using the three-stage process at farm level ahead of the intelligence, to increase our accuracy in PLF, will help with some of the key areas of concern and opportunity by farms and ranches, sustainability, traceability, welfare, profitability. This has enormous potential value in helping rural economies, expanding rural wealth, and small business growth, while also creating many new jobs that will further contribute to rural economic development and AgriTech innovation.

Right now, the current generation of AgriTech data tools, platforms, and sensors should be considered with some caution—making sure they have the transparency on the data ownership upfront in contractual agreements and that those suppliers are willing to integrate and have those capabilities to do so. Finally, if you are purchasing intelligence, you should be sure to understand the quality of the data produced by a tool or applied to intelligence, and where does that data go after processing through that AI or ML.

Farmers and ranchers can start by auditing their current data files and systems while understanding how the Seven V’s of Big Data apply to each of these systems. Data cleaning is part of the veracity of your data, which is linked to validity and visualization, which are ultimately all linked to value.

Lastly, as all farmers know, good hygiene is imperative to run a successful livestock facility, now we need to add good data hygiene to that cleaning list. As livestock producers transition to this “agricultural digital revolution,” remember you are going to need to better understand the data systems and data flow that is coming in from every part of your business, what quality it is, how has it been handled prior to entry to a central data ecosystem. This is data hygiene and understanding and implementing this on your farm or ranch will make you the farm of the future.

Farm and Ranch Data Hygiene Terminology

Here are some useful terms for you, the data owner, and your farm and ranch team, to familiarize yourself with around data hygiene:

Data cleaning: Process of detecting, diagnosing, and editing faulty data.

Data editing: Changing the value of data shown to be incorrect.

Data flow: Passage of recoded information through successive information carriers, platforms, and models.

Inlier: Data value falling inside the expected range.

Outlier: Data value falling outside the expected range.

Robust estimation: Estimation of statistical parameters, using methods that are less sensitive to the effect of the outliers than more conventional methods.

Data Lake: Data lakes are a location where the first layer of assessment can begin to make sure the data are coming into the right place, on time, the right data content or format. A data lake needs to be filled with quality data before we bring it to a data warehouse.

Data Warehouse: This is where data can be appropriately cleaned, addressed and bias checked before going on for analysis.

Bias within data: This is any trend or deviation from the truth in data collection, data analysis, interpretation, and publication which can cause false conclusions.