About the competition The numbers are concerning, given that proper hand hygiene is one of the most effective measures to halt the spread of CoviD19 and other pathogens. Simple handwashing not only protects individuals from contracting the disease but also prevents transmission to others. However, it is alarming to note that 2.2 billion people worldwide lack access to safe water at home, and an additional 1.37 billion people lack handwashing facilities at home. Furthermore, nearly two billion people worldwide rely on healthcare facilities that lack basic water services. Climate change, population growth, and pollution are threatening the world’s water resources. As the global population continues to expand, the challenge of accessing sufficient water while preserving aquatic ecosystems’ integrity persists. The Pacific Institute collaborates with stakeholders worldwide to address water resource issues and ensure that communities and nature have the water they need to thrive presently and in the future. Understanding water sanitation and ensuring water cleanliness is crucial in both rural and urban areas. One way to achieve this is by assessing the quality of the water we consume daily. The objective of this competition is to train a machine learning model using the provided water quality data in the training file and use it to predict the quality estimation result for the test dataset. For further information on the competition, including instructions on submitting predictions, please refer to Kaggle’s competition documentation available at the following link: www.kaggle.com/docs/competitions.
About the data set The dataset provided in the train, csv consists of the following features: – 1d: The unique ID for each row. – categorya – categoryF: 6 category columns with suffix A to . – featurea – featureI: 9 feature columns with suffix a to I. – compositiona – compositionJ: 10 composition columns with suffix A to J. – unit: The unit of measurement for the result values. – result: The measure for water quality (target variable). The datasets provided could be read using the read_csv( ) function in the pandas module. \# code to read the dataset import pandas pandas . read_csv(“train.csv”) Acknowledgements We thank European Environment Agency and The World Bank for providing this dataset.
Evaluation Metric The evaluation metric for this competition is Root Mean Squared Logarithmic Error (RMSLE). The RMSLE is calculated as where: is the total number of observations in the (public/private) data set, is your prediction of target, and is the actual target for . is the natural logarithm of Submission Format For every id in the dataset, the submission file should contain two columns: id which is the unique id for each data point from the testing dataset, and result which is the water quality measurement factor. The second column should be a space-delimited string value. The file should contain a header and have the following format:
About the dataset This dataset consists of a train. csv file which contains the unique id id for each row, 6 category columns with suffix to feature columns with suffix A to I, 10 composition columns with suffix from A to , unit column which is the unit of measurement of the result value and finally a result columns which is the measure of the quality of water as a numerical value to be predicted. 1. The category columns are various categorical features for a data point such as country of data collection, the site from which the data is collected, media of sample, etc. 2. The feature columns are the various demographic features that affect the pollution of water in a particular region such as population density, GDP, droughts in a region, literacy rate of students in a region, etc. 3. The composition columns are the compositions of various elements like paper, plastic wastes, cardboard, etc. in water. 4. The unit value is the unit of measurement in which the result value is measured. 5. The result value is a floating number that expresses the quality of water based on the various factors provided in the dataset. FAQ What files do I need? You are required to use the train.csv file to read the tabular dataset provided and train your algorithms to predict the numerical
What am I predicting? You would be training your model based on the train.csv (training dataset) and then would be predicting the result value for each row in test. csv to create a submission. Files – train.csv – the provided training data set. – test.csv – the provided testing data set. – sample_submission.csv – a sample submission file in the correct format. Columns – id: The unique ID for each row. – categoryA – categoryF: 6 category columns with suffix to . – featureA – featureI: 9 feature columns with suffix to . – compositionA – compositionJ: 10 composition columns with suffix to . – unit: The unit of measurement for the result values. – result: The measure for water quality (target variable). The training dataset provided could be read using the read_csv() function in the pandas module.
Overview Data Code Discussion Leaderboard Rules Team Submissions Submit Predictions IIII sample_submission.csv Detail Compact Column
About the data set The dataset provided in the train, csv consists of the following features: – 1d: The unique ID for each row. – categorya – categoryF: 6 category columns with suffix A to . – featurea – featureI: 9 feature columns with suffix a to I. – compositiona – compositionJ: 10 composition columns with suffix A to J. – unit: The unit of measurement for the result values. – result: The measure for water quality (target variable). The datasets provided could be read using the read_csv( ) function in the pandas module. \# code to read the dataset import pandas pandas . read_csv(“train.csv”) Acknowledgements We thank European Environment Agency and The World Bank for providing this dataset.
Evaluation Metric The evaluation metric for this competition is Root Mean Squared Logarithmic Error (RMSLE). The RMSLE is calculated as where: is the total number of observations in the (public/private) data set, is your prediction of target, and is the actual target for . is the natural logarithm of Submission Format For every id in the dataset, the submission file should contain two columns: id which is the unique id for each data point from the testing dataset, and result which is the water quality measurement factor. The second column should be a space-delimited string value. The file should contain a header and have the following format:
About the dataset This dataset consists of a train. csv file which contains the unique id id for each row, 6 category columns with suffix to feature columns with suffix A to I, 10 composition columns with suffix from A to , unit column which is the unit of measurement of the result value and finally a result columns which is the measure of the quality of water as a numerical value to be predicted. 1. The category columns are various categorical features for a data point such as country of data collection, the site from which the data is collected, media of sample, etc. 2. The feature columns are the various demographic features that affect the pollution of water in a particular region such as population density, GDP, droughts in a region, literacy rate of students in a region, etc. 3. The composition columns are the compositions of various elements like paper, plastic wastes, cardboard, etc. in water. 4. The unit value is the unit of measurement in which the result value is measured. 5. The result value is a floating number that expresses the quality of water based on the various factors provided in the dataset. FAQ What files do I need? You are required to use the train.csv file to read the tabular dataset provided and train your algorithms to predict the numerical
What am I predicting? You would be training your model based on the train.csv (training dataset) and then would be predicting the result value for each row in test. csv to create a submission. Files – train.csv – the provided training data set. – test.csv – the provided testing data set. – sample_submission.csv – a sample submission file in the correct format. Columns – id: The unique ID for each row. – categoryA – categoryF: 6 category columns with suffix to . – featureA – featureI: 9 feature columns with suffix to . – compositionA – compositionJ: 10 composition columns with suffix to . – unit: The unit of measurement for the result values. – result: The measure for water quality (target variable). The training dataset provided could be read using the read_csv() function in the pandas module.
Overview Data Code Discussion Leaderboard Rules Team Submissions Submit Predictions IIII sample_submission.csv Detail Compact Column
Read more here: Source link