Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
How can data science contribute to a greener world?: an exploration featuring machine learning and data mining for environmental facilities and energy end users
Umeå University, Faculty of Science and Technology, Department of Chemistry. (Green Technology and Environmental Economics)
2021 (English)Doctoral thesis, comprehensive summary (Other academic)Alternative title
Hur kan datavetenskap bidra till en grönare värld? : studier med användande av maskininlärning och datamining för miljöanläggningar och energiförbrukning (Swedish)
Abstract [en]

Human society has taken many measures to address environmental issues. For example, deploying wastewater treatment plants (WWTPs) to alleviate water pollution and the shortage of usable water; using waste-to-energy (WtE) plants to recover energy from the waste and reduce its environmental impact. However, managing these facilities is taxing because the processes and operations are always complex and dynamic. These characteristics hinder the comprehensive and precise understanding of the processes through the conventional mechanistic models. On the other hand, with the development of the Fourth Industrial Revolution, large-volume and high-resolution data from automatic online monitoring have become increasingly obtainable. These data usually reflect abundant detailed information of process activities that can be utilized for optimizing process control. Similarly, data monitoring is also adopted by the resource end users. For example, energy consumption is usually recorded by commercial buildings for optimizing energy consumption behavior, eventually saving running costs and reducing carbon footprint. With the data recorded and retrieved, appropriate data science methods need to be employed to extract the desired information. Data science is a field incorporating formulating data-driven solutions, data preprocessing, analyzing data with particular algorithms, and employing results to support high-level decisions in various application scenarios.

The aim of this PhD project is to explore how data science can contribute to a more sustainable world from the perspectives of both improving the operation of environmental engineering processes and optimizing the activities of energy end users. The major work and corresponding results are as follows:

(1) (Paper I) An ML workflow consisting of Random Forest (RF) models, Deep Neural Network (DNN) models, Variable Importance Measure (VIM) analyses, and Partial Dependence Plot (PDP) analyses was developed and utilized to model WWTP processes and reveal how operational features impact on effluent quality. The case study was conducted on a full-scale WWTP in Sweden with large data (105,763 samples). This paper was the first ML application study investigating cause-and-effect relationships for full-scale WWTPs. Also, for the first time, time lags between process parameters were treated rigorously for accurate information uncovering. The cause-and-effect findings in this paper can contribute to more sophisticated process control that is more precise and cost-effective. 

(2) (Paper II) An upgraded workflow was designed to enhance the WWTP cause-and-effect investigation to be more precise, reliable, and comprehensive. Besides RF, two more typical tree-based models, XGBoost and LightGBM, were introduced. Also, two more metrics were adopted for a more comprehensive performance evaluation. A unified and more advanced interpretation method, SHapley Additive exPlanations (SHAP), was employed to aid model comparison and interpret the optimal models more profoundly. Along with the new local findings, this study delivered two significant general findings for cause-and-effect ML implementations in process industries. First, multi-perspective model comparison is vital for selecting a truly reliable model for interpretation. Second, adopting an accurate and granular interpretation method can profit both model comparison and interpretation.

(3) (Paper III) A novel workflow was proposed to identify the accountable operational factors for boiler failures at WtE plants. In addition to data preprocessing and domain knowledge integration, it mainly comprised feature space embedding and unsupervised clustering. Two methods, PCA + K-means and Deep Embedding Clustering (DEC), were carried out and compared. The workflow succeeded in fulfilling the objective of a case study on three datasets from a WtE plant in Sweden, and DEC outperformed PCA + K-means for all the three datasets. DEC was superior due to its unique mechanism in which the embedding module and K-means are trained simultaneously and iteratively with the bidirectional information pass.

(4) (Paper IV) A two-level (data structure level and algorithm mechanism level) workflow was put forward to detect imperceptible anomalies in energy consumption profiles of commercial buildings. The workflow achieved two objectives – it precisely detected the contextual energy anomalies hidden behind the time variation in the case study; it investigated the combined influence of data structures and algorithm mechanisms on unsupervised anomaly detection for building energy consumption. The overall conclusion was that the contextualization resulted in a less skewed estimation of correlations between instances, and the algorithms with more local perspectives benefited more from the contextualization.

Abstract [sv]

Dagens samhälle har vidtagit många åtgärder för att lösa miljöproblem. Exempel är reningsverk för avloppsvatten för att minska efffekter av vattenföroreningar och öka mängden av användbart vatten och använda avfall-till-energi (WtE) anläggningar för att återvinna energi från avfallet och minska dess miljöpåverkan. Det kan dock vara krävande att hantera dessa anläggningar eftersom processerna oftast är komplexa och dynamiska. Komplexiteten i processerna kan hindra en heltäckande och exakta förståelsen av processerna genom konventionella mekanistiska modellerna. Å andra sidan, med utvecklingen av s k Industri 4.0 har stora volymer och högupplösta data från automatisk onlineövervakning blivit alltmer tillgängliga. Dessa data återspeglar vanligtvis detaljerad information om processaktiviteter som kan användas för att optimera processkontroll. På samma sätt antas dataövervakning också ha ett stort värde för slutanvändare. Till exempel registreras energiförbrukningen av kommersiella byggnader vanligtvis för att optimera energiförbrukning, för att i förlängningen spara driftskostnader och minska koldioxidavtryck. För att uppnå maximal kunskap från inhämtade data måste dessa bearbetas med lämpliga datavetenskapliga metoder för att extrahera ut den önskade informationen. Datavetenskap är ett område som innefattar formulering av datadrivna lösningar, dataförbehandling, analys av data med speciella algoritmer och användning av resultat för att stödja beslut på i olika tillämpningsscenarier.

Syftet med detta doktorandprojekt har varit att utforska hur datavetenskap kan bidra till en mer hållbar värld utifrån perspektiven att både förbättra driften av miljötekniska processer och optimera energianvändning i byggnader. Sammanfattning av resultaten är följande:

Paper I - Ett Machine Learning (ML)-arbetsflöde bestående av Random Forest (RF)-modeller, Deep Neural Network (DNN)-modeller, Variable Importance Measure (VIM)-analyser och PDP-analyser (Partial Dependence Plot) utvecklades och användes för att modellera avloppsvattenreningsprocess och identifiera operativa funktioner påverkar avloppskvaliteten. Fallstudien genomfördes på ett fullskaligt reningsverk i Umeå, Sverige med stora data (105 763 prover). Detta arbete är en av de första ML-tillämpningarna som undersökt orsak-verkan-samband för fullskaliga reningsverk. För första gången behandlades också tidsfördröjningar mellan processparametrar för att ge optimal och korrekt information. Resultaten  i detta arbete kan bidrar till en mer exakt och kostnadseffektiv processkontroll.

Paper II - Ett uppgraderat arbetsflöde utformades för att förbättra undersökningen av reningsverkets orsak-verkansamband. Förutom RF introducerades ytterligare två typer av  beslutträdsmodeller, XGBoost och LightGBM. Dessutom antogs ytterligare två mätvärden för en mer omfattande av utvärdering av prestanda. En enhetlig och mer avancerad tolkningsmetod, SHapley Additive exPlanations (SHAP), användes för modelljämförelser och tolkning av de optimala modellerna. Arbetet resulterade i  två allmänna observationer kring orsak-och-verkan genom ML-implementationer i processindustrier. För det första är jämförelse av multiperspektivmodeller avgörande för att välja en  tillförlitlig modell för tolkning. För det andra kan implementering av en  granulär tolkningsmetod tjäna både modelljämförelse och tolkning.

Paper III - Ett nytt arbetsflöde föreslogs för att identifiera de driftsfaktorerna kopplade till oplanerade stopp hos pannor vid avfallsförbränningsanläggningar. Förutom dataförbearbetning och domänkunskapsintegration omfattade det  sammfattning av data till underliggande strukturer  och  olika klustringsmodeller. Två metoder, PCA + K-means och Deep Embedding Clustering (DEC), genomfördes och jämfördes. Metoderna applicerades på i en fallstudie på tre datamängder kopplade till oplanerade stopp vid en fullskalig WtE-anläggning i Umeå, Sverige.  DEC visade sig överträffa PCA + K-means för alla tre datamängderna. 

Paper IV - Ett arbetsflöde på två nivåer (datastrukturnivå och algoritmmekanismnivå) presenterades för att upptäcka anomalier i energiförbrukningsprofiler för kommersiella byggnader. Arbetsflödet uppnådde två mål, dels identifierade de kontextuella energianomalier som gömdes bakom tidsvariationen i fallstudien och dels undersöktes den kombinerade inverkan av datastrukturer och algoritmmekanismer på oövervakad anomalidetektering för byggnaders energiförbrukning. Den övergripande slutsatsen var att kontextualiseringen resulterade i en mindre skev uppskattning av korrelationer mellan instanser, och algoritmerna med mer lokala perspektiv gynnades mer av kontextualiseringen.

Place, publisher, year, edition, pages
Umeå: Umeå University , 2021. , p. 49
Keywords [en]
Wastewater treatment, Process analytics, Big data, Machine learning, Interpretable AI, Power plants, Failure analysis, Data mining, Buildings, Energy consumption, Anomaly detection
National Category
Computational Mathematics Computer and Information Sciences Civil Engineering
Identifiers
URN: urn:nbn:se:umu:diva-190259ISBN: 978-91-7855-719-6 (electronic)ISBN: 978-91-7855-718-9 (print)OAI: oai:DiVA.org:umu-190259DiVA, id: diva2:1619076
Public defence
2022-01-18, Glasburen, KBC-huset, Umeå, 13:00 (English)
Opponent
Supervisors
Available from: 2021-12-17 Created: 2021-12-11 Last updated: 2021-12-13Bibliographically approved
List of papers
1. A machine learning framework to improve effluent quality control in wastewater treatment plants
Open this publication in new window or tab >>A machine learning framework to improve effluent quality control in wastewater treatment plants
Show others...
2021 (English)In: Science of the Total Environment, ISSN 0048-9697, E-ISSN 1879-1026, Vol. 784, article id 147138Article in journal (Refereed) Published
Abstract [en]

Due to the intrinsic complexity of wastewater treatment plant (WWTP) processes, it is always challenging to respond promptly and appropriately to the dynamic process conditions in order to ensure the quality of the effluent, especially when operational cost is a major concern. Machine Learning (ML) methods have therefore been used to model WWTP processes in order to avoid various shortcomings of conventional mechanistic models. However, to the best of the authors' knowledge, no ML applications have focused on investigating how operational factors can affect effluent quality. Additionally, the time lags between process steps have always been neglected, making it difficult to explain the relationships between operational factors and effluent quality. Therefore, this paper presents a novel ML-based framework designed to improve effluent quality control in WWTPs by clarifying the relationships between operational variables and effluent parameters. The framework consists of Random Forest (RF) models, Deep Neural Network (DNN) models, Variable Importance Measure (VIM) analyses, and Partial Dependence Plot (PDP) analyses, and uses a novel approach to account for the impact of time lags between processes. Details of the framework are provided along with a demonstration of its practical applicability based on a case study of the Umeå WWTP in Sweden involving a large number of samples (105763) representing the full scale of the plant's operations. Two effluent parameters, Total Suspended Solids in effluent (TSSe) and Phosphate in effluent (PO4e), and thirty-two operational variables are studied. RF models are developed, validated using DNN models as references, and shown to be suitable for VIM and PDP analyses. VIM identifies the variables that most strongly influence TSSe and PO4e, while PDP elucidates their specific effects on TSSe and PO4e. The major findings are: (1) Influent temperature is the most influential variable for both TSSe and PO4e, but it affects them in different ways; (2) PO4e depends strongly on the TSS in aeration basins – higher TSS concentrations in aeration basins generally promote PO4 removal, but excess TSS can have negative effects; (3) In general, the impact of TSS in aeration basins on TSSe and PO4e increases with the distances of the basin from the merging outlet, so more attention should be paid to the TSS concentration in the third or fourth aeration basins than the first and second ones; (4) Returning excessive amounts of sludge through the second return sludge pipe should be avoided because of its adverse impact on TSSe removal. These results could support the development of more advanced control strategies to increase control precision and reduce running costs in the Umeå WWTP and other similarly configured WWTPs. The framework could also be applied to other parameters in WWTPs and industrial processes in general if sufficient high-resolution data are available.

Place, publisher, year, edition, pages
Elsevier, 2021
Keywords
Wastewater treatment, Big data, Interpretable AI, Effluent quality, Process analytics
National Category
Water Engineering
Identifiers
urn:nbn:se:umu:diva-182709 (URN)10.1016/j.scitotenv.2021.147138 (DOI)000657595400009 ()34088065 (PubMedID)2-s2.0-85105698913 (Scopus ID)
Available from: 2021-05-04 Created: 2021-05-04 Last updated: 2021-12-13Bibliographically approved
2. Towards better process management in wastewater treatment plants: Process analytics based on SHAP values for tree-based machine learning methods
Open this publication in new window or tab >>Towards better process management in wastewater treatment plants: Process analytics based on SHAP values for tree-based machine learning methods
Show others...
2022 (English)In: Journal of Environmental Management, ISSN 0301-4797, E-ISSN 1095-8630, Vol. 301, article id 113941Article in journal (Refereed) Published
Abstract [en]

Understanding the mechanisms of pollutant removal in Wastewater Treatment Plants (WWTPs) is crucial for controlling effluent quality efficiently. However, the numerous treatment units, operational factors, and the underlying interactions between these units and factors usually obfuscate the comprehensive and precise understanding of the processes. We have previously proposed a machine learning (ML) framework to uncover complex cause-and-effect relationships in WWTPs. However, only one interpretable ML model, Random forest (RF), was studied and the interpretation method was not granular enough to reveal very detailed relationships between operational factors and effluent parameters. Thus, in this paper, we present an upgraded framework involving three interpretable tree-based models (RF, XGboost and LightGBM), three metrics (R2, Root mean squared error (RMSE), and Mean absolute error (MAE)) and a more advanced interpretation system SHapley Additive exPlanations (SHAP). Details of the framework are provided along with a demonstration of its practical applicability based on a case study of the Umeå WWTP in Sweden. Results show that, for both labels TSSe (Total suspended solids in effluent) and PO4e (Phosphate in effluent), the XGBoost models are optimal whereas the RF models are the least optimal, due to overfitting and polarized fitting. This study has yielded multiple new and significant findings with respect to the control of TSSe and PO4e in the Umeå WWTP and other similarly configured WWTPs. Additionally, this study has produced two important generic findings relating to ML applications for WWTPs (or even other process industries) in terms of cause-and-effect investigations. First, the model comparison should be carried out from multiple perspectives to ensure that underlying details are fully revealed and examined. Second, using a precise, robust, and granular (feature attribution available for individual instances) explanation method can bring extra insight into both model comparison and model interpretation. SHAP is recommended as we found it to be of great value in this study.

Place, publisher, year, edition, pages
Elsevier, 2022
Keywords
Wastewater treatment, Process analytics, Interpretable AI, Machine learning, SHapley additive exPlanations
National Category
Environmental Sciences
Identifiers
urn:nbn:se:umu:diva-188792 (URN)10.1016/j.jenvman.2021.113941 (DOI)000711337500004 ()34731954 (PubMedID)2-s2.0-85117064804 (Scopus ID)
Projects
EcoChange
Available from: 2021-10-22 Created: 2021-10-22 Last updated: 2021-12-13Bibliographically approved
3. Investigation into causes of boiler failures in waste-to-energy plants with a coupled engineering and data mining solution
Open this publication in new window or tab >>Investigation into causes of boiler failures in waste-to-energy plants with a coupled engineering and data mining solution
Show others...
(English)Manuscript (preprint) (Other academic)
National Category
Other Civil Engineering Computational Mathematics
Identifiers
urn:nbn:se:umu:diva-190257 (URN)
Available from: 2021-12-11 Created: 2021-12-11 Last updated: 2021-12-13
4. Towards delicate anomaly detection of energy consumption for buildings: enhance the performance from two levels
Open this publication in new window or tab >>Towards delicate anomaly detection of energy consumption for buildings: enhance the performance from two levels
Show others...
(English)Manuscript (preprint) (Other academic)
National Category
Computational Mathematics Building Technologies
Identifiers
urn:nbn:se:umu:diva-190258 (URN)
Available from: 2021-12-11 Created: 2021-12-11 Last updated: 2021-12-13

Open Access in DiVA

fulltext(12208 kB)1941 downloads
File information
File name FULLTEXT01.pdfFile size 12208 kBChecksum SHA-512
141cb40e20d5c48adde4d77910d5829ce666d32be8f7bddd8a5117b591e9515b4e74106980c24ca0cc6599c57ade94f7b2f67a0a2a45d8361c66f114375bf670
Type fulltextMimetype application/pdf
spikblad(423 kB)59 downloads
File information
File name SPIKBLAD01.pdfFile size 423 kBChecksum SHA-512
088063d50eddeb40de88c256895a057a54f4f465331a76ad760cd8af42069276e8947a38bc5fbc2aeddf1374d9c75a5032ddd5184c4ce2bce53ed01b78dc3387
Type spikbladMimetype application/pdf

Authority records

Wang, Dong

Search in DiVA

By author/editor
Wang, Dong
By organisation
Department of Chemistry
Computational MathematicsComputer and Information SciencesCivil Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 1947 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1182 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf