The DBMS_DATA_MINING documentation lists all the mining functions and the algorithms to choose from. create table t_training_data as select * from ( select * from t_all_data SAMPLE (80) SEED (11) ) create table t_test_data as select * from ( select * from t_all_data MINUS select * from t_training_data ) □ Select an algorithmīefore we could create a model, we needed to select an algorithm, that fulfils the goal of our data mining task. The test data will be generated with a minus clause from the total data and provide the rest of the data for testing. It is easier, but the more resource-consuming way - caused by the minus. This is important if you have skewed data. If you want to have the distribution checked before sampling, using the sample clause is better. However, ora hash should only be used, if the data isn’t skewed. ![]() If the two datasets should have a distribution of 80/20, max bucket value - the second parameter in ora_hash - can be changed accordingly. The seed value - the third parameter in ora_hash - guarantees reproducibility of the dataset. The above query generates two datasets that are equally big. This was done with ORA_HASH functionality to get two equally large datasets that contain different data with no overlaps.ĭ create table t_test_data as select * from ( select ora_hash(date_id, 1, 11) hashvalue, t.* from t_all_data t order by date_id ) where hashvalue = 1 create table t_training_data as select * from ( select ora_hash(date_id, 1, 11) as hashvalue, t.* from t_all_data t order by date_id ) where hashvalue = 0 One dataset for the training of the model and one for the testing of the model. Once the data is in a form where it can be used further, the data needs to be split into two datasets. We assumed (correctly) that this number depends on the other attributes in the data. This distinct number of skiers was our target value in the model to predict. For each day, we had the distinct number of skiers that visited the skiing region and used their ski pass. Information about the date features (week of the year, day of the week, week of the skiing season) was added to add weight to those important features. In our dataset, a case was a certain date during the skiing season with weather and snow information, overnight stays of visitors in the area, and exchange rate information from the most common countries of the foreign visitors. The information for each case (record) must be stored in a separate row. It could also be prepared as a view if the data doesn’t need to be moved and duplicated.ĭata for mining must exist within a single table or view. □ The data needs to be well-labelled, annotated and cleansed (resolve keys, aggregate numbers to the granularity needed).ĭata doesn’t necessarily need to be persisted into a table to work for a data mining model.□ All data - that should be analyzed - needs to be in one single table or view.The following things had to be considered when preparing the data: ![]() We needed to get from multiple tables with selected, specific data, to one table with all the necessary, aggregated, relevant data. To be able to build a data mining model, we needed to restructure and prepare the data to get one single table or view. ![]() ![]() To import the data (received as Excel and CSV Files from the customer) the import functionality of SQL Developer is very useful. Oracle Cloud Dashboard with Quick Options - Creating a data warehouse
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |