Highest Voted 'apache-spark' Questions - Database Administrators Stack Exchange

2

votes

1 answer

Can we use Cassandra in place of Hadoop with Spark?

Considering we have a backend written in NodeJS and uses MySQL and Cassandra as it's databases, if we want to add Spark to the system to do some data analyzing stuff like recommendation, can we do it with Cassandra( I mean using Spark + Cassandra)…

asked Nov 29 '22 at 04:41

user20551429

69
2

2

votes

1 answer

Best practices for large JOINs - Warehouse or External Compute (e.g. Spark)

I am working on a problem that requires a very large join. The JOIN itself is pretty straightforward but the amount of data I am processing is very large. I am wondering for very large JOINs, is there a preferred type of technology. For example, is…

join self-join snowflake apache-spark

asked Feb 24 '22 at 21:06

Arthur Putnam

553
2
6
12

2

votes

2 answers

Check the time that PostgreSQL is taking to automatically create existing indexes when you do bulk insert using copy command from Spark

I am doing a bulk insert from the spark to the postgres table. Amount of data that I am ingesting is huge. The number of records is around 120-130 million I am first saving the records as multiple csv files on distributed storage location i.e. S3…

postgresql postgresql-performance apache-spark

asked Jan 21 '21 at 11:05

Nikunj Kakadiya

123
6

1

vote

0 answers

Writing large dataset from spark dataframe

We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out…

postgresql python jdbc apache-spark databricks

asked Feb 16 '24 at 00:57

Kyle Chamberlin

13
2

1

vote

0 answers

Why does MySQL select offset not work in pyspark?

select distinct day from t1 order by 1 desc limit 1,1 should return the second latest date in t1, but this error is thrown: ParseException: mismatched input ',' expecting {, ';'}

mysql apache-spark

asked Mar 09 '23 at 12:01

uhmosdhsjxpbcrstis

11
1

1

vote

1 answer

Optimal join for joining facts with scd-type-2 dimension for aggregation/reporting

I have a fact table and an scd-type-2 dimension table. I want to produce sales report by region and year. I have working solution with a query that joins them for reporting purposes. When I run the query in spark/databricks, it gives me a little…

query-performance join data-warehouse slowly-changing-dimension apache-spark

asked Mar 07 '23 at 16:49

Kashyap

145
6

1

vote

2 answers

When will the Spark Cassandra connector with support for Spark 3.3 be released?

Master branch has a merged pull request supporting Spark 3.3 how long before this gets built and published on maven repository?

cassandra apache-spark

asked Jan 04 '23 at 05:15

Rubber Duck

111
3

1

vote

2 answers

Cassandra-4-Update: Multiple Schema Versions

after upgrading our first node, it has a different schema version (according to node tool describe cluster). This caused Spark Jobs to hang, because of reoccurring "schema agreement not reached" by metadata.SchemaAgreementChecker. Is this different…

schema upgrade cassandra apache-spark

asked Oct 12 '22 at 12:41

Sven

11
2

1

vote

0 answers

MySQL is not showing the history queries after the ETL is complete

I am using docker container for MySQL and Spark. Both containers are on AWS Ec2 instance. Pyspark ETL connects to MySQL with JDBC and start extracting the data. When the ETL is running i can see the long running threads/queries in MySQL with command…

mysql docker apache-spark

asked Sep 15 '21 at 21:23

nomad123

19
1
3

1

vote

0 answers

Pyspark error inserting into mysql database table that exists

I am trying to insert into an existing mysql table using Pyspark JDBC connection however I get the following error(picture attached) Can I get assistance on this error. The table exist in the MySql Database, I was successful in Inserting with a…

mysql python apache-spark

asked Jul 16 '20 at 16:09

IIShriyaII

11
3

1

vote

0 answers

HIVE + understanding the hive-metastore logs

we have HDP cluster version - 2.6.4 , and we are runs spark streaming app and we are uses presto cluster in order to run Hive queries when we look on the hivemetastore logs ( under /var/log/hive ) , we can see the following warnings , that repeat…

postgresql hive apache-spark

asked Jan 23 '20 at 20:50

King David

111
1
4

0

votes

0 answers

Py4JJavaError while creating SparkSession

I encountered this error while trying to create spark session in a python virtual environment using VS Code as IDE. The code I ran and the output is below, please help. Code spark = SparkSession.builder\ .master("local[1]")\ …

apache-spark

asked Jul 22 '24 at 12:49

tomiealff

1
2

0

votes

1 answer

spark-cassandra-connector read throughput unpredictable

A user reports that the range query throughput is far higher than expected when setting spark.cassandra.input.readsPerSec in the spark-cassandra-connector. Job dependencies. The Java driver version is set to 4.13.0. …

cassandra apache-spark spark-cassandra-connector

asked Nov 07 '23 at 19:56

Paul

416
2
6

0

votes

0 answers

Is there an open source implementation of QGM (Query Graph Model)?

I am building a new system that needs to interact essentially as an SQL backend. We would like to import logical queries into it (e.g. from ApacheSPARQ or Postgres and related things) and want to develop an internal representation (IR) for them. …

postgresql hive apache-spark

asked Sep 25 '21 at 12:13

intel_chris

141
3

0

votes

3 answers

Data storage for analytics

I have to store some amount of data for analytical purposes. The data source produces 2TB data per month. Data is collected on a monthly basis (not real-time). Data is fully structured. There are 100+ different columns of data. Availability of SQL…

sql-server postgresql hadoop apache-spark

asked Dec 24 '20 at 12:23

Leeloo

111
5

Questions tagged [apache-spark]